ABOUT

This is an optimized library for ChaCha, a stream cipher with a 256 bit key and a 64 bit nonce.

HChaCha is also implemented, which is used to build XChaCha, a variant which extends the nonce from 64 bits to 192 bits. See Extending the Salsa20 nonce.

The most optimized version for the underlying CPU, that passes internal tests, is selected at runtime.

All assembler is PIC safe.

If you encrypt anything without using a MAC (HMAC, Poly1305, etc), you will be found, and made fun of.

INITIALIZING

The library can be initialized, i.e. the most optimized implementation that passes internal tests will be automatically selected, in two ways, neither of which are thread safe:

int chacha_startup(void); explicitly initializes the library, and returns a non-zero value if no suitable implementation is found that passes internal tests
Do nothing and use the library like normal. It will auto-initialize itself when needed, and hard exit if no suitibale implementation is found.

CALLING

Common assumptions:

chacha_key, chacha_iv, and chacha_iv24 variables can be accessed through their b member, which is an array of unsigned bytes.
rounds is an even number 2 or greater.
If in is NULL, the output will be stored to out (useful for things like random number generation or generating intermediate keys).

ONE SHOT

in and out are assumed to be word aligned. Incremental support has no alignment requirements, but will obviously slow down if non word-aligned pointers are passed.

void chacha(const chacha_key *key, const chacha_iv *iv, const uint8_t *in, uint8_t *out, size_t inlen, size_t rounds);

void xchacha(const chacha_key *key, const chacha_iv24 *iv, const uint8_t *in, uint8_t *out, size_t inlen, size_t rounds);

Encrypts inlen bytes from in to out, using key, iv, and rounds`.

INCREMENTAL

Incremental in and out buffers are not required to be word aligned. Unaligned buffers will require copying to aligned buffers however, which will obviously incur a speed penalty.

void chacha_init(chacha_state *S, const chacha_key *key, const chacha_iv *iv, size_t rounds);

void xchacha_init(chacha_state *S, const chacha_key *key, const chacha_iv24 *iv, size_t rounds);

Initialize the chacha_state with key and iv, and rounds, and sets the internal block counter to 0.

size_t chacha_update(chacha_state *S, const uint8_t *in, uint8_t *out, size_t inlen);

size_t xchacha_update(chacha_state *S, const uint8_t *in, uint8_t *out, size_t inlen);

Generates/xors up to inlen + 63 bytes depending on how many bytes are in the internal buffer, and returns the number of encrypted bytes written to out.

size_t chacha_final(chacha_state *S, uint8_t *out);

size_t xchacha_final(chacha_state *S, uint8_t *out);

Generates/crypts any leftover data in the state to out, returns the number of bytes written.

HChaCha

void hchacha(const uint8_t key[32], const uint8_t iv[16], uint8_t out[32], size_t rounds);

Computes HChaCha in to out, using key, iv, and rounds.

Examples

ENCRYPTING WITH ONE CALL

const size_t rounds = 20;
chacha_key key = {{..}};
chacha_iv iv = {{..}};
uint8_t in[100] = {..}, out[100];

chacha(&key, &iv, in, out, 100, rounds);

ENCRYPTING INCREMENTALLY

Encrypting incrementally, i.e. with multiple calls to collect/write data. Note that passing in data to be encrypted will not always result in data being written out. The implementation collects data until there is at least 1 block (64 bytes) of data available.

const size_t rounds = 20;
chacha_state S;
chacha_key key = {{..}};
chacha_iv iv = {{..}};
uint8_t in[100] = {..}, out[100], *out_pointer = out;
size_t i, bytes_written;

chacha_init(&S, &key, &iv, rounds);

/* add one byte at a time, extremely inefficient */
for (i = 0; i < 100; i++) {
    bytes_written = chacha_update(&S, in + i, out_pointer, 1);
    out_pointer += bytes_written;
}
bytes_written = chacha_final(&S, out_pointer);

VERSIONS

x86-64, SSE2-32, and SSE3-32 versions are minorly modified from DJB's public domain implementations.

Reference

Generic: chacha_ref

x86 (32 bit)

386 compatible: chacha_x86
SSE2: chacha_sse2
SSSE3: chacha_ssse3
AVX: chacha_avx
XOP: chacha_xop
AVX2: chacha_avx2

x86-64

x86-64 compatible: chacha_x86
SSE2: chacha_sse2
SSSE3: chacha_ssse3
AVX: chacha_avx
XOP: chacha_xop
AVX2: chacha_avx2

x86-64 will almost always be slower than SSE2, but on some older AMDs it may be faster

ARM

ARMv6 chacha_armv6
NEON chacha_neon

BUILDING

See asm-opt#configuring for full configure options.

If you would like to use Yasm with a gcc-compatible compiler, pass --yasm to configure.

The Visual Studio projects are generated assuming Yasm is available. You will need to have Yasm.exe somewhere in your path to build them.

STATIC LIBRARY

./configure
make lib

and make install-lib OR copy bin/chacha.lib and app/include/chacha.h to your desired location.

SHARED LIBRARY

./configure --pic
make shared
make install-shared

UTILITIES / TESTING

./configure
make util
bin/chacha-util [bench|fuzz]

BENCHMARK / TESTING

Benchmarking will implicitly test every available version. If any fail, it will exit with an error indicating which versions did not pass. Features tested include:

Partial block generation
Single block generation
Multi block generation
Counter handling when the 32-bit low half overflows to the upper half
Streaming and XOR modes
Incremental encryption
Input/Output alignment

FUZZING

Fuzzing tests every available implementation for the current CPU against the reference implementation. Features tested are:

HChaCha output
One-shot ChaCha
Incremental ChaCha with potentially unaligned output

BENCHMARKS

As I have not updated any benchmarks yet, raw cycle counts should have ~10-20 cycles added from the overhead of targets not being hardcoded.

E5200

ChaCha

Impl.	8	12	20	8	12	20	8	12	20
SSSE3-64	237	300	437	1.71	2.23	3.30	1.46	1.90	2.82
SSE2-64	262	337	500	1.98	2.65	3.97	1.68	2.29	3.42
SSSE3-32	287	350	487	2.04	2.69	3.99	1.72	2.37	3.59
SSE2-32	312	400	562	2.43	3.26	4.95	2.12	2.90	4.52

HChaCha

Impl.	8	12	20
SSSE3-64	162	237	362
SSSE3-32	175	250	375
SSE2-64	200	275	450
SSE2-32	200	275	450

E3-1270

ChaCha

Impl.	8	12	20	8	12	20	8	12	20
AVX-64	176	240	364	1.22	1.68	2.64	1.04	1.46	2.29
SSSE3-64	180	248	384	1.35	1.88	2.94	1.18	1.65	2.59
AVX-32	184	248	380	1.50	2.03	3.10	1.24	1.72	2.68
SSSE3-32	228	292	428	1.84	2.47	3.74	1.65	2.23	3.41

HChaCha

Impl.	8	12	20
AVX-64	116	180	308
AVX-32	128	192	320
SSSE3-64	128	192	328
SSSE3-32	136	204	336

i7-4770K

Timings are with Turbo Boost and Hyperthreading, so their accuracy is not concrete. For reference, OpenSSL and Crypto++ give ~0.8cpb for AES-128-CTR and ~1.1cpb for AES-256-CTR, and ~7.4cpb for SHA-512.

ChaCha

Impl.	8	12	20	8	12	20	8	12	20
AVX2-64	146	194	313	0.68	0.97	1.48	0.52	0.71	1.08
AVX2-32	170	218	337	0.83	1.11	1.66	0.62	0.83	1.24
AVX-64	146	194	316	1.06	1.50	2.33	0.94	1.32	2.05
AVX-32	158	206	328	1.32	1.82	2.81	1.12	1.57	2.47

HChaCha

(these are all literally the same version, timing differences are noise)

Impl.	8	12	20
AVX2-64	81	155	251
AVX2-32	87	155	254
AVX-64	87	155	274
AVX-32	87	152	251

AMD FX-8120

Timings are with Turbo on, so accuracy is not concrete. I'm not sure how to adjust for it either, and depending on clock speed (3.1ghz vs 4.0ghz), OpenSSL gives between 0.73cpb - 0.94cpb for AES-128-CTR, 1.03cpb - 1.33cpb for AES-256-CTR, and 10.96cpb - 14.1cpb for SHA-512.

ChaCha

Impl.	8	12	20	8	12	20	8	12	20
XOP-64	194	269	418	1.09	1.47	2.25	0.93	1.22	1.80
AVX-64	245	344	544	1.41	1.97	3.14	1.20	1.63	2.51
XOP-32	247	322	471	1.44	1.96	3.01	1.26	1.70	2.59
AVX-32	276	375	573	1.88	2.53	3.78	1.62	2.16	3.23

HChaCha

Impl.	8	12	20
XOP-64	84	160	309
XOP-32	91	165	318
AVX-64	144	243	441
AVX-32	144	237	441

ZedBoard (Cortex-A9)

I don't have access to the cycle counter yet, so cycles are computed by taking the microseconds times the clock speed (666mhz) divided by 1 million. For comparison, on long messages, OpenSSL 1.0.0e gives 52.3 cpb for aes-128-cbc (woof), and djb's armneon6 Salsa20/20 implementation gives 8.2 cpb.

ChaCha

Impl.	1 byte	8	12	20	576 bytes	8	12	20	8192 bytes	8	12	20
NEON-32		460	573	814		3.53	4.73	7.13		3.06	4.26	6.47
ARMv6-32		437	565	793		5.33	7.07	10.87		5.07	6.93	10.73

HChaCha

NEON shares the same implementation as ARMv6 as NEON latencies are too high for a single block.

Impl.	8	12	20
NEON-32	294	446	658
ARMv6-32	294	446	658

LICENSE

Public Domain, or MIT

Name		Name	Last commit message	Last commit date
Latest commit History 150 Commits
app		app
framework		framework
vs2010		vs2010
vs2012		vs2012
vs2013		vs2013
.gitattributes		.gitattributes
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
configure		configure
genvs.php		genvs.php

clcarwin/chacha-opt

Folders and files

Latest commit

History

Repository files navigation

ABOUT

INITIALIZING

CALLING

ONE SHOT

INCREMENTAL

HChaCha

Examples

ENCRYPTING WITH ONE CALL

ENCRYPTING INCREMENTALLY

VERSIONS

Reference

x86 (32 bit)

x86-64

ARM

BUILDING

STATIC LIBRARY

SHARED LIBRARY

UTILITIES / TESTING

BENCHMARK / TESTING

FUZZING

BENCHMARKS

ChaCha

HChaCha

ChaCha

HChaCha

ChaCha

HChaCha

AMD FX-8120

ChaCha

HChaCha

ZedBoard (Cortex-A9)

ChaCha

HChaCha

LICENSE

About

Resources

Stars

Watchers

Forks

Languages