Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

yEnc SSE decode ideas #4

Closed
Safihre opened this issue Sep 16, 2017 · 54 comments
Closed

yEnc SSE decode ideas #4

Safihre opened this issue Sep 16, 2017 · 54 comments

Comments

@Safihre
Copy link

Safihre commented Sep 16, 2017

I was amazed to find this repo, I have been thinking of some way to do yEnc-decoding (as a Python-C-extension) using SSE instructions but my knowledge of C is just too rudimentary for now.

Do you think think SSE can help compared to regular char-by-char decoding of yEnc body?
How would you go about the decoding-escaping problem? I can imagine finding the escape chars, but how to remove them later on when building the output string?
I tried to grasp your encoding-code, but I think I probably miss the main idea due to the included edge-cases and optimizations.

Thanks!

EDIT: I think I am getting more and more of the code and how you handle the encoding-escaping here: https://github.com/animetosho/node-yencode/blob/master/yencode.cc#L718-L752
I don't completly understand the shuffle operations just yet and how they handle the extra chars, what are shufMixLUT and shufLUT?

@animetosho
Copy link
Owner

animetosho commented Sep 17, 2017

SIMD isn't really designed for text processing tasks like yEnc, so implementation isn't so straight forward unfortunately. Because of this, a scalar (byte by byte) implementation usually isn't so bad, but doing a SIMD implementation is intellectually interesting and works for the epeen factor (as for Nyuu, I intend to integrate PAR2 generation, so reducing CPU usage may help a bit on fast connections).

Having said that, a SIMD implementation of yEnc encoding can usually beat the naive implementation by several factors, due to corner cases that an encoder has to deal with.

I've been thinking about writing up how the algorithm works. Unfortunately the code isn't the easiest to read (I'm a little lazy there, admittedly), but here's a rough diagram showing an overview of the main loop:

diag

Note that the byte shuffle instruction requires a CPU with SSSE3 support (which most people should have these days). If the CPU doesn't have SSSE3 support, the algorithm just switches to doing the escaping using standard non-SIMD methods if special characters are found (the 16-bit mask != 0). Despite this, it's still a bit faster than a pure scalar implementation because most bytes aren't escaped in yEnc, and hence, end up hitting the fast route of: load 16 bytes, +42 to all bytes, check special characters, store to memory.

Line endings are a bit complicated, since they have different rules to the rest of the line (more characters to escape). When the algorithm crosses a line boundary, it needs to 'revert' (back-track) until the last character of the line, process the line boundary, then go back to the main loop. A large portion of the code is there just to deal with this case.

Generating vectors for the shuffle and add operation isn't exactly fast, so we pre-generate these instead and just do a lookup when encoding data. shufLUT is a 256 x 16-byte table for the shuffle vectors, whilst shufMixLUT is the corresponding lookup table for add vectors.

Hope that all makes sense.


As for decoding, I've actually thought about it, and have tried a few things. A naive implementation is surprisingly fast, since decoding is somewhat simpler than encoding, though you could still probably beat it by a few factors with SIMD.

Here's my scalar implementation:

static size_t do_decode_scalar(const unsigned char* src, unsigned char* dest, size_t len) {
	unsigned char *p = dest; // destination pointer
	unsigned long i; // input position
	unsigned char c; // input character
	
	for(i = 0; i < len - 1; i++) {
		c = src[i];
		switch(c) {
			case '\n': case '\r': continue;
			case '=':
				i++;
				c = src[i] - 64;
		}
		*p++ = c - 42;
	}
	// do final char; we process this separately to prevent an overflow if the final char is '='
	if(i < len) {
		c = src[i];
		if(c != '\n' && c != '\r' && c != '=') {
			*p++ = c - 42;
		}
	}
	
	return p - dest;
}

I see that in your implementation, you handle two dots after a newline which I don't. yEnc doesn't actually allow such to happen (first dot must be escaped), but I suppose a valid decoder should still deal with it.

A general workflow for a SIMD version may be:

  1. load 16 bytes
  2. check for special characters, = \r \n similar to the encoding algorithm
  3. subtract 42 from all bytes
  4. if there's no special characters, simply store the vector to memory
  5. otherwise take all matches to =, shift one byte across and put the number 64 in each place. All other elements of this vector should be 0
  6. perform a vector subtract (data - vector_partially_filled_with_64)
  7. use shuffles to remove occurrences of = \r \n (similar idea to the encoding algorithm, just instead of expanding data, it compacts it)
  8. store to memory

I actually have a somewhat working implementation of the above.
Against a valid yEnc encoder this works, however, if you want to be fully spec compliant, you'll need to handle invalid yEnc representations, such as a sequence like ====. A spec compliant decoder must treat that as two characters, however the above algorithm won't, even though no good yEnc encoder should ever create such a sequence. There's also the \n.. sequence mentioned above.

I haven't yet thought of a way to deal with such invalid sequences well, but it may be doable. A compromise may be to detect such cases instead (rather than actually handle them), and if encountered, fallback to scalar processing. As I expect most yEnc implementations to be reasonably valid, it's unlikely the slower scalar fallback will ever be invoked, but remains there to be spec compliant.

Hope that's useful. I may eventually add decoding support to this library. Welcome any ideas.
Otherwise, if you wish to implement it yourself, hope that this gives you some ideas on how.

@Safihre
Copy link
Author

Safihre commented Sep 17, 2017

That's a great overview, thank you! 👍

What would you do with the case where the = is the last character of the 16 byte block? So the character-to-unescape is actually in the next 16 byte block.

The double-dot thing is unrelated to yEnc but inherent to NNTP, the servers will add this extra dot when transmitting the message. While I initially ignored it as an edge case, it actually happens quite often (once per file almost).

I was experimenting a bit with implementing this yesterday, but I am sort of stuck in the "finally understanding what pointers do and how to use them"-stage.. so this stuff is a bit above my head.
You say you already have a somewhat working implementation of it? That would make me and the SABnzbd users very happy 😃

@animetosho
Copy link
Owner

animetosho commented Sep 17, 2017

What would you do with the case where the = is the last character of the 16 byte block? So the character-to-unescape is actually in the next 16 byte block.

That's not too hard to deal with. Once you have the bit mask indicating the location of = characters:

escape_first_char = mask >> 15;

// on next loop
data = read_next_16_bytes();
// check for special characters
if(escape_first_char)
  data = vector_subtract(data, make_vector(42, 42, 42... 42+64)); // note, first byte is at the end
else
  data = vector_subtract(data, make_vector(42, 42, 42... 42));

// ...or, you could optimize the above four lines with a small LUT based on escape_first_char

While I initially ignored it as an edge case, it actually happens quite often (once per file almost).

That's interesting, because a correctly yEnc encoded message should never have a dot as the first character of a line.
Perhaps these messages were encoded incorrectly?

but I am sort of stuck in the "finally understanding what pointers do and how to use them"-stage

I'm sure you'll get it. It's likely a bit different to what you're used to, but you seem to be a very experienced programmer, so it won't take you long to understand.
Programming in SIMD may be more challenging as it requires you to think a bit differently to how you'd usually write code, and stuff like yEnc can require a little creativity. But it just boils down to yet another programming problem, and use it enough and you should be fine too.

@Safihre
Copy link
Author

Safihre commented Sep 17, 2017

Perhaps these messages were encoded incorrectly?

No, it's not yEnc et al. It's the NNTP server that has to add a dot to every line that starts with a dot per-NNTP-spec during the transmission, some leftovers from ancient times I think.

@animetosho
Copy link
Owner

It's the NNTP server that has to add a dot to every line that starts with a dot

But there should be never be a line that starts with a dot, no?

@Safihre
Copy link
Author

Safihre commented Sep 17, 2017

Why not? It's not a special character in the already encoded data, just a char of which 42 can be subtracted.

@animetosho
Copy link
Owner

animetosho commented Sep 17, 2017

Looks like I made a mistake - had thought that yEnc requires dots to be escaped if it's the first character in a line, but re-reading the spec, it's actually optional. Sorry about that.
So it looks like that will need to be handled by most decoders.

@animetosho
Copy link
Owner

So I've been thinking about this, and I think that dealing with invalid sequences can be solved. But the double-dot issue looks challenging.

Looking at RFC3977:

When a multi-line block is interpreted, the "dot-stuffing" MUST
be undone; i.e., the recipient MUST ensure that, in any line
beginning with the termination octet followed by octets other
than a CRLF pair, that initial termination octet is disregarded

This seems to define how invalid sequences such as \n.a are handled - the dot at the start of the line needs to be discarded, regardless of what comes after it. I can't tell if the document covers how sequences like a\n.. should be handled, other than saying that it's invalid. Presumably, from the language, a line must be terminated by \r\n, hence no special treatment should be done for a\n.., however \r\n.. should have the first dot removed.
In other words, dots should be removed if preceded by \r\n (and not a valid = before that), or the dot is the first character in the article/yEnc body (since it's implicitly at the beginning of a line).

I had a quick look at Nzbget's code and it doesn't appear to handle lines starting with a dot. Presumably SABnzbd didn't do this for a long time either? All yEnc decoders I've seen, apart from SABYenc don't handle this case (which isn't an issue if they go through some NNTP layer, but I have doubts that typically happens).
So is it really the case that most downloaders don't handle this, or am I missing something?

Interestingly, I haven't yet seen an yEnc encoder that doesn't escape dots which start a line. But there's obviously posts out there that have it.

I'll probably take the approach of stripping out all instances of \r\n., even though this is likely different to all existing implementations. It also seems difficult to do from a SIMD perspective.

This does also make incremental processing a little more difficult, as it may need to know the preceeding two characters to be correct.

@Safihre
Copy link
Author

Safihre commented Sep 18, 2017

NZBGet does it will receiving the data in the NNTP layer:
https://github.com/nzbget/nzbget/blob/develop/daemon/nntp/ArticleDownloader.cpp#L388-L392
NZBGet can have a worker thread per connection that decodes the stream as it comes in and do the check then, however, due to the Python GIL the overhead of thread-switching in Python is terrible and we cannot do this.
Maybe when we switch to Python 3 and can use async/greenlet/etc approaches, but switching to Python 3 is still a long road for us.

I just checked some test-articles, and it seems that usually less than 1% of all 16-byte blocks contains a dot.
So as a first 'solution' I was thinking of doing the check if there is a dot present in the 16-byte block, then revert back to char-by-char processing for the whole block and also the block before/after it (in case it was at the start or finish of the current block).
It's a bit convoluted, but it would allow still a vast majority to be done via SIMD.
I tried to implement something of the sorts, but so far not much luck.

EDIT: All usenet downloaders must correct for this, 60% of posts I tried have the \r\n.. at least once.

animetosho added a commit that referenced this issue Sep 19, 2017
Not finished, mostly a PoC for now
Ref #4
@animetosho
Copy link
Owner

I see, thanks for pointing that out!

I think a SIMD version should be fast enough to make threading not that needed. I mean, this yEnc library is single threaded and synchronous, as I feel that it's fast enough to not really warrant the need (I'll probably eventually add async support though).

I've committed up a very rough implementation. From my quick testing, it seems to work and handles all edge cases I've tried so far. Still needs work though, including more testing, tidying up, optimizations etc. I've just pushed it up in case you're interested. Unfortunately the algorithm is a bit convoluted, but maybe can be streamlined better.

The algorithm supports incremental processing by supporting 4 possible previous states (preceeding character is a =, \r, \r\n or something else).

Quick speed test of the current code on a 3.8GHz Haswell CPU, using random input data:
Scalar version: ~400MB/s
SIMD version: ~1700MB/s

Once this is tidied up, maybe I can write up something to help others port the code across (though it's mostly copy the functions across, a few defines, and lookup-table generation code).

@Safihre
Copy link
Author

Safihre commented Sep 19, 2017

Coool! 💯
I will need to take a closer look and dissect it a bit so I understand it well!

@sanderjo
Copy link

Wow, very impressive & cool.

I wanted to ask a noob question: is _mm_movemask_epi8() generic Intel SIMD / SSE, or is it Microsoft specific (because most Google hits are to microsoft.com)? ... But I can answer that question myself:

$ cat /usr/lib/gcc/x86_64-linux-gnu/6/include/emmintrin.h | grep _mm_or_si128
_mm_or_si128 (__m128i __A, __m128i __B)

So: also on Linux, so generic Intel SIMD/SSE. 👍

@animetosho
Copy link
Owner

The Intel Instrinsics Guide (obviously lacks AMD extensions) is a handy reference. The movemask instruction is one useful instruction SSE has over ARM NEON.

I just realized a mistake: I've been treating =\r\n as a non-newline sequence, but thinking about it, this is incorrect because the NNTP layer doesn't care about yEnc's escaping semantics. So a sequence like =\r\n. should have its dot removed (NNTP protocol), then yEnc kicks in and unescapes the \r, so it should be converted to \xA3 instead of \xA3.. I'll need to fix that up.

@hugbug
Copy link

hugbug commented Sep 23, 2017

Super interesting topic 👍

I've tried the SSE decoder but for me it produces incorrect output. Am I'm doing something wrong or is there a bug in the code?

I've made a test program to illustrate the issue:

curl https://gist.githubusercontent.com/hugbug/fd7d95d53e3a2ca4aafb0f811d929bfc/raw/457d22331669372bd6c92bab812730e2d18338be/decoder_test.cpp > decoder_test.cpp

g++ -std=c++11 decoder_test.cpp

./a.out

Testing scalar

Test 1: OK

Test 2: OK

Testing sse

Test 1: OK

Test 2: FAILURE
Source:
0f 1a b1 a4 0c d4 15 2a 61 47 c8 bf bf e4 d3 e9 e2 b2 1f e0 99 1d 79 9a 38 26 c0 8b 3d 40 42 cf c6 f9 85 34 8d 9c f2 55 ce 16 ec 4d 38 29 3d 4d 22 d8 bb cc ce 2a 91 c9 93 87 6f 0f fb 5b 2c d7 90 3c 22 4a ac a7 1a 57 1a bb 6b 64 23 e0 87 8f b2 3d 7d 94 30 c5 eb 2f cb e5 78 35 8e bc d0 0b 57 15 58 69 e3 9d fc f3 da 6b c1 07 3d 4d d2 6a 60 6f 43 a4 3d 4a 81 dc b7 ca 04 8a c1 f6 8d b8 
Expected:
e5 f0 87 7a e2 aa eb 00 37 1d 9e 95 95 ba a9 bf b8 88 f5 b6 6f f3 4f 70 0e fc 96 61 d6 18 a5 9c cf 5b 0a 63 72 c8 2b a4 ec c2 23 0e ff e3 f8 ae 91 a2 a4 00 67 9f 69 5d 45 e5 d1 31 02 ad 66 12 f8 20 82 7d f0 2d f0 91 41 3a f9 b6 5d 65 88 13 6a 06 9b c1 05 a1 bb 4e 0b 64 92 a6 e1 2d eb 2e 3f b9 73 d2 c9 b0 41 97 dd e3 a8 40 36 45 19 7a e0 57 b2 8d a0 da 60 97 cc 63 8e 
Result:
e5 f0 87 7a e2 aa eb 00 37 1d 9e 95 95 ba a9 bf b8 b8 b8 b8 b8 b8 b8 b8 0e 0e 0e 0e 0e 0e 0e 9c 9c 9c 9c 9c 9c 9c 9c a4 a4 a4 a4 a4 a4 a4 f8 ae 91 a2 a4 00 67 9f 69 5d 45 e5 d1 31 02 ad 66 12 f8 20 82 7d f0 2d f0 91 41 3a f9 b6 5d 65 88 88 88 88 88 88 88 a1 a1 a1 a1 a1 a1 a1 a1 2d 2d 2d 2d 2d 2d 2d 2d b0 b0 b0 b0 b0 b0 b0 36 45 19 7a e0 57 b2 8d a0 da 60 97 cc 63 8e 

NOTE: My system doesn't have __POPCNT__.

Another issue is that the code fails to compile if __SSSE3__ is undefined.

@animetosho
Copy link
Owner

Thanks for the comment! Yeah, the SSE2 path was broken. I've pushed up my changes that I have been working on, which should have that fixed.

As for your test failures, the SSE code requires some lookup tables to be generated before running the main function. In the old code, this was done in the node initialization function, in the new code, there's a decoder_init function which will do that. (alternatively, you could hard code the lookup tables; I just prefer dynamic generation because it's easier to maintain)

I've modified your program to use the new code. You'll need to copy over common.h, decoder.cc and decoder.h from the src directory over as well. Compiles via g++ decoder_test.cpp decoder.cc -O2 -march=native

@sanderjo
Copy link

sanderjo commented Sep 24, 2017

Cool, that works for me:

git clone https://github.com/animetosho/node-yencode.git
cd node-yencode/
git pull
cd src
curl https://gist.githubusercontent.com/animetosho/5012484c08204bd744be475e78131247/raw/af29e30a3f243f8ce46621322903b69aa6d2f7ad/decoder_test.cpp > decoder_test.cpp
g++ decoder_test.cpp decoder.cc -O2 -march=native -o decoder_test
./decoder_test 

with result:

$ ./decoder_test 
Testing scalar

Test 1: OK

Test 2: OK

Testing sse

Test 1: OK

Test 2: OK

@hugbug
Copy link

hugbug commented Sep 24, 2017

I've tested the SSE decoder in NZBGet.

Conditions

  • hardware 1: MacBook 2010 with 2.4 GHz Core i5, SSD, __POPCNT__ isn't defined;
  • hardware 2: PVR with ARM 2 cores 1.7GHz, HDD;
  • downloading from NServ running on MacBook, cache in NServ enabled;
  • to minimise the influence of disk sub-system the writing of downloaded data was disabled in NZBGet, compiled with define SKIP_ARTICLE_WRITING.

Test case 1: decoding off

Article decoding was disabled via compiler define SKIP_ARTICLE_DECODING. That test case represents the ideal goal. It shows how fast the program can download if the yEnc-decoding would not take any time at all.

Test case 2: classic decoder

That test case shows how fast the program can download using current yEnc-decoder.

Test case 3: SSE decoder

Now we use the SSE-decoder. With SSSE3 but without __POPCNT__.

Test case 4: improved classic decoder

During experiments I've realised that I can improve current (classic) decoder. It checks for \n and \r although these characters can never occur in input because NZBGet processes downloaded data line per line. Therefore I've modified the decoder by removing the checks and that allowed to transform switchstatement into one if.

Test case 5: CRC check off

This test is for reference: both decoding and CRC check disabled.

Test results

The numbers are download speed (MB/s) and download time (s). The download speed is calculated based on download time in seconds which is an integer value; fractions of seconds can not be measured here unfortunately.

Test case Intel 10GB file ARM 2GB file
decoding off 306MB/s (34s) 69MB/s (30s)
classic decoder 193MB/s (54s) 59MB/s (35s)
SSE decoder 281MB/s (37s) -
improved classic 260MB/s (40s) 65MB/s (32s)
decoding off, crc check off 335MB/s (31s) 74MB/s (28s)

@hugbug
Copy link

hugbug commented Sep 24, 2017

Update: I've checked CPU capabilities using Intel tool and it showed that CPU supports POPCNT.
After recompiling with "-march=native" the define __POPCNT__ was active. Interesting, the speed hasn't improved and actually after multiple test runs it seems the version without POPCNT performs even a little bit faster.

As a side note: Now, knowing the CPU also supports PCLMULQDQ I can try the fast CRC routine, which didn't compile before.

@Safihre
Copy link
Author

Safihre commented Sep 25, 2017

@hugbug Did you use @animetosho version or a modified one for SSE?
Since in NZBGet you don't have to deal with \r\n or the \r\n.. problem in the decoder and that part/checks of the code could be left out?

@hugbug
Copy link

hugbug commented Sep 25, 2017

The SSE decoder can be parametrized to work in raw mode (where it processes dots) or in clean mode. I used the latter. It still filters out \r and \n but that happens at the same time with processing = and doesn't add much overhead. I've tried removing that part but that didn't change the result.

I think the SSE decoder can't show it's best due to line by line processing in NZBGet. With typical line length of 128 bytes that's the portion the decoder deals with. The lookup tables are probably never in CPU cache.

Although I must say I also tested with a post encoded with 2048 bytes per line. All decoders speed up considerably but the difference remains similar.

I suppose the SSE decoder can perform much better on larger inputs. I guess in SAB you can feed it with the whole article body (~500KB) and it should show its strengths.

My attempt was to replace the decoder without much rework in NZBGet and I wanted to share the results.

Also take into account that I tested on one CPU only. The results may be different on other (newer) CPUs.

animetosho added a commit that referenced this issue Sep 25, 2017
A lookup is likely faster everywhere else
Ref #4
@animetosho
Copy link
Owner

animetosho commented Sep 25, 2017

I'm guessing your x86 CPU there is a Core i5 520M. I can't tell what the ARM CPU is from the details given.

The decoder definitely should work better on larger inputs, as the overhead from start/end alignment gets amortized more, so it isn't really designed for ~128 byte inputs. The fact that it can handle newlines as well as dot unstuffing means you don't have to do it elsewhere, which is another speed benefit.
Note that ignoring newlines isn't exactly the same behaviour if given badly formed data such as foo\nbar\rbaz (unless your newline splitting logic handles that), and the SSE algorithm is designed to give identical output to typical decoders by stripping all \r and \n characters.

I haven't done enough investigation into whether using the POPCNT instruction is worth it or not. As we only need to count bits in 8-bit integers, a lookup table is quite fast for that purpose. My intuition is that it should be about even, possibly slightly slower, on Intel CPUs (1 POPCNT/clock vs 2 loads/clock), likely faster on AMD Jaguar/Ryzen, probably slower on AMD Bulldozer family. Removing one instance of POPCNT may be beneficial (only running it across the 16-bit mask instead), and/or maybe expanding the pshufb_combine_table lookup to take in part of the mask instead of a count may have some positive effect.
Thanks to your info though, I've decided to disable that path for most CPUs. Also, thanks for posting the results!


I think I've got in most of the optimizations now, and can't think of many ideas on how to change the algorithm. So some quick speed tests:

Tested by encoding 750KB of fixed random data into yEnc, then decoding that data and timing the decode. There'll obviously be some variation depending on how much escaping the random data generates. Measured speed is relative to the decoded output produced (not input to the decoder).
Compiled using GCC version 5/6.

Cleaned decoder:

Intel Ivy Bridge @3.8GHz (i5 3570S): 2007 - 2118 MB/s
Intel Haswell @3.8GHz (E3 1231v3): 2235 - 2398 MB/s
Intel Silvermont @2.6GHz (Atom C2750): 697 - 700 MB/s
AMD K10 @2.9GHz (Athlon II 245): 547 - 619 MB/s

Raw decoder:

Intel Ivy Bridge @3.8GHz (i5 3570S): 1943 - 1999 MB/s
Intel Haswell @3.8GHz (E3 1231v3): 2003 - 2114 MB/s
Intel Silvermont @2.6GHz (Atom C2750): 587 - 589 MB/s

@animetosho
Copy link
Owner

I've ported the algorithm to use ARM NEON. I've been wondering whether it'd actually be faster or not, considering limitations of most ARM CPUs, but on my Cortex A53 running in armv7 mode, decoding does seem to be about 2x faster than my scalar code. Encoding actually gets more of a speed boost, most likely because there's less scalar <-> NEON interaction.

There's a number of issues to consider with ARM/NEON though:

  • the various CPU uArchs seem to differ a lot, and I'm struggling to find documented info on them. ARM doesn't seem to publish optimization manuals on all its designs, particularly for older cores
  • following on from above, I suspect this Cortex A53 has only 64-bit SIMD units. I'd expect a CPU with 128-bit SIMD units (Cortex A57 and up probably?) to perform much better
  • ARM doesn't have an equivalent instruction to PMOVMSKB which the algorithm relies on, so it has to be emulated (done using an AND and 3x pairwise adds), which means that this is expensive on ARM. Maybe there's some optimization potential here, though I can't think of much at the moment
  • I don't know whether the A53 is affected, but I believe some older ARM CPUs implement NEON on a separate unit, meaning that scalar <-> NEON transitions impose a sizable penalty, which this code does a fair bit. This can mean slower performance on those CPUs
  • ARM doesn't provide any nice way to do CPU detection, and it seems that GCC's -march=native doesn't pick up the CPU either (probably related). As such, I don't plan to offer runtime CPU detection on ARM, and the user may have to manually supply compiler flags to compile the correct version. Of course, if you're using a configure script, you could test for all these cases

The code is otherwise basically identical to the SSE code, just uses NEON instructions instead.
I don't know whether NEON is worthwhile across all ARM CPUs. It should be good on more powerful ARM cores, but older/weaker cores may not benefit as much, or at all.

@hugbug
Copy link

hugbug commented Oct 1, 2017

Thanks a lot!

In the meantime I've done many tests for performance optimisations in NZBGet including SSE decoder and SSE crc routine. I'm currently in process of integrating them for good.

Producing binaries that work on all systems was the difficult part. I cannot compile the whole program with -march=native as it will not work on older systems. The solution I currently implementing is compiling decoder-unit with additional flags -msse3 -mssse3 and crc_fold-unit with -msse4.1 -mpclmul whereas the all other units are compiled with default architecture (i686 or x86_64).

I was just about to ask you regarding ARM support but you have that already. Just in time! 🥇

I'm absolutely need to do runtime feature check on ARM too. I'm planning to achieve this by parsing /proc/cpuinfo on Linux. I don't support ARM on other platforms anyway.

My ARM device doesn't have crc but it has neon. I'll do tests for neon-decoder there.

@sanderjo
Copy link

sanderjo commented Oct 1, 2017

@hugbug

I'm planning to achieve this by parsing /proc/cpuinfo on Linux.

Please note: ARM64 / Aarch64 always has NEON, but does not mention that in /proc/cpuinfo:

sander@nanopineo2:~$ cat /proc/cpuinfo
processor	: 0
Processor	: AArch64 Processor rev 4 (aarch64)
Hardware	: sun50iw1p1
BogoMIPS	: 48.00
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x0
CPU part	: 0xd03
CPU revision	: 4

I believe that when compiling, you should NOT mention "neon". I'll check.

... about compiling (#1)

I tried #4 (comment) on my ARM64, but that didn't work: see errors below.

The GCC on my ARM64 is older (5.4.0 20160609) than on my x86-ubuntu (6.3.0 20170406) ... is that the cause?

sander@nanopineo2:~/git/node-yencode/src$ g++ decoder_test.cpp decoder.cc -O2 -march=native -o decoder_test
decoder_test.cpp:10:7: error: expected nested-name-specifier before ‘decode_func’
 using decode_func = size_t(*)(const unsigned char* src, unsigned char* dest, size_t len, char* state);
       ^
decoder_test.cpp:24:1: error: in C++98 ‘tests’ must be initialized by constructor, not by ‘{...}’
 };
 ^
decoder_test.cpp:24:1: error: could not convert ‘{{"7c 8b 9c 4b 44 31 2a 84 98 9d 3b 2b 37 2a 2a 2a 2a 2a 2a 2a dd b5 9e 4c bb 78 2a b4 29 69 30 2a 2a 8a ed 2c 8a 04 14 98 1e 8c 2e 73 3e 5a 4b 2a 4a 4a 2a 2a 2a 2a 2a 2a 2b 2a 2a 2a 79 9a 8f 98 6b 7e 80 57 8c 9f 93 96 8e 57 8c 99 a2 57 76 93 98 9f a2 57 8e 93 9d 95 5b 58 a0 8e 93 2a 1a 7d e6 a2 66 66 66 4a 79 9c 8b 8d 96 8f 4a 80 77 4a 80 93 9c 9e 9f 8b 96 6c 99 a2 4a 6e 93 9d 95 4a ", "52 61 72 21 1a 07 00 5a 6e 73 11 01 0d 00 00 00 00 00 00 00 b3 8b 74 22 91 4e 00 8a ff 3f 06 00 00 60 c3 02 60 da ea 6e f4 62 04 49 14 30 21 00 20 20 00 00 00 00 00 00 01 00 00 00 4f 70 65 6e 41 54 56 2d 62 75 69 6c 64 2d 62 6f 78 2d 4c 69 6e 75 78 2d 64 69 73 6b 31 2e 76 64 69 00 f0 53 bc 78 3c 3c 3c 20 4f 72 61 63 6c 65 20 56 4d 20 56 69 72 74 75 61 6c 42 6f 78 20 44 69 73 6b 20 "}, {"0f 1a b1 a4 0c d4 15 2a 61 47 c8 bf bf e4 d3 e9 e2 b2 1f e0 99 1d 79 9a 38 26 c0 8b 3d 40 42 cf c6 f9 85 34 8d 9c f2 55 ce 16 ec 4d 38 29 3d 4d 22 d8 bb cc ce 2a 91 c9 93 87 6f 0f fb 5b 2c d7 90 3c 22 4a ac a7 1a 57 1a bb 6b 64 23 e0 87 8f b2 3d 7d 94 30 c5 eb 2f cb e5 78 35 8e bc d0 0b 57 15 58 69 e3 9d fc f3 da 6b c1 07 3d 4d d2 6a 60 6f 43 a4 3d 4a 81 dc b7 ca 04 8a c1 f6 8d b8 ", "e5 f0 87 7a e2 aa eb 00 37 1d 9e 95 95 ba a9 bf b8 88 f5 b6 6f f3 4f 70 0e fc 96 61 d6 18 a5 9c cf 5b 0a 63 72 c8 2b a4 ec c2 23 0e ff e3 f8 ae 91 a2 a4 00 67 9f 69 5d 45 e5 d1 31 02 ad 66 12 f8 20 82 7d f0 2d f0 91 41 3a f9 b6 5d 65 88 13 6a 06 9b c1 05 a1 bb 4e 0b 64 92 a6 e1 2d eb 2e 3f b9 73 d2 c9 b0 41 97 dd e3 a8 40 36 45 19 7a e0 57 b2 8d a0 da 60 97 cc 63 8e "}}’ from ‘<brace-enclosed initializer list>’ to ‘std::vector<test_pair>’
decoder_test.cpp: In function ‘std::vector<unsigned char> unhex(const char*)’:
decoder_test.cpp:32:35: error: ‘strtol’ was not declared in this scope
   long val = strtol(p, &endptr, 16);
                                   ^
decoder_test.cpp: In function ‘void print_vector(const char*, std::vector<unsigned char>)’:
decoder_test.cpp:40:23: error: ‘printf’ was not declared in this scope
  printf("%s:\n", title);
                       ^
decoder_test.cpp:41:26: warning: range-based ‘for’ loops only available with -std=c++11 or -std=gnu++11
  for (unsigned char ch : vec)
                          ^
decoder_test.cpp: At global scope:
decoder_test.cpp:48:11: error: variable or field ‘test’ declared void
 void test(decode_func func)
           ^
decoder_test.cpp:48:11: error: ‘decode_func’ was not declared in this scope
sander@nanopineo2:~/git/node-yencode/src$

Am I doing something wrong?

UPDATE
More compiling (#2):

After upgrading to gcc 6.3.0 20170519, less errors messages, but still two fatal:

sander@nanopineo2:~/git/node-yencode/src$ g++-6 decoder_test.cpp decoder.cc -O2  -o decoder_test
decoder_test.cpp:8:15: warning: ‘size_t do_decode_sse(const unsigned char*, unsigned char*, size_t, char*) [with bool isRaw = false; bool use_ssse3 = false]’ used but never defined
 static size_t do_decode_sse(const unsigned char* src, unsigned char* dest, size_t len, char* state);
               ^~~~~~~~~~~~~
decoder_test.cpp:6:15: warning: ‘size_t do_decode_scalar(const unsigned char*, unsigned char*, size_t, char*) [with bool isRaw = false]’ used but never defined
 static size_t do_decode_scalar(const unsigned char* src, unsigned char* dest, size_t len, char* state);
               ^~~~~~~~~~~~~~~~
/tmp/ccRoCy9t.o: In function `main':
decoder_test.cpp:(.text.startup+0x38): undefined reference to `unsigned long do_decode_sse<false, false>(unsigned char const*, unsigned char*, unsigned long, char*)'
decoder_test.cpp:(.text.startup+0x3c): undefined reference to `unsigned long do_decode_sse<false, false>(unsigned char const*, unsigned char*, unsigned long, char*)'
collect2: error: ld returned 1 exit status
sander@nanopineo2:~/git/node-yencode/src$ 

@hugbug
Copy link

hugbug commented Oct 1, 2017

In the test app change do_decode_sse to do_decode_neon.

@Safihre
Copy link
Author

Safihre commented Oct 3, 2017

The input for SABYenc is chunks of data from the socket, so not lines. Whenever the socket reports data can be read, we read it and put it in a python list until we reach end-of-article. Then this list of chunks we give to sabyenc. This is so we can do all the work inside the C-module (faster), where we can turn off Python's GIL, so we have actual multithreading. Usually the list has around 30 chunks for each 700K article.
On each chunk (except the last 1 or 2) we can then run the SIMD decoder that also takes care of the \r\n.. thing. So should be feasible.
Hope to also see how much it improved things in NZBGet, or are these already the final results @hugbug? nzbget/nzbget#448 (comment)

@hugbug
Copy link

hugbug commented Oct 3, 2017

I've added support for memory cache to NServ which slightly improved overall performance in tests (since NServ runs on the same computer and consumes CPU) but the proportions remained.

To estimate the possible performance improvement I suggest you to disable decoding in SABYEnc and measure the speed. You can't get more speed with SSE decoder than without decoder at all.

@animetosho
Copy link
Owner

The input for SABYenc is chunks of data from the socket

Oh I see.
I see that it tries to search for the =yend sequence - if you simply parse the last few chunks for this marker (you probably only need about 1KB of data to search through), I presume you don't need the check in the main loop?

If you can simplify the main loop down, removing the end check from it, you could then replace your main loop with one that just loops through chunks and feeds it to the decoder.
Note that I don't search for null terminators in strings - the code relies on lengths (both faster and more reliable). If the Python API gives you a length, without needing to scan the string (a la strlen), this shouldn't be much of an issue.

I've wondered whether I should add support for detecting terminator sequences (\r\n.\r\n and \r\n=yend), as it'd make sense for cases when you directly feed socket data through, rather than have to buffer it up and scan for those elsewhere. It just makes the state machine of an incremental decoder a little tricker.

@hugbug
Copy link

hugbug commented Oct 3, 2017

@animetosho
Is it safe to assume that compiler will not emit any SIMD code on its own?

If I compile decoder unit with -mssse3 (which is necessary to use SSSE3 intrinsics) can I be sure that compiler will not emit any SSE3/SSSE3 code into routines which are for use when SSSE3 isn't available? It would be fatal if do_decode_scalar or decoder_init end up having SSSE3 opcodes because compiler was smart enough to use certain optimisations.

The situation is even more problematic in case with ARM. The CRC-unit for ARM must be compiled with -march=armv8-a+crc. If I put CPU detection routine into the same unit the compiler may use ARMv8 opcodes in that routine although the rest of the program is compiled for ARMv7 and the CRC init routine must work on ARMv7.

To solve that issue I'm putting CPU detection code into a separate unit compiled with default settings.

I wonder if I should put SSSE3-decoder into a separate unit to make sure that SSE2-decoder is compiled with -msse2, without -mssse3.

And by the way, is SSSE3 decoder faster in your tests? In my limited tests I see no difference between SSE2 and SSSE3 decoders. Maybe I should keep only SSE2 version; that would make things easier.

@animetosho
Copy link
Owner

Technically, -mssse3 tells the compiler it can use any feature up to SSSE3 anywhere it likes. This means that the proper way to deal with CPU targeting is to split everything into separate files and dynamically dispatch to the correct function.

It's a pain to do this though. In practice, (S)SSE3 additions are quite purpose specific, and intrinsics are just wrappers around single assembly instructions, so compilers generally follow them. As such, I can't see the compiler using SSSE3 instructions in the SSE2 version, even if you compile with -mssse3, though one could argue that a really clever compiler could eventually do so.
So yeah, if you want to be truly safe, you'll have to split it up. For me, I only provide Windows binaries (I can't get static nodejs builds working for other platforms). MSVC is interesting in that it only goes up to SSE2 (before going to AVX teritory) but you can still write SSSE3 intrinsics. This means the compiler itself will never use more than SSE2, but of course will use any SSSE3 code you specifically write. I don't know whether GCC/Clang has a similar feature - it would be nice if it did.

I'm not sure whether the same can be said with SSE2 over i686 as SSE2 is somewhat more general purpose.

I get significantly faster speeds for SSSE3 (like 1380MB/s SSE2 vs 2200MB/s SSSE3 on a modern Intel CPU). It only differs when removing characters though, so in your case, since you'll never get \r or \n characters, the only characters to remove at the = ones, which won't happen as often. Hence, you probably won't see as much of a speed difference, because the SSSE3 code won't be invoked that frequently.

To my understanding, 32-bit ARMv8 is largely the same as ARMv7, but the documentation lists some differences so it can't really be relied on unfortunately. Though I'm running in armv7 mode, and code compiled using armv8-a+crc seems to work... maybe it's because the CPU can understand the ARMv8 stuff.

I have to say that compiler tooling/support for dynamic dispatch is rather primitive, and a bit of a pain to deal with.

@hugbug
Copy link

hugbug commented Oct 6, 2017

Found the relevant GCC doc:

If you specify command-line switches such as -msse, the compiler could use the extended instruction sets even if the built-ins are not used explicitly in the program. For this reason, applications that perform run-time CPU detection must compile separate files for each supported architecture, using the appropriate flags. In particular, the file containing the CPU detection code should be compiled without these options.

So, yeah, I have to split up SSE2 and SSSE3 into separate units. Not a big deal, I've already put initialization part into a separate unit. I guess I have to eliminate the extra optimizations with tune- and AVX- ifdefs though.


I'm currently in process of reworking code which fetches data from news server and feeds it into decoder. My plan is to eliminate own line detection and use raw-decoder instead (on large blocks like 10KB). That saves data copying and allows decoding of data directly from buffer which receives from socket. All doing in one pass instead of two (line detection, then decoding).

There is however one thing that stops me at the moment. The decoder must detect end of yEnc-data or end of article. Currently I need to tell the decoder where to stop using len but that means I have to scan the data first which is an additional pass.

In yEnc end of data is marked with \r\n=yend . More generally yEnc specifies =y as reserved sequence; if that's easier to detect (I guess so) that would be sufficient. In addition decoder must detect end of NNTP article which is marked with \r\n.\r\n; this is necessary for a case of malformed posts not having proper yEnc end-markers.

Once detected end of data (yend or article end), the decoder should stop processing and should report that (for example via new state value). It would also be helpful if it could report back the end position since we need to process the yenc-trailing data (parse CRC). It's not necessary for decoder to deal with =ybegin or =ypart markers as this part can and should be done by caller, before the decoding process starts.

if you simply parse the last few chunks for this marker (you probably only need about 1KB of data to search through), I presume you don't need the check in the main loop?

This doesn't work unfortunately because I can't detect the last chunk on transmission level. Once the news server sent the last chunk my attempts to receive more data result in hanging as no data is available (and the server doesn't close connection waiting for further commands). That's why I need to scan the received data for end of article marker before attempting to receive more data.

Do you think you could extend the decoder with end of stream detection? That'd be great and highly appreciated.

If implemented we would be able to do all the heavy number crunching using SIMD in one pass. Well, in two passes actually, as we need to calculate CRC of decoded data separately. Theoretically decoding and CRC calculation could be made in one pass but that would be too much of efforts I guess, especially the Intel routine isn't designed for such use. So I'm not asking about this, just thinking out loud.

Thanks again. This discussion topic is a great pleasure to participate in.

@animetosho
Copy link
Owner

Yeah, detecting terminator sequences is something I've been wondering about. I didn't realize that =y was considered reserved - reading the yEnc specification, it appears that \r\n=y is considered a 'keyword line', in which case, it would make sense to stop there. I previously thought that \r\n=yend would need to be detected, which is a little trickier to maintain state in an incremental decoder (and potentially deal with sequences like \r\n=yenb), but \r\n=y will be easier.

So there'll need to be a second version of the function, with a different signature (since input may only be partially consumed) which stops at the end point. Non-raw decoder will only look for \r\n=y whereas the raw will also look for \r\n.\r\n (or maybe just \r\n.\r?).

This doesn't work unfortunately because I can't detect the last chunk on transmission level

Yes, this only works if the end sequence has been scanned elsewhere. SABYenc seems to take that approach.

Theoretically decoding and CRC calculation could be made in one pass but that would be too much of efforts I guess, especially the Intel routine isn't designed for such use.

I've been thinking about the possibility of such optimizations via function stitching. It's a little awkward because yEnc is variable length, and may require the algorithm to go back to memory, though I think it could still yield a net benefit. It's a little complicated to implement/maintain, and I don't know how much it'd yield over running it separately (perhaps with loop tiling). It's something I may look into eventually nonetheless.

This discussion topic is a great pleasure to participate in.

Same here! I'm glad to be a part of it!

@hugbug
Copy link

hugbug commented Oct 7, 2017

Non-raw decoder will only look for \r\n=y whereas the raw will also look for \r\n.\r\n (or maybe just \r\n.\r?).

Indeed, NNTP mandates \r\n as a line terminator sequence. I think we can safely assume that none of these characters appears separately in article content. Then looking for \n=y and \n.\r would be sufficient, if that makes decoding any easier.

perhaps with loop tiling

Regarding better CPU cache usage. If we choose block size large enough for efficient SIMD processing but small enough to remain in cache between decoder and CRC passes, we can process the data in two passes without unnecessary complicating decoder, right? I mean the caller should execute CRC function just after decoding and you don't have to deal with integrating CRC calculation into decoder. That's actually how it is currently processed in nzbget (although in very small one-line chunks).

The only thing we need is an incremental CRC function. The ARM function is already incremental. The crc_fold not yet. Linux kernel has a CRC function which also uses PCLMULQDQ but supports initial CRC parameter. Have you evaluated it yet? It's pure asm and that scares, not easily portable probably. From the amount of code it seems to be much smaller than crc_fold, probably not as sophisticated and not as fast.

@animetosho
Copy link
Owner

My current plan is to bail the SIMD decoder when it sees \r\n=y or \r\n.\r. The two character sequences are slightly easier to handle, since we already search for \r\n, we can just shift the data across 2 bytes and match them against the next two bytes.
When it sees the sequence, it'll break out into scalar code. The scalar code requires a \r\n.\r\n sequence to end though, so if you get a sequence like \r\n.\ra, it won't end, but the rest of the message will be processed slower. This stays strictly spec compliant, and I think should be acceptable.

Limiting processing size is useful for cache if multiple passes are necessary. Function stitching is still beneficial for a number of reasons though:

  • may result in fewer load/store operations
  • better usage of CPU's execution units
  • less loop overhead (as you're only processing one loop instead of two)
  • removes the need for loop tiling and selecting a tile size

I'm not intending this decoder to be the fastest one can possibly be, I'm mostly looking for easy wins (and stitching CRC into the decoder is a little complex). A modern CPU can already run this at around 2GB/s per core, and PCLMUL CRC32 runs in excess of 10GB/s on one core of a Haswell CPU, which is a fair bit faster than what a 10Gbps connection would give you. I can't see exceeding 10Gbps to be an issue for a while, and one can always multi-thread.

The CRC folding approach is incremental, if you see the original code I took it from. The state isn't stored as a CRC32 though, rather it's stored in 512 bits which can be reduced to a 32-bit CRC.

I haven't tested the Linux code yet, thanks for linking to that. It looks like it largely uses the same idea. One could port this to C intrinsics if assembly is not desired.

@hugbug
Copy link

hugbug commented Oct 9, 2017

I'm currently in process of reworking code which fetches data from news server and feeds it into decoder. My plan is to eliminate own line detection and use raw-decoder instead (on large blocks like 10KB). That saves data copying and allows decoding of data directly from buffer which receives from socket. All doing in one pass instead of two (line detection, then decoding).

Done!

Since current version of raw decoder doesn't support end-of-stream detection yet, an extra scan of incoming data before feeding decoder is implemented to properly process data. Therefore it's not one pass at the moment.

Nonetheless the overall efficiency is greatly improved: 268 MB/s -> 350 MB/s.

Detailed benchmark results

Decoder CrcCheck Off CrcCheck On
new-decoder (per-line) 282 MB/s 245 MB/s
sse-decoder (per-line) 304 MB/s 268 MB/s
sse-decoder (raw mode) 404 MB/s 350 MB/s
scalar-decoder (raw mode) 329 MB/s 292 MB/s
no-decoder (raw mode) 449 MB/s 399 MB/s
  • new-decoder (per-line) refers to improved classic decoder tested in previous tests;
  • sse-decoder (per-line) is SIMD decoder running in clean mode (as opposite to raw mode). The line processing is performed by other program parts, just like in "new-decoder (per-line)";
  • sse-decoder (raw mode) is SIMD decoder running in raw mode with other program parts reworked to support it;
  • scalar-decoder (raw mode) similar to previous but without using SIMD CPU instructions;
  • no-decoder - decoding was disabled using compiler define SKIP_ARTICLE_DECODING.

Please note that these speeds represent overall download speed in NZBGet, not just decoding speed (the program of course has to do way more work in addition to decoding).

For test conditions see this topic.


I'm still using non SIMD CRC32 calculation routine (slice by 4). Improvements on that front is the next item on my list.

The CRC folding approach is incremental, if you see the original code I took it from. The state isn't stored as a CRC32 though, rather it's stored in 512 bits which can be reduced to a 32-bit CRC.

That's great news. I'll adopt it.

@Safihre
Copy link
Author

Safihre commented Oct 9, 2017

Cool, nice to see my random thought bubble making this issue resulted in something usefull. If it's possible for NZBget, SABnzbd will also be able to take advantage of it 👍 All thanks to @animetosho!

@hugbug Did you possibly also have test results on ARM? Would be very interested to see how much Neon adds!

I've shifted my focus now to first converting SABnzbd to Python 3, so that it can also use VS2017 for compiling the C extensions (in 2.7 you're locked to VS2008) and use async stuff to also start decoding in chunks instead of having to use the current convoluted approach of list of chunks.
SABnzbd 2.3.1 maxes out a gigabit connection using my old i5, so I need to buy a slower device to even start testing real world impact of these things 😄

@hugbug
Copy link

hugbug commented Oct 10, 2017

Reposting test results from NZBGet topic here as this is very much related to the discussion.

When making tests on two ARM devices discovered and fixed a performance bottleneck in NServ. The CPU usage of NServ has been greatly reduced, which in turn gives more CPU time to NZBGet and increases speed. All tests were rerun with improved NServ (including tests on Mac).

Test devices

  1. MacBook 2010 with Intel Core i5-520 (2 cores @ 2.4 GHz, 4 threads), macOS 64 bit;
  2. PVR with ARMv7 Cortex A-15 (2 cores @ 1.7GHz), Broadcom BCM 7251s, Linux 32 bit;
  3. NanoPi NEO2 with ARMv8 Cortex A-53 (4 cores @ 1.5GHz), Allwinner H5, Linux 64 bit. NZBGet runs in 32 bit.

Test results

All numbers are in MB/s. For each decoder two test cases were measured - with and without CRC calculation; the latter is shown in parentheses. The overhead of CRC calculation shows how much improvement potential is still there - the CRC routine isn't optimised for SIMD yet.

Once again a reminder that the speeds below represent overall download speed in NZBGet, not just decoding speed.

Decoder MacBook 2010 PVR ARMv7 NEO2 ARMv8
new-decoder per-line mode 307 (354) 89 (99) 102 (118)
simd-decoder per-line mode 336 (404) 89 (100) 101 (119)
simd-decoder raw mode 467 (570) 99 (109) 121 (141)
scalar-decoder raw mode 369 (421) 93 (105) 107 (124)
no-decoder raw mode 544 (693) 111 (128) 145 (168)

Observations

  • significant improvement when going from scalar to SIMD on Intel: 369 -> 467 MB/s: +25%;
  • good speed increase on ARMv8: 107 -> 121 MB/s: +13%;
  • small difference on ARMv7: +6%;
  • raw mode increases speed (compared to line mode) even without SIMD decoder; and SIMD decoder can show full potential in raw mode (thanks to larger blocks);
  • CRC calculation eats significant amount of speed and is worth improving. We can get potential speed increase up to 22% on Intel and 17% on ARMv8 (theoretical limit).

@animetosho
Copy link
Owner

Thanks for those benchmarks - very interesting to know!

The ARM results are interesting. I don't really know ARM uArchs anywhere as well as I do x86, but I suspect lower performance on ARM cores which have a separate NEON unit (I believe older designs do this). But interestingly, from these slides, it seems that the Cortex A15 fully integrates NEON into the regular pipeline, so I'm a bit surprised with the lack of speed gain there. Perhaps this could mean that the SIMD decoder is slower than the scalar decoder on something like a Cortex A7.

I can't find much information on SIMD width on ARM chips. I believe the Cortex A57 and later use 128-bit units, so am guessing the A53 and A15 possibly use 64-bit units. That in itself will reduce the benefit of SIMD (half the throughput).

@hugbug
Copy link

hugbug commented Oct 11, 2017

On Wikipedia page Comparison of ARMv7-A cores Cortex A-12, A-15 and A-17 are listed as having 128 bit NEON.

What wonders me in benchmark results is that the performance difference between the slowest (new-decoder per-line mode crc on) and the fastest (no-decoder raw mode crc off) is only 29% (68 vs 88) on ARMv7 whereas on Intel it's 125% (307 vs 693). As if CPU spends far more time doing other things and therefore doesn't respond so well to optimisations in decoder. May be it's not CPU itself but other system components such as slow RAM, I don't know.

@Safihre
Copy link
Author

Safihre commented Oct 12, 2017

@hugbug do you run NServ also on the ARM devices themselves? What if you run it on your Mac/Windows and then connect to the ARMv7 via Gigabit?
Just to see where the bottleneck is. In the end it's of course real life performance, in a fully working setup, that counts.

@animetosho
Copy link
Owner

On Wikipedia page Comparison of ARMv7-A cores Cortex A-12, A-15 and A-17 are listed as having 128 bit NEON.

That's an interesting and useful comparison table.

Somewhat interestingly though, I just found this article which seems to imply that A15 is not 128b:

Architecturally, the Cortex A57 is much like a tweaked Cortex A15 with 64-bit support. The CPU is still a 3-wide/3-issue machine with a 15+ stage pipeline. ARM has increased the width of NEON execution units in the Cortex A57 (128-bits wide now?)

Similarly, the same site mentions that the A7 and A53 are single issue 64-bit units. Of course, the site could be wrong, but if it was true, it'd make a little bit more sense.

Your PVR benchmarks do seem a little odd still - I'd expect a 1.7GHz 3 wide out-of-order core to easily beat the 1.5GHz 2 wide in-order CPU at almost everything, but it seems to be significantly weaker. ARM claims that the A53 should be about as performant as an A9, but the A15 should be more powerful than an A9.

The NEO2 benchmarks make more sense to me, though your point about there being much less gain with everything disabled is certainly on the mark still.

connect to the ARMv7 via Gigabit?

Gigabit maxes out at 125MB/s, which I suspect could be a bit of a bottleneck (perhaps not for the PVR). Not exactly knowledgeable about the test server, but if it runs on a different core, hopefully it doesn't have much of an effect. Unless there's a memory bottleneck or similar - might be worth a try nonetheless (can get an idea for how network transfers affect the device).

@hugbug
Copy link

hugbug commented Oct 12, 2017

I'm testing in NZBGet which is multithreaded and for tests 10 connections was configured, meaning 10 threads all doing similar jobs (on different articles): downloading, decoding, computing CRC, then normally also writing into disk but that last part was disabled during tests.

Therefore 4 cores @ 1.5GHz is better than 2 cores @ 1.7GHz.

What if you run it on your Mac/Windows and then connect to the ARMv7 via Gigabit?

Tried that too and got similar results. Now process ksoftirqd takes roughly the same amount of CPU as NServ was taking (40% from total 200%) . My guess is that has something to do with ethernet chip.

Decoder PVR ARMv7 via Ethernet
new-decoder per-line mode 67 (74)
simd-decoder per-line mode 68 (74)
simd-decoder raw mode 73 (81)
scalar-decoder raw mode 71 (78)
no-decoder raw mode 82 (91)

When redoing tests on PVR I've noticed that I'm now getting lower CPU usage in NServ and better speeds in benchmark (97 MB/s in simd-decoder raw mode, crc on vs 72 MB/s in previous test). It seems in my previous test on PVR I was using an older version of NServ which didn't include the latest optimisation.

I'm redoing tests on PVR and will post updated results.

@hugbug
Copy link

hugbug commented Oct 12, 2017

Results for ARMv7 (NServ) updated directly in the benchmark post.

ARMv7 and ARMv8 can now be better compared because the same nzbget binaries and same NServ were used in all tests.

@hugbug
Copy link

hugbug commented Oct 12, 2017

If you are not tired of benchmarks here are numbers for SIMD CRC, which I've got just integrated (reposting from NZBGet issue tracker).

SIMD CRC

Integrated SIMD CRC routines for Intel and ARMv8 into NZBGet.

  • scalar-crc - CRC routine used in previous test (slice by 4);
  • simd-crc - SIMD routine: on Intel crc_fold, a sophisticated routine which uses CPU instruction PCLMUL. On ARMv8 it's a simple routine whose job is to execute dedicated CPU instruction crc32.

All numbers are in MB/s. For each decoder two test cases were measured - with and without CRC calculation; the latter is shown in parentheses. All tests were performed 4 times, the worst result was discarded and the average of the remaining three results were taken.

For convenience the table also includes all previous measurements with scalar-crc routine.

Decoder MacBook 2010 PVR ARMv7 NEO2 ARMv8
new-decoder per-line mode, scalar-crc 307 (354) 89 (99) 102 (118)
simd-decoder per-line mode, scalar-crc 336 (404) 89 (100) 101 (119)
simd-decoder raw mode, scalar-crc 467 (570) 99 (109) 121 (141)
scalar-decoder raw mode, scalar-crc 369 (421) 93 (105) 107 (124)
no-decoder raw mode, scalar-crc 544 (693) 111 (128) 145 (168)
simd-decoder raw mode, simd-crc 520 (563) n/a 136 (141)
scalar-decoder raw mode, simd-crc 397 (420) n/a 118 (125)
no-decoder raw mode, simd-crc 621 (684) n/a 165 (169)

Conclusion

  • Intel: with SIMD CRC routine the speed has improved from 467 MB/s -> 520 MB/s, significantly reducing the gap between CRC-ON (520 MB/s) and CRC-OFF (563 MB/s);
  • ARMv8: with dedicated crc-instruction we are getting CRC calculation almost for free: 136 MB/s CRC-ON vs. 141 MB/s CRC-OFF. Speed improvement: 121 MB/s -> 136 MB/s.

@animetosho
Copy link
Owner

Thanks again for the benchmarks - they're quite interesting.

which is multithreaded

Oops, forgot about that, thanks for the correction!


I've gotten around to implementing the end scanning version of the decoder. Have been a little busy with other stuff, so took longer than expected.

Function signatures have changed to accommodate signaling how much input/output is consumed. I've also moved code around to be rather annoying because the interfaces have changed anyway. Return value signals what end sequence was hit. If it returns 0, then all input was consumed, otherwise *src is updated to point to the byte following \r\n=y or \r\n.\r\n.

Turns out that \r\n.=y needs to also be identified as a valid end sequence, due to the NNTP layer removing the dot, then the yEnc layer seeing \r\n=y. As such, I've decided to handle \r\n.\r\n accurately in the SIMD portion as well.

Searching for the end sequence is sometimes noticeably slower than not doing it, but hopefully faster than a second scan of the data.

@hugbug
Copy link

hugbug commented Oct 15, 2017

Thanks so much! I'll integrate the new version and report back.


In the meantime I've done more tests, in particular on Dell 2015 notebook when running Linux. The numbers are crazy high (MB/s):

Improvement MacBook
macOS
i5-520
Dell
Windows
i7‑5600U
Dell
Linux
i7‑5600U
PVR
Linux
ARMv7
NEO2
Linux
ARMv8
improved decoder, scalar crc 305 389 480 89 102
raw decoder, scalar crc 369 414 636 93 107
simd decoder, scalar crc 467 493 836 99 121
simd decoder, simd crc 520 541 1011 n/a 136

For description of devices, test conditions and more results (not related to SIMD) please see original post.

@hugbug
Copy link

hugbug commented Oct 19, 2017

Results for one-pass simd decoder with end-of-stream detection (simd-end):

Improvement MacBook
macOS
i5-520
Dell
Windows
i7‑5600U
Dell
Linux
i7‑5600U
PVR
Linux
ARMv7
NEO2
Linux
ARMv8
simd decoder 520 541 1011 99 136
simd-end decoder 520 570 1140 106 157
  • scalar crc for ARMv7, simd crc for all other devices.

@Safihre
Copy link
Author

Safihre commented Oct 28, 2017

Cool to see how within 1 month from creating this issue it has now a working implementation in NZBget. So I would say this issue served its purpose and in case I have specific implementation questions for SABnzbd I will open another topic!
Thanks all!

@Safihre Safihre closed this as completed Oct 28, 2017
@animetosho
Copy link
Owner

It has been interesting - thanks for creating the topic!

Are you planning to migrate to Python 3 before using this decoder? I imagine that SABYenc could be changed to use it, as is, but I'd imagine that Python 3's API would be different - if that's the goal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants