Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sse3 lookup #3

Closed
aqrit opened this issue Dec 29, 2016 · 7 comments
Closed

sse3 lookup #3

aqrit opened this issue Dec 29, 2016 · 7 comments

Comments

@aqrit
Copy link

aqrit commented Dec 29, 2016

the obvious extension of the saturation method to pshufb luts...
it is a few instructions shorter than the current sse4 method and AFAIK should be faster, if unrolled.

;;; sse3 lookup ;;;
; xmm0  = in/out
; xmm12 = 0x2F,0x2F,0x2F,0x2F,0x2F,0x2F,0x2F,0x2F,[...]  // mask_2F
; xmm13 = 0x01,0x50,0x54,0x46,0x25,0x25,0x05,0x05,[...]  // lut_roll_up
; xmm14 = 0xFF,0x7F,0x7F,0x76,0x66,0x66,0x66,0x66,[...]  // lut_roll_down
; xmm15 = 0x00,0x3F,0x3E,0x34,0x00,0x00,0x1A,0x1A,[junk] // lut_shift
;
;;; perfect hash for lut = ((src>>4)&0x2F)+((src==0x2F)?0xFF:0x00)
;;; 0000 = garbage
;;; 0001 = slash
;;; 0010 = plus
;;; 0011 = 0-9
;;; 0100 = A-Z
;;; 0101 = A-Z
;;; 0110 = a-z
;;; 0111 = a-z
;;; 1000 >= garbage

movdqa   xmm2, xmm12
movdqa   xmm3, xmm13
movdqa   xmm4, xmm14
movdqa   xmm5, xmm15
movdqa   xmm1, xmm0
psrld    xmm1, 4
pcmpeqb  xmm2, xmm0
pand     xmm1, xmm12
paddb    xmm1, xmm2
pshufb   xmm3, xmm1
pshufb   xmm4, xmm1
pshufb   xmm5, xmm1
paddb    xmm0, xmm3
psubsb   xmm0, xmm4
paddusb  xmm0, xmm5
pmovmskb  eax, xmm0
@WojciechMula
Copy link
Owner

Thanks, it is really clever. However, you're still stuck with a kind of range checking, although done with saturated add. I'll be happy to compare performance of your approach, though. Would you provide a C++ implementation?

BTW recently I came up with method with involve a single bit-and to find out invalid input char: https://github.com/WojciechMula/base64simd/blob/master/decode/lookup.sse.cpp (function lookup_pshufb_bitmask).

@aqrit
Copy link
Author

aqrit commented Jan 4, 2017

I will do a version with C++ intrinsics eventually.

The difference between the saturation method and the bitmask method will probably be insignificant when using only SSE3 instructions.

However with SSE4.1, lookup_pshufb_bitmask() needs to be reworked to place the _mm_movemask_epi8() behind a _mm_testz_si128(). If that can't be done then AFAIK it will be noticeably slower than the saturation method.


According to IACA, the unrolled sse3 saturation method will 'lookup' 64 bytes in 22.0 cycles on Nehalem, Westmere, and Sandy Bridge.

@WojciechMula
Copy link
Owner

Thank you, that's really interesting. I'm curious if your approach is faster on Skylake -- @lemire and I are working now on AVX2-only library (https://github.com/lemire/fastbase64). If you prepare an intrinsics version we will be happy to test it.

@lemire
Copy link
Contributor

lemire commented Jan 4, 2017

+1

@aqrit
Copy link
Author

aqrit commented Jan 8, 2017

bitmap method w/testz

const __m128i lut_lo = _mm_setr_epi8(
	0x15, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 
	0x11, 0x11, 0x13, 0x1A, 0x1B, 0x1B, 0x1B, 0x1A 
);
const __m128i lut_hi = _mm_setr_epi8(
	0x10, 0x10, 0x01, 0x02, 0x04, 0x08, 0x04, 0x08, 
	0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10
);
const __m128i lut_roll = _mm_setr_epi8(
	0,   16,  19,   4, -65, -65, -71, -71,
	0,   0,   0,   0,   0,   0,   0,   0
);
const __m128i mask_2F = _mm_set1_epi8( 0x2F );

// lookup
__m128i hi_nibbles, lo_nibbles, lo, hi, roll, eq_2F;
hi_nibbles = _mm_srli_epi32( values, 4 );
lo_nibbles = _mm_and_si128( values, mask_2F );
lo = _mm_shuffle_epi8( lut_lo, lo_nibbles );
eq_2F = _mm_cmpeq_epi8( values, mask_2F );
hi_nibbles = _mm_and_si128( hi_nibbles, mask_2F );
hi = _mm_shuffle_epi8( lut_hi, hi_nibbles );
roll = _mm_shuffle_epi8( lut_roll, _mm_add_epi8( eq_2F, hi_nibbles ) );
if( ! _mm_testz_si128( lo, hi ) ) goto invalid_char;
values = _mm_add_epi8( values, roll );

the saturation method might save a cycle or so when unrolled... but other than that it seems pointless.

@WojciechMula
Copy link
Owner

@aqrit Thank you very much, will check it next week and ping you back.

@lemire
Copy link
Contributor

lemire commented Mar 22, 2017

@aqrit We are working on a write-up for this code and we would like to credit you for this clever idea. You appear to be using the Web anonymously, under a pseudonym, which is fine, but it makes formally giving credit difficult. Would you give us a name? Short of that, if you are willing to email your name, please do so at lemire@gmail.com or wojciech_mula@poczta.onet.pl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants