sse3 lookup #3

aqrit · 2016-12-29T23:20:18Z

the obvious extension of the saturation method to pshufb luts...
it is a few instructions shorter than the current sse4 method and AFAIK should be faster, if unrolled.

;;; sse3 lookup ;;;
; xmm0  = in/out
; xmm12 = 0x2F,0x2F,0x2F,0x2F,0x2F,0x2F,0x2F,0x2F,[...]  // mask_2F
; xmm13 = 0x01,0x50,0x54,0x46,0x25,0x25,0x05,0x05,[...]  // lut_roll_up
; xmm14 = 0xFF,0x7F,0x7F,0x76,0x66,0x66,0x66,0x66,[...]  // lut_roll_down
; xmm15 = 0x00,0x3F,0x3E,0x34,0x00,0x00,0x1A,0x1A,[junk] // lut_shift
;
;;; perfect hash for lut = ((src>>4)&0x2F)+((src==0x2F)?0xFF:0x00)
;;; 0000 = garbage
;;; 0001 = slash
;;; 0010 = plus
;;; 0011 = 0-9
;;; 0100 = A-Z
;;; 0101 = A-Z
;;; 0110 = a-z
;;; 0111 = a-z
;;; 1000 >= garbage

movdqa   xmm2, xmm12
movdqa   xmm3, xmm13
movdqa   xmm4, xmm14
movdqa   xmm5, xmm15
movdqa   xmm1, xmm0
psrld    xmm1, 4
pcmpeqb  xmm2, xmm0
pand     xmm1, xmm12
paddb    xmm1, xmm2
pshufb   xmm3, xmm1
pshufb   xmm4, xmm1
pshufb   xmm5, xmm1
paddb    xmm0, xmm3
psubsb   xmm0, xmm4
paddusb  xmm0, xmm5
pmovmskb  eax, xmm0

The text was updated successfully, but these errors were encountered:

WojciechMula · 2017-01-03T19:37:54Z

Thanks, it is really clever. However, you're still stuck with a kind of range checking, although done with saturated add. I'll be happy to compare performance of your approach, though. Would you provide a C++ implementation?

BTW recently I came up with method with involve a single bit-and to find out invalid input char: https://github.com/WojciechMula/base64simd/blob/master/decode/lookup.sse.cpp (function lookup_pshufb_bitmask).

aqrit · 2017-01-04T19:13:31Z

I will do a version with C++ intrinsics eventually.

The difference between the saturation method and the bitmask method will probably be insignificant when using only SSE3 instructions.

However with SSE4.1, lookup_pshufb_bitmask() needs to be reworked to place the _mm_movemask_epi8() behind a _mm_testz_si128(). If that can't be done then AFAIK it will be noticeably slower than the saturation method.

According to IACA, the unrolled sse3 saturation method will 'lookup' 64 bytes in 22.0 cycles on Nehalem, Westmere, and Sandy Bridge.

WojciechMula · 2017-01-04T20:16:00Z

Thank you, that's really interesting. I'm curious if your approach is faster on Skylake -- @lemire and I are working now on AVX2-only library (https://github.com/lemire/fastbase64). If you prepare an intrinsics version we will be happy to test it.

lemire · 2017-01-04T20:19:53Z

+1

aqrit · 2017-01-08T08:19:21Z

bitmap method w/testz

const __m128i lut_lo = _mm_setr_epi8(
	0x15, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 
	0x11, 0x11, 0x13, 0x1A, 0x1B, 0x1B, 0x1B, 0x1A 
);
const __m128i lut_hi = _mm_setr_epi8(
	0x10, 0x10, 0x01, 0x02, 0x04, 0x08, 0x04, 0x08, 
	0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10
);
const __m128i lut_roll = _mm_setr_epi8(
	0,   16,  19,   4, -65, -65, -71, -71,
	0,   0,   0,   0,   0,   0,   0,   0
);
const __m128i mask_2F = _mm_set1_epi8( 0x2F );

// lookup
__m128i hi_nibbles, lo_nibbles, lo, hi, roll, eq_2F;
hi_nibbles = _mm_srli_epi32( values, 4 );
lo_nibbles = _mm_and_si128( values, mask_2F );
lo = _mm_shuffle_epi8( lut_lo, lo_nibbles );
eq_2F = _mm_cmpeq_epi8( values, mask_2F );
hi_nibbles = _mm_and_si128( hi_nibbles, mask_2F );
hi = _mm_shuffle_epi8( lut_hi, hi_nibbles );
roll = _mm_shuffle_epi8( lut_roll, _mm_add_epi8( eq_2F, hi_nibbles ) );
if( ! _mm_testz_si128( lo, hi ) ) goto invalid_char;
values = _mm_add_epi8( values, roll );

the saturation method might save a cycle or so when unrolled... but other than that it seems pointless.

WojciechMula · 2017-01-08T19:22:02Z

@aqrit Thank you very much, will check it next week and ping you back.

lemire · 2017-03-22T17:59:49Z

@aqrit We are working on a write-up for this code and we would like to credit you for this clever idea. You appear to be using the Web anonymously, under a pseudonym, which is fine, but it makes formally giving credit difficult. Would you give us a name? Short of that, if you are willing to email your name, please do so at lemire@gmail.com or wojciech_mula@poczta.onet.pl.

aqrit closed this as completed Jan 8, 2017

WojciechMula added the enhancement label Jan 8, 2017

WojciechMula mentioned this issue Jan 8, 2017

Another decoding approach to test lemire/fastbase64#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sse3 lookup #3

sse3 lookup #3

aqrit commented Dec 29, 2016

WojciechMula commented Jan 3, 2017

aqrit commented Jan 4, 2017

WojciechMula commented Jan 4, 2017

lemire commented Jan 4, 2017

aqrit commented Jan 8, 2017

WojciechMula commented Jan 8, 2017

lemire commented Mar 22, 2017

sse3 lookup #3

sse3 lookup #3

Comments

aqrit commented Dec 29, 2016

WojciechMula commented Jan 3, 2017

aqrit commented Jan 4, 2017

WojciechMula commented Jan 4, 2017

lemire commented Jan 4, 2017

aqrit commented Jan 8, 2017

WojciechMula commented Jan 8, 2017

lemire commented Mar 22, 2017