---
layout: post
title: Speeding up CRC-32 calculations in Mojo 
categories: [mojo]
date: "2024-05-99"
author: "Ferdinand Schenck"
draft: true
description: Or how I made CRC-32 calculations AA times faster than Python, and BB times slower than Python. 
---

In my [last post](https://fnands.com/blog/2024/mojo-png-parsing/) on parsing PNG images in I very briefly mentioned cyclic redundancy checks, and posted a rather cryptic looking function which I claimed was a bit inefficient. 

In this post I want to follow up on that a bit and see how we can speed up these calculations, including a little bit of a look at the compile-time metaprogramming side of Mojo.    

For reference, this post was done with Mojo 24.4.0, so a few language details have changed since my last post (e.g. `math.bit` got moved to the top-level `bit` and a few of it's functions have been renamed). 


## A bit of context
But first, let's go through a bit of background so we know what we're dealing with. 

### Cyclic redundancy checks

CRCs are error detecting codes that are often used to detect corruption of data in digital files, an example of which is PNG files. In the case of PNGs for example the CRC-32 is calculated for the data of each chunk and appended to the end of the chunk, so that the person reading the file can verify whether the data they read was the same as the data that was written.  

A CRC check technically does "long division in the ring of polynomials of binary coefficients ($\Bbb{F}_2[x]$)" 😳.   

It's not as complicated as it sounds. I found the [Wikipedia article on Polynomial long division](https://en.wikipedia.org/wiki/Polynomial_long_division) to be helpful, and if you want an in depth explanation then
[this post](https://github.com/komrad36/CRC) by [Kareem Omar](https://github.com/komrad36) does an excellent job of explaining both the concept and implementation considerations. I won't go deep into the explanations, so I recommend you read at least the first part of Kareem's post for more background. 

Did you read that post? Then welcome back, and well continue from there. 

But tl;dr: 
XOR is equivalent to polynomial long division (over a finite field) for binary numbers, and XOR is a very efficient operation to calculate in hardware. 
Essentially what a CRC check does in practice is to run through a sequence of bytes, and iteratively perform a lot of XORs and bit-shifts.  

## The original 

The CRC-32 check from my previous post looked something like this: 

In [3]:
from bit import bit_reverse

fn CRC32(data: List[SIMD[DType.uint8, 1]]) -> SIMD[DType.uint32, 1]:
    var crc32: UInt32 = 0xffffffff
    for byte in data:
        crc32 = (bit_reverse(byte[]).cast[DType.uint32]() << 24) ^ crc32
        for i in range(8):
            
            if crc32 & 0x80000000 != 0:
                crc32 = (crc32 << 1) ^ 0x04c11db7
            else:
                crc32 = crc32 << 1
                
    return bit_reverse(crc32^0xffffffff)

I'll step through this in a moment, but the first thing you might notice here is that I am reversing a lot of bits here. 

This is because when I was implementing this function (based off a C example), I implemented a little-endian version of the algorithm, while PNGs are encoded as big-endian. It's not a huge deal, but it does mean that I am constantly reversing bytes, and then reversing the output again.

## The correct bit order


We can make this better by implementing the big-endian version:

In [25]:
fn CRC32_inv(owned data: List[SIMD[DType.uint8, 1]]) -> SIMD[DType.uint32, 1]:
    """Big endian CRC-32 check using 0xedb88320 as polynomial."""

    # Initialize crc32 as all 1s
    var crc32: UInt32 = 0xffffffff

    # Step though all bytes in bytestream
    for byte in data:

        # XOR new byte with crc32
        crc32 = (byte[].cast[DType.uint32]() ) ^ crc32

        # Step though crc32 8 times
        for i in range(8):
            
            # If leading bit is 1, bitshift by 1 and XOR, otherwise just shift 
            if crc32 & 1 != 0:
                crc32 = (crc32 >> 1) ^ 0xedb88320
            else:
                crc32 = crc32 >> 1

    # Invert upon return
    return crc32^0xffffffff

 This is very similar, and just entails that we use a the reverse of the polygon we did before (if you bit reverse `0x04c11db7` you get `0xedb88320`). This also saves us one 24-bit bit-shift, as we are now working on the bottom 8 bits of the `UInt32` instead of the top 8. 

 Just to verify that these implementations are equivalent, let's do a quick test: 

In [5]:
var test_list = List[SIMD[DType.uint8, 1]](5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201,
                                           5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201,
                                           5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201,
                                           5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201, 
                                           5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201,
                                           5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201,
                                           5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201,
                                           5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201, 42)

print(hex(CRC32(test_list)))
print(hex(CRC32_inv(test_list)))                                           

0x382aa34e
0x382aa34e


And there we go, a more elegant version of the CRC-32 check I implemented last time.  

As the theme of today's post is trying to speed things up, let's do a little bit of benchmarking to see if this change has saved us any time. As we are doing one fewer bit reverse and bit shift per byte, as well as a final reverse, we should see a bit of a performance uplift. 


## A bit of benchmarking:


So let's define a benchmarking function. This function will take two version of the CRC32 function and benchmark their runtimes. 

In [23]:
import benchmark

alias data_table_func = fn[data: List[UInt8], table: List[UInt32]]() -> None


fn bench[function_1: data_table_func, function_2: data_table_func,
         test_list: List[UInt8], test_table: List[UInt32]]():

    var report = benchmark.run[function_1[test_list, test_table]](max_runtime_secs=1.0
    ).mean(benchmark.Unit.ms)
    print("Function 1 runtime (ms): \t", report)

    var report_2 = benchmark.run[function_2[test_list, test_table]](max_runtime_secs=1.0
    ).mean(benchmark.Unit.ms)
    print("Function 2 runtime (ms): \t", report_2)

    print("Speedup factor: \t\t", (report/report_2))

The table parameter is a bit of future proofing, which we won't need for now, but will become apparent soon. Just ignore it for now. 


Let's 

In [24]:
from random import rand, seed


fn run_32[data: List[UInt8], table: List[UInt32] ]():
    var a =  CRC32(data)
    benchmark.keep(a)


fn run_32_inv[data: List[UInt8], table: List[UInt32] ]():
    var a = CRC32_inv(data)
    benchmark.keep(a)


alias fill_size = 2**20
alias g = UnsafePointer[UInt8].alloc(fill_size)
rand[DType.uint8](ptr =  g, size = fill_size)


alias rand_list = List[UInt8](unsafe_pointer = g, size = fill_size, capacity = fill_size)


alias dummy_table = List[UInt32](1)


bench[run_32, run_32_inv, rand_list, dummy_table]()

Function 1 runtime (ms): 	 8.5214284550000006
Function 2 runtime (ms): 	 6.918508535
Speedup factor: 		 1.2316857617347725


In [None]:
%%python

def py_crc32(data: bytearray) -> int: 
    crc32 = 0xffffffff


    for byte in data:
        crc32 = byte ^ crc32

        for i in range(8):
            if crc32 & 1 != 0:
                crc32 = (crc32 >> 1) ^ 0xedb88320
            else:
                crc32 = crc32 >> 1

    return crc32^0xffffffff
        

py_test_list = [5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201,
    5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201,
    5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201,
    5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201, 
    5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201,
    5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201,
    5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201,
    5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201, 42]
    
py_test_bytearray = bytearray(py_test_list)

print(hex(py_crc32(py_test_bytearray)))

0x382aa34e


In [None]:

from time import sleep




In [None]:
from random import rand, seed
import benchmark
seed(614114419)

fn run_32[data: List[SIMD[DType.uint8, 1]] ]():
    var a =  CRC32(data)
    benchmark.keep(a)


fn run_32_inv[data: List[SIMD[DType.uint8, 1]] ]():
    var a = CRC32_inv(data)
    benchmark.keep(a)


fn bench():

    
    alias fill_size = 2**16
    alias g = UnsafePointer[SIMD[DType.uint8, 1]].alloc(fill_size)
    rand[DType.uint8](ptr =  g, size = fill_size)


    alias rand_list = List[SIMD[DType.uint8,1]](unsafe_pointer = g, size = fill_size, capacity = fill_size)


    print(len(rand_list))

    var report = benchmark.run[run_32[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report)

    var report_2 = benchmark.run[run_32_inv[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_2)

    print(100 * (report/report_2 -1))


bench()

65536
0.59500467449999994
0.41253443299999998
44.2315178815631


In [None]:
var little_endian_table = List[UInt32](capacity=256)

for i in range(256):

    var key = UInt8(i)
    var crc32 = key.cast[DType.uint32]()
    for i in range(8):
        if crc32 & 1 != 0:
            crc32 = (crc32 >> 1) ^ 0xedb88320
        else:
            crc32 = crc32 >> 1

    little_endian_table[i] = crc32

fn CRC32_table(owned data: List[SIMD[DType.uint8, 1]], table: List[UInt32]) -> SIMD[DType.uint32, 1]:
    var crc32: UInt32 = 0xffffffff
    for byte in data:
        var index = (crc32 ^ byte[].cast[DType.uint32]()) & 0xff
        crc32 = table[int(index)] ^ (crc32 >> 8)


    return crc32^0xffffffff

In [None]:
print(hex(CRC32(test_list)))
print(hex(CRC32_inv(test_list)))
print(hex(CRC32_table(test_list, little_endian_table)))

0x382aa34e
0x382aa34e
0x382aa34e


In [None]:
from random import rand

fn run_32[data: List[SIMD[DType.uint8, 1]] ]():
    var a =  CRC32(data)
    benchmark.keep(a)


fn run_32_inv[data: List[SIMD[DType.uint8, 1]] ]():
    var a = CRC32_inv(data)
    benchmark.keep(a)


fn run_32_table[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table(data, table)
    benchmark.keep(a)


fn fill_table() -> List[UInt32]:

    var table = List[UInt32](capacity=256)

    for i in range(256):

        var key = UInt8(i)
        var crc32 = key.cast[DType.uint32]()
        for i in range(8):
            if crc32 & 1 != 0:
                crc32 = (crc32 >> 1) ^ 0xedb88320
            else:
                crc32 = crc32 >> 1

        table[i] = crc32
    return table

fn bench():

    
    alias fill_size = 2**16
    alias g = UnsafePointer[SIMD[DType.uint8, 1]].alloc(fill_size)
    rand[DType.uint8](ptr =  g, size = fill_size)


    alias rand_list = List[SIMD[DType.uint8,1]](data = g, size = fill_size, capacity = fill_size)


    print(len(rand_list))

    var report = benchmark.run[run_32[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report)

    var report_2 = benchmark.run[run_32_inv[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_2)

 
    
    alias little_endian_table = fill_table()

    var report_3 = benchmark.run[run_32_table[rand_list, little_endian_table]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_3)

    print(100 * (report/report_2 -1))
    print(100 * (report_2/report_3 -1))
    print(100 * (report/report_3 -1))
    #report.print_full()
    #report_2.print_full()


bench()

65536
0.84526218549999999
0.76998419399999996
0.16062016688897607
9.7765632186470608
379.38201591598937
426.2491017608408


In [None]:
var little_endian_table_2_byte = List[UInt32](capacity=512)

for i in range(256):

    var key = UInt8(i)
    var crc32 = key.cast[DType.uint32]()
    for i in range(8):
        if crc32 & 1 != 0:
            crc32 = (crc32 >> 1) ^ 0xedb88320
        else:
            crc32 = crc32 >> 1

    little_endian_table_2_byte[i] = crc32

for i in range(256, 512):
    var crc32 = little_endian_table_2_byte[i-256]
    little_endian_table_2_byte[i] = (crc32 >> 8) ^ little_endian_table_2_byte[int(crc32.cast[DType.uint8]())]






In [None]:
from testing import assert_true


fn CRC32_table_2_byte(owned data: List[SIMD[DType.uint8, 1]], table: List[UInt32]) -> SIMD[DType.uint32, 1]:
    var crc32: UInt32 = 0xffffffff

    #assert_true(len(data) % 2 == 0, "List must be divisible by two for 16-bit optimization.")

    var extra = len(data) % 2
    var leftover = List[SIMD[DType.uint8, 1]](capacity = extra)
    for i in range(extra):
        leftover.append(data[-(i + 1)])

    var result_length = len(data)//2
    var ptr_to_int8 = data.steal_data() 
    var ptr_to_uint16 = ptr_to_int8.bitcast[UInt16]()

    var result = List[UInt16]()
    result.data = ptr_to_uint16
    result.capacity = result_length
    result.size = result_length

    for byte in result:
        var index = (crc32 ^ byte[].cast[DType.uint32]()) #& 0xff
        crc32 =  table[int((index >> 8).cast[DType.uint8]())] ^ table[256 + int(index.cast[DType.uint8]())] ^ (crc32 >> 16)
    
    for byte in leftover:
        var index = (crc32 ^ byte[].cast[DType.uint32]()) & 0xff
        crc32 = table[int(index)] ^ (crc32 >> 8)


    return crc32^0xffffffff

In [None]:
print(hex(CRC32(test_list)))
print(hex(CRC32_inv(test_list)))
print(hex(CRC32_table_2_byte(test_list, little_endian_table_2_byte)))

0x382aa34e
0x382aa34e
0x382aa34e


In [None]:
var f: UInt32 = (0xff << 8) | 0xff
print(hex(f))

0xffff


In [None]:
fn run_32[data: List[SIMD[DType.uint8, 1]] ]():
    var a =  CRC32(data)
    benchmark.keep(a)


fn run_32_inv[data: List[SIMD[DType.uint8, 1]] ]():
    var a = CRC32_inv(data)
    benchmark.keep(a)


fn run_32_table[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table(data, table)
    benchmark.keep(a)


fn run_32_table_2_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_2_byte(data, table)
    benchmark.keep(a)


fn fill_table() -> List[UInt32]:

    var table = List[UInt32](capacity=256)

    for i in range(256):

        var key = UInt8(i)
        var crc32 = key.cast[DType.uint32]()
        for i in range(8):
            if crc32 & 1 != 0:
                crc32 = (crc32 >> 1) ^ 0xedb88320
            else:
                crc32 = crc32 >> 1

        table[i] = crc32
    return table

fn fill_table_2_byte() -> List[UInt32]:

    var table = List[UInt32](capacity=512)
    table.size = 512

    for i in range(256):

        var key = UInt8(i)
        var crc32 = key.cast[DType.uint32]()
        for i in range(8):
            if crc32 & 1 != 0:
                crc32 = (crc32 >> 1) ^ 0xedb88320
            else:
                crc32 = crc32 >> 1

        table[i] = crc32

    for i in range(256, 512):
        var crc32 = table[i-256]
        table[i] = (crc32 >> 8) ^ table[int(crc32.cast[DType.uint8]())]
    return table



fn bench():

    
    alias fill_size = 2**16
    alias g = UnsafePointer[SIMD[DType.uint8, 1]].alloc(fill_size)
    rand[DType.uint8](ptr =  g, size = fill_size)


    alias rand_list = List[SIMD[DType.uint8,1]](data = g, size = fill_size, capacity = fill_size)


    print(len(rand_list))

    var report = benchmark.run[run_32[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report)

    var report_2 = benchmark.run[run_32_inv[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_2)

 
    
    alias little_endian_table = fill_table()

    var report_3 = benchmark.run[run_32_table[rand_list, little_endian_table]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_3)

    alias little_endian_table_2_byte = fill_table_2_byte()

    var report_4 = benchmark.run[run_32_table_2_byte[rand_list, little_endian_table_2_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_4)

    print(100 * (report/report_2 -1))
    print(100 * (report_2/report_3 -1))
    print(100 * (report/report_3 -1))
    print(100 * (report/report_4 -1))
    #report.print_full()
    #report_2.print_full()


bench()

65536
0.86205289350000003
0.7265950175
0.16436725754418979
0.092023411149999998
18.642830288882362
342.05581352153206
424.46752861848955
836.77563429466511


In [None]:
fn CRC32_table_2_byte_2(owned data: List[SIMD[DType.uint8, 1]], table: List[UInt32]) -> SIMD[DType.uint32, 1]:
    var crc32: UInt32 = 0xffffffff

    var length = len(data)//2
    var extra = len(data) % 2

    for i in range(start = 0, end = length *2 , step = 2):
        
        var val: UInt32 = ((data[i + 1].cast[DType.uint32]() << 8) | data[i].cast[DType.uint32]())
        var index = crc32 ^ val
        crc32 =  table[int((index >> 8).cast[DType.uint8]())] ^ table[256 + int(index.cast[DType.uint8]())] ^ (crc32 >> 16)
    

    for i in range(2*length, 2*length + extra ):
        var index = (crc32 ^ data[i].cast[DType.uint32]()) & 0xff
        crc32 = table[int(index)] ^ (crc32 >> 8)


    return crc32^0xffffffff

In [None]:
print(hex(CRC32(test_list)))
print(hex(CRC32_inv(test_list)))
print(hex(CRC32_table_2_byte(test_list, little_endian_table_2_byte)))
print(hex(CRC32_table_2_byte_2(test_list, little_endian_table_2_byte)))

0x382aa34e
0x382aa34e
0x382aa34e
0x382aa34e


In [None]:
fn run_32[data: List[SIMD[DType.uint8, 1]] ]():
    var a =  CRC32(data)
    benchmark.keep(a)


fn run_32_inv[data: List[SIMD[DType.uint8, 1]] ]():
    var a = CRC32_inv(data)
    benchmark.keep(a)


fn run_32_table[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table(data, table)
    benchmark.keep(a)


fn run_32_table_2_byte_2[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_2_byte_2(data, table)
    benchmark.keep(a)


fn fill_table() -> List[UInt32]:

    var table = List[UInt32](capacity=256)

    for i in range(256):

        var key = UInt8(i)
        var crc32 = key.cast[DType.uint32]()
        for i in range(8):
            if crc32 & 1 != 0:
                crc32 = (crc32 >> 1) ^ 0xedb88320
            else:
                crc32 = crc32 >> 1

        table[i] = crc32
    return table



fn bench():

    
    alias fill_size = 2**16
    alias g = UnsafePointer[SIMD[DType.uint8, 1]].alloc(fill_size)
    rand[DType.uint8](ptr =  g, size = fill_size)


    alias rand_list = List[SIMD[DType.uint8,1]](data = g, size = fill_size, capacity = fill_size)


    print(len(rand_list))

    var report = benchmark.run[run_32[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report)

    var report_2 = benchmark.run[run_32_inv[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_2)

 
    
    alias little_endian_table = fill_table()

    var report_3 = benchmark.run[run_32_table[rand_list, little_endian_table]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_3)

    alias little_endian_table_2_byte = fill_table_2_byte()

    var report_4 = benchmark.run[run_32_table_2_byte_2[rand_list, little_endian_table_2_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_4)


    var report_5 = benchmark.run[run_32_table_2_byte[rand_list, little_endian_table_2_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_5)

    print(100 * (report/report_2 -1))
    print(100 * (report_2/report_3 -1))
    print(100 * (report/report_3 -1))
    print(100 * (report/report_4 -1))
    print(100 * (report/report_5 -1))
    #report.print_full()
    #report_2.print_full()


bench()

65536
0.83095668599999994
0.68331465049999995
0.18584170618320112
0.11355653557158424
0.098940018650000003
21.60674228072914
267.68638457634177
347.13143409308253
631.75593268797638
739.85903513876076


In [None]:
fn fill_table_n_byte[n: Int]() -> List[UInt32]:

    var table = List[UInt32](capacity=256*n)
    table.size = 256*n

    for i in range(256*n):

        if i < 256: 
            var key = UInt8(i)
            var crc32 = key.cast[DType.uint32]()
            for i in range(8):
                if crc32 & 1 != 0:
                    crc32 = (crc32 >> 1) ^ 0xedb88320
                else:
                    crc32 = crc32 >> 1

            table[i] = crc32
        else:
            var crc32 = table[i-256]
            var index = int(crc32.cast[DType.uint8]())
            table[i] = (crc32 >> 8) ^ table[index]
            
    return table

In [None]:
#var t = fill_table_n_byte[1]()
alias t2 = fill_table_n_byte[2]()

In [None]:
fn CRC32_table_4_byte(owned data: List[SIMD[DType.uint8, 1]], table: List[UInt32]) -> SIMD[DType.uint32, 1]:
    var crc32: UInt32 = 0xffffffff

    var size = 4

    #assert_true(len(data) % 2 == 0, "List must be divisible by two for 16-bit optimization.")
    var length = len(data)//size
    var extra = len(data) % size



    for i in range(start = 0, end = length*size, step = size):
        
        var val: UInt32 =  (data[i + 3].cast[DType.uint32]() << 24) | (data[i + 2].cast[DType.uint32]() << 16) | (data[i + 1].cast[DType.uint32]() << 8) | data[i].cast[DType.uint32]()
        var index = crc32 ^ val.cast[DType.uint32]()
        crc32 = table[0*256 + int((index >> 24).cast[DType.uint8]())] ^
                table[1*256 + int((index >> 16).cast[DType.uint8]())] ^
                table[2*256 + int((index >> 8).cast[DType.uint8]())] ^
                table[3*256 + int((index >> 0).cast[DType.uint8]())] 
    
    for i in range(size*length, size*length + extra ):
        var index = (crc32 ^ data[i].cast[DType.uint32]()) & 0xff
        crc32 = table[int(index)] ^ (crc32 >> 8)


    return crc32^0xffffffff

In [None]:



var little_endian_table_4_byte  = fill_table_n_byte[4]()

In [None]:
print(hex(CRC32(test_list)))
print(hex(CRC32_inv(test_list)))
print(hex(CRC32_table_2_byte(test_list, little_endian_table_2_byte)))
print(hex(CRC32_table_2_byte_2(test_list, little_endian_table_2_byte)))
print(hex(CRC32_table_4_byte(test_list, little_endian_table_4_byte)))

0x382aa34e
0x382aa34e
0x382aa34e
0x382aa34e
0x382aa34e


In [None]:
fn run_32[data: List[SIMD[DType.uint8, 1]] ]():
    var a =  CRC32(data)
    benchmark.keep(a)


fn run_32_inv[data: List[SIMD[DType.uint8, 1]] ]():
    var a = CRC32_inv(data)
    benchmark.keep(a)


fn run_32_table[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table(data, table)
    benchmark.keep(a)


fn run_32_table_2_byte_2[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_2_byte_2(data, table)
    benchmark.keep(a)

fn run_32_table_4_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_4_byte(data, table)
    benchmark.keep(a)



fn bench():

    
    alias fill_size = 2**16
    alias g = UnsafePointer[SIMD[DType.uint8, 1]].alloc(fill_size)
    rand[DType.uint8](ptr =  g, size = fill_size)


    alias rand_list = List[SIMD[DType.uint8,1]](data = g, size = fill_size, capacity = fill_size)


    print(len(rand_list))

    var report = benchmark.run[run_32[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report)

    var report_2 = benchmark.run[run_32_inv[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_2)

 
    
    alias little_endian_table = fill_table_n_byte[1]()

    var report_3 = benchmark.run[run_32_table[rand_list, little_endian_table]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_3)

    alias little_endian_table_2_byte = fill_table_n_byte[2]()

    var report_4 = benchmark.run[run_32_table_2_byte_2[rand_list, little_endian_table_2_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_4)

    alias little_endian_table_4_byte = fill_table_n_byte[4]()

    var report_5 = benchmark.run[run_32_table_4_byte[rand_list, little_endian_table_4_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_5)

    print(100 * (report/report_2 -1))
    print(100 * (report_2/report_3 -1))
    print(100 * (report/report_3 -1))
    print(100 * (report/report_4 -1))
    print(100 * (report/report_5 -1))
    #report.print_full()
    #report_2.print_full()


bench()

65536
0.79050925449999998
0.5410031475
0.12362461018886081
0.083568155249999998
0.0433552434
46.119159962927945
337.6176771546634
539.44327370767212
845.9455604053195
1723.3302191540688


In [None]:
fn CRC32_table_8_byte(owned data: List[SIMD[DType.uint8, 1]], table: List[UInt32]) -> SIMD[DType.uint32, 1]:
    var crc32: UInt32 = 0xffffffff

    var size = 8

    var length = len(data)//size
    var extra = len(data) % size



    for i in range(start = 0, end = length*size, step = size):
        
        var val_1: UInt32 = (data[i + 3].cast[DType.uint32]() << 24) | 
                            (data[i + 2].cast[DType.uint32]() << 16) | 
                            (data[i + 1].cast[DType.uint32]() << 8) | 
                             data[i + 0].cast[DType.uint32]()

        var val_2: UInt32 = (data[i + 7].cast[DType.uint32]() << 24) | 
                            (data[i + 6].cast[DType.uint32]() << 16) | 
                            (data[i + 5].cast[DType.uint32]() << 8) | 
                             data[i + 4].cast[DType.uint32]()

        var index_1 = crc32 ^ val_1#.cast[DType.uint32]()
        var index_2 = val_2#.cast[DType.uint32]()
        crc32 = table[4*256 + int((index_1 >> 24).cast[DType.uint8]())] ^
                table[5*256 + int((index_1 >> 16).cast[DType.uint8]())] ^
                table[6*256 + int((index_1 >> 8).cast[DType.uint8]())] ^
                table[7*256 + int((index_1 >> 0).cast[DType.uint8]())] ^
                table[0*256 + int((index_2 >> 24).cast[DType.uint8]())] ^
                table[1*256 + int((index_2 >> 16).cast[DType.uint8]())] ^
                table[2*256 + int((index_2 >> 8).cast[DType.uint8]())] ^
                table[3*256 + int((index_2 >> 0).cast[DType.uint8]())] 
    
    for i in range(size*length, size*length + extra ):
        var index = (crc32 ^ data[i].cast[DType.uint32]()) & 0xff
        crc32 = table[int(index)] ^ (crc32 >> 8)


    return crc32^0xffffffff

In [None]:
fn CRC32_table_n_byte_compact[
    size: Int
](owned data: List[SIMD[DType.uint8, 1]], table: List[UInt32]) -> SIMD[DType.uint32, 1]:
    var crc32: UInt32 = 0xFFFFFFFF

    alias step_size = 4
    alias units = size // step_size

    var length = len(data) // size
    var extra = len(data) % size

    var vals = List[UInt32](capacity=units)
    vals.size = units
    var n = 0
    for i in range(start=0, end=length * size, step=size):


        @unroll(units)
        for j in range(units):
            vals[j] = (
                (data[i + j * step_size + 3].cast[DType.uint32]() << 24)
                | (data[i + j * step_size + 2].cast[DType.uint32]() << 16)
                | (data[i + j * step_size + 1].cast[DType.uint32]() << 8)
                | (data[i + j * step_size + 0].cast[DType.uint32]() << 0)
            )

            if j == 0:
                vals[0] = vals[0] ^ crc32
                crc32 = 0

            n = size - j * step_size
            crc32 = (
                table[(n - 4) * 256 + int((vals[j] >> 24).cast[DType.uint8]())]
                ^ table[(n - 3) * 256 + int((vals[j] >> 16).cast[DType.uint8]())]
                ^ table[(n - 2) * 256 + int((vals[j] >> 8).cast[DType.uint8]())]
                ^ table[(n - 1) * 256 + int((vals[j] >> 0).cast[DType.uint8]())]
                ^ crc32
            )
    for i in range(size * length, size * length + extra):
        var index = (crc32 ^ data[i].cast[DType.uint32]()) & 0xFF
        crc32 = table[int(index)] ^ (crc32 >> 8)

    return crc32 ^ 0xFFFFFFFF


In [None]:
var little_endian_table_8_byte  = fill_table_n_byte[8]()
print(hex(CRC32(test_list)))
print(hex(CRC32_inv(test_list)))
print(hex(CRC32_table_8_byte(test_list, little_endian_table_8_byte)))
print(hex(CRC32_table_n_byte_compact[8](test_list, little_endian_table_8_byte)))

0x382aa34e
0x382aa34e
0x382aa34e
0x382aa34e


In [None]:
fn run_32[data: List[SIMD[DType.uint8, 1]] ]():
    var a =  CRC32(data)
    benchmark.keep(a)


fn run_32_inv[data: List[SIMD[DType.uint8, 1]] ]():
    var a = CRC32_inv(data)
    benchmark.keep(a)


fn run_32_table[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table(data, table)
    benchmark.keep(a)


fn run_32_table_2_byte_2[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_2_byte_2(data, table)
    benchmark.keep(a)

fn run_32_table_4_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_4_byte(data, table)
    benchmark.keep(a)

fn run_32_table_8_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_8_byte(data, table)
    benchmark.keep(a)

fn run_32_table_8_byte_compact[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_n_byte_compact[8](data, table)
    benchmark.keep(a)


fn bench():

    
    alias fill_size = 2**20
    alias g = UnsafePointer[SIMD[DType.uint8, 1]].alloc(fill_size)
    rand[DType.uint8](ptr =  g, size = fill_size)


    alias rand_list = List[SIMD[DType.uint8,1]](data = g, size = fill_size, capacity = fill_size)


    print(len(rand_list))

    var report = benchmark.run[run_32[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report)

    var report_2 = benchmark.run[run_32_inv[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_2)

 
    
    alias little_endian_table = fill_table_n_byte[1]()

    var report_3 = benchmark.run[run_32_table[rand_list, little_endian_table]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_3)

    alias little_endian_table_2_byte = fill_table_n_byte[2]()

    var report_4 = benchmark.run[run_32_table_2_byte_2[rand_list, little_endian_table_2_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_4)

    alias little_endian_table_4_byte = fill_table_n_byte[4]()

    var report_5 = benchmark.run[run_32_table_4_byte[rand_list, little_endian_table_4_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_5)

    alias little_endian_table_8_byte = fill_table_n_byte[8]()




    var report_6 = benchmark.run[run_32_table_8_byte[rand_list, little_endian_table_8_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_6)

    var report_7 = benchmark.run[run_32_table_8_byte_compact[rand_list, little_endian_table_8_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_7)



    print(100 * (report/report_2 -1))
    print(100 * (report_2/report_3 -1))
    print(100 * (report/report_3 -1))
    print(100 * (report/report_4 -1))
    print(100 * (report/report_5 -1))
    print(100 * (report/report_6 -1))
    print(100 * (report/report_7 -1))

    #report.print_full()
    #report_2.print_full()


bench()

1048576
11.288160865671642
8.4133244950000012
2.082689491438356
1.669156014851485
0.77359028100000005
0.49301487150000001
0.49764872500000001
34.170040301906134
303.964418584047
441.99922321957672
576.2795547710391
1359.1911432857883
2189.618735298463
2168.2989624200568


In [None]:
#var little_endian_table_8_byte  = fill_table_n_byte[8]()
#print(hex(CRC32(test_list)))
#print(hex(CRC32_inv(test_list)))
#print(hex(CRC32_table_8_byte_compact(test_list, little_endian_table_8_byte)))

In [None]:
fn CRC32_table_16_byte(owned data: List[SIMD[DType.uint8, 1]], table: List[UInt32]) -> SIMD[DType.uint32, 1]:
    var crc32: UInt32 = 0xffffffff

    var size = 16

    var length = len(data)//size
    var extra = len(data) % size



    for i in range(start = 0, end = length*size, step = size):
        
        var val_1: UInt32 = (data[i + 3].cast[DType.uint32]() << 24) | 
                            (data[i + 2].cast[DType.uint32]() << 16) | 
                            (data[i + 1].cast[DType.uint32]() << 8) | 
                             data[i + 0].cast[DType.uint32]()

        var val_2: UInt32 = (data[i + 7].cast[DType.uint32]() << 24) | 
                            (data[i + 6].cast[DType.uint32]() << 16) | 
                            (data[i + 5].cast[DType.uint32]() << 8) | 
                             data[i + 4].cast[DType.uint32]()

        var val_3: UInt32 = (data[i + 11].cast[DType.uint32]() << 24) | 
                            (data[i + 10].cast[DType.uint32]() << 16) | 
                            (data[i + 9].cast[DType.uint32]() << 8) | 
                             data[i + 8].cast[DType.uint32]()

        var val_4: UInt32 = (data[i + 15].cast[DType.uint32]() << 24) | 
                            (data[i + 14].cast[DType.uint32]() << 16) | 
                            (data[i + 13].cast[DType.uint32]() << 8) | 
                             data[i + 12].cast[DType.uint32]()

        var index_1 = crc32 ^ val_1.cast[DType.uint32]()
        var index_2 = val_2.cast[DType.uint32]()
        var index_3 = val_3.cast[DType.uint32]()
        var index_4 = val_4.cast[DType.uint32]()

        crc32 = table[0*256 + int((index_4 >> 24).cast[DType.uint8]())] ^
                table[1*256 + int((index_4 >> 16).cast[DType.uint8]())] ^
                table[2*256 + int((index_4 >> 8).cast[DType.uint8]())] ^
                table[3*256 + int((index_4 >> 0).cast[DType.uint8]())] ^
                table[4*256 + int((index_3 >> 24).cast[DType.uint8]())] ^
                table[5*256 + int((index_3 >> 16).cast[DType.uint8]())] ^
                table[6*256 + int((index_3 >> 8).cast[DType.uint8]())] ^
                table[7*256 + int((index_3 >> 0).cast[DType.uint8]())] ^
                table[8*256 + int((index_2 >> 24).cast[DType.uint8]())] ^
                table[9*256 + int((index_2 >> 16).cast[DType.uint8]())] ^
                table[10*256 + int((index_2 >> 8).cast[DType.uint8]())] ^
                table[11*256 + int((index_2 >> 0).cast[DType.uint8]())] ^
                table[12*256 + int((index_1 >> 24).cast[DType.uint8]())] ^
                table[13*256 + int((index_1 >> 16).cast[DType.uint8]())] ^
                table[14*256 + int((index_1 >> 8).cast[DType.uint8]())] ^
                table[15*256 + int((index_1 >> 0).cast[DType.uint8]())] 
    
    for i in range(size*length, size*length + extra ):
        var index = (crc32 ^ data[i].cast[DType.uint32]()) & 0xff
        crc32 = table[int(index)] ^ (crc32 >> 8)


    return crc32^0xffffffff

In [None]:
var little_endian_table_16_byte  = fill_table_n_byte[16]()
print(hex(CRC32(test_list)))
print(hex(CRC32_inv(test_list)))
print(hex(CRC32_table_16_byte(test_list, little_endian_table_16_byte)))

0x382aa34e
0x382aa34e
0x382aa34e


In [None]:
fn run_32[data: List[SIMD[DType.uint8, 1]] ]():
    var a =  CRC32(data)
    benchmark.keep(a)


fn run_32_inv[data: List[SIMD[DType.uint8, 1]] ]():
    var a = CRC32_inv(data)
    benchmark.keep(a)


fn run_32_table[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table(data, table)
    benchmark.keep(a)


fn run_32_table_2_byte_2[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_2_byte_2(data, table)
    benchmark.keep(a)

fn run_32_table_4_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_4_byte(data, table)
    benchmark.keep(a)

fn run_32_table_8_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_8_byte(data, table)
    benchmark.keep(a)


fn run_32_table_16_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_16_byte(data, table)
    benchmark.keep(a)

fn bench():

    
    alias fill_size = 2**20
    alias g = UnsafePointer[SIMD[DType.uint8, 1]].alloc(fill_size)
    rand[DType.uint8](ptr =  g, size = fill_size)


    alias rand_list = List[SIMD[DType.uint8,1]](data = g, size = fill_size, capacity = fill_size)


    print(len(rand_list))

    var report = benchmark.run[run_32[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report)

    var report_2 = benchmark.run[run_32_inv[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_2)

 
    
    alias little_endian_table = fill_table_n_byte[1]()

    var report_3 = benchmark.run[run_32_table[rand_list, little_endian_table]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_3)

    alias little_endian_table_2_byte = fill_table_n_byte[2]()

    var report_4 = benchmark.run[run_32_table_2_byte_2[rand_list, little_endian_table_2_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_4)

    alias little_endian_table_4_byte = fill_table_n_byte[4]()

    var report_5 = benchmark.run[run_32_table_4_byte[rand_list, little_endian_table_4_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_5)

    alias little_endian_table_8_byte = fill_table_n_byte[8]()

    var report_6 = benchmark.run[run_32_table_8_byte[rand_list, little_endian_table_8_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_6)

    alias little_endian_table_16_byte = fill_table_n_byte[16]()

    var report_7 = benchmark.run[run_32_table_16_byte[rand_list, little_endian_table_16_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_7)

    print(100 * (report/report_2 -1))
    print(100 * (report_2/report_3 -1))
    print(100 * (report/report_3 -1))
    print(100 * (report/report_4 -1))
    print(100 * (report/report_5 -1))
    print(100 * (report/report_6 -1))
    print(100 * (report/report_7 -1))


    #report.print_full()
    #report_2.print_full()


bench()

1048576
10.729803022222223
8.2542810899999992
1.9779388177299086
1.3382771795160382
0.71156432250000001
0.47510569199999997
0.236569169
29.990763644108263
317.3173111326812
442.47395956043158
701.76238423959717
1407.9175111709205
2158.4033832670275
4435.5880766619348


In [None]:
fn CRC32_table_32_byte(owned data: List[SIMD[DType.uint8, 1]], table: List[UInt32]) -> SIMD[DType.uint32, 1]:
    var crc32: UInt32 = 0xffffffff

    var size = 32

    var length = len(data)//size
    var extra = len(data) % size



    for i in range(start = 0, end = length*size, step = size):
        
        var val_1: UInt32 = (data[i + 3].cast[DType.uint32]() << 24) | 
                            (data[i + 2].cast[DType.uint32]() << 16) | 
                            (data[i + 1].cast[DType.uint32]() << 8) | 
                             data[i + 0].cast[DType.uint32]()

        var val_2: UInt32 = (data[i + 7].cast[DType.uint32]() << 24) | 
                            (data[i + 6].cast[DType.uint32]() << 16) | 
                            (data[i + 5].cast[DType.uint32]() << 8) | 
                             data[i + 4].cast[DType.uint32]()

        var val_3: UInt32 = (data[i + 11].cast[DType.uint32]() << 24) | 
                            (data[i + 10].cast[DType.uint32]() << 16) | 
                            (data[i + 9].cast[DType.uint32]() << 8) | 
                             data[i + 8].cast[DType.uint32]()

        var val_4: UInt32 = (data[i + 15].cast[DType.uint32]() << 24) | 
                            (data[i + 14].cast[DType.uint32]() << 16) | 
                            (data[i + 13].cast[DType.uint32]() << 8) | 
                             data[i + 12].cast[DType.uint32]()

        var val_5: UInt32 = (data[i + 19].cast[DType.uint32]() << 24) | 
                            (data[i + 18].cast[DType.uint32]() << 16) | 
                            (data[i + 17].cast[DType.uint32]() << 8) | 
                             data[i + 16].cast[DType.uint32]()

        var val_6: UInt32 = (data[i + 23].cast[DType.uint32]() << 24) | 
                            (data[i + 22].cast[DType.uint32]() << 16) | 
                            (data[i + 21].cast[DType.uint32]() << 8) | 
                             data[i + 20].cast[DType.uint32]()

        var val_7: UInt32 = (data[i + 27].cast[DType.uint32]() << 24) | 
                            (data[i + 26].cast[DType.uint32]() << 16) | 
                            (data[i + 25].cast[DType.uint32]() << 8) | 
                             data[i + 24].cast[DType.uint32]()

        var val_8: UInt32 = (data[i + 31].cast[DType.uint32]() << 24) | 
                            (data[i + 30].cast[DType.uint32]() << 16) | 
                            (data[i + 29].cast[DType.uint32]() << 8) | 
                             data[i + 28].cast[DType.uint32]()

        var index_1 = crc32 ^ val_1.cast[DType.uint32]()
        var index_2 = val_2.cast[DType.uint32]()
        var index_3 = val_3.cast[DType.uint32]()
        var index_4 = val_4.cast[DType.uint32]()
        var index_5 = val_5.cast[DType.uint32]()
        var index_6 = val_6.cast[DType.uint32]()
        var index_7 = val_7.cast[DType.uint32]()
        var index_8 = val_8.cast[DType.uint32]()

        crc32 = table[0*256 + int((index_8 >> 24).cast[DType.uint8]())] ^
                table[1*256 + int((index_8 >> 16).cast[DType.uint8]())] ^
                table[2*256 + int((index_8 >> 8).cast[DType.uint8]())] ^
                table[3*256 + int((index_8 >> 0).cast[DType.uint8]())] ^
                table[4*256 + int((index_7 >> 24).cast[DType.uint8]())] ^
                table[5*256 + int((index_7 >> 16).cast[DType.uint8]())] ^
                table[6*256 + int((index_7 >> 8).cast[DType.uint8]())] ^
                table[7*256 + int((index_7 >> 0).cast[DType.uint8]())] ^
                table[8*256 + int((index_6 >> 24).cast[DType.uint8]())] ^
                table[9*256 + int((index_6 >> 16).cast[DType.uint8]())] ^
                table[10*256 + int((index_6 >> 8).cast[DType.uint8]())] ^
                table[11*256 + int((index_6 >> 0).cast[DType.uint8]())] ^
                table[12*256 + int((index_5 >> 24).cast[DType.uint8]())] ^
                table[13*256 + int((index_5 >> 16).cast[DType.uint8]())] ^
                table[14*256 + int((index_5 >> 8).cast[DType.uint8]())] ^
                table[15*256 + int((index_5 >> 0).cast[DType.uint8]())] ^
                table[16*256 + int((index_4 >> 24).cast[DType.uint8]())] ^
                table[17*256 + int((index_4 >> 16).cast[DType.uint8]())] ^
                table[18*256 + int((index_4 >> 8).cast[DType.uint8]())] ^
                table[19*256 + int((index_4 >> 0).cast[DType.uint8]())] ^
                table[20*256 + int((index_3 >> 24).cast[DType.uint8]())] ^
                table[21*256 + int((index_3 >> 16).cast[DType.uint8]())] ^
                table[22*256 + int((index_3 >> 8).cast[DType.uint8]())] ^
                table[23*256 + int((index_3 >> 0).cast[DType.uint8]())] ^
                table[24*256 + int((index_2 >> 24).cast[DType.uint8]())] ^
                table[25*256 + int((index_2 >> 16).cast[DType.uint8]())] ^
                table[26*256 + int((index_2 >> 8).cast[DType.uint8]())] ^
                table[27*256 + int((index_2 >> 0).cast[DType.uint8]())] ^
                table[28*256 + int((index_1 >> 24).cast[DType.uint8]())] ^
                table[29*256 + int((index_1 >> 16).cast[DType.uint8]())] ^
                table[30*256 + int((index_1 >> 8).cast[DType.uint8]())] ^
                table[31*256 + int((index_1 >> 0).cast[DType.uint8]())] 
    
    for i in range(size*length, size*length + extra ):
        var index = (crc32 ^ data[i].cast[DType.uint32]()) & 0xff
        crc32 = table[int(index)] ^ (crc32 >> 8)


    return crc32^0xffffffff

In [None]:
var little_endian_table_32_byte  = fill_table_n_byte[32]()
print(hex(CRC32(test_list)))
print(hex(CRC32_inv(test_list)))
print(hex(CRC32_table_32_byte(test_list, little_endian_table_32_byte)))

0x382aa34e
0x382aa34e
0x382aa34e


In [None]:
fn run_32[data: List[SIMD[DType.uint8, 1]] ]():
    var a =  CRC32(data)
    benchmark.keep(a)


fn run_32_inv[data: List[SIMD[DType.uint8, 1]] ]():
    var a = CRC32_inv(data)
    benchmark.keep(a)


fn run_32_table[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table(data, table)
    benchmark.keep(a)


fn run_32_table_2_byte_2[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_2_byte_2(data, table)
    benchmark.keep(a)

fn run_32_table_4_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_4_byte(data, table)
    benchmark.keep(a)

fn run_32_table_8_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_8_byte(data, table)
    benchmark.keep(a)


fn run_32_table_16_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_16_byte(data, table)
    benchmark.keep(a)

fn run_32_table_32_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_32_byte(data, table)
    benchmark.keep(a)

fn bench():

    
    alias fill_size = 2**20
    alias g = UnsafePointer[SIMD[DType.uint8, 1]].alloc(fill_size)
    rand[DType.uint8](ptr =  g, size = fill_size)


    alias rand_list = List[SIMD[DType.uint8,1]](data = g, size = fill_size, capacity = fill_size)


    print(len(rand_list))

    var report = benchmark.run[run_32[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report)

    var report_2 = benchmark.run[run_32_inv[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_2)

 
    
    alias little_endian_table = fill_table_n_byte[1]()

    var report_3 = benchmark.run[run_32_table[rand_list, little_endian_table]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_3)

    alias little_endian_table_2_byte = fill_table_n_byte[2]()

    var report_4 = benchmark.run[run_32_table_2_byte_2[rand_list, little_endian_table_2_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_4)

    alias little_endian_table_4_byte = fill_table_n_byte[4]()

    var report_5 = benchmark.run[run_32_table_4_byte[rand_list, little_endian_table_4_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_5)

    alias little_endian_table_8_byte = fill_table_n_byte[8]()

    var report_6 = benchmark.run[run_32_table_8_byte[rand_list, little_endian_table_8_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_6)

    alias little_endian_table_16_byte = fill_table_n_byte[16]()

    var report_7 = benchmark.run[run_32_table_16_byte[rand_list, little_endian_table_16_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_7)

    alias little_endian_table_32_byte = fill_table_n_byte[32]()

    var report_8 = benchmark.run[run_32_table_32_byte[rand_list, little_endian_table_32_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_8)

    print(100 * (report/report_2 -1))
    print(100 * (report_2/report_3 -1))
    print(100 * (report/report_3 -1))
    print(100 * (report/report_4 -1))
    print(100 * (report/report_5 -1))
    print(100 * (report/report_6 -1))
    print(100 * (report/report_7 -1))
    print(100 * (report/report_8 -1))



    #report.print_full()
    #report_2.print_full()


bench()

1048576
8.6481844649999999
6.8747104100000005
1.9391274461538461
1.3276128694677871
0.7069171955000001
0.48138664549999999
0.2292578575
0.19133430500078627
25.79707288353983
254.52597113385275
345.98329429833007
551.40860441243535
1123.3659783708003
1696.5152431716137
3672.2521527969875
4419.9340834172217


In [None]:
fn CRC32_table_64_byte(owned data: List[SIMD[DType.uint8, 1]], table: List[UInt32]) -> SIMD[DType.uint32, 1]:
    var crc32: UInt32 = 0xffffffff

    var size = 64

    var length = len(data)//size
    var extra = len(data) % size



    for i in range(start = 0, end = length*size, step = size):
        
        var val_1: UInt32 = (data[i + 3].cast[DType.uint32]() << 24) | 
                            (data[i + 2].cast[DType.uint32]() << 16) | 
                            (data[i + 1].cast[DType.uint32]() << 8) | 
                             data[i + 0].cast[DType.uint32]()

        var val_2: UInt32 = (data[i + 7].cast[DType.uint32]() << 24) | 
                            (data[i + 6].cast[DType.uint32]() << 16) | 
                            (data[i + 5].cast[DType.uint32]() << 8) | 
                             data[i + 4].cast[DType.uint32]()

        var val_3: UInt32 = (data[i + 11].cast[DType.uint32]() << 24) | 
                            (data[i + 10].cast[DType.uint32]() << 16) | 
                            (data[i + 9].cast[DType.uint32]() << 8) | 
                             data[i + 8].cast[DType.uint32]()

        var val_4: UInt32 = (data[i + 15].cast[DType.uint32]() << 24) | 
                            (data[i + 14].cast[DType.uint32]() << 16) | 
                            (data[i + 13].cast[DType.uint32]() << 8) | 
                             data[i + 12].cast[DType.uint32]()

        var val_5: UInt32 = (data[i + 19].cast[DType.uint32]() << 24) | 
                            (data[i + 18].cast[DType.uint32]() << 16) | 
                            (data[i + 17].cast[DType.uint32]() << 8) | 
                             data[i + 16].cast[DType.uint32]()

        var val_6: UInt32 = (data[i + 23].cast[DType.uint32]() << 24) | 
                            (data[i + 22].cast[DType.uint32]() << 16) | 
                            (data[i + 21].cast[DType.uint32]() << 8) | 
                             data[i + 20].cast[DType.uint32]()

        var val_7: UInt32 = (data[i + 27].cast[DType.uint32]() << 24) | 
                            (data[i + 26].cast[DType.uint32]() << 16) | 
                            (data[i + 25].cast[DType.uint32]() << 8) | 
                             data[i + 24].cast[DType.uint32]()

        var val_8: UInt32 = (data[i + 31].cast[DType.uint32]() << 24) | 
                            (data[i + 30].cast[DType.uint32]() << 16) | 
                            (data[i + 29].cast[DType.uint32]() << 8) | 
                             data[i + 28].cast[DType.uint32]()


        var val_9: UInt32 = (data[i + 35].cast[DType.uint32]() << 24) | 
                            (data[i + 34].cast[DType.uint32]() << 16) | 
                            (data[i + 33].cast[DType.uint32]() << 8) | 
                             data[i + 32].cast[DType.uint32]()

        var val_10: UInt32 =(data[i + 39].cast[DType.uint32]() << 24) | 
                            (data[i + 38].cast[DType.uint32]() << 16) | 
                            (data[i + 37].cast[DType.uint32]() << 8) | 
                             data[i + 36].cast[DType.uint32]()

        var val_11: UInt32 =(data[i + 43].cast[DType.uint32]() << 24) | 
                            (data[i + 42].cast[DType.uint32]() << 16) | 
                            (data[i + 41].cast[DType.uint32]() << 8) | 
                             data[i + 40].cast[DType.uint32]()

        var val_12: UInt32 =(data[i + 47].cast[DType.uint32]() << 24) | 
                            (data[i + 46].cast[DType.uint32]() << 16) | 
                            (data[i + 45].cast[DType.uint32]() << 8) | 
                             data[i + 44].cast[DType.uint32]()

        var val_13: UInt32 =(data[i + 51].cast[DType.uint32]() << 24) | 
                            (data[i + 50].cast[DType.uint32]() << 16) | 
                            (data[i + 49].cast[DType.uint32]() << 8) | 
                             data[i + 48].cast[DType.uint32]()

        var val_14: UInt32 =(data[i + 55].cast[DType.uint32]() << 24) | 
                            (data[i + 54].cast[DType.uint32]() << 16) | 
                            (data[i + 53].cast[DType.uint32]() << 8) | 
                             data[i + 52].cast[DType.uint32]()

        var val_15: UInt32 =(data[i + 59].cast[DType.uint32]() << 24) | 
                            (data[i + 58].cast[DType.uint32]() << 16) | 
                            (data[i + 57].cast[DType.uint32]() << 8) | 
                             data[i + 56].cast[DType.uint32]()

        var val_16: UInt32 =(data[i + 63].cast[DType.uint32]() << 24) | 
                            (data[i + 62].cast[DType.uint32]() << 16) | 
                            (data[i + 61].cast[DType.uint32]() << 8) | 
                             data[i + 60].cast[DType.uint32]()

        var index_1 = crc32 ^ val_1.cast[DType.uint32]()
        var index_2 = val_2.cast[DType.uint32]()
        var index_3 = val_3.cast[DType.uint32]()
        var index_4 = val_4.cast[DType.uint32]()
        var index_5 = val_5.cast[DType.uint32]()
        var index_6 = val_6.cast[DType.uint32]()
        var index_7 = val_7.cast[DType.uint32]()
        var index_8 = val_8.cast[DType.uint32]()
        var index_9 = val_9.cast[DType.uint32]()
        var index_10 = val_10.cast[DType.uint32]()
        var index_11 = val_11.cast[DType.uint32]()
        var index_12 = val_12.cast[DType.uint32]()
        var index_13 = val_13.cast[DType.uint32]()
        var index_14 = val_14.cast[DType.uint32]()
        var index_15 = val_15.cast[DType.uint32]()
        var index_16 = val_16.cast[DType.uint32]()

        crc32 = table[0*256 + int((index_16 >> 24).cast[DType.uint8]())] ^
                table[1*256 + int((index_16 >> 16).cast[DType.uint8]())] ^
                table[2*256 + int((index_16 >> 8).cast[DType.uint8]())] ^
                table[3*256 + int((index_16 >> 0).cast[DType.uint8]())] ^
                table[4*256 + int((index_15 >> 24).cast[DType.uint8]())] ^
                table[5*256 + int((index_15 >> 16).cast[DType.uint8]())] ^
                table[6*256 + int((index_15 >> 8).cast[DType.uint8]())] ^
                table[7*256 + int((index_15 >> 0).cast[DType.uint8]())] ^
                table[8*256 + int((index_14 >> 24).cast[DType.uint8]())] ^
                table[9*256 + int((index_14 >> 16).cast[DType.uint8]())] ^
                table[10*256 + int((index_14 >> 8).cast[DType.uint8]())] ^
                table[11*256 + int((index_14 >> 0).cast[DType.uint8]())] ^
                table[12*256 + int((index_13 >> 24).cast[DType.uint8]())] ^
                table[13*256 + int((index_13 >> 16).cast[DType.uint8]())] ^
                table[14*256 + int((index_13 >> 8).cast[DType.uint8]())] ^
                table[15*256 + int((index_13 >> 0).cast[DType.uint8]())] ^
                table[16*256 + int((index_12 >> 24).cast[DType.uint8]())] ^
                table[17*256 + int((index_12 >> 16).cast[DType.uint8]())] ^
                table[18*256 + int((index_12 >> 8).cast[DType.uint8]())] ^
                table[19*256 + int((index_12 >> 0).cast[DType.uint8]())] ^
                table[20*256 + int((index_11 >> 24).cast[DType.uint8]())] ^
                table[21*256 + int((index_11 >> 16).cast[DType.uint8]())] ^
                table[22*256 + int((index_11 >> 8).cast[DType.uint8]())] ^
                table[23*256 + int((index_11 >> 0).cast[DType.uint8]())] ^
                table[24*256 + int((index_10 >> 24).cast[DType.uint8]())] ^
                table[25*256 + int((index_10 >> 16).cast[DType.uint8]())] ^
                table[26*256 + int((index_10 >> 8).cast[DType.uint8]())] ^
                table[27*256 + int((index_10 >> 0).cast[DType.uint8]())] ^
                table[28*256 + int((index_9 >> 24).cast[DType.uint8]())] ^
                table[29*256 + int((index_9 >> 16).cast[DType.uint8]())] ^
                table[30*256 + int((index_9 >> 8).cast[DType.uint8]())] ^
                table[31*256 + int((index_9 >> 0).cast[DType.uint8]())] ^
                table[32*256 + int((index_8 >> 24).cast[DType.uint8]())] ^
                table[33*256 + int((index_8 >> 16).cast[DType.uint8]())] ^
                table[34*256 + int((index_8 >> 8).cast[DType.uint8]())] ^
                table[35*256 + int((index_8 >> 0).cast[DType.uint8]())] ^
                table[36*256 + int((index_7 >> 24).cast[DType.uint8]())] ^
                table[37*256 + int((index_7 >> 16).cast[DType.uint8]())] ^
                table[38*256 + int((index_7 >> 8).cast[DType.uint8]())] ^
                table[39*256 + int((index_7 >> 0).cast[DType.uint8]())] ^
                table[40*256 + int((index_6 >> 24).cast[DType.uint8]())] ^
                table[41*256 + int((index_6 >> 16).cast[DType.uint8]())] ^
                table[42*256 + int((index_6 >> 8).cast[DType.uint8]())] ^
                table[43*256 + int((index_6 >> 0).cast[DType.uint8]())] ^
                table[44*256 + int((index_5 >> 24).cast[DType.uint8]())] ^
                table[45*256 + int((index_5 >> 16).cast[DType.uint8]())] ^
                table[46*256 + int((index_5 >> 8).cast[DType.uint8]())] ^
                table[47*256 + int((index_5 >> 0).cast[DType.uint8]())] ^
                table[48*256 + int((index_4 >> 24).cast[DType.uint8]())] ^
                table[49*256 + int((index_4 >> 16).cast[DType.uint8]())] ^
                table[50*256 + int((index_4 >> 8).cast[DType.uint8]())] ^
                table[51*256 + int((index_4 >> 0).cast[DType.uint8]())] ^
                table[52*256 + int((index_3 >> 24).cast[DType.uint8]())] ^
                table[53*256 + int((index_3 >> 16).cast[DType.uint8]())] ^
                table[54*256 + int((index_3 >> 8).cast[DType.uint8]())] ^
                table[55*256 + int((index_3 >> 0).cast[DType.uint8]())] ^
                table[56*256 + int((index_2 >> 24).cast[DType.uint8]())] ^
                table[57*256 + int((index_2 >> 16).cast[DType.uint8]())] ^
                table[58*256 + int((index_2 >> 8).cast[DType.uint8]())] ^
                table[59*256 + int((index_2 >> 0).cast[DType.uint8]())] ^
                table[60*256 + int((index_1 >> 24).cast[DType.uint8]())] ^
                table[61*256 + int((index_1 >> 16).cast[DType.uint8]())] ^
                table[62*256 + int((index_1 >> 8).cast[DType.uint8]())] ^
                table[63*256 + int((index_1 >> 0).cast[DType.uint8]())] 
    
    for i in range(size*length, size*length + extra ):
        var index = (crc32 ^ data[i].cast[DType.uint32]()) & 0xff
        crc32 = table[int(index)] ^ (crc32 >> 8)


    return crc32^0xffffffff

In [None]:
var little_endian_table_64_byte  = fill_table_n_byte[64]()
print(hex(CRC32(test_list)))
print(hex(CRC32_inv(test_list)))
print(hex(CRC32_table_64_byte(test_list, little_endian_table_64_byte)))

0x382aa34e
0x382aa34e
0x382aa34e


In [None]:
fn run_32[data: List[SIMD[DType.uint8, 1]] ]():
    var a =  CRC32(data)
    benchmark.keep(a)


fn run_32_inv[data: List[SIMD[DType.uint8, 1]] ]():
    var a = CRC32_inv(data)
    benchmark.keep(a)


fn run_32_table[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table(data, table)
    benchmark.keep(a)


fn run_32_table_2_byte_2[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_2_byte_2(data, table)
    benchmark.keep(a)

fn run_32_table_4_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_4_byte(data, table)
    benchmark.keep(a)

fn run_32_table_8_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_8_byte(data, table)
    benchmark.keep(a)


fn run_32_table_16_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_16_byte(data, table)
    benchmark.keep(a)

fn run_32_table_32_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_32_byte(data, table)
    benchmark.keep(a)

fn run_32_table_64_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_64_byte(data, table)
    benchmark.keep(a)

fn bench():

    
    alias fill_size = 2**20
    alias g = UnsafePointer[SIMD[DType.uint8, 1]].alloc(fill_size)
    rand[DType.uint8](ptr =  g, size = fill_size)


    alias rand_list = List[SIMD[DType.uint8,1]](data = g, size = fill_size, capacity = fill_size)


    print(len(rand_list))

    var report = benchmark.run[run_32[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report)

    var report_2 = benchmark.run[run_32_inv[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_2)

 
    
    alias little_endian_table = fill_table_n_byte[1]()

    var report_3 = benchmark.run[run_32_table[rand_list, little_endian_table]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_3)

    alias little_endian_table_2_byte = fill_table_n_byte[2]()

    var report_4 = benchmark.run[run_32_table_2_byte_2[rand_list, little_endian_table_2_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_4)

    alias little_endian_table_4_byte = fill_table_n_byte[4]()

    var report_5 = benchmark.run[run_32_table_4_byte[rand_list, little_endian_table_4_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_5)

    alias little_endian_table_8_byte = fill_table_n_byte[8]()

    var report_6 = benchmark.run[run_32_table_8_byte[rand_list, little_endian_table_8_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_6)

    alias little_endian_table_16_byte = fill_table_n_byte[16]()

    var report_7 = benchmark.run[run_32_table_16_byte[rand_list, little_endian_table_16_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_7)

    alias little_endian_table_32_byte = fill_table_n_byte[32]()

    var report_8 = benchmark.run[run_32_table_32_byte[rand_list, little_endian_table_32_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_8)

    alias little_endian_table_64_byte = fill_table_n_byte[64]()

    var report_9 = benchmark.run[run_32_table_64_byte[rand_list, little_endian_table_64_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_9)

    print(100 * (report/report_2 -1))
    print(100 * (report_2/report_3 -1))
    print(100 * (report/report_3 -1))
    print(100 * (report/report_4 -1))
    print(100 * (report/report_5 -1))
    print(100 * (report/report_6 -1))
    print(100 * (report/report_7 -1))
    print(100 * (report/report_8 -1))
    print(100 * (report/report_9 -1))



    #report.print_full()
    #report_2.print_full()


bench()

1048576
8.0958940049999999
7.3598195150000008
2.1762406051159076
1.5418122323619632
0.73977778249999993
0.50607860449999997
0.2477206075
0.23779043850000001
0.32381718649999996
10.001257347409265
238.18960540018108
272.01281815844112
425.0894911241935
994.36836256974243
1499.730542451731
3168.1552361363392
3304.6339525127705
2400.1433964963439


In [None]:
fn CRC32_table_48_byte(owned data: List[SIMD[DType.uint8, 1]], table: List[UInt32]) -> SIMD[DType.uint32, 1]:
    var crc32: UInt32 = 0xffffffff

    var size = 48

    var length = len(data)//size
    var extra = len(data) % size



    for i in range(start = 0, end = length*size, step = size):
        
        var val_1: UInt32 = (data[i + 3].cast[DType.uint32]() << 24) | 
                            (data[i + 2].cast[DType.uint32]() << 16) | 
                            (data[i + 1].cast[DType.uint32]() << 8) | 
                             data[i + 0].cast[DType.uint32]()

        var val_2: UInt32 = (data[i + 7].cast[DType.uint32]() << 24) | 
                            (data[i + 6].cast[DType.uint32]() << 16) | 
                            (data[i + 5].cast[DType.uint32]() << 8) | 
                             data[i + 4].cast[DType.uint32]()

        var val_3: UInt32 = (data[i + 11].cast[DType.uint32]() << 24) | 
                            (data[i + 10].cast[DType.uint32]() << 16) | 
                            (data[i + 9].cast[DType.uint32]() << 8) | 
                             data[i + 8].cast[DType.uint32]()

        var val_4: UInt32 = (data[i + 15].cast[DType.uint32]() << 24) | 
                            (data[i + 14].cast[DType.uint32]() << 16) | 
                            (data[i + 13].cast[DType.uint32]() << 8) | 
                             data[i + 12].cast[DType.uint32]()

        var val_5: UInt32 = (data[i + 19].cast[DType.uint32]() << 24) | 
                            (data[i + 18].cast[DType.uint32]() << 16) | 
                            (data[i + 17].cast[DType.uint32]() << 8) | 
                             data[i + 16].cast[DType.uint32]()

        var val_6: UInt32 = (data[i + 23].cast[DType.uint32]() << 24) | 
                            (data[i + 22].cast[DType.uint32]() << 16) | 
                            (data[i + 21].cast[DType.uint32]() << 8) | 
                             data[i + 20].cast[DType.uint32]()

        var val_7: UInt32 = (data[i + 27].cast[DType.uint32]() << 24) | 
                            (data[i + 26].cast[DType.uint32]() << 16) | 
                            (data[i + 25].cast[DType.uint32]() << 8) | 
                             data[i + 24].cast[DType.uint32]()

        var val_8: UInt32 = (data[i + 31].cast[DType.uint32]() << 24) | 
                            (data[i + 30].cast[DType.uint32]() << 16) | 
                            (data[i + 29].cast[DType.uint32]() << 8) | 
                             data[i + 28].cast[DType.uint32]()


        var val_9: UInt32 = (data[i + 35].cast[DType.uint32]() << 24) | 
                            (data[i + 34].cast[DType.uint32]() << 16) | 
                            (data[i + 33].cast[DType.uint32]() << 8) | 
                             data[i + 32].cast[DType.uint32]()

        var val_10: UInt32 =(data[i + 39].cast[DType.uint32]() << 24) | 
                            (data[i + 38].cast[DType.uint32]() << 16) | 
                            (data[i + 37].cast[DType.uint32]() << 8) | 
                             data[i + 36].cast[DType.uint32]()

        var val_11: UInt32 =(data[i + 43].cast[DType.uint32]() << 24) | 
                            (data[i + 42].cast[DType.uint32]() << 16) | 
                            (data[i + 41].cast[DType.uint32]() << 8) | 
                             data[i + 40].cast[DType.uint32]()

        var val_12: UInt32 =(data[i + 47].cast[DType.uint32]() << 24) | 
                            (data[i + 46].cast[DType.uint32]() << 16) | 
                            (data[i + 45].cast[DType.uint32]() << 8) | 
                             data[i + 44].cast[DType.uint32]()



        var index_1 = crc32 ^ val_1.cast[DType.uint32]()
        var index_2 = val_2.cast[DType.uint32]()
        var index_3 = val_3.cast[DType.uint32]()
        var index_4 = val_4.cast[DType.uint32]()
        var index_5 = val_5.cast[DType.uint32]()
        var index_6 = val_6.cast[DType.uint32]()
        var index_7 = val_7.cast[DType.uint32]()
        var index_8 = val_8.cast[DType.uint32]()
        var index_9 = val_9.cast[DType.uint32]()
        var index_10 = val_10.cast[DType.uint32]()
        var index_11 = val_11.cast[DType.uint32]()
        var index_12 = val_12.cast[DType.uint32]()

        crc32 = table[0*256 + int((index_12 >> 24).cast[DType.uint8]())] ^
                table[1*256 + int((index_12 >> 16).cast[DType.uint8]())] ^
                table[2*256 + int((index_12 >> 8).cast[DType.uint8]())] ^
                table[3*256 + int((index_12 >> 0).cast[DType.uint8]())] ^
                table[4*256 + int((index_11 >> 24).cast[DType.uint8]())] ^
                table[5*256 + int((index_11 >> 16).cast[DType.uint8]())] ^
                table[6*256 + int((index_11 >> 8).cast[DType.uint8]())] ^
                table[7*256 + int((index_11 >> 0).cast[DType.uint8]())] ^
                table[8*256 + int((index_10 >> 24).cast[DType.uint8]())] ^
                table[9*256 + int((index_10 >> 16).cast[DType.uint8]())] ^
                table[10*256 + int((index_10 >> 8).cast[DType.uint8]())] ^
                table[11*256 + int((index_10 >> 0).cast[DType.uint8]())] ^
                table[12*256 + int((index_9 >> 24).cast[DType.uint8]())] ^
                table[13*256 + int((index_9 >> 16).cast[DType.uint8]())] ^
                table[14*256 + int((index_9 >> 8).cast[DType.uint8]())] ^
                table[15*256 + int((index_9 >> 0).cast[DType.uint8]())] ^
                table[16*256 + int((index_8 >> 24).cast[DType.uint8]())] ^
                table[17*256 + int((index_8 >> 16).cast[DType.uint8]())] ^
                table[18*256 + int((index_8 >> 8).cast[DType.uint8]())] ^
                table[19*256 + int((index_8 >> 0).cast[DType.uint8]())] ^
                table[20*256 + int((index_7 >> 24).cast[DType.uint8]())] ^
                table[21*256 + int((index_7 >> 16).cast[DType.uint8]())] ^
                table[22*256 + int((index_7 >> 8).cast[DType.uint8]())] ^
                table[23*256 + int((index_7 >> 0).cast[DType.uint8]())] ^
                table[24*256 + int((index_6 >> 24).cast[DType.uint8]())] ^
                table[25*256 + int((index_6 >> 16).cast[DType.uint8]())] ^
                table[26*256 + int((index_6 >> 8).cast[DType.uint8]())] ^
                table[27*256 + int((index_6 >> 0).cast[DType.uint8]())] ^
                table[28*256 + int((index_5 >> 24).cast[DType.uint8]())] ^
                table[29*256 + int((index_5 >> 16).cast[DType.uint8]())] ^
                table[30*256 + int((index_5 >> 8).cast[DType.uint8]())] ^
                table[31*256 + int((index_5 >> 0).cast[DType.uint8]())] ^
                table[32*256 + int((index_4 >> 24).cast[DType.uint8]())] ^
                table[33*256 + int((index_4 >> 16).cast[DType.uint8]())] ^
                table[34*256 + int((index_4 >> 8).cast[DType.uint8]())] ^
                table[35*256 + int((index_4 >> 0).cast[DType.uint8]())] ^
                table[36*256 + int((index_3 >> 24).cast[DType.uint8]())] ^
                table[37*256 + int((index_3 >> 16).cast[DType.uint8]())] ^
                table[38*256 + int((index_3 >> 8).cast[DType.uint8]())] ^
                table[39*256 + int((index_3 >> 0).cast[DType.uint8]())] ^
                table[40*256 + int((index_2 >> 24).cast[DType.uint8]())] ^
                table[41*256 + int((index_2 >> 16).cast[DType.uint8]())] ^
                table[42*256 + int((index_2 >> 8).cast[DType.uint8]())] ^
                table[43*256 + int((index_2 >> 0).cast[DType.uint8]())] ^
                table[44*256 + int((index_1 >> 24).cast[DType.uint8]())] ^
                table[45*256 + int((index_1 >> 16).cast[DType.uint8]())] ^
                table[46*256 + int((index_1 >> 8).cast[DType.uint8]())] ^
                table[47*256 + int((index_1 >> 0).cast[DType.uint8]())] 
                
    for i in range(size*length, size*length + extra ):
        var index = (crc32 ^ data[i].cast[DType.uint32]()) & 0xff
        crc32 = table[int(index)] ^ (crc32 >> 8)


    return crc32^0xffffffff

In [None]:
var little_endian_table_48_byte  = fill_table_n_byte[48]()
print(hex(CRC32(test_list)))
print(hex(CRC32_inv(test_list)))
print(hex(CRC32_table_48_byte(test_list, little_endian_table_48_byte)))



error: [0;1;31m[1mExpression [23]:1:36: [0m[1muse of unknown declaration 'fill_table_n_byte'
[0mvar little_endian_table_48_byte  = fill_table_n_byte[48]()
[0;1;32m                                   ^~~~~~~~~~~~~~~~~
[0m[0merror: [0;1;31m[1mExpression [23]:2:11: [0m[1muse of unknown declaration 'CRC32'
[0mprint(hex(CRC32(test_list)))
[0;1;32m          ^~~~~
[0m[0merror: [0;1;31m[1mExpression [23]:3:11: [0m[1muse of unknown declaration 'CRC32_inv'
[0mprint(hex(CRC32_inv(test_list)))
[0;1;32m          ^~~~~~~~~
[0m[0merror: [0;1;31m[1mExpression [23]:4:11: [0m[1muse of unknown declaration 'CRC32_table_48_byte'
[0mprint(hex(CRC32_table_48_byte(test_list, little_endian_table_48_byte)))
[0;1;32m          ^~~~~~~~~~~~~~~~~~~
[0m[0m

expression failed to parse (no further compiler diagnostics)

In [None]:
fn run_32[data: List[SIMD[DType.uint8, 1]] ]():
    var a =  CRC32(data)
    benchmark.keep(a)


fn run_32_inv[data: List[SIMD[DType.uint8, 1]] ]():
    var a = CRC32_inv(data)
    benchmark.keep(a)


fn run_32_table[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table(data, table)
    benchmark.keep(a)


fn run_32_table_2_byte_2[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_2_byte_2(data, table)
    benchmark.keep(a)

fn run_32_table_4_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_4_byte(data, table)
    benchmark.keep(a)

fn run_32_table_8_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_8_byte(data, table)
    benchmark.keep(a)


fn run_32_table_16_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_16_byte(data, table)
    benchmark.keep(a)

fn run_32_table_32_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_32_byte(data, table)
    benchmark.keep(a)

fn run_32_table_48_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_48_byte(data, table)
    benchmark.keep(a)

fn run_32_table_64_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_64_byte(data, table)
    benchmark.keep(a)

fn bench():

    
    alias fill_size = 2**20
    alias g = UnsafePointer[SIMD[DType.uint8, 1]].alloc(fill_size)
    rand[DType.uint8](ptr =  g, size = fill_size)


    alias rand_list = List[SIMD[DType.uint8,1]](data = g, size = fill_size, capacity = fill_size)


    print(len(rand_list))

    var report = benchmark.run[run_32[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report)

    var report_2 = benchmark.run[run_32_inv[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_2)

 
    
    alias little_endian_table = fill_table_n_byte[1]()

    var report_3 = benchmark.run[run_32_table[rand_list, little_endian_table]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_3)

    alias little_endian_table_2_byte = fill_table_n_byte[2]()

    var report_4 = benchmark.run[run_32_table_2_byte_2[rand_list, little_endian_table_2_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_4)

    alias little_endian_table_4_byte = fill_table_n_byte[4]()

    var report_5 = benchmark.run[run_32_table_4_byte[rand_list, little_endian_table_4_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_5)

    alias little_endian_table_8_byte = fill_table_n_byte[8]()

    var report_6 = benchmark.run[run_32_table_8_byte[rand_list, little_endian_table_8_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_6)

    alias little_endian_table_16_byte = fill_table_n_byte[16]()

    var report_7 = benchmark.run[run_32_table_16_byte[rand_list, little_endian_table_16_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_7)

    alias little_endian_table_32_byte = fill_table_n_byte[32]()

    var report_8 = benchmark.run[run_32_table_32_byte[rand_list, little_endian_table_32_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_8)

    alias little_endian_table_48_byte = fill_table_n_byte[48]()

    var report_9 = benchmark.run[run_32_table_48_byte[rand_list, little_endian_table_48_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_9)

    alias little_endian_table_64_byte = fill_table_n_byte[64]()

    var report_10 = benchmark.run[run_32_table_64_byte[rand_list, little_endian_table_64_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_10)

    print(100 * (report/report_2 -1))
    print(100 * (report_2/report_3 -1))
    print(100 * (report/report_3 -1))
    print(100 * (report/report_4 -1))
    print(100 * (report/report_5 -1))
    print(100 * (report/report_6 -1))
    print(100 * (report/report_7 -1))
    print(100 * (report/report_8 -1))
    print(100 * (report/report_9 -1))
    print(100 * (report/report_10 -1))




    #report.print_full()
    #report_2.print_full()


bench()

1048576
8.15321797
7.0219428749999997
1.9393683111291633
1.3442710239154616
0.70748446850000002
0.48288434749999998
0.234955997
0.18959147131244575
0.40115316100000004
0.25623588050000001
16.110571036224798
262.07371414208558
320.40585706244372
506.51593502715701
1052.4235983987542
1588.4411375541636
3370.104221259779
4200.4138918061035
1932.4451512922267
3081.9189233336115


In [None]:
%%python
from timeit import timeit
import numpy as np
import zlib


py_rand_list = np.random.randint(0, 255, size=2**18, dtype=np.uint8)
print(hex(zlib.crc32(py_rand_list)))

secs = timeit(lambda: zlib.crc32(py_rand_list), number=100)/100

print(secs*100)

0x5ada5791
0.021449088002555072


In [None]:
print(secs*100)

0.021655022050254047


In [None]:
fn fill_table_n_byte_simd[n: Int]() -> DTypePointer[DType.uint32]:

    #alias size = 256*n
    #var table = SIMD[DType.uint32, size](0)
    var table = DTypePointer[DType.uint32]()
    table = table.alloc(256*n, alignment=64)

    for i in range(256*n):

        if i < 256: 
            var key = UInt8(i)
            var crc32 = key.cast[DType.uint32]()
            for i in range(8):
                if crc32 & 1 != 0:
                    crc32 = (crc32 >> 1) ^ 0xedb88320
                else:
                    crc32 = crc32 >> 1

            table[i] = crc32
        else:
            var crc32 = table[i-256]
            var index = int(crc32.cast[DType.uint8]())
            table[i] = (crc32 >> 8) ^ table[index]
            
    return table


In [None]:
alias let_simd = fill_table_n_byte_simd[32]()

In [None]:
print(let_simd.is_aligned[128]())

True
