---
layout: post
title: CRC calculations and compile time metaprogramming in Mojo
categories: [mojo]
date: "2024-05-99"
author: "Ferdinand Schenck"
draft: true
description: . 
---

In my [last post](https://fnands.com/blog/2024/mojo-png-parsing/) on parsing PNG images in I very briefly mentioned cyclic redundancy checks, and posted a rather cryptic looking function which I claimed was a bit inefficient. 

In this post I want to follow up on that a bit, and delve into the compile time metaprogramming side of Mojo to see how we can speed up these calculations.   

But first, let's go through a bit of background so we know what we're dealing with. 

## Cyclic redundancy checks

CRCs are error detecting codes that are often used to detect corruption of data in digital files, an example of which is PNG files. In the case of PNGs for example the CRC32 is calculated for the data of each chunk and appended to the end of the chunk, so that the person reading the file can verify whether the data they read was the same as the data that was written.  

A CRC check technically does "long division in the ring of polynomials of binary coefficients ($\Bbb{F}_2[x]$)" 😳.   

It's not as complicated as it sounds. I found the [Wikipedia article on Polynomial long division](https://en.wikipedia.org/wiki/Polynomial_long_division) to be helpful, and if you want an in depth explanation then
[this post](https://github.com/komrad36/CRC) by [Kareem Omar](https://github.com/komrad36) does a really great job of explaining both the concept and implementation considerations. 

But what you need to know is that XOR is equivalent to polynomial long division (over a finite field) for binary numbers, and XOR is a very efficient operation to calculate in hardware. 

The simplest example of a cyclic redundancy check is the [parity bit](https://en.wikipedia.org/wiki/Parity_bit), AKA CRC-1. The parity bit is used to detect whether an error has occurred while transmitting a byte-long message (it can be used for longer messages, but probably shouldn't be). 

In the formalism of CRC checks, it can be calculated by successively applying XOR between your message and the relevant *generator polynomial*. For larger cases the choice of generator polynomial can get quite involved, but for the CRC-1 case it is $x + 1$, expressed in binary as 11. Notice that the Generator polynomial is always 1 order (or has one more bit) than the CRC. The way it is applied is by bitshifting 

```
1+0+0+1 (mod 2) = 0
1+0+1+1 (mod 2) = 1

1001/1100 = 0101
0101/0110 = 0011
0011/0011 = 0000

1011/1100 = 0111
0111/0110 = 0001
```


In [1]:
from math.bit import bitreverse
import benchmark

fn CRC32(owned data: List[SIMD[DType.uint8, 1]]) -> SIMD[DType.uint32, 1]:
    var crc32: UInt32 = 0xffffffff
    for byte in data:
        crc32 = (bitreverse(byte[]).cast[DType.uint32]() << 24) ^ crc32
        for i in range(8):
            
            if crc32 & 0x80000000 != 0:
                crc32 = (crc32 << 1) ^ 0x04c11db7
            else:
                crc32 = crc32 << 1

    return bitreverse(crc32^0xffffffff)

In [2]:
fn CRC32_inv(owned data: List[SIMD[DType.uint8, 1]]) -> SIMD[DType.uint32, 1]:
    var crc32: UInt32 = 0xffffffff
    for byte in data:
        crc32 = (byte[].cast[DType.uint32]() ) ^ crc32
        for i in range(8):
            
            if crc32 & 1 != 0:
                crc32 = (crc32 >> 1) ^ 0xedb88320
            else:
                crc32 = crc32 >> 1

    return crc32^0xffffffff

In [3]:
var test_list = List[SIMD[DType.uint8, 1]](5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201,
                                           5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201,
                                           5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201,
                                           5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201, 
                                           5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201,
                                           5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201,
                                           5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201,
                                           5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201, 42)

In [4]:
print(hex(CRC32(test_list)))
print(hex(CRC32_inv(test_list)))

0x382aa34e
0x382aa34e


In [5]:

from time import sleep




In [6]:
from random import rand, seed

seed(614114419)

fn run_32[data: List[SIMD[DType.uint8, 1]] ]():
    var a =  CRC32(data)
    benchmark.keep(a)


fn run_32_inv[data: List[SIMD[DType.uint8, 1]] ]():
    var a = CRC32_inv(data)
    benchmark.keep(a)


fn bench():

    
    alias fill_size = 2**16
    alias g = UnsafePointer[SIMD[DType.uint8, 1]].alloc(fill_size)
    rand[DType.uint8](ptr =  g, size = fill_size)


    alias rand_list = List[SIMD[DType.uint8,1]](data = g, size = fill_size, capacity = fill_size)


    print(len(rand_list))

    var report = benchmark.run[run_32[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report)

    var report_2 = benchmark.run[run_32_inv[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_2)

    print(100 * (report/report_2 -1))
    #report.print_full()
    #report_2.print_full()


bench()

65536
0.54345133014492752
0.53359242868010659
1.8476464310424934


In [7]:
var little_endian_table = List[UInt32](capacity=256)

for i in range(256):

    var key = UInt8(i)
    var crc32 = key.cast[DType.uint32]()
    for i in range(8):
        if crc32 & 1 != 0:
            crc32 = (crc32 >> 1) ^ 0xedb88320
        else:
            crc32 = crc32 >> 1

    little_endian_table[i] = crc32

fn CRC32_table(owned data: List[SIMD[DType.uint8, 1]], table: List[UInt32]) -> SIMD[DType.uint32, 1]:
    var crc32: UInt32 = 0xffffffff
    for byte in data:
        var index = (crc32 ^ byte[].cast[DType.uint32]()) & 0xff
        crc32 = table[int(index)] ^ (crc32 >> 8)


    return crc32^0xffffffff

In [8]:
print(hex(CRC32(test_list)))
print(hex(CRC32_inv(test_list)))
print(hex(CRC32_table(test_list, little_endian_table)))

0x382aa34e
0x382aa34e
0x382aa34e


In [9]:
from random import rand

fn run_32[data: List[SIMD[DType.uint8, 1]] ]():
    var a =  CRC32(data)
    benchmark.keep(a)


fn run_32_inv[data: List[SIMD[DType.uint8, 1]] ]():
    var a = CRC32_inv(data)
    benchmark.keep(a)


fn run_32_table[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table(data, table)
    benchmark.keep(a)


fn fill_table() -> List[UInt32]:

    var table = List[UInt32](capacity=256)

    for i in range(256):

        var key = UInt8(i)
        var crc32 = key.cast[DType.uint32]()
        for i in range(8):
            if crc32 & 1 != 0:
                crc32 = (crc32 >> 1) ^ 0xedb88320
            else:
                crc32 = crc32 >> 1

        table[i] = crc32
    return table

fn bench():

    
    alias fill_size = 2**16
    alias g = UnsafePointer[SIMD[DType.uint8, 1]].alloc(fill_size)
    rand[DType.uint8](ptr =  g, size = fill_size)


    alias rand_list = List[SIMD[DType.uint8,1]](data = g, size = fill_size, capacity = fill_size)


    print(len(rand_list))

    var report = benchmark.run[run_32[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report)

    var report_2 = benchmark.run[run_32_inv[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_2)

 
    
    alias little_endian_table = fill_table()

    var report_3 = benchmark.run[run_32_table[rand_list, little_endian_table]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_3)

    print(100 * (report/report_2 -1))
    print(100 * (report_2/report_3 -1))
    print(100 * (report/report_3 -1))
    #report.print_full()
    #report_2.print_full()


bench()

65536
0.63409343566433563
0.49859184247823446
0.13950786015076683
27.17685722907035
257.39336976382828
354.52165561070734


In [10]:
var little_endian_table_2_byte = List[UInt32](capacity=512)

for i in range(256):

    var key = UInt8(i)
    var crc32 = key.cast[DType.uint32]()
    for i in range(8):
        if crc32 & 1 != 0:
            crc32 = (crc32 >> 1) ^ 0xedb88320
        else:
            crc32 = crc32 >> 1

    little_endian_table_2_byte[i] = crc32

for i in range(256, 512):
    var crc32 = little_endian_table_2_byte[i-256]
    little_endian_table_2_byte[i] = (crc32 >> 8) ^ little_endian_table_2_byte[int(crc32.cast[DType.uint8]())]






In [11]:
from testing import assert_true


fn CRC32_table_2_byte(owned data: List[SIMD[DType.uint8, 1]], table: List[UInt32]) -> SIMD[DType.uint32, 1]:
    var crc32: UInt32 = 0xffffffff

    #assert_true(len(data) % 2 == 0, "List must be divisible by two for 16-bit optimization.")

    var extra = len(data) % 2
    var leftover = List[SIMD[DType.uint8, 1]](capacity = extra)
    for i in range(extra):
        leftover.append(data[-(i + 1)])

    var result_length = len(data)//2
    var ptr_to_int8 = data.steal_data() 
    var ptr_to_uint16 = ptr_to_int8.bitcast[UInt16]()

    var result = List[UInt16]()
    result.data = ptr_to_uint16
    result.capacity = result_length
    result.size = result_length

    for byte in result:
        var index = (crc32 ^ byte[].cast[DType.uint32]()) #& 0xff
        crc32 =  table[int((index >> 8).cast[DType.uint8]())] ^ table[256 + int(index.cast[DType.uint8]())] ^ (crc32 >> 16)
    
    for byte in leftover:
        var index = (crc32 ^ byte[].cast[DType.uint32]()) & 0xff
        crc32 = table[int(index)] ^ (crc32 >> 8)


    return crc32^0xffffffff

In [12]:
print(hex(CRC32(test_list)))
print(hex(CRC32_inv(test_list)))
print(hex(CRC32_table_2_byte(test_list, little_endian_table_2_byte)))

0x382aa34e
0x382aa34e
0x382aa34e


In [13]:
var f: UInt32 = (0xff << 8) | 0xff
print(hex(f))

0xffff


In [14]:
fn run_32[data: List[SIMD[DType.uint8, 1]] ]():
    var a =  CRC32(data)
    benchmark.keep(a)


fn run_32_inv[data: List[SIMD[DType.uint8, 1]] ]():
    var a = CRC32_inv(data)
    benchmark.keep(a)


fn run_32_table[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table(data, table)
    benchmark.keep(a)


fn run_32_table_2_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_2_byte(data, table)
    benchmark.keep(a)


fn fill_table() -> List[UInt32]:

    var table = List[UInt32](capacity=256)

    for i in range(256):

        var key = UInt8(i)
        var crc32 = key.cast[DType.uint32]()
        for i in range(8):
            if crc32 & 1 != 0:
                crc32 = (crc32 >> 1) ^ 0xedb88320
            else:
                crc32 = crc32 >> 1

        table[i] = crc32
    return table

fn fill_table_2_byte() -> List[UInt32]:

    var table = List[UInt32](capacity=512)
    table.size = 512

    for i in range(256):

        var key = UInt8(i)
        var crc32 = key.cast[DType.uint32]()
        for i in range(8):
            if crc32 & 1 != 0:
                crc32 = (crc32 >> 1) ^ 0xedb88320
            else:
                crc32 = crc32 >> 1

        table[i] = crc32

    for i in range(256, 512):
        var crc32 = table[i-256]
        table[i] = (crc32 >> 8) ^ table[int(crc32.cast[DType.uint8]())]
    return table



fn bench():

    
    alias fill_size = 2**16
    alias g = UnsafePointer[SIMD[DType.uint8, 1]].alloc(fill_size)
    rand[DType.uint8](ptr =  g, size = fill_size)


    alias rand_list = List[SIMD[DType.uint8,1]](data = g, size = fill_size, capacity = fill_size)


    print(len(rand_list))

    var report = benchmark.run[run_32[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report)

    var report_2 = benchmark.run[run_32_inv[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_2)

 
    
    alias little_endian_table = fill_table()

    var report_3 = benchmark.run[run_32_table[rand_list, little_endian_table]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_3)

    alias little_endian_table_2_byte = fill_table_2_byte()

    var report_4 = benchmark.run[run_32_table_2_byte[rand_list, little_endian_table_2_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_4)

    print(100 * (report/report_2 -1))
    print(100 * (report_2/report_3 -1))
    print(100 * (report/report_3 -1))
    print(100 * (report/report_4 -1))
    #report.print_full()
    #report_2.print_full()


bench()

65536
0.60503601359358661
0.5166418020723198
0.14394653191489362
0.098237166000000001
17.109380457931533
258.91229555830984
320.3199657156768
515.89318811740418


In [15]:
fn CRC32_table_2_byte_2(owned data: List[SIMD[DType.uint8, 1]], table: List[UInt32]) -> SIMD[DType.uint32, 1]:
    var crc32: UInt32 = 0xffffffff

    var length = len(data)//2
    var extra = len(data) % 2

    for i in range(start = 0, end = length *2 , step = 2):
        
        var val: UInt32 = ((data[i + 1].cast[DType.uint32]() << 8) | data[i].cast[DType.uint32]())
        var index = crc32 ^ val
        crc32 =  table[int((index >> 8).cast[DType.uint8]())] ^ table[256 + int(index.cast[DType.uint8]())] ^ (crc32 >> 16)
    

    for i in range(2*length, 2*length + extra ):
        var index = (crc32 ^ data[i].cast[DType.uint32]()) & 0xff
        crc32 = table[int(index)] ^ (crc32 >> 8)


    return crc32^0xffffffff

In [16]:
print(hex(CRC32(test_list)))
print(hex(CRC32_inv(test_list)))
print(hex(CRC32_table_2_byte(test_list, little_endian_table_2_byte)))
print(hex(CRC32_table_2_byte_2(test_list, little_endian_table_2_byte)))

0x382aa34e
0x382aa34e
0x382aa34e
0x382aa34e


In [17]:
fn run_32[data: List[SIMD[DType.uint8, 1]] ]():
    var a =  CRC32(data)
    benchmark.keep(a)


fn run_32_inv[data: List[SIMD[DType.uint8, 1]] ]():
    var a = CRC32_inv(data)
    benchmark.keep(a)


fn run_32_table[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table(data, table)
    benchmark.keep(a)


fn run_32_table_2_byte_2[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_2_byte_2(data, table)
    benchmark.keep(a)


fn fill_table() -> List[UInt32]:

    var table = List[UInt32](capacity=256)

    for i in range(256):

        var key = UInt8(i)
        var crc32 = key.cast[DType.uint32]()
        for i in range(8):
            if crc32 & 1 != 0:
                crc32 = (crc32 >> 1) ^ 0xedb88320
            else:
                crc32 = crc32 >> 1

        table[i] = crc32
    return table



fn bench():

    
    alias fill_size = 2**16
    alias g = UnsafePointer[SIMD[DType.uint8, 1]].alloc(fill_size)
    rand[DType.uint8](ptr =  g, size = fill_size)


    alias rand_list = List[SIMD[DType.uint8,1]](data = g, size = fill_size, capacity = fill_size)


    print(len(rand_list))

    var report = benchmark.run[run_32[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report)

    var report_2 = benchmark.run[run_32_inv[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_2)

 
    
    alias little_endian_table = fill_table()

    var report_3 = benchmark.run[run_32_table[rand_list, little_endian_table]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_3)

    alias little_endian_table_2_byte = fill_table_2_byte()

    var report_4 = benchmark.run[run_32_table_2_byte_2[rand_list, little_endian_table_2_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_4)


    var report_5 = benchmark.run[run_32_table_2_byte[rand_list, little_endian_table_2_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_5)

    print(100 * (report/report_2 -1))
    print(100 * (report_2/report_3 -1))
    print(100 * (report/report_3 -1))
    print(100 * (report/report_4 -1))
    print(100 * (report/report_5 -1))
    #report.print_full()
    #report_2.print_full()


bench()

65536
0.62060917593244191
0.50325528196930946
0.14514870670863206
0.096061115799999999
0.1023257679
23.318959217657877
246.71702792332263
327.56783026543769
546.05659716107721
506.50331648519386


In [18]:
fn fill_table_n_byte[n: Int]() -> List[UInt32]:

    var table = List[UInt32](capacity=256*n)
    table.size = 256*n

    for i in range(256*n):

        if i < 256: 
            var key = UInt8(i)
            var crc32 = key.cast[DType.uint32]()
            for i in range(8):
                if crc32 & 1 != 0:
                    crc32 = (crc32 >> 1) ^ 0xedb88320
                else:
                    crc32 = crc32 >> 1

            table[i] = crc32
        else:
            var crc32 = table[i-256]
            var index = int(crc32.cast[DType.uint8]())
            table[i] = (crc32 >> 8) ^ table[index]
            
    return table

In [19]:
#var t = fill_table_n_byte[1]()
alias t2 = fill_table_n_byte[2]()

In [20]:
fn CRC32_table_4_byte(owned data: List[SIMD[DType.uint8, 1]], table: List[UInt32]) -> SIMD[DType.uint32, 1]:
    var crc32: UInt32 = 0xffffffff

    var size = 4

    #assert_true(len(data) % 2 == 0, "List must be divisible by two for 16-bit optimization.")
    var length = len(data)//size
    var extra = len(data) % size



    for i in range(start = 0, end = length*size, step = size):
        
        var val: UInt32 =  (data[i + 3].cast[DType.uint32]() << 24) | (data[i + 2].cast[DType.uint32]() << 16) | (data[i + 1].cast[DType.uint32]() << 8) | data[i].cast[DType.uint32]()
        var index = crc32 ^ val.cast[DType.uint32]()
        crc32 = table[0*256 + int((index >> 24).cast[DType.uint8]())] ^
                table[1*256 + int((index >> 16).cast[DType.uint8]())] ^
                table[2*256 + int((index >> 8).cast[DType.uint8]())] ^
                table[3*256 + int((index >> 0).cast[DType.uint8]())] 
    
    for i in range(size*length, size*length + extra ):
        var index = (crc32 ^ data[i].cast[DType.uint32]()) & 0xff
        crc32 = table[int(index)] ^ (crc32 >> 8)


    return crc32^0xffffffff

In [21]:



var little_endian_table_4_byte  = fill_table_n_byte[4]()

In [22]:
print(hex(CRC32(test_list)))
print(hex(CRC32_inv(test_list)))
print(hex(CRC32_table_2_byte(test_list, little_endian_table_2_byte)))
print(hex(CRC32_table_2_byte_2(test_list, little_endian_table_2_byte)))
print(hex(CRC32_table_4_byte(test_list, little_endian_table_4_byte)))

0x382aa34e
0x382aa34e
0x382aa34e
0x382aa34e
0x382aa34e


In [23]:
fn run_32[data: List[SIMD[DType.uint8, 1]] ]():
    var a =  CRC32(data)
    benchmark.keep(a)


fn run_32_inv[data: List[SIMD[DType.uint8, 1]] ]():
    var a = CRC32_inv(data)
    benchmark.keep(a)


fn run_32_table[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table(data, table)
    benchmark.keep(a)


fn run_32_table_2_byte_2[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_2_byte_2(data, table)
    benchmark.keep(a)

fn run_32_table_4_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_4_byte(data, table)
    benchmark.keep(a)



fn bench():

    
    alias fill_size = 2**16
    alias g = UnsafePointer[SIMD[DType.uint8, 1]].alloc(fill_size)
    rand[DType.uint8](ptr =  g, size = fill_size)


    alias rand_list = List[SIMD[DType.uint8,1]](data = g, size = fill_size, capacity = fill_size)


    print(len(rand_list))

    var report = benchmark.run[run_32[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report)

    var report_2 = benchmark.run[run_32_inv[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_2)

 
    
    alias little_endian_table = fill_table_n_byte[1]()

    var report_3 = benchmark.run[run_32_table[rand_list, little_endian_table]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_3)

    alias little_endian_table_2_byte = fill_table_n_byte[2]()

    var report_4 = benchmark.run[run_32_table_2_byte_2[rand_list, little_endian_table_2_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_4)

    alias little_endian_table_4_byte = fill_table_n_byte[4]()

    var report_5 = benchmark.run[run_32_table_4_byte[rand_list, little_endian_table_4_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_5)

    print(100 * (report/report_2 -1))
    print(100 * (report_2/report_3 -1))
    print(100 * (report/report_3 -1))
    print(100 * (report/report_4 -1))
    print(100 * (report/report_5 -1))
    #report.print_full()
    #report_2.print_full()


bench()

65536
0.60933668494550408
0.51093877646534425
0.14623572986493247
0.10261496649999999
0.052074877086801967
19.258258134344963
249.39393877082026
316.68112540506002
493.80878416551849
1070.1164151185994


In [24]:
fn CRC32_table_8_byte(owned data: List[SIMD[DType.uint8, 1]], table: List[UInt32]) -> SIMD[DType.uint32, 1]:
    var crc32: UInt32 = 0xffffffff

    var size = 8

    var length = len(data)//size
    var extra = len(data) % size



    for i in range(start = 0, end = length*size, step = size):
        
        var val_1: UInt32 = (data[i + 3].cast[DType.uint32]() << 24) | 
                            (data[i + 2].cast[DType.uint32]() << 16) | 
                            (data[i + 1].cast[DType.uint32]() << 8) | 
                             data[i + 0].cast[DType.uint32]()

        var val_2: UInt32 = (data[i + 7].cast[DType.uint32]() << 24) | 
                            (data[i + 6].cast[DType.uint32]() << 16) | 
                            (data[i + 5].cast[DType.uint32]() << 8) | 
                             data[i + 4].cast[DType.uint32]()

        var index_1 = crc32 ^ val_1#.cast[DType.uint32]()
        var index_2 = val_2#.cast[DType.uint32]()
        crc32 = table[4*256 + int((index_1 >> 24).cast[DType.uint8]())] ^
                table[5*256 + int((index_1 >> 16).cast[DType.uint8]())] ^
                table[6*256 + int((index_1 >> 8).cast[DType.uint8]())] ^
                table[7*256 + int((index_1 >> 0).cast[DType.uint8]())] ^
                table[0*256 + int((index_2 >> 24).cast[DType.uint8]())] ^
                table[1*256 + int((index_2 >> 16).cast[DType.uint8]())] ^
                table[2*256 + int((index_2 >> 8).cast[DType.uint8]())] ^
                table[3*256 + int((index_2 >> 0).cast[DType.uint8]())] 
    
    for i in range(size*length, size*length + extra ):
        var index = (crc32 ^ data[i].cast[DType.uint32]()) & 0xff
        crc32 = table[int(index)] ^ (crc32 >> 8)


    return crc32^0xffffffff

In [25]:
var little_endian_table_8_byte  = fill_table_n_byte[8]()
print(hex(CRC32(test_list)))
print(hex(CRC32_inv(test_list)))
print(hex(CRC32_table_8_byte(test_list, little_endian_table_8_byte)))

0x382aa34e
0x382aa34e
0x382aa34e


In [26]:
fn run_32[data: List[SIMD[DType.uint8, 1]] ]():
    var a =  CRC32(data)
    benchmark.keep(a)


fn run_32_inv[data: List[SIMD[DType.uint8, 1]] ]():
    var a = CRC32_inv(data)
    benchmark.keep(a)


fn run_32_table[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table(data, table)
    benchmark.keep(a)


fn run_32_table_2_byte_2[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_2_byte_2(data, table)
    benchmark.keep(a)

fn run_32_table_4_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_4_byte(data, table)
    benchmark.keep(a)

fn run_32_table_8_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_8_byte(data, table)
    benchmark.keep(a)

#fn run_32_table_8_byte_compact[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
#    var a = CRC32_table_8_byte_compact(data, table)
#    benchmark.keep(a)


fn bench():

    
    alias fill_size = 2**20
    alias g = UnsafePointer[SIMD[DType.uint8, 1]].alloc(fill_size)
    rand[DType.uint8](ptr =  g, size = fill_size)


    alias rand_list = List[SIMD[DType.uint8,1]](data = g, size = fill_size, capacity = fill_size)


    print(len(rand_list))

    var report = benchmark.run[run_32[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report)

    var report_2 = benchmark.run[run_32_inv[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_2)

 
    
    alias little_endian_table = fill_table_n_byte[1]()

    var report_3 = benchmark.run[run_32_table[rand_list, little_endian_table]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_3)

    alias little_endian_table_2_byte = fill_table_n_byte[2]()

    var report_4 = benchmark.run[run_32_table_2_byte_2[rand_list, little_endian_table_2_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_4)

    alias little_endian_table_4_byte = fill_table_n_byte[4]()

    var report_5 = benchmark.run[run_32_table_4_byte[rand_list, little_endian_table_4_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_5)

    alias little_endian_table_8_byte = fill_table_n_byte[8]()

    #var report_7 = benchmark.run[run_32_table_8_byte_compact[rand_list, little_endian_table_8_byte]](max_runtime_secs=0.5
    #).mean(benchmark.Unit.ms)
    #print(report_7)


    var report_6 = benchmark.run[run_32_table_8_byte[rand_list, little_endian_table_8_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_6)




    print(100 * (report/report_2 -1))
    print(100 * (report_2/report_3 -1))
    print(100 * (report/report_3 -1))
    print(100 * (report/report_4 -1))
    print(100 * (report/report_5 -1))
    print(100 * (report/report_6 -1))
    #print(100 * (report/report_7 -1))

    #report.print_full()
    #report_2.print_full()


bench()

1048576
9.5133472319587629
10.088749911111112
2.3775413462857142
1.8442974057108863
0.93763144855967073
0.55137638930406352
-5.7034090865770821
324.33541384557378
300.13382929474079
415.82500753406595
914.61477711445752
1625.3816841824371


In [27]:
#var little_endian_table_8_byte  = fill_table_n_byte[8]()
#print(hex(CRC32(test_list)))
#print(hex(CRC32_inv(test_list)))
#print(hex(CRC32_table_8_byte_compact(test_list, little_endian_table_8_byte)))

error: [0;1;31m[1mExpression [27]:4:11: [0m[1muse of unknown declaration 'CRC32_table_8_byte_compact'
[0mprint(hex(CRC32_table_8_byte_compact(test_list, little_endian_table_8_byte)))
[0;1;32m          ^~~~~~~~~~~~~~~~~~~~~~~~~~
[0m[0m

expression failed to parse (no further compiler diagnostics)

In [28]:
fn CRC32_table_16_byte(owned data: List[SIMD[DType.uint8, 1]], table: List[UInt32]) -> SIMD[DType.uint32, 1]:
    var crc32: UInt32 = 0xffffffff

    var size = 16

    var length = len(data)//size
    var extra = len(data) % size



    for i in range(start = 0, end = length*size, step = size):
        
        var val_1: UInt32 = (data[i + 3].cast[DType.uint32]() << 24) | 
                            (data[i + 2].cast[DType.uint32]() << 16) | 
                            (data[i + 1].cast[DType.uint32]() << 8) | 
                             data[i + 0].cast[DType.uint32]()

        var val_2: UInt32 = (data[i + 7].cast[DType.uint32]() << 24) | 
                            (data[i + 6].cast[DType.uint32]() << 16) | 
                            (data[i + 5].cast[DType.uint32]() << 8) | 
                             data[i + 4].cast[DType.uint32]()

        var val_3: UInt32 = (data[i + 11].cast[DType.uint32]() << 24) | 
                            (data[i + 10].cast[DType.uint32]() << 16) | 
                            (data[i + 9].cast[DType.uint32]() << 8) | 
                             data[i + 8].cast[DType.uint32]()

        var val_4: UInt32 = (data[i + 15].cast[DType.uint32]() << 24) | 
                            (data[i + 14].cast[DType.uint32]() << 16) | 
                            (data[i + 13].cast[DType.uint32]() << 8) | 
                             data[i + 12].cast[DType.uint32]()

        var index_1 = crc32 ^ val_1.cast[DType.uint32]()
        var index_2 = val_2.cast[DType.uint32]()
        var index_3 = val_3.cast[DType.uint32]()
        var index_4 = val_4.cast[DType.uint32]()

        crc32 = table[0*256 + int((index_4 >> 24).cast[DType.uint8]())] ^
                table[1*256 + int((index_4 >> 16).cast[DType.uint8]())] ^
                table[2*256 + int((index_4 >> 8).cast[DType.uint8]())] ^
                table[3*256 + int((index_4 >> 0).cast[DType.uint8]())] ^
                table[4*256 + int((index_3 >> 24).cast[DType.uint8]())] ^
                table[5*256 + int((index_3 >> 16).cast[DType.uint8]())] ^
                table[6*256 + int((index_3 >> 8).cast[DType.uint8]())] ^
                table[7*256 + int((index_3 >> 0).cast[DType.uint8]())] ^
                table[8*256 + int((index_2 >> 24).cast[DType.uint8]())] ^
                table[9*256 + int((index_2 >> 16).cast[DType.uint8]())] ^
                table[10*256 + int((index_2 >> 8).cast[DType.uint8]())] ^
                table[11*256 + int((index_2 >> 0).cast[DType.uint8]())] ^
                table[12*256 + int((index_1 >> 24).cast[DType.uint8]())] ^
                table[13*256 + int((index_1 >> 16).cast[DType.uint8]())] ^
                table[14*256 + int((index_1 >> 8).cast[DType.uint8]())] ^
                table[15*256 + int((index_1 >> 0).cast[DType.uint8]())] 
    
    for i in range(size*length, size*length + extra ):
        var index = (crc32 ^ data[i].cast[DType.uint32]()) & 0xff
        crc32 = table[int(index)] ^ (crc32 >> 8)


    return crc32^0xffffffff

In [29]:
var little_endian_table_16_byte  = fill_table_n_byte[16]()
print(hex(CRC32(test_list)))
print(hex(CRC32_inv(test_list)))
print(hex(CRC32_table_16_byte(test_list, little_endian_table_16_byte)))

0x382aa34e
0x382aa34e
0x382aa34e


In [30]:
fn run_32[data: List[SIMD[DType.uint8, 1]] ]():
    var a =  CRC32(data)
    benchmark.keep(a)


fn run_32_inv[data: List[SIMD[DType.uint8, 1]] ]():
    var a = CRC32_inv(data)
    benchmark.keep(a)


fn run_32_table[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table(data, table)
    benchmark.keep(a)


fn run_32_table_2_byte_2[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_2_byte_2(data, table)
    benchmark.keep(a)

fn run_32_table_4_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_4_byte(data, table)
    benchmark.keep(a)

fn run_32_table_8_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_8_byte(data, table)
    benchmark.keep(a)


fn run_32_table_16_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_16_byte(data, table)
    benchmark.keep(a)

fn bench():

    
    alias fill_size = 2**20
    alias g = UnsafePointer[SIMD[DType.uint8, 1]].alloc(fill_size)
    rand[DType.uint8](ptr =  g, size = fill_size)


    alias rand_list = List[SIMD[DType.uint8,1]](data = g, size = fill_size, capacity = fill_size)


    print(len(rand_list))

    var report = benchmark.run[run_32[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report)

    var report_2 = benchmark.run[run_32_inv[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_2)

 
    
    alias little_endian_table = fill_table_n_byte[1]()

    var report_3 = benchmark.run[run_32_table[rand_list, little_endian_table]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_3)

    alias little_endian_table_2_byte = fill_table_n_byte[2]()

    var report_4 = benchmark.run[run_32_table_2_byte_2[rand_list, little_endian_table_2_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_4)

    alias little_endian_table_4_byte = fill_table_n_byte[4]()

    var report_5 = benchmark.run[run_32_table_4_byte[rand_list, little_endian_table_4_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_5)

    alias little_endian_table_8_byte = fill_table_n_byte[8]()

    var report_6 = benchmark.run[run_32_table_8_byte[rand_list, little_endian_table_8_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_6)

    alias little_endian_table_16_byte = fill_table_n_byte[16]()

    var report_7 = benchmark.run[run_32_table_16_byte[rand_list, little_endian_table_16_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_7)

    print(100 * (report/report_2 -1))
    print(100 * (report_2/report_3 -1))
    print(100 * (report/report_3 -1))
    print(100 * (report/report_4 -1))
    print(100 * (report/report_5 -1))
    print(100 * (report/report_6 -1))
    print(100 * (report/report_7 -1))


    #report.print_full()
    #report_2.print_full()


bench()

1048576
13.255485929133858
12.07438032596685
3.0362552249999997
1.7011732215288613
0.83217084949999998
0.55661897399999993
0.28126769000000001
9.7819148584126872
297.6734309602333
336.57350739129186
679.19671914545074
1492.8803486805937
2281.4290472127996
4612.7652412311763


In [31]:
fn CRC32_table_32_byte(owned data: List[SIMD[DType.uint8, 1]], table: List[UInt32]) -> SIMD[DType.uint32, 1]:
    var crc32: UInt32 = 0xffffffff

    var size = 32

    var length = len(data)//size
    var extra = len(data) % size



    for i in range(start = 0, end = length*size, step = size):
        
        var val_1: UInt32 = (data[i + 3].cast[DType.uint32]() << 24) | 
                            (data[i + 2].cast[DType.uint32]() << 16) | 
                            (data[i + 1].cast[DType.uint32]() << 8) | 
                             data[i + 0].cast[DType.uint32]()

        var val_2: UInt32 = (data[i + 7].cast[DType.uint32]() << 24) | 
                            (data[i + 6].cast[DType.uint32]() << 16) | 
                            (data[i + 5].cast[DType.uint32]() << 8) | 
                             data[i + 4].cast[DType.uint32]()

        var val_3: UInt32 = (data[i + 11].cast[DType.uint32]() << 24) | 
                            (data[i + 10].cast[DType.uint32]() << 16) | 
                            (data[i + 9].cast[DType.uint32]() << 8) | 
                             data[i + 8].cast[DType.uint32]()

        var val_4: UInt32 = (data[i + 15].cast[DType.uint32]() << 24) | 
                            (data[i + 14].cast[DType.uint32]() << 16) | 
                            (data[i + 13].cast[DType.uint32]() << 8) | 
                             data[i + 12].cast[DType.uint32]()

        var val_5: UInt32 = (data[i + 19].cast[DType.uint32]() << 24) | 
                            (data[i + 18].cast[DType.uint32]() << 16) | 
                            (data[i + 17].cast[DType.uint32]() << 8) | 
                             data[i + 16].cast[DType.uint32]()

        var val_6: UInt32 = (data[i + 23].cast[DType.uint32]() << 24) | 
                            (data[i + 22].cast[DType.uint32]() << 16) | 
                            (data[i + 21].cast[DType.uint32]() << 8) | 
                             data[i + 20].cast[DType.uint32]()

        var val_7: UInt32 = (data[i + 27].cast[DType.uint32]() << 24) | 
                            (data[i + 26].cast[DType.uint32]() << 16) | 
                            (data[i + 25].cast[DType.uint32]() << 8) | 
                             data[i + 24].cast[DType.uint32]()

        var val_8: UInt32 = (data[i + 31].cast[DType.uint32]() << 24) | 
                            (data[i + 30].cast[DType.uint32]() << 16) | 
                            (data[i + 29].cast[DType.uint32]() << 8) | 
                             data[i + 28].cast[DType.uint32]()

        var index_1 = crc32 ^ val_1.cast[DType.uint32]()
        var index_2 = val_2.cast[DType.uint32]()
        var index_3 = val_3.cast[DType.uint32]()
        var index_4 = val_4.cast[DType.uint32]()
        var index_5 = val_5.cast[DType.uint32]()
        var index_6 = val_6.cast[DType.uint32]()
        var index_7 = val_7.cast[DType.uint32]()
        var index_8 = val_8.cast[DType.uint32]()

        crc32 = table[0*256 + int((index_8 >> 24).cast[DType.uint8]())] ^
                table[1*256 + int((index_8 >> 16).cast[DType.uint8]())] ^
                table[2*256 + int((index_8 >> 8).cast[DType.uint8]())] ^
                table[3*256 + int((index_8 >> 0).cast[DType.uint8]())] ^
                table[4*256 + int((index_7 >> 24).cast[DType.uint8]())] ^
                table[5*256 + int((index_7 >> 16).cast[DType.uint8]())] ^
                table[6*256 + int((index_7 >> 8).cast[DType.uint8]())] ^
                table[7*256 + int((index_7 >> 0).cast[DType.uint8]())] ^
                table[8*256 + int((index_6 >> 24).cast[DType.uint8]())] ^
                table[9*256 + int((index_6 >> 16).cast[DType.uint8]())] ^
                table[10*256 + int((index_6 >> 8).cast[DType.uint8]())] ^
                table[11*256 + int((index_6 >> 0).cast[DType.uint8]())] ^
                table[12*256 + int((index_5 >> 24).cast[DType.uint8]())] ^
                table[13*256 + int((index_5 >> 16).cast[DType.uint8]())] ^
                table[14*256 + int((index_5 >> 8).cast[DType.uint8]())] ^
                table[15*256 + int((index_5 >> 0).cast[DType.uint8]())] ^
                table[16*256 + int((index_4 >> 24).cast[DType.uint8]())] ^
                table[17*256 + int((index_4 >> 16).cast[DType.uint8]())] ^
                table[18*256 + int((index_4 >> 8).cast[DType.uint8]())] ^
                table[19*256 + int((index_4 >> 0).cast[DType.uint8]())] ^
                table[20*256 + int((index_3 >> 24).cast[DType.uint8]())] ^
                table[21*256 + int((index_3 >> 16).cast[DType.uint8]())] ^
                table[22*256 + int((index_3 >> 8).cast[DType.uint8]())] ^
                table[23*256 + int((index_3 >> 0).cast[DType.uint8]())] ^
                table[24*256 + int((index_2 >> 24).cast[DType.uint8]())] ^
                table[25*256 + int((index_2 >> 16).cast[DType.uint8]())] ^
                table[26*256 + int((index_2 >> 8).cast[DType.uint8]())] ^
                table[27*256 + int((index_2 >> 0).cast[DType.uint8]())] ^
                table[28*256 + int((index_1 >> 24).cast[DType.uint8]())] ^
                table[29*256 + int((index_1 >> 16).cast[DType.uint8]())] ^
                table[30*256 + int((index_1 >> 8).cast[DType.uint8]())] ^
                table[31*256 + int((index_1 >> 0).cast[DType.uint8]())] 
    
    for i in range(size*length, size*length + extra ):
        var index = (crc32 ^ data[i].cast[DType.uint32]()) & 0xff
        crc32 = table[int(index)] ^ (crc32 >> 8)


    return crc32^0xffffffff

In [32]:
var little_endian_table_32_byte  = fill_table_n_byte[32]()
print(hex(CRC32(test_list)))
print(hex(CRC32_inv(test_list)))
print(hex(CRC32_table_32_byte(test_list, little_endian_table_32_byte)))

0x382aa34e
0x382aa34e
0x382aa34e


In [33]:
fn run_32[data: List[SIMD[DType.uint8, 1]] ]():
    var a =  CRC32(data)
    benchmark.keep(a)


fn run_32_inv[data: List[SIMD[DType.uint8, 1]] ]():
    var a = CRC32_inv(data)
    benchmark.keep(a)


fn run_32_table[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table(data, table)
    benchmark.keep(a)


fn run_32_table_2_byte_2[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_2_byte_2(data, table)
    benchmark.keep(a)

fn run_32_table_4_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_4_byte(data, table)
    benchmark.keep(a)

fn run_32_table_8_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_8_byte(data, table)
    benchmark.keep(a)


fn run_32_table_16_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_16_byte(data, table)
    benchmark.keep(a)

fn run_32_table_32_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_32_byte(data, table)
    benchmark.keep(a)

fn bench():

    
    alias fill_size = 2**20
    alias g = UnsafePointer[SIMD[DType.uint8, 1]].alloc(fill_size)
    rand[DType.uint8](ptr =  g, size = fill_size)


    alias rand_list = List[SIMD[DType.uint8,1]](data = g, size = fill_size, capacity = fill_size)


    print(len(rand_list))

    var report = benchmark.run[run_32[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report)

    var report_2 = benchmark.run[run_32_inv[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_2)

 
    
    alias little_endian_table = fill_table_n_byte[1]()

    var report_3 = benchmark.run[run_32_table[rand_list, little_endian_table]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_3)

    alias little_endian_table_2_byte = fill_table_n_byte[2]()

    var report_4 = benchmark.run[run_32_table_2_byte_2[rand_list, little_endian_table_2_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_4)

    alias little_endian_table_4_byte = fill_table_n_byte[4]()

    var report_5 = benchmark.run[run_32_table_4_byte[rand_list, little_endian_table_4_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_5)

    alias little_endian_table_8_byte = fill_table_n_byte[8]()

    var report_6 = benchmark.run[run_32_table_8_byte[rand_list, little_endian_table_8_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_6)

    alias little_endian_table_16_byte = fill_table_n_byte[16]()

    var report_7 = benchmark.run[run_32_table_16_byte[rand_list, little_endian_table_16_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_7)

    alias little_endian_table_32_byte = fill_table_n_byte[32]()

    var report_8 = benchmark.run[run_32_table_32_byte[rand_list, little_endian_table_32_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_8)

    print(100 * (report/report_2 -1))
    print(100 * (report_2/report_3 -1))
    print(100 * (report/report_3 -1))
    print(100 * (report/report_4 -1))
    print(100 * (report/report_5 -1))
    print(100 * (report/report_6 -1))
    print(100 * (report/report_7 -1))
    print(100 * (report/report_8 -1))



    #report.print_full()
    #report_2.print_full()


bench()

1048576
10.113304680000001
8.1286328650000002
2.4219192299999999
1.5149720617715619
0.83514784549999999
0.52576874750000002
0.2494913165
0.1966320271299381
24.415813187301573
235.62774366344169
317.57398656106307
567.55717383816386
1110.9598000513552
1823.5271643832348
3953.5698083103425
5043.264211641851


In [34]:
fn CRC32_table_64_byte(owned data: List[SIMD[DType.uint8, 1]], table: List[UInt32]) -> SIMD[DType.uint32, 1]:
    var crc32: UInt32 = 0xffffffff

    var size = 64

    var length = len(data)//size
    var extra = len(data) % size



    for i in range(start = 0, end = length*size, step = size):
        
        var val_1: UInt32 = (data[i + 3].cast[DType.uint32]() << 24) | 
                            (data[i + 2].cast[DType.uint32]() << 16) | 
                            (data[i + 1].cast[DType.uint32]() << 8) | 
                             data[i + 0].cast[DType.uint32]()

        var val_2: UInt32 = (data[i + 7].cast[DType.uint32]() << 24) | 
                            (data[i + 6].cast[DType.uint32]() << 16) | 
                            (data[i + 5].cast[DType.uint32]() << 8) | 
                             data[i + 4].cast[DType.uint32]()

        var val_3: UInt32 = (data[i + 11].cast[DType.uint32]() << 24) | 
                            (data[i + 10].cast[DType.uint32]() << 16) | 
                            (data[i + 9].cast[DType.uint32]() << 8) | 
                             data[i + 8].cast[DType.uint32]()

        var val_4: UInt32 = (data[i + 15].cast[DType.uint32]() << 24) | 
                            (data[i + 14].cast[DType.uint32]() << 16) | 
                            (data[i + 13].cast[DType.uint32]() << 8) | 
                             data[i + 12].cast[DType.uint32]()

        var val_5: UInt32 = (data[i + 19].cast[DType.uint32]() << 24) | 
                            (data[i + 18].cast[DType.uint32]() << 16) | 
                            (data[i + 17].cast[DType.uint32]() << 8) | 
                             data[i + 16].cast[DType.uint32]()

        var val_6: UInt32 = (data[i + 23].cast[DType.uint32]() << 24) | 
                            (data[i + 22].cast[DType.uint32]() << 16) | 
                            (data[i + 21].cast[DType.uint32]() << 8) | 
                             data[i + 20].cast[DType.uint32]()

        var val_7: UInt32 = (data[i + 27].cast[DType.uint32]() << 24) | 
                            (data[i + 26].cast[DType.uint32]() << 16) | 
                            (data[i + 25].cast[DType.uint32]() << 8) | 
                             data[i + 24].cast[DType.uint32]()

        var val_8: UInt32 = (data[i + 31].cast[DType.uint32]() << 24) | 
                            (data[i + 30].cast[DType.uint32]() << 16) | 
                            (data[i + 29].cast[DType.uint32]() << 8) | 
                             data[i + 28].cast[DType.uint32]()


        var val_9: UInt32 = (data[i + 35].cast[DType.uint32]() << 24) | 
                            (data[i + 34].cast[DType.uint32]() << 16) | 
                            (data[i + 33].cast[DType.uint32]() << 8) | 
                             data[i + 32].cast[DType.uint32]()

        var val_10: UInt32 =(data[i + 39].cast[DType.uint32]() << 24) | 
                            (data[i + 38].cast[DType.uint32]() << 16) | 
                            (data[i + 37].cast[DType.uint32]() << 8) | 
                             data[i + 36].cast[DType.uint32]()

        var val_11: UInt32 =(data[i + 43].cast[DType.uint32]() << 24) | 
                            (data[i + 42].cast[DType.uint32]() << 16) | 
                            (data[i + 41].cast[DType.uint32]() << 8) | 
                             data[i + 40].cast[DType.uint32]()

        var val_12: UInt32 =(data[i + 47].cast[DType.uint32]() << 24) | 
                            (data[i + 46].cast[DType.uint32]() << 16) | 
                            (data[i + 45].cast[DType.uint32]() << 8) | 
                             data[i + 44].cast[DType.uint32]()

        var val_13: UInt32 =(data[i + 51].cast[DType.uint32]() << 24) | 
                            (data[i + 50].cast[DType.uint32]() << 16) | 
                            (data[i + 49].cast[DType.uint32]() << 8) | 
                             data[i + 48].cast[DType.uint32]()

        var val_14: UInt32 =(data[i + 55].cast[DType.uint32]() << 24) | 
                            (data[i + 54].cast[DType.uint32]() << 16) | 
                            (data[i + 53].cast[DType.uint32]() << 8) | 
                             data[i + 52].cast[DType.uint32]()

        var val_15: UInt32 =(data[i + 59].cast[DType.uint32]() << 24) | 
                            (data[i + 58].cast[DType.uint32]() << 16) | 
                            (data[i + 57].cast[DType.uint32]() << 8) | 
                             data[i + 56].cast[DType.uint32]()

        var val_16: UInt32 =(data[i + 63].cast[DType.uint32]() << 24) | 
                            (data[i + 62].cast[DType.uint32]() << 16) | 
                            (data[i + 61].cast[DType.uint32]() << 8) | 
                             data[i + 60].cast[DType.uint32]()

        var index_1 = crc32 ^ val_1.cast[DType.uint32]()
        var index_2 = val_2.cast[DType.uint32]()
        var index_3 = val_3.cast[DType.uint32]()
        var index_4 = val_4.cast[DType.uint32]()
        var index_5 = val_5.cast[DType.uint32]()
        var index_6 = val_6.cast[DType.uint32]()
        var index_7 = val_7.cast[DType.uint32]()
        var index_8 = val_8.cast[DType.uint32]()
        var index_9 = val_9.cast[DType.uint32]()
        var index_10 = val_10.cast[DType.uint32]()
        var index_11 = val_11.cast[DType.uint32]()
        var index_12 = val_12.cast[DType.uint32]()
        var index_13 = val_13.cast[DType.uint32]()
        var index_14 = val_14.cast[DType.uint32]()
        var index_15 = val_15.cast[DType.uint32]()
        var index_16 = val_16.cast[DType.uint32]()

        crc32 = table[0*256 + int((index_16 >> 24).cast[DType.uint8]())] ^
                table[1*256 + int((index_16 >> 16).cast[DType.uint8]())] ^
                table[2*256 + int((index_16 >> 8).cast[DType.uint8]())] ^
                table[3*256 + int((index_16 >> 0).cast[DType.uint8]())] ^
                table[4*256 + int((index_15 >> 24).cast[DType.uint8]())] ^
                table[5*256 + int((index_15 >> 16).cast[DType.uint8]())] ^
                table[6*256 + int((index_15 >> 8).cast[DType.uint8]())] ^
                table[7*256 + int((index_15 >> 0).cast[DType.uint8]())] ^
                table[8*256 + int((index_14 >> 24).cast[DType.uint8]())] ^
                table[9*256 + int((index_14 >> 16).cast[DType.uint8]())] ^
                table[10*256 + int((index_14 >> 8).cast[DType.uint8]())] ^
                table[11*256 + int((index_14 >> 0).cast[DType.uint8]())] ^
                table[12*256 + int((index_13 >> 24).cast[DType.uint8]())] ^
                table[13*256 + int((index_13 >> 16).cast[DType.uint8]())] ^
                table[14*256 + int((index_13 >> 8).cast[DType.uint8]())] ^
                table[15*256 + int((index_13 >> 0).cast[DType.uint8]())] ^
                table[16*256 + int((index_12 >> 24).cast[DType.uint8]())] ^
                table[17*256 + int((index_12 >> 16).cast[DType.uint8]())] ^
                table[18*256 + int((index_12 >> 8).cast[DType.uint8]())] ^
                table[19*256 + int((index_12 >> 0).cast[DType.uint8]())] ^
                table[20*256 + int((index_11 >> 24).cast[DType.uint8]())] ^
                table[21*256 + int((index_11 >> 16).cast[DType.uint8]())] ^
                table[22*256 + int((index_11 >> 8).cast[DType.uint8]())] ^
                table[23*256 + int((index_11 >> 0).cast[DType.uint8]())] ^
                table[24*256 + int((index_10 >> 24).cast[DType.uint8]())] ^
                table[25*256 + int((index_10 >> 16).cast[DType.uint8]())] ^
                table[26*256 + int((index_10 >> 8).cast[DType.uint8]())] ^
                table[27*256 + int((index_10 >> 0).cast[DType.uint8]())] ^
                table[28*256 + int((index_9 >> 24).cast[DType.uint8]())] ^
                table[29*256 + int((index_9 >> 16).cast[DType.uint8]())] ^
                table[30*256 + int((index_9 >> 8).cast[DType.uint8]())] ^
                table[31*256 + int((index_9 >> 0).cast[DType.uint8]())] ^
                table[32*256 + int((index_8 >> 24).cast[DType.uint8]())] ^
                table[33*256 + int((index_8 >> 16).cast[DType.uint8]())] ^
                table[34*256 + int((index_8 >> 8).cast[DType.uint8]())] ^
                table[35*256 + int((index_8 >> 0).cast[DType.uint8]())] ^
                table[36*256 + int((index_7 >> 24).cast[DType.uint8]())] ^
                table[37*256 + int((index_7 >> 16).cast[DType.uint8]())] ^
                table[38*256 + int((index_7 >> 8).cast[DType.uint8]())] ^
                table[39*256 + int((index_7 >> 0).cast[DType.uint8]())] ^
                table[40*256 + int((index_6 >> 24).cast[DType.uint8]())] ^
                table[41*256 + int((index_6 >> 16).cast[DType.uint8]())] ^
                table[42*256 + int((index_6 >> 8).cast[DType.uint8]())] ^
                table[43*256 + int((index_6 >> 0).cast[DType.uint8]())] ^
                table[44*256 + int((index_5 >> 24).cast[DType.uint8]())] ^
                table[45*256 + int((index_5 >> 16).cast[DType.uint8]())] ^
                table[46*256 + int((index_5 >> 8).cast[DType.uint8]())] ^
                table[47*256 + int((index_5 >> 0).cast[DType.uint8]())] ^
                table[48*256 + int((index_4 >> 24).cast[DType.uint8]())] ^
                table[49*256 + int((index_4 >> 16).cast[DType.uint8]())] ^
                table[50*256 + int((index_4 >> 8).cast[DType.uint8]())] ^
                table[51*256 + int((index_4 >> 0).cast[DType.uint8]())] ^
                table[52*256 + int((index_3 >> 24).cast[DType.uint8]())] ^
                table[53*256 + int((index_3 >> 16).cast[DType.uint8]())] ^
                table[54*256 + int((index_3 >> 8).cast[DType.uint8]())] ^
                table[55*256 + int((index_3 >> 0).cast[DType.uint8]())] ^
                table[56*256 + int((index_2 >> 24).cast[DType.uint8]())] ^
                table[57*256 + int((index_2 >> 16).cast[DType.uint8]())] ^
                table[58*256 + int((index_2 >> 8).cast[DType.uint8]())] ^
                table[59*256 + int((index_2 >> 0).cast[DType.uint8]())] ^
                table[60*256 + int((index_1 >> 24).cast[DType.uint8]())] ^
                table[61*256 + int((index_1 >> 16).cast[DType.uint8]())] ^
                table[62*256 + int((index_1 >> 8).cast[DType.uint8]())] ^
                table[63*256 + int((index_1 >> 0).cast[DType.uint8]())] 
    
    for i in range(size*length, size*length + extra ):
        var index = (crc32 ^ data[i].cast[DType.uint32]()) & 0xff
        crc32 = table[int(index)] ^ (crc32 >> 8)


    return crc32^0xffffffff

In [35]:
var little_endian_table_64_byte  = fill_table_n_byte[64]()
print(hex(CRC32(test_list)))
print(hex(CRC32_inv(test_list)))
print(hex(CRC32_table_64_byte(test_list, little_endian_table_64_byte)))

0x382aa34e
0x382aa34e
0x382aa34e


In [36]:
fn run_32[data: List[SIMD[DType.uint8, 1]] ]():
    var a =  CRC32(data)
    benchmark.keep(a)


fn run_32_inv[data: List[SIMD[DType.uint8, 1]] ]():
    var a = CRC32_inv(data)
    benchmark.keep(a)


fn run_32_table[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table(data, table)
    benchmark.keep(a)


fn run_32_table_2_byte_2[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_2_byte_2(data, table)
    benchmark.keep(a)

fn run_32_table_4_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_4_byte(data, table)
    benchmark.keep(a)

fn run_32_table_8_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_8_byte(data, table)
    benchmark.keep(a)


fn run_32_table_16_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_16_byte(data, table)
    benchmark.keep(a)

fn run_32_table_32_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_32_byte(data, table)
    benchmark.keep(a)

fn run_32_table_64_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_64_byte(data, table)
    benchmark.keep(a)

fn bench():

    
    alias fill_size = 2**20
    alias g = UnsafePointer[SIMD[DType.uint8, 1]].alloc(fill_size)
    rand[DType.uint8](ptr =  g, size = fill_size)


    alias rand_list = List[SIMD[DType.uint8,1]](data = g, size = fill_size, capacity = fill_size)


    print(len(rand_list))

    var report = benchmark.run[run_32[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report)

    var report_2 = benchmark.run[run_32_inv[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_2)

 
    
    alias little_endian_table = fill_table_n_byte[1]()

    var report_3 = benchmark.run[run_32_table[rand_list, little_endian_table]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_3)

    alias little_endian_table_2_byte = fill_table_n_byte[2]()

    var report_4 = benchmark.run[run_32_table_2_byte_2[rand_list, little_endian_table_2_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_4)

    alias little_endian_table_4_byte = fill_table_n_byte[4]()

    var report_5 = benchmark.run[run_32_table_4_byte[rand_list, little_endian_table_4_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_5)

    alias little_endian_table_8_byte = fill_table_n_byte[8]()

    var report_6 = benchmark.run[run_32_table_8_byte[rand_list, little_endian_table_8_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_6)

    alias little_endian_table_16_byte = fill_table_n_byte[16]()

    var report_7 = benchmark.run[run_32_table_16_byte[rand_list, little_endian_table_16_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_7)

    alias little_endian_table_32_byte = fill_table_n_byte[32]()

    var report_8 = benchmark.run[run_32_table_32_byte[rand_list, little_endian_table_32_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_8)

    alias little_endian_table_64_byte = fill_table_n_byte[64]()

    var report_9 = benchmark.run[run_32_table_64_byte[rand_list, little_endian_table_64_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_9)

    print(100 * (report/report_2 -1))
    print(100 * (report_2/report_3 -1))
    print(100 * (report/report_3 -1))
    print(100 * (report/report_4 -1))
    print(100 * (report/report_5 -1))
    print(100 * (report/report_6 -1))
    print(100 * (report/report_7 -1))
    print(100 * (report/report_8 -1))
    print(100 * (report/report_9 -1))



    #report.print_full()
    #report_2.print_full()


bench()

1048576
8.8548864700000003
8.0418802500000002
1.9753312542648254
1.4341844980435996
0.85713391500000002
0.54524593349999995
0.27567692199999999
0.22807946937116008
0.27024903449999998
10.109653398532025
307.11552721282737
348.27349594565061
517.41613314598874
933.08086578279892
1524.0169666483173
3112.0521390615349
3782.3689367630877
3176.5654413466555


In [38]:
fn CRC32_table_48_byte(owned data: List[SIMD[DType.uint8, 1]], table: List[UInt32]) -> SIMD[DType.uint32, 1]:
    var crc32: UInt32 = 0xffffffff

    var size = 48

    var length = len(data)//size
    var extra = len(data) % size



    for i in range(start = 0, end = length*size, step = size):
        
        var val_1: UInt32 = (data[i + 3].cast[DType.uint32]() << 24) | 
                            (data[i + 2].cast[DType.uint32]() << 16) | 
                            (data[i + 1].cast[DType.uint32]() << 8) | 
                             data[i + 0].cast[DType.uint32]()

        var val_2: UInt32 = (data[i + 7].cast[DType.uint32]() << 24) | 
                            (data[i + 6].cast[DType.uint32]() << 16) | 
                            (data[i + 5].cast[DType.uint32]() << 8) | 
                             data[i + 4].cast[DType.uint32]()

        var val_3: UInt32 = (data[i + 11].cast[DType.uint32]() << 24) | 
                            (data[i + 10].cast[DType.uint32]() << 16) | 
                            (data[i + 9].cast[DType.uint32]() << 8) | 
                             data[i + 8].cast[DType.uint32]()

        var val_4: UInt32 = (data[i + 15].cast[DType.uint32]() << 24) | 
                            (data[i + 14].cast[DType.uint32]() << 16) | 
                            (data[i + 13].cast[DType.uint32]() << 8) | 
                             data[i + 12].cast[DType.uint32]()

        var val_5: UInt32 = (data[i + 19].cast[DType.uint32]() << 24) | 
                            (data[i + 18].cast[DType.uint32]() << 16) | 
                            (data[i + 17].cast[DType.uint32]() << 8) | 
                             data[i + 16].cast[DType.uint32]()

        var val_6: UInt32 = (data[i + 23].cast[DType.uint32]() << 24) | 
                            (data[i + 22].cast[DType.uint32]() << 16) | 
                            (data[i + 21].cast[DType.uint32]() << 8) | 
                             data[i + 20].cast[DType.uint32]()

        var val_7: UInt32 = (data[i + 27].cast[DType.uint32]() << 24) | 
                            (data[i + 26].cast[DType.uint32]() << 16) | 
                            (data[i + 25].cast[DType.uint32]() << 8) | 
                             data[i + 24].cast[DType.uint32]()

        var val_8: UInt32 = (data[i + 31].cast[DType.uint32]() << 24) | 
                            (data[i + 30].cast[DType.uint32]() << 16) | 
                            (data[i + 29].cast[DType.uint32]() << 8) | 
                             data[i + 28].cast[DType.uint32]()


        var val_9: UInt32 = (data[i + 35].cast[DType.uint32]() << 24) | 
                            (data[i + 34].cast[DType.uint32]() << 16) | 
                            (data[i + 33].cast[DType.uint32]() << 8) | 
                             data[i + 32].cast[DType.uint32]()

        var val_10: UInt32 =(data[i + 39].cast[DType.uint32]() << 24) | 
                            (data[i + 38].cast[DType.uint32]() << 16) | 
                            (data[i + 37].cast[DType.uint32]() << 8) | 
                             data[i + 36].cast[DType.uint32]()

        var val_11: UInt32 =(data[i + 43].cast[DType.uint32]() << 24) | 
                            (data[i + 42].cast[DType.uint32]() << 16) | 
                            (data[i + 41].cast[DType.uint32]() << 8) | 
                             data[i + 40].cast[DType.uint32]()

        var val_12: UInt32 =(data[i + 47].cast[DType.uint32]() << 24) | 
                            (data[i + 46].cast[DType.uint32]() << 16) | 
                            (data[i + 45].cast[DType.uint32]() << 8) | 
                             data[i + 44].cast[DType.uint32]()



        var index_1 = crc32 ^ val_1.cast[DType.uint32]()
        var index_2 = val_2.cast[DType.uint32]()
        var index_3 = val_3.cast[DType.uint32]()
        var index_4 = val_4.cast[DType.uint32]()
        var index_5 = val_5.cast[DType.uint32]()
        var index_6 = val_6.cast[DType.uint32]()
        var index_7 = val_7.cast[DType.uint32]()
        var index_8 = val_8.cast[DType.uint32]()
        var index_9 = val_9.cast[DType.uint32]()
        var index_10 = val_10.cast[DType.uint32]()
        var index_11 = val_11.cast[DType.uint32]()
        var index_12 = val_12.cast[DType.uint32]()

        crc32 = table[0*256 + int((index_12 >> 24).cast[DType.uint8]())] ^
                table[1*256 + int((index_12 >> 16).cast[DType.uint8]())] ^
                table[2*256 + int((index_12 >> 8).cast[DType.uint8]())] ^
                table[3*256 + int((index_12 >> 0).cast[DType.uint8]())] ^
                table[4*256 + int((index_11 >> 24).cast[DType.uint8]())] ^
                table[5*256 + int((index_11 >> 16).cast[DType.uint8]())] ^
                table[6*256 + int((index_11 >> 8).cast[DType.uint8]())] ^
                table[7*256 + int((index_11 >> 0).cast[DType.uint8]())] ^
                table[8*256 + int((index_10 >> 24).cast[DType.uint8]())] ^
                table[9*256 + int((index_10 >> 16).cast[DType.uint8]())] ^
                table[10*256 + int((index_10 >> 8).cast[DType.uint8]())] ^
                table[11*256 + int((index_10 >> 0).cast[DType.uint8]())] ^
                table[12*256 + int((index_9 >> 24).cast[DType.uint8]())] ^
                table[13*256 + int((index_9 >> 16).cast[DType.uint8]())] ^
                table[14*256 + int((index_9 >> 8).cast[DType.uint8]())] ^
                table[15*256 + int((index_9 >> 0).cast[DType.uint8]())] ^
                table[16*256 + int((index_8 >> 24).cast[DType.uint8]())] ^
                table[17*256 + int((index_8 >> 16).cast[DType.uint8]())] ^
                table[18*256 + int((index_8 >> 8).cast[DType.uint8]())] ^
                table[19*256 + int((index_8 >> 0).cast[DType.uint8]())] ^
                table[20*256 + int((index_7 >> 24).cast[DType.uint8]())] ^
                table[21*256 + int((index_7 >> 16).cast[DType.uint8]())] ^
                table[22*256 + int((index_7 >> 8).cast[DType.uint8]())] ^
                table[23*256 + int((index_7 >> 0).cast[DType.uint8]())] ^
                table[24*256 + int((index_6 >> 24).cast[DType.uint8]())] ^
                table[25*256 + int((index_6 >> 16).cast[DType.uint8]())] ^
                table[26*256 + int((index_6 >> 8).cast[DType.uint8]())] ^
                table[27*256 + int((index_6 >> 0).cast[DType.uint8]())] ^
                table[28*256 + int((index_5 >> 24).cast[DType.uint8]())] ^
                table[29*256 + int((index_5 >> 16).cast[DType.uint8]())] ^
                table[30*256 + int((index_5 >> 8).cast[DType.uint8]())] ^
                table[31*256 + int((index_5 >> 0).cast[DType.uint8]())] ^
                table[32*256 + int((index_4 >> 24).cast[DType.uint8]())] ^
                table[33*256 + int((index_4 >> 16).cast[DType.uint8]())] ^
                table[34*256 + int((index_4 >> 8).cast[DType.uint8]())] ^
                table[35*256 + int((index_4 >> 0).cast[DType.uint8]())] ^
                table[36*256 + int((index_3 >> 24).cast[DType.uint8]())] ^
                table[37*256 + int((index_3 >> 16).cast[DType.uint8]())] ^
                table[38*256 + int((index_3 >> 8).cast[DType.uint8]())] ^
                table[39*256 + int((index_3 >> 0).cast[DType.uint8]())] ^
                table[40*256 + int((index_2 >> 24).cast[DType.uint8]())] ^
                table[41*256 + int((index_2 >> 16).cast[DType.uint8]())] ^
                table[42*256 + int((index_2 >> 8).cast[DType.uint8]())] ^
                table[43*256 + int((index_2 >> 0).cast[DType.uint8]())] ^
                table[44*256 + int((index_1 >> 24).cast[DType.uint8]())] ^
                table[45*256 + int((index_1 >> 16).cast[DType.uint8]())] ^
                table[46*256 + int((index_1 >> 8).cast[DType.uint8]())] ^
                table[47*256 + int((index_1 >> 0).cast[DType.uint8]())] 
                
    for i in range(size*length, size*length + extra ):
        var index = (crc32 ^ data[i].cast[DType.uint32]()) & 0xff
        crc32 = table[int(index)] ^ (crc32 >> 8)


    return crc32^0xffffffff

In [39]:
var little_endian_table_48_byte  = fill_table_n_byte[48]()
print(hex(CRC32(test_list)))
print(hex(CRC32_inv(test_list)))
print(hex(CRC32_table_48_byte(test_list, little_endian_table_48_byte)))

0x382aa34e
0x382aa34e
0x382aa34e


In [43]:
fn run_32[data: List[SIMD[DType.uint8, 1]] ]():
    var a =  CRC32(data)
    benchmark.keep(a)


fn run_32_inv[data: List[SIMD[DType.uint8, 1]] ]():
    var a = CRC32_inv(data)
    benchmark.keep(a)


fn run_32_table[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table(data, table)
    benchmark.keep(a)


fn run_32_table_2_byte_2[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_2_byte_2(data, table)
    benchmark.keep(a)

fn run_32_table_4_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_4_byte(data, table)
    benchmark.keep(a)

fn run_32_table_8_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_8_byte(data, table)
    benchmark.keep(a)


fn run_32_table_16_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_16_byte(data, table)
    benchmark.keep(a)

fn run_32_table_32_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_32_byte(data, table)
    benchmark.keep(a)

fn run_32_table_48_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_48_byte(data, table)
    benchmark.keep(a)

fn run_32_table_64_byte[data: List[SIMD[DType.uint8, 1]], table: List[UInt32]]():
    var a = CRC32_table_64_byte(data, table)
    benchmark.keep(a)

fn bench():

    
    alias fill_size = 2**20
    alias g = UnsafePointer[SIMD[DType.uint8, 1]].alloc(fill_size)
    rand[DType.uint8](ptr =  g, size = fill_size)


    alias rand_list = List[SIMD[DType.uint8,1]](data = g, size = fill_size, capacity = fill_size)


    print(len(rand_list))

    var report = benchmark.run[run_32[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report)

    var report_2 = benchmark.run[run_32_inv[rand_list]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_2)

 
    
    alias little_endian_table = fill_table_n_byte[1]()

    var report_3 = benchmark.run[run_32_table[rand_list, little_endian_table]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_3)

    alias little_endian_table_2_byte = fill_table_n_byte[2]()

    var report_4 = benchmark.run[run_32_table_2_byte_2[rand_list, little_endian_table_2_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_4)

    alias little_endian_table_4_byte = fill_table_n_byte[4]()

    var report_5 = benchmark.run[run_32_table_4_byte[rand_list, little_endian_table_4_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_5)

    alias little_endian_table_8_byte = fill_table_n_byte[8]()

    var report_6 = benchmark.run[run_32_table_8_byte[rand_list, little_endian_table_8_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_6)

    alias little_endian_table_16_byte = fill_table_n_byte[16]()

    var report_7 = benchmark.run[run_32_table_16_byte[rand_list, little_endian_table_16_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_7)

    alias little_endian_table_32_byte = fill_table_n_byte[32]()

    var report_8 = benchmark.run[run_32_table_32_byte[rand_list, little_endian_table_32_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_8)

    alias little_endian_table_48_byte = fill_table_n_byte[48]()

    var report_9 = benchmark.run[run_32_table_48_byte[rand_list, little_endian_table_48_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_9)

    alias little_endian_table_64_byte = fill_table_n_byte[64]()

    var report_10 = benchmark.run[run_32_table_64_byte[rand_list, little_endian_table_64_byte]](max_runtime_secs=0.5
    ).mean(benchmark.Unit.ms)
    print(report_10)

    print(100 * (report/report_2 -1))
    print(100 * (report_2/report_3 -1))
    print(100 * (report/report_3 -1))
    print(100 * (report/report_4 -1))
    print(100 * (report/report_5 -1))
    print(100 * (report/report_6 -1))
    print(100 * (report/report_7 -1))
    print(100 * (report/report_8 -1))
    print(100 * (report/report_9 -1))
    print(100 * (report/report_10 -1))




    #report.print_full()
    #report_2.print_full()


bench()

1048576
9.3230161299999992
7.9815732300000004
2.0583052525862069
1.4192005800970875
0.74990954899999995
0.52478961849999994
0.26935500350000002
0.20227008232553972
0.41525093600000001
0.28250161699999998
16.806748010003524
287.77403011391829
352.94623420339974
556.92025924638585
1143.2187511723496
1676.5244969303828
3361.2374037447566
4509.1918403410982
2145.1523456650316
3200.163811097761


In [37]:
fn CRC32_table_8_byte_compact(owned data: List[SIMD[DType.uint8, 1]], table: List[UInt32]) -> SIMD[DType.uint32, 1]:
    var crc32: UInt32 = 0xffffffff

    var size = 8
    var step_size = 4 # really just 32/8
    var units = size//step_size

    var length = len(data)//size
    var extra = len(data) % size

    var vals = List[UInt32](capacity=units)
    vals.size = units
    var n = 0
    for i in range(start = 0, end = length*size, step = size):
        
        
        
        for j in range(units):
            vals[j] =   (data[i + j*step_size + 3].cast[DType.uint32]() << 24) | 
                        (data[i + j*step_size + 2].cast[DType.uint32]() << 16) | 
                        (data[i + j*step_size + 1].cast[DType.uint32]() << 8) | 
                        (data[i + j*step_size + 0].cast[DType.uint32]() << 0)

            if j == 0:
                vals[0] = vals[0]^crc32
                crc32 = 0
        #for j in range(units):
            n = size - j*step_size
            crc32 = table[(n-4)*256 + int((vals[j] >> 24).cast[DType.uint8]())] ^
                    table[(n-3)*256 + int((vals[j] >> 16).cast[DType.uint8]())] ^
                    table[(n-2)*256 + int((vals[j] >> 8).cast[DType.uint8]())] ^
                    table[(n-1)*256 + int((vals[j] >> 0).cast[DType.uint8]())] ^ crc32

    
    for i in range(size*length, size*length + extra ):
        var index = (crc32 ^ data[i].cast[DType.uint32]()) & 0xff
        crc32 = table[int(index)] ^ (crc32 >> 8)


    return crc32^0xffffffff