# 4. SSTable to Arrow

We manually read the SSTables on disk using [Kaitai Struct](https://kaitai.io/) and then send it to the client analytics program through an Arrow IPC stream.

Data transformations:

1. SSTable on disk
2. Deserialized into Kaitai object in C++
3. Client makes request to server (not to C* DB)
4. Kaitai object serialized via Arrow IPC stream
5. Sent across network
6. Arrow IPC stream received by client
7. Transformed into Arrow Table / cuDF

**Pros:**
- doesn't make request to C* DB, which lessens the load and allows for other operations to run
- Kaitai Struct format is almost self-documenting and easier to maintain (e.g. for DSE SSTable format)
    - can also generate cool images by exporting to graphviz:
    - ![](assets/data.png)
    - ![](assets/statistics.png)
    - ![](assets/index.png)
    - ![](assets/summary.png)
- works especially well with serverless since most of the files are all in one place
- more flexible for future developments like parallelization with CUDA

~~Cons~~

In [None]:
import socket

import pyarrow as pa
from blazingsql import BlazingContext
import cudf

import numpy as np # for visualization purposes

HOST = '127.0.0.1'
PORT = 9143

In [None]:
def read_bytes(sock, n):
    data = b''
    while len(data) < n:
        more = sock.recv(n - len(data))
        if not more:
            raise EOFError("Socket connection ended before reading specified number of bytes")
        data += more
    return data

def read_u8(sock):
    data = read_bytes(sock, 8)
    return int.from_bytes(data, byteorder='big')

# read data from socket
def fetch_data():
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
        sock.connect((HOST, PORT))
        sock.sendall(b'hello world\n')
        num_tables = read_u8(sock)
        table_buffers = []
        for i in range(num_tables):
            print('receiving table', i)
            table_size = read_u8(sock)
            buf = read_bytes(sock, table_size)
            table_buffers.append(buf)
    return table_buffers

In [None]:
buffers = fetch_data()
tables = [pa.ipc.open_stream(buf).read_all() for buf in buffers]
len(tables)

In [None]:
# turn the first arrow table into a cuDF
gdf = cudf.DataFrame.from_arrow(table[0])
gdf

In [None]:
bc = BlazingContext()
bc.create_table("gpu_table", gdf)
bc.describe_table("gpu_table")
result = bc.sql("SELECT * FROM gpu_table")
result