# 4. SSTable to Arrow

We manually read the SSTables on disk using [Kaitai Struct](https://kaitai.io/) and then send it to the client analytics program through an Arrow IPC stream.

Data transformations:

1. SSTable on disk
2. Deserialized into Kaitai object in C++
3. Client makes request to server (not to C* DB)
4. Kaitai object serialized via Arrow IPC stream
5. Sent across network
6. Arrow IPC stream received by client
7. Transformed into Arrow Table / cuDF

**Pros:**
- doesn't make request to C* DB, which lessens the load and allows for other operations to run
- Kaitai Struct format is almost self-documenting and easier to maintain (e.g. for DSE SSTable format)
    - can also generate cool images by exporting to graphviz:
    - ![](assets/data.png)
    - ![](assets/statistics.png)
    - ![](assets/index.png)
    - ![](assets/summary.png)
- works especially well with serverless since most of the files are all in one place
- more flexible for future developments like parallelization with CUDA

~~Cons~~

In [4]:
import pyarrow as pa
from blazingsql import BlazingContext
import cudf
from utils import fetch_data

In [39]:
buffers = fetch_data()
tables = [pa.ipc.open_stream(buf).read_all() for buf in buffers]

receiving table 0


1

In [40]:
# turn the first arrow table into a cuDF
gdf = cudf.DataFrame.from_arrow(tables[0].flatten())
gdf

Unnamed: 0,partition_key_part1.org.apache.cassandra.db.marshal.UUIDType,partition_key_part1.org.apache.cassandra.db.marshal.UTF8Type,partition_key_part2.org.apache.cassandra.db.marshal.UUIDType,partition_key_part2.org.apache.cassandra.db.marshal.UTF8Type,_ts_row_liveness,_del_time_row_liveness,_ttl_row_liveness,_local_del_time_partition,_marked_for_del_at_partition,_local_del_time_row,...,_ttl_data,sensor_value,_ts_sensor_value,_del_time_sensor_value,_ttl_sensor_value,station_id_part1,station_id_part2,_ts_station_id,_del_time_station_id,_ttl_station_id
0,1828142208147734908,dispersion,11081545588760870504,dispersion,1970-01-01 00:00:00.002,,,,,,...,,95.759791,,,,2945182322382029771,10904053516202300378,,,
1,8329893407965204367,solubility,12954906978328135592,solubility,1970-01-01 00:00:00.009,,,,,,...,,106.497951,,,,2945182322382029771,10904053516202300378,,,
2,4678114788215243590,fitness,11367188723867789301,fitness,1970-01-01 00:00:00.002,,,,,,...,,100.090271,,,,2945182322382029771,10904053516202300378,,,
3,1127822330704970463,phase_offset,10548166629717554178,phase_offset,1970-01-01 00:00:00.001,,,,,,...,,94.181632,,,,2945182322382029771,10904053516202300378,,,
4,4544790122793091762,periodicity,11718901302575831064,periodicity,1970-01-01 00:00:00.000,,,,,,...,,99.909094,,,,2945182322382029771,10904053516202300378,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,8122384078079413211,color_of_rust,9424322085127913799,color_of_rust,1970-01-01 00:00:00.003,,,,,,...,,105.890743,,,,2945182322382029771,10904053516202300378,,,
996,7192109222758598265,90pctile,9845011310262641079,90pctile,1970-01-01 00:00:00.006,,,,,,...,,103.857088,,,,2945182322382029771,10904053516202300378,,,
997,1658863039434342490,brightness,9747043485374406081,brightness,1970-01-01 00:00:00.000,,,,,,...,,95.420393,,,,2945182322382029771,10904053516202300378,,,
998,7120074136673206703,ratio,13437665513156663070,ratio,1970-01-01 00:00:00.001,,,,,,...,,103.726586,,,,2945182322382029771,10904053516202300378,,,


In [None]:
bc = BlazingContext()
bc.create_table("gpu_table", gdf)
bc.describe_table("gpu_table")
result = bc.sql("SELECT * FROM gpu_table")
result