# 3. Cassandra to Arrow

We use some code from the Cassandra server to read the SSTable, but instead of de/serializing to/from CQL, we use an [Arrow IPC stream](http://arrow.apache.org/), which is stored in a columnar format and better suited for analytics.

Data transformations:

1. SSTable on disk
2. Deserialized into Java Object in C* server
3. Client makes request to server (not to C* DB)
4. Data serialized via Arrow IPC stream
5. Sent across network
6. Arrow IPC stream received by client
7. Transformed into Arrow Table / cuDF

**Pros:**
- doesn't make request to the main Cassandra DB, which lessens the load and allows for other operations to run
- less de/serialization involved using the Arrow IPC stream

**Cons:**
- don't want to have to start Cassandra or use the JVM
- complex architecture

In [1]:
import pyarrow as pa
import pandas as pd
import socket

HOST = '127.0.0.1'
PORT = 9143

In [2]:
def read_bytes(sock, n):
    data = b''
    while len(data) < n:
        more = sock.recv(n - len(data))
        if not more:
            raise EOFError("Socket connection ended before reading specified number of bytes")
        data += more
    return data

def read_u8(sock):
    data = read_bytes(sock, 8)
    return int.from_bytes(data, byteorder='big')

# read data from socket
def fetch_data():
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
        sock.connect((HOST, PORT))
        sock.sendall(b'hello world\n')
        num_tables = read_u8(sock)
        table_buffers = []
        for i in range(num_tables):
            print('receiving table', i)
            table_size = read_u8(sock)
            buf = read_bytes(sock, table_size)
            table_buffers.append(buf)
    return table_buffers

In [24]:
buffers = fetch_data()
tables = [pa.ipc.open_stream(buf).read_all() for buf in buffers]
len(tables)

receiving table 0
receiving table 1
receiving table 2


3

In [25]:
tables[0].to_pandas()

Unnamed: 0,partition key,_ts_row,_del_time_row,_ttl_row,count,_ts_count,_del_time_count,_ttl_count,size,_ts_size,_del_time_size,_ttl_size,names,_ts_names,_del_time_names,_ttl_names,types,_ts_types,_del_time_types,_ttl_types
0,bar,2021-07-20 19:45:23.434589,NaT,NaT,456.0,NaT,NaT,NaT,,NaT,NaT,NaT,,,,,"[cyan, green, red, violet, yellow]","[NaT, 2021-07-20T23:50:27.469790, NaT, 2021-07...","[NaT, NaT, NaT, NaT, NaT]","[NaT, NaT, NaT, NaT, NaT]"
1,asdfqwer,2021-07-20 19:12:09.531761,NaT,NaT,30.0,NaT,NaT,NaT,big,NaT,NaT,NaT,,,,,"[blue, orange, red]","[NaT, NaT, NaT]","[NaT, NaT, NaT]","[NaT, NaT, NaT]"
2,foo,2021-07-20 19:45:00.539860,NaT,NaT,123.0,NaT,NaT,NaT,small,NaT,NaT,NaT,"[jane, joe]","[2021-07-20T23:51:22.317024, 2021-07-20T23:51:...","[NaT, NaT]","[NaT, NaT]",,,,
3,emptyrow,2021-07-20 19:45:39.919177,NaT,NaT,,NaT,NaT,NaT,,NaT,NaT,NaT,,,,,,,,
4,testlist,2021-07-20 19:43:40.532672,NaT,NaT,,NaT,NaT,NaT,,NaT,NaT,NaT,"[bob, alice]","[NaT, NaT]","[NaT, NaT]","[NaT, NaT]",,,,
5,zxcvasdf,2021-07-20 19:12:34.702061,NaT,NaT,26.0,NaT,NaT,NaT,big,NaT,NaT,NaT,,,,,"[blue, purple]","[NaT, NaT]","[NaT, NaT]","[NaT, NaT]"


In [26]:
tables[1].to_pandas()

Unnamed: 0,partition key,_ts_row,_del_time_row,_ttl_row,count,_ts_count,_del_time_count,_ttl_count
0,test,2021-07-23 17:19:42.905355,1970-01-01 00:27:07.120782,0 days 16:40:00,13,NaT,NaT,NaT


In [27]:
tables[2].to_pandas()

Unnamed: 0,partition key,_ts_row,_del_time_row,_ttl_row
0,testlist,2021-07-23 17:23:42.355027,1970-01-01 00:27:07.061022,NaT
