Skip to content

danielbeach/ArrowFlightWithDeltaLake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Arrow Flight with Delta Lake

Minimal example of serving a Delta Lake table over Apache Arrow Flight (gRPC). Demonstrates how to bridge delta-rs and PyArrow Flight so clients can stream columnar data without a Spark or cloud dependency.

What it does

  • Reads a CSV of Divvy bike-share trips into a Delta Lake table (one-time setup)
  • Serves that table over Arrow Flight on grpc://localhost:5005
  • Client connects, discovers available datasets, fetches all rows as Arrow record batches, and prints summary stats

Architecture

setup_delta.py        CSV → Delta Lake (Parquet on disk)
server.py             Delta Lake → Arrow Flight gRPC server
client.py             Arrow Flight client → in-memory Arrow table
main.py               CLI entrypoint (setup | server | client)

Arrow Flight concepts used

Concept Role
FlightServerBase Base class — handles gRPC; you implement the 3 RPCs
list_flights Discovery — returns available datasets + their schema
get_flight_info Returns schema + endpoint for a named dataset
do_get Streams record batches to the client
Ticket Opaque bytes the client holds and returns on do_get — server encodes whatever it needs (dataset name, partition filter, etc.)
FlightInfo Schema + endpoints + row/byte counts returned to client
GeneratorStream Wraps a Python batch generator into the Flight streaming protocol

Ticket design: In this implementation a ticket is just b"divvy_trips" — the whole dataset. Tickets can encode partition filters (e.g. b"divvy_trips/year=2024") so clients can request slices; the server decodes and applies filters before scanning.

Data flow

client.list_flights()
  → server yields FlightInfo (schema + Ticket)

client.do_get(ticket)
  → server opens DeltaTable
  → scans Parquet files in 65,536-row batches
  → streams Arrow RecordBatches over gRPC
  → client reassembles into Arrow Table

Requirements

  • Python 3.12+
  • pyarrow >= 16.0.0
  • deltalake >= 0.10.0

Install with uv:

uv sync

Usage

1. Download data

Download a Divvy monthly CSV (e.g. 202604-divvy-tripdata.csv) from Divvy trip data and place it at:

data/202604-divvy-tripdata.csv

2. Setup — convert CSV to Delta Lake

python main.py setup

3. Start the Flight server

python main.py server

4. Run the client (separate terminal)

python main.py client

Client prints total row count, rideable_type breakdown, and member/casual split.

About

Trying out Apache Arrow Flight Server ontop of Delta Lake

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages