Note: The
mainbranch contains content for the second edition of How Query Engines Work, due to be published in 2026. For the original project from the first edition, please see thefirst-editionbranch.
This is the companion repository for the book How Query Engines Work and contains source code for a simple in-memory query engine implemented in Kotlin.
The query engine is designed to be easy to learn and hack on rather than being optimized for performance, scalability, or robustness.
The query engine contains the following components:
- Logical Plan
- DataFrame API
- SQL Parser
- SQL Query Planner
- Logical Query Optimizer
- Physical Plan
- Arrow Flight Server
- JDBC Driver
The following operators are supported:
- Table Scan (Parquet and CSV)
- Projection
- Filter
- Hash Aggregate
- Hash Join
- Limit
The following expressions are supported:
- Literals
- Attributes
- Simple Aggregates (Min, Max, Sum, Count, Avg)
- Cast
- Boolean expressions (AND, OR, NOT)
- Simple math expressions (+, -, *, /)
- Java 11+
./gradlew build # Linux/macOS
gradlew.bat build # WindowsRun tests:
./gradlew testInstall to local Maven repository:
./gradlew publishToMavenLocalThe query engine includes an Arrow Flight server that allows remote query execution via gRPC.
Start the Flight server:
./gradlew :flight-server:runThis starts the Flight server listening on 0.0.0.0:50051.
Client usage:
The client module provides a Client class that can connect to the Flight server and submit logical plans:
val client = Client("localhost", 50051)
client.execute(logicalPlan)The client serializes the logical plan to protobuf, sends it to the server via Arrow Flight, and receives results as Arrow record batches.
The benchmark runs an aggregate SQL query against NYC taxi trip CSV files in parallel.
Build the benchmark:
./gradlew :benchmarks:installDistRun the benchmark:
BENCH_PATH=/path/to/csv/files \
BENCH_RESULT_FILE=/path/to/results.csv \
./benchmarks/build/install/benchmarks/bin/benchmarksBENCH_PATH- Directory containing NYC taxi trip CSV filesBENCH_RESULT_FILE- Output file for benchmark timing results
There is also a Docker-based benchmark setup in the docker/ directory.
The benchmarks module includes a TPC-H query runner that executes SQL queries against TPC-H data in Parquet format.
Build the benchmarks:
./gradlew :benchmarks:installDistRun a TPC-H query:
./benchmarks/build/install/benchmarks/bin/benchmarks /path/to/query.sql /path/to/tpch/dataArguments:
- First argument: Path to a SQL file containing the query
- Second argument: Path to directory containing TPC-H Parquet files
The data directory should contain these files:
customer.parquetlineitem.parquetnation.parquetorders.parquetpart.parquetpartsupp.parquetregion.parquetsupplier.parquet
Some of the examples in the book use the yellow_tripdata_2019-12.csv data set. The original data set
is no longer available from the original location (because the data is now, quite sensibly, provided
in Parquet format), but copies can be still be found on the internet.
As of December 2025, copies can be found at the following locations: