DataFusion: Modern Distributed Compute Platform implemented in Rust
DataFusion is an attempt at building a modern distributed compute platform in Rust, using Apache Arrow as the memory model.
See my article How To Build a Modern Distributed Compute Platform to learn about the design and my motivation for building this. The TL;DR is that this project is a great way to learn about building distributed systems but there are plenty of better choices if you need something mature and supported.
The original POC no longer works due to changes in Rust nightly since 11/3/18 and since then I have been contributing more code to the Apache Arrow project and decided to start implementing DataFusion from scratch based on that latest Arrow code and incorporating lessons learned from the first attempt. The original POC code is is now on the original_poc branch and supports single threaded SQL execution against Parquet and CSV files using Apache Arrow as the memory model.
The current task list:
- Delete existing code and update the README with the new plan
- Implement serializable logical query plan
- Implement data source for CSV
- Implement data source for Parquet
- Implement query execution: Projection
- Implement query execution: Selection
- Implement query execution: Sort
- Implement query execution: Aggregate
- Implement query execution: Scalar UDFs
- Implement query execution: Array UDFs
- Implement query execution: StructArray
- Implement parallel query execution (multithreaded, single process)
- Generate query plan from SQL
- Execute SQL against DataSource
- Implement worker node that can receive a query plan, execute the query, and return a result in Arrow IPC format
- Kubernetes support for spinning up worker nodes
- Implement distributed query execution
- Rust nightly (required by
There is a Gitter channel where you can ask questions about the project or make feature suggestions too.
Contributors are welcome! Please see CONTRIBUTING.md for details.