Modern distributed compute platform using Apache Arrow as the memory model
Switch branches/tags
Nothing to show
Clone or download

README.md

DataFusion: Modern Distributed Compute Platform implemented in Rust

License Version Build Status Coverage Status Gitter chat

DataFusion is an attempt at building a modern distributed compute platform in Rust, using Apache Arrow as the memory model.

See my article How To Build a Modern Distributed Compute Platform to learn about the design and my motivation for building this. The TL;DR is that this project is a great way to learn about building distributed systems but there are plenty of better choices if you need something mature and supported.

Status

The original POC no longer works due to changes in Rust nightly since 11/3/18 and since then I have been contributing more code to the Apache Arrow project and decided to start implementing DataFusion from scratch based on that latest Arrow code and incorporating lessons learned from the first attempt. The original POC code is is now on the original_poc branch and supports single threaded SQL execution against Parquet and CSV files using Apache Arrow as the memory model.

The current task list:

  • Delete existing code and update the README with the new plan
  • Implement serializable logical query plan
  • Implement data source for CSV
  • Implement data source for Parquet
  • Implement query execution: Projection
  • Implement query execution: Selection
  • Implement query execution: Sort
  • Implement query execution: Aggregate
  • Implement query execution: Scalar UDFs
  • Implement query execution: Array UDFs
  • Implement query execution: StructArray
  • Implement parallel query execution (multithreaded, single process)
  • Generate query plan from SQL
  • Execute SQL against DataSource
  • Implement worker node that can receive a query plan, execute the query, and return a result in Arrow IPC format
  • Kubernetes support for spinning up worker nodes
  • Implement distributed query execution

Prerequisites

  • Rust nightly (required by parquet-rs crate)

Building DataFusion

See BUILDING.md.

Gitter

There is a Gitter channel where you can ask questions about the project or make feature suggestions too.

Contributing

Contributors are welcome! Please see CONTRIBUTING.md for details.