SQL Query Execution against Apache Arrow, in Rust
Switch branches/tags
Clone or download
Latest commit 2c610f4 Nov 11, 2018

README.md

DataFusion: SQL Query Execution in Rust

License Version Build Status Coverage Status Gitter chat

DataFusion is an attempt at building a modern distributed compute platform in Rust, using Apache Arrow as the memory model.

See my article How To Build a Modern Distributed Compute Platform to learn about the design and my motivation for building this. The TL;DR is that this project is a great way to learn about building distributed systems but there are plenty of better choices if you need something mature and supported.

The following features are currently supported:

  • SQL Parser, Planner and Optimizer
  • DataFrame API
  • Columnar processing using Apache Arrow
  • Support for local CSV and Apache Parquet files
  • Single-threaded execution of SQL queries, supporting:
    • Projection
    • Selection
    • Scalar Functions
    • Aggregates (Min, Max, Count)
    • Grouping
  • User-defined Scalar Functions (UDFs)

DataFusion can be used as a crate dependency in your project to add SQL support for custom data sources.

A Docker image is also available if you just want to run SQL queries against your CSV and Parquet files.

Project Home Page

The project home page is now at https://datafusion.rs and contains the roadmap as well as documentation for using this crate. I am using GitHub issues to track development tasks and feedback.

Prerequisites

  • Rust nightly (required by parquet-rs crate)

Building DataFusion

See BUILDING.md.

Gitter

There is a Gitter channel where you can ask questions about the project or make feature suggestions too.

Contributing

Contributors are welcome! Please see CONTRIBUTING.md for details.