Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 10 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,16 @@

This is a Python library that binds to [Apache Arrow](https://arrow.apache.org/) in-memory query engine [DataFusion](https://github.com/apache/arrow-datafusion).

DataFusion's Python bindings can be used as an end-user tool as well as providing a foundation for building new systems.
DataFusion's Python bindings can be used as a foundation for building new data systems in Python. Here are some examples:

- [Dask SQL](https://github.com/dask-contrib/dask-sql) uses DataFusion's Python bindings for SQL parsing, query
planning, and logical plan optimizations, and then transpiles the logical plan to Dask operations for execution.
- [DataFusion Ballista](https://github.com/apache/arrow-ballista) is a distributed SQL query engine that extends
DataFusion's Python bindings for distributed use cases.

It is also possible to use these Python bindings directly for DataFrame and SQL operations, but you may find that
[Polars](http://pola.rs/) and [DuckDB](http://www.duckdb.org/) are more suitable for this use case, since they have
more of an end-user focus and are more actively maintained than these Python bindings.

## Features

Expand All @@ -35,20 +44,6 @@ DataFusion's Python bindings can be used as an end-user tool as well as providin
- Serialize and deserialize query plans in Substrait format.
- Experimental support for transpiling SQL queries to DataFrame calls with Polars, Pandas, and cuDF.

## Comparison with other projects

Here is a comparison with similar projects that may help understand when DataFusion might be suitable and unsuitable
for your needs:

- [DuckDB](http://www.duckdb.org/) is an open source, in-process analytic database. Like DataFusion, it supports
very fast execution, both from its custom file format and directly from Parquet files. Unlike DataFusion, it is
written in C/C++ and it is primarily used directly by users as a serverless database and query system rather than
as a library for building such database systems.

- [Polars](http://pola.rs/) is one of the fastest DataFrame libraries at the time of writing. Like DataFusion, it
is also written in Rust and uses the Apache Arrow memory model, but unlike DataFusion it does not provide full SQL
support, nor as many extension points.

## Example Usage

The following example demonstrates running a SQL query against a Parquet file using DataFusion, storing the results
Expand Down Expand Up @@ -143,12 +138,6 @@ See [examples](examples/README.md) for more information.

- [Serialize query plans using Substrait](./examples/substrait.py)

### Executing SQL against DataFrame Libraries (Experimental)

- [Executing SQL on Polars](./examples/sql-on-polars.py)
- [Executing SQL on Pandas](./examples/sql-on-pandas.py)
- [Executing SQL on cuDF](./examples/sql-on-cudf.py)

## How to install (from pip)

### Pip
Expand Down
142 changes: 0 additions & 142 deletions datafusion/context.py

This file was deleted.

97 changes: 0 additions & 97 deletions datafusion/cudf.py

This file was deleted.

93 changes: 0 additions & 93 deletions datafusion/pandas.py

This file was deleted.

Loading