Permalink
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
108 lines (78 sloc) 4 KB

Backends

Blaze backends include projects like streaming Python, Pandas, SQLAlchemy, MongoDB, PyTables, and Spark. Most Blaze expressions can run well on any of these backends, allowing developers to easily transition their computation to changing performance needs.

Existing Backends

Streaming Python

via toolz and cytoolz

Blaze can operate on core Python data structures like lists, tuples, ints and strings. This can be useful both in small cases like rapid prototyping or unit testing but also in large cases where streaming computation is desired.

The performance of Python data structures like dict make Python a surprisingly powerful platform for data-structure bound computations commonly found in split-apply-combine and join operations. Additionally, Python's support for lazy iterators (i.e. generators) means that it can easily support streaming computations that pull data in from disk, taking up relatively little memory.

Pandas

Pandas DataFrames are the gold standard for in-memory data analytics. They are fast, intuitive, and come with a wealth of additional features like plotting, and data I/O.

SQLAlchemy

Blaze levarages the SQLAlchemy project, which provides a uniform interface over the varied landscape of SQL systems. Blaze manipulates SQLAlchemy expressions which are then compiled down to SQL query strings of the appropriate backend.

The prevalance of SQL among data technologies makes this backend particularly useful. Databases like Impala and Hive have SQLAlchemy dialects, enabling easy Blaze interoperation.

MongoDB

Blaze drives MongoDB through the pymongo interface and is able to use many of the built in operations such as aggregration and group by.

PyTables

PyTables provides compressed Table objects backed by the popular HDF5 library. Blaze can compute simple expressions using PyTables, such as elementwise operations and row-wise selections.

Spark

Spark provides resilient distributed in-memory computing and easy access to HDFS storage. Blaze drives Spark through the PySpark interface.

Benefits of Backend Agnostic Computation

For maximum performance and expressivity it is best to use the backends directly. Blaze is here when absolute customization is not required.

Familiarity

Users within the numeric Python ecosystem may be familiar with the NumPy and Pandas interfaces but relatively unfamiliar with SQL or the functional idioms behind Spark or Streaming Python. In this case Blaze provides a familiar interface which can drive common computations in these more exotic backends.

Prototyping and Testing

Blaze allows you to prototype and test your computation on a small dataset using Python or Pandas and then scale that computation up to larger computational systems with confidence that nothing will break.

A changing hardware landscape drives a changing software landscape. Analytic code written for systems today may not be relevant for systems five years from now. Symbolic systems like Blaze provide some stability on top of this rapidly changing ecosystem.

Static Analysis

Not yet implemented

Blaze is able to inspect and optimize your computation before it is run. Common optimizations include loop fusion, rearranging joins and projections to minimize data flow, etc..