Permalink
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
116 lines (81 sloc) 3.64 KB

Expressions

Blaze expressions describe computational workflows symbolically. They allow developers to architect and check their computations rapidly before applying them to data. These expressions can then be compiled down to a variety of supported backends.

Tables

Table expressions track operations found in relational algebra or your standard Pandas/R DataFrame object. Operations include projecting columns, filtering, mapping and basic mathematics, reductions, split-apply-combine (group by) operations, and joining. This compact set of operations can express a surprisingly large set of common computations. They are widely supported.

Symbol

A Symbol refers to a single collection of data. It must be given a name and a datashape.

>>> from blaze import symbol
>>> accounts = symbol('accounts', 'var * {id: int, name: string, balance: int}')

Projections, Selection, Arithmetic

Many operations follow from standard Python syntax, familiar from systems like NumPy and Pandas.

The following example defines a collection, accounts, and then selects the names of those accounts with negative balance.

>>> accounts = symbol('accounts', 'var * {id: int, name: string, balance: int}')
>>> deadbeats = accounts[accounts.balance < 0].name

Internally this doesn't do any actual work because we haven't specified a data source. Instead it builds a symbolic representation of a computation to execute in the future.

>>> deadbeats
accounts[accounts.balance < 0].name
>>> deadbeats.dshape
dshape("var * string")

Split-apply-combine, Reductions

Blaze borrows the by operation from R and Julia. The by operation is a combined groupby and reduction, fulfilling split-apply-combine workflows.

>>> from blaze import by
>>> by(accounts.name,                 # Splitting/grouping element
...    total=accounts.balance.sum())  # Apply and reduction
by(accounts.name, total=sum(accounts.balance))

This operation groups the collection by name and then sums the balance of each group. It finds out how much all of the "Alice"s, "Bob"s, etc. of the world have in total.

Note the reduction sum in the third apply argument. Blaze supports the standard reductions of numpy like sum, min, max and also the reductions of Pandas like count and nunique.

Join

Collections can be joined with the join operation, which allows for advanced queries to span multiple collections.

>>> from blaze import join
>>> cities = symbol('cities', 'var * {name: string, city: string}')
>>> join(accounts, cities, 'name')
Join(lhs=accounts, rhs=cities, _on_left='name', _on_right='name', how='inner', suffixes=('_left', '_right'))

If given no inputs, join will join on all columns with shared names between the two collections.

>>> shared_names = join(accounts, cities)

Type Conversion

Type conversion of expressions can be done with the coerce expression. Here's how to compute the average account balance for all the deadbeats in my accounts table and then cast the result to a 64-bit integer:

>>> deadbeats = accounts[accounts.balance < 0]
>>> avg_deliquency = deadbeats.balance.mean()
>>> chopped = avg_deliquency.coerce(to='int64')
>>> chopped
mean(accounts[accounts.balance < 0].balance).coerce(to='int64')

Other

Blaze supports a variety of other operations common to our supported backends. See our API docs for more details.