Here we give a quick overview of some of the more common query functionality.
We use the well known iris dataset
>>> from blaze import Data
>>> iris = Data('blaze/examples/data/iris.csv')
>>> iris
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
...
Select individual columns using attributes
>>> iris.species
species
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
...
Or item access
>>> iris['species']
species
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
...
Select many columns using a list of names
>>> iris[['sepal_length', 'species']]
sepal_length species
0 5.1 Iris-setosa
1 4.9 Iris-setosa
2 4.7 Iris-setosa
3 4.6 Iris-setosa
...
Use mathematical operators and functions as normal
>>> from blaze import log
>>> log(iris.sepal_length * 10)
sepal_length
0 3.931826
1 3.891820
2 3.850148
3 3.828641
...
Note that mathematical functions like log
should be imported from blaze
. These will translate to np.log
, math.log
, sqlalchemy.sql.func.log
, etc. based on the backend.
As with many Blaze operations reductions like sum
and mean
may be used either as methods or as base functions.
>>> iris.sepal_length.mean() # doctest: +ELLIPSIS
5.84333333333333...
>>> from blaze import mean
>>> mean(iris.sepal_length) # doctest: +ELLIPSIS
5.84333333333333...
The by
operation expresses split-apply-combine computations. It has the general format
>>> by(table.grouping_columns, name_1=table.column.reduction(),
... name_2=table.column.reduction(),
... ...) # doctest: +SKIP
Here is a concrete example. Find the shortest, longest, and average petal length by species.
>>> from blaze import by
>>> by(iris.species, shortest=iris.petal_length.min(),
... longest=iris.petal_length.max(),
... average=iris.petal_length.mean())
species average longest shortest
0 Iris-setosa 1.462 1.9 1.0
1 Iris-versicolor 4.260 5.1 3.0
2 Iris-virginica 5.552 6.9 4.5
This simple model can be extended to include more complex groupers and more complex reduction expressions.
Add new columns using the transform
function
>>> transform(iris, sepal_ratio = iris.sepal_length / iris.sepal_width,
... petal_ratio = iris.petal_length / iris.petal_width) # doctest: +SKIP
sepal_length sepal_width petal_length petal_width species \
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
sepal_ratio petal_ratio
0 1.457143 7.000000
1 1.633333 7.000000
2 1.468750 6.500000
3 1.483871 7.500000
...
Match text with glob strings, specifying columns with keyword arguments.
>>> iris.like(species='*versicolor') # doctest: +SKIP
sepal_length sepal_width petal_length petal_width species
50 7.0 3.2 4.7 1.4 Iris-versicolor
51 6.4 3.2 4.5 1.5 Iris-versicolor
52 6.9 3.1 4.9 1.5 Iris-versicolor
>>> iris.relabel(petal_length='PETAL-LENGTH', petal_width='PETAL-WIDTH') # doctest: +SKIP
sepal_length sepal_width PETAL-LENGTH PETAL-WIDTH species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
Blaze can help solve many common problems that data analysts and scientists encounter. Here are a few examples of common issues that can be solved using blaze.
>>> from blaze import odo
>>> from pandas import DataFrame
>>> odo('blaze/examples/data/accounts_*.csv.gz', DataFrame)
id name amount
0 1 Alice 100
1 2 Bob 200
2 3 Charlie 300
3 4 Dan 400
4 5 Edith 500
>>> from blaze import Data, by
>>> t = Data('sqlite:///blaze/examples/data/iris.db::iris')
>>> t
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa
...
>>> by(t.species, max=t.petal_length.max(), min=t.petal_length.min())
species max min
0 Iris-setosa 1.9 1.0
1 Iris-versicolor 5.1 3.0
2 Iris-virginica 6.9 4.5