# Getting started with the Blaze ecosystem

_The following content in this block has been take from the talk: [Christine Doig - Scale your data, not your process: Welcome to the Blaze Ecosystem](https://www.youtube.com/watch?v=QKBcnEhkCtk)_.

## Five areas of Data Science
- Scientific Computing (Research/Computational Scientists) - Algorithms
- ML/Stats (Data Scientists) - Models
- Analytics (Data/Business Analyst) - Reports
- Web (Developers) - Applications
- Distributed systems (Architects) - Pipeline/Architecture

## The Data Science Triforce
- Data -> metadata and storage/containers (semantics)
- Expressions -> API/syntax/language (simplicity/compression and accessibility)
- Engine -> compute (performance)

##### EXAMPLE: numpy
- Data -> metadata (numpy dtypes - np.int32, np.float64, etc.) and storage/containers (np.ndarray)
- Engine -> compute (Numpy- python + FFTW(C) + BLAS(Fortran))
- Expressions -> API, syntax, language (a = arange(15).reshape(3,5))

Different tools (pandas, numpy, spark, sqlDB) have their limitations with respect to different components of the DS triforce.

## Enter Blaze
To allow for rich and performant data analytics, Blaze uses and provides interface to various external projects including:
- datashape (data description language)
- DyND (dynamic, multidimensional arrays)
- odo (data migration)
- numba (code optimization)
- dask (parallel computing)
- bcolz (column store and query)

## Data

In [7]:
# Blaze can be used to query data on different storage systems
from blaze import Data

# You can query good plain ol' CSVs
iris = Data('iris.csv')

# But you can also query a JSON, MongoDB, SQLite, S3, etc.
# iris = Data('sqlite:///flowers.db::iris')
# iris = Data('mongodb://localhost/mydb::iris')
# iris = Data('iris.json')
# iris = Data('s3://blaze-data/iris.csv')

### The data, once read in, can be acted upon in a uniform manner irrespective of the source.

In [8]:
# Select columns
iris[['sepal_length', 'species']]

Unnamed: 0,sepal_length,species
0,5.1,Iris-setosa
1,4.9,Iris-setosa
2,4.7,Iris-setosa
3,4.6,Iris-setosa
4,5.0,Iris-setosa
5,5.4,Iris-setosa
6,4.6,Iris-setosa
7,5.0,Iris-setosa
8,4.4,Iris-setosa
9,4.9,Iris-setosa


In [9]:
# Filter
iris[(iris.species == 'Iris-setosa') & (iris.sepal_length > 5.0)]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
10,5.4,3.7,1.5,0.2,Iris-setosa
14,5.8,4.0,1.2,0.2,Iris-setosa
15,5.7,4.4,1.5,0.4,Iris-setosa
16,5.4,3.9,1.3,0.4,Iris-setosa
17,5.1,3.5,1.4,0.3,Iris-setosa
18,5.7,3.8,1.7,0.3,Iris-setosa
19,5.1,3.8,1.5,0.3,Iris-setosa
20,5.4,3.4,1.7,0.2,Iris-setosa


In [17]:
# Operate
from blaze import log
log(iris.sepal_length * 10)

Unnamed: 0,sepal_length
0,3.931826
1,3.89182
2,3.850148
3,3.828641
4,3.912023
5,3.988984
6,3.828641
7,3.912023
8,3.78419
9,3.89182


In [14]:
# Reduce
iris.sepal_length.mean()

In [16]:
# Split-apply-combine
from blaze import by
by(iris.species, shortest=iris.petal_length.min(),
                  longest=iris.petal_length.max(),
                  average=iris.petal_length.mean())

Unnamed: 0,species,average,longest,shortest
0,Iris-setosa,1.464,1.9,1.0
1,Iris-versicolor,4.26,5.1,3.0
2,Iris-virginica,5.552,6.9,4.5


In [18]:
# Add new columns
from blaze import transform
transform(iris, sepal_ratio=iris.sepal_length/iris.sepal_width, petal_ratio=iris.petal_length/iris.petal_width)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,petal_ratio,sepal_ratio
0,5.1,3.5,1.4,0.2,Iris-setosa,7.0,1.457143
1,4.9,3.0,1.4,0.2,Iris-setosa,7.0,1.633333
2,4.7,3.2,1.3,0.2,Iris-setosa,6.5,1.46875
3,4.6,3.1,1.5,0.2,Iris-setosa,7.5,1.483871
4,5.0,3.6,1.4,0.2,Iris-setosa,7.0,1.388889
5,5.4,3.9,1.7,0.4,Iris-setosa,4.25,1.384615
6,4.6,3.4,1.4,0.3,Iris-setosa,4.666667,1.352941
7,5.0,3.4,1.5,0.2,Iris-setosa,7.5,1.470588
8,4.4,2.9,1.4,0.2,Iris-setosa,7.0,1.517241
9,4.9,3.1,1.5,0.1,Iris-setosa,15.0,1.580645


In [19]:
# Relabel columns
iris.relabel(petal_length='PETAL-LENGTH', petal_width='PETAL-WIDTH')

Unnamed: 0,sepal_length,sepal_width,PETAL-LENGTH,PETAL-WIDTH,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


In [27]:
# Text macthing
from blaze import like
like(iris.species, '*versicolor')

Unnamed: 0,species
0,False
1,False
2,False
3,False
4,False
5,False
6,False
7,False
8,False
9,False
