<img src="http://dask.readthedocs.io/en/latest/_images/dask_horizontal.svg" 
     width="30%" 
     align=right
     alt="Dask logo">

DataFrames and Timeseries
------------------

This notebook uses [Dask dataframe](http://dask.pydata.org/en/latest/dataframe.html), a parallel version of [Pandas](http://pandas.pydata.org) on a cluster.  It shows off both the Dask dataframe API and how to operate with the [distributed cluster](http://distributed.readthedocs.io/en/latest/api.html).

In [None]:
from dask.distributed import Client, progress
c = Client()
c

In [None]:
import dask.dataframe as dd

df = dd.demo.make_timeseries('2010', '2016',
                             {'value': float, 'name': str, 'id': int},
                             freq='10s', partition_freq='7d', seed=1)

df = df[df.value > 0][['id', 'value', 'name']]

df.head()

In [None]:
df = df.persist()
progress(df)

In [None]:
%time len(df)

In [None]:
%time df.groupby(df.id).value.mean().nlargest(10).compute()

### Quickly get data for a particular time or date range

In [None]:
%time df.loc['2015-12-25'].head()

### Aggregations

In [None]:
df.value.std().compute()

### Filtering

In [None]:
df2 = df[df.name == 'Hannah']
df2.head()

### Groupby operations

In [None]:
df.groupby(df.name).value.min().compute()

### Resample by day

In [None]:
df.value.resample('1d').mean().head()

### Rolling aggregations

In [None]:
df.value.rolling(100).mean().tail()

### Understanding algorithms with the `visualize` method

In this example we look at a smaller dataset and see how Dask.dataframe would resample data that is organized by month to data that is organized by week.  This is a bit messy because weeks and months don't line up perfectly.  Fortunately Dask's task scheduler are built for this sort of messy situation.

In [None]:
df_small = dd.demo.make_timeseries('2010-01-01', '2010-12-31',
                                   {'value': float, 'name': str, 'id': int},
                                   freq='10s', partition_freq='1M', seed=1)
df_small.value.resample('1w').mean().visualize()

In [None]:
df_small.rolling(100).mean().visualize(rankdir='LR')