DataFrame
==========

In the last section we manipulated CSV files in parallel by building dask graphs by hand and running them with `dask` `get` functions. 

In this section we use `dask.dataframe` to build and execute dask graphs automatically.

### Setup

Create data if we don't have any

In [55]:
import os
if not os.path.exists(os.path.join('data', 'accounts.0.csv')):
    from prep_data import accounts_csvs
    accounts_csvs(3, 1000000, 500)

### `dask.dataframe.read_csv`

This works just like `pandas.read_csv`, except on multiple csv files at once.

In [None]:
import os
filename = os.path.join('data', 'accounts.*.csv')
filename

In [None]:
import dask.dataframe as dd
df = dd.read_csv(filename)

In [None]:
%%time 
len(df)

### Exercise: Inspect dask graph

Dask `DataFrame` copies a subset of the Pandas API.  

However unlike Pandas, operations on dask.dataframes don't trigger immediate computation, instead they add key-value pairs to an underlying dask graph.

In [None]:
df._visualize()

In [None]:
df.amount.sum()._visualize()

Above we see graphs corresponding to a call to `dd.read_csv` and `df.amount.sum()` on the result.  

Below we see the resulting computations as dictionaries.  You'll note that these dictionaries are a bit more complex than what we built by hand in the last section.  However if you look closely then you'll see all of the familiar elements of `pd.read_csv` and the filenames.

Try changing around the expression `df.amount.sum()` and see how the dictionary and graph change.  Explore a bit with the Pandas syntax that you already know.

In [None]:
df.dask  # .dask attribute contains underlying graph

In [None]:
df._visualize()

In [None]:
df.amount.sum().dask

Exercise: Recall and use Pandas API
----------------------------------------

If you are already familiar with the Pandas API then you should have a firm grasp on how to use `dask.dataframe`.  There are a couple of small changes.

As noted above, computations on dask `DataFrame` objects don't perform work, instead they build up a dask graph.  We can evaluate this dask graph at any time using the `.compute()` method.

In [None]:
result = df.amount.mean()  # create lazily evaluated result
result

In [None]:
result.compute()           # perform actual computation

Try the following exercises

1.  Use the `head()` method to get the first ten rows
2.  Use the `drop_duplicates()` method to find all of the distinct names
3.  Use selections `df[...]` to find how many positive and negative amounts there are
4.  Use groupby `df.groupby(df.A).B.func()` to get the average amount per user ID
5.  Sort the result to (4) by amount, find the names of the top 10 

This section should be easy if you are familiar with Pandas.  If you aren't then that's ok too.  You may find the [pandas documenation](http://pandas.pydata.org/) a useful read in the future.  Don't worry, future sections in this tutorial will not depend on this knowledge.

In [None]:
# 1. Use the `head()` method to get the first ten rows
df.head()

In [None]:
# 2. Use the `drop_duplicates()` method to find all of the distinct names
df.names.drop_duplicates().compute()

In [None]:
# 3. Use selections `df[...]` to find how many positive and negative amounts there are
len(df[df.amount < 0])

In [None]:
# 3. Use selections `df[...]` to find how many positive and negative amounts there are
len(df[df.amount > 0])

In [None]:
# 4. Use groupby `df.groupby(df.A).B.func()` to get the average amount per user ID 
df.groupby(df.names).amount.mean().compute()

In [None]:
# 5. Combine your answers to 3 and 4 to compute the average withdrawal (negative amount) per name
df2 = df[df.amount < 0]
df2.groupby(df2.names).amount.mean().compute()