# Grouping and reshaping data

We're going to look at some different ways of grouping and aggregating data. We're building towards thinking about 'split', 'apply', and 'combine workflows, which look something like this:

![split-apply-combine](https://github.com/core-skills/02-getting-to-know-the-tools/blob/master/notebooks/split-apply-combine.png?raw=true)

(taken from Jake VanderPlas' excellent [Python data science handbook](https://github.com/jakevdp/PythonDataScienceHandbook) - check out all the notebooks available on github if you want more in-depth examples than what we've worked through today).

## Groupby

Find the pandas `groupby` method and work out how it works on your dataframe. Hint: try passing a categorical column from your data. 

If you don't have a categorical column but you do have a column of numbers, you can generate groups by binning the data into seperate bins using the `pandas.cut` function - something like this:

```python
import pandas
from numpy import inf
from random_data import random_dataframe

# Our faithful bogus dataframe
df = random_dataframe(30)

# Add a new column which bins the a values
df['how_big'] = pandas.cut(df.a, 
                           bins=[-inf, 50, inf],
                           labels=('low', 'high'))
```

`pandas.cut` can often be useful for investigating subsets of numerical data (e.g. ore grade in marginal blocks!).

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html

Next try to generate some summary statistics about each of your groups. The `info` and `describe` methods of the pandas dataframe are good places to start - try something like this:

```python
from random_data import random_dataframe

# Yet more bogus data
df_rand = random_dataframe(100)

# Iterating in a for-loop
for category, grp_df in df_rand.groupby('category'):
    print(f"\nInfo for group {category}")
    print(grp_df.describe())
```

You don't have to iterate over the groups if you don't want to - you can pipeline them to an aggregating function directly (which is often easier to read).

```python
# Calculating an aggregation directly
df.groupby('category').sum()
```

Try looking at some of the other pandas aggregations: `count`, `first`, `last`, `mean`, `median`, `min`, `max`, `std`, `var`, `mad`, `prod`, `sum`. What do each of these do?

Take a look at some of the more advanced group options - for example you can set a category as an index, and pass functions which take an index and output a group.

```python
# Calculating an aggregation by specifying a mapping 
# from index to group
mapping = {'x': 'first', 'y': 'first', 'z': 'second'}
df.set_index('category').groupby(mapping).mean()
```

How might you write a small function to start to aggregate or summarize the data in your data's groups in more complex ways?

## Pivot table

Pivot tables are a lot like groupby operations but instead of ending up with one column of groups we can end up with multidimensional arrays of aggregations. 

In the diagram at the top of the page, you can think of a pivot table splitting the data using more than one column in the 'split' step.

This is generally more useful when we want to start to aggregate along multiple dimensions.

Using our example from above:

```python
import pandas
from numpy import inf
from random_data import random_dataframe

# Our faithful bogus dataframe
df = random_dataframe(3000, categories='uvwxyz')

# Add a new column which bins the a values
df['how_big'] = pandas.cut(df.a, 
                           bins=[-inf, 50, inf],
                           labels=('low', 'high'))

# make a new pivot table that calculates the mean for each
# of our subcategories - both 'x,y,z' and 'low' and 'high'
pivot = df.pivot_table('b', index='category', columns='how_big', aggfunc='mean')
```

Try creating a pivot table on your own data (as before you can use `pandas.cut` to bin numerical data if that's more useful).

In [None]:
import pandas
from numpy import inf
from random_data import random_dataframe

# Our faithful bogus dataframe
df = random_dataframe(3000, categories='uvwxyz')

# Add a new column which bins the a values
df['how_big'] = pandas.cut(df.a, 
                           bins=[-inf, 25, 50, 75, inf],
                           labels=('tiny', 'small', 'medium' 'large', 'huge'))

# make a new pivot table that calculates the mean for each
# of our subcategories - both 'x,y,z' and 'low' and 'high'
pivot = df.pivot_table('b', index='category', columns='how_big', aggfunc='sum')

In [None]:
df.groupby('how_big').count()

## Plotting data

Next we're going to use [seaborn](seaborn.pydata.org) to generate some pretty plots of our data. 

Most Python tutorials will introduce [matplotlib](https://matplotlib.org) at this stage because it's the default but seaborn is a much higher-level library with a nicer API, especially for exploratory vis (matplotlib will probably make more sense to you if you're coming from MATLAB world though). The only hangover is that we need to include the `%matplotlib inline` cell magic to tell Jupyter to render the graphics inline for us. 

We'll start by looking at our random dataset.  

In [None]:
# %matplotlib inline
import seaborn as sns
from numpy import inf
import pandas as pd

sns.set()

from random_data import random_dataframe

# Set up our dataframe and pivot table
df = random_dataframe(3000, categories='uvwxyz')
df['how_big'] = pd.cut(df.a, 
                           bins=[-inf, 25, 50, 75, inf],
                           labels=('tiny', 'small', 'medium' 'large', 'huge'))
df.head()

For one-dimensional dataset we can try `seaborn.distplot`, `seaborn.kdeplot` and `seaborn.rugplot` to visualize the data.

In [None]:
# sns.set_context('talk') # paper, notebook, talk, poster
# sns.set_palette('colorblind') # deep, muted, pastel, bright, dark, and colorblind

In [None]:
sns.distplot(df.a)

A kernel density estimate (KDE) plot is a method for visualizing the distribution of observations in a dataset, analagous to a histogram. KDE represents the data using a continuous probability density curve in one or more dimensions.


In [None]:
sns.kdeplot(df.a)

In [None]:
sns.boxenplot(df['a'])

In [None]:
sns.violinplot(df['a'])

We can also use `jointplot` to generate a scatter and histograms of sets of data

In [None]:
sns.jointplot(x='a', y='b', data=df)

In [None]:
sns.jointplot(x='a', y='b', data=df, kind='hex')

For times when you want 'plot everything against everything else' you can do something like

In [None]:
sns.pairplot(df)

This is really useful for pulling out relationships between variables

In [None]:
df['c'] = df.a + df.b * df.a
df['d'] = df.c * df.b + df.a

In [None]:
sns.pairplot(df)

In [None]:
# g = sns.PairGrid(df)
# g.map_diag(sns.kdeplot)
# g.map_offdiag(sns.kdeplot, n_levels=6);

Seaborn also has a heap of support for categorical data. We can also include more dimensions in the visualization by specifying further dimensions as colors or point size

In [None]:
sns.stripplot(x='a', y='how_big', hue='category', data=df, jitter=True, dodge=True)

We can also visualize the pivot table we generated above with heatmaps. 

In [None]:
pivot = df.pivot_table('b', index='category', columns='how_big', aggfunc='sum')
pivot

In [None]:
sns.heatmap(pivot)

In [None]:
sns.clustermap(pivot)

Seaborn can get a lot more complicated than this and it's worth digging through the examples to find useful ways of slicing and dicing your dataframes into pictures.

Now try this out on your own dataset!