# Groupby operations

Some imports:

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

try:
    import seaborn
except ImportError:
    pass

pd.options.display.max_rows = 10

## Recap: the groupby operation (split-apply-combine)

The "group by" concept: we want to **apply the same function on subsets of your dataframe, based on some key to split the dataframe in subsets**

This operation is also referred to as the "split-apply-combine" operation, involving the following steps:

* **Splitting** the data into groups based on some criteria
* **Applying** a function to each group independently
* **Combining** the results into a data structure

<img src="img/splitApplyCombine.png">

Similar to SQL `GROUP BY`

The example of the image in pandas syntax:

In [None]:
df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],
                   'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})
df

Using the filtering and reductions operations we have seen in the previous notebooks, we could do something like:


    df[df['key'] == "A"].sum()
    df[df['key'] == "B"].sum()
    ...

But pandas provides the `groupby` method to do this:

In [None]:
df.groupby('key').aggregate('sum')  # np.sum

In [None]:
df.groupby('key').sum()

Pandas does not only let you group by a column name. In `df.groupby(grouper)` can be many things:

- Series (or string indicating a column in df)
- function (to be applied on the index)
- dict : groups by values
- levels=[], names of levels in a MultiIndex



In [None]:
df.groupby(lambda x: x % 2).mean()

## And now applying this on some real data

These exercises are based on the [PyCon tutorial of Brandon Rhodes](https://github.com/brandon-rhodes/pycon-pandas-tutorial/) (so all credit to him!) and the datasets he prepared for that. You can download these data from here: [`titles.csv`](https://drive.google.com/open?id=0B3G70MlBnCgKajNMa1pfSzN6Q3M) and [`cast.csv`](https://drive.google.com/open?id=0B3G70MlBnCgKal9UYTJSR2ZhSW8) and put them in the `/data` folder.

`cast` dataset: different roles played by actors/actresses in films

- title: title of the film
- name: name of the actor/actress
- type: actor/actress
- n: the order of the role (n=1: leading role)

In [None]:
cast = pd.read_csv('data/cast.csv')
cast.head()

In [None]:
titles = pd.read_csv('data/titles.csv')
titles.head()

<div class="alert alert-success">
    <b>EXERCISE</b>: Using groupby(), plot the number of films that have been released each decade in the history of cinema.
</div>

In [None]:
titles.groupby(titles.year // 10 * 10).size().plot(kind='bar')

<div class="alert alert-success">
    <b>EXERCISE</b>: Use groupby() to plot the number of "Hamlet" films made each decade.
</div>

In [None]:
hamlet = titles[titles['title'] == 'Hamlet']
hamlet.groupby(hamlet.year // 10 * 10).size().plot(kind='bar')

<div class="alert alert-success">
    <b>EXERCISE</b>: How many leading (n=1) roles were available to actors, and how many to actresses, in each year of the 1950s?
</div>

In [None]:
cast1950 = cast[cast.year // 10 == 195]
cast1950 = cast1950[cast1950.n == 1]
cast1950.groupby(['year', 'type']).size()

<div class="alert alert-success">
    <b>EXERCISE</b>: List the 10 actors/actresses that have the most leading roles (n=1) since the 1990's.
</div>

In [None]:
cast1990 = cast[cast['year'] >= 1990]
cast1990 = cast1990[cast1990.n == 1]
cast1990.groupby('name').size().nlargest(10)

<div class="alert alert-success">
    <b>EXERCISE</b>: Use groupby() to determine how many roles are listed for each of The Pink Panther movies.
</div>

In [None]:
c = cast
c = c[c.title == 'The Pink Panther']
c = c.groupby(['year'])[['n']].max()
c

<div class="alert alert-success">
    <b>EXERCISE</b>: List, in order by year, each of the films in which Frank Oz has played more than 1 role.
</div>

In [None]:
c = cast
c = c[c.name == 'Frank Oz']
g = c.groupby(['year', 'title']).size()
g[g > 1]

<div class="alert alert-success">
    <b>EXERCISE</b>: List each of the characters that Frank Oz has portrayed at least twice.
</div>

In [None]:
c = cast
c = c[c.name == 'Frank Oz']
g = c.groupby(['character']).size()
g[g > 1].sort_values()

## Transforms

Sometimes you don't want to aggregate the groups, but transform the values in each group. This can be achieved with `transform`:

In [None]:
df

In [None]:
df.groupby('key').transform('mean')

In [None]:
def normalize(group):
    return (group - group.mean()) / group.std()

In [None]:
df.groupby('key').transform(normalize)

In [None]:
df.groupby('key').transform('sum')

<div class="alert alert-success">
    <b>EXERCISE</b>: Add a column to the `cast` dataframe that indicates the number of roles for the film.
</div>

In [None]:
cast['n_total'] = cast.groupby('title')['n'].transform('max')
cast.head()

<div class="alert alert-success">
    <b>EXERCISE</b>: Calculate the ratio of leading actor and actress roles to the total number of leading roles per decade.
</div>

Tip: you can to do a groupby twice in two steps,  once calculating the numbers, and then the ratios.

In [None]:
leading = cast[cast['n'] == 1]
sums_decade = leading.groupby([cast['year'] // 10 * 10, 'type']).size()
sums_decade

In [None]:
#sums_decade.groupby(level='year').transform(lambda x: x / x.sum())
ratios_decade = sums_decade / sums_decade.groupby(level='year').transform('sum')
ratios_decade

In [None]:
ratios_decade[:, 'actor'].plot()
ratios_decade[:, 'actress'].plot()

## Intermezzo: string manipulations

Python strings have a lot of useful methods available to manipulate or check the content of the string:

In [None]:
s = 'Bradwurst'

In [None]:
s.startswith('B')

In pandas, those methods (together with some additional methods) are also available for string Series through the `.str` accessor:

In [None]:
s = pd.Series(['Bradwurst', 'Kartoffelsalat', 'Sauerkraut'])

In [None]:
s.str.startswith('B')

For an overview of all string methods, see: http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling

<div class="alert alert-success">
    <b>EXERCISE</b>: We already plotted the number of 'Hamlet' films released each decade, but not all titles are exactly called 'Hamlet'. Give an overview of the titles that contain 'Hamlet', and that start with 'Hamlet':
</div>

In [None]:
hamlets = titles[titles['title'].str.contains('Hamlet')]
hamlets['title'].value_counts()

In [None]:
hamlets = titles[titles['title'].str.match('Hamlet')]
hamlets['title'].value_counts()

<div class="alert alert-success">
    <b>EXERCISE</b>: List the 10 movie titles with the longest name.
</div>

In [None]:
title_longest = titles['title'].str.len().nlargest(10)
title_longest

In [None]:
pd.options.display.max_colwidth = 210
titles.loc[title_longest.index]

## Value counts

A useful shortcut to calculate the number of occurences of certain values is `value_counts` (this is somewhat equivalent to `df.groupby(key).size())`)

For example, what are the most occuring movie titles?

In [None]:
titles.title.value_counts().head()

<div class="alert alert-success">
    <b>EXERCISE</b>: Which years saw the most films released?
</div>

In [None]:
t = titles
t.year.value_counts().head(3)

<div class="alert alert-success">
    <b>EXERCISE</b>: Plot the number of released films over time
</div>

In [None]:
titles.year.value_counts().sort_index().plot()

<div class="alert alert-success">
    <b>EXERCISE</b>: Plot the number of "Hamlet" films made each decade.
</div>

In [None]:
t = titles
t = t[t.title == 'Hamlet']
(t.year // 10 * 10).value_counts().sort_index().plot(kind='bar')

<div class="alert alert-success">
    <b>EXERCISE</b>: What are the 11 most common character names in movie history?
</div>

In [None]:
cast.character.value_counts().head(11)

<div class="alert alert-success">
    <b>EXERCISE</b>: Which actors or actresses appeared in the most movies in the year 2010?
</div>

In [None]:
cast[cast.year == 2010].name.value_counts().head(10)

<div class="alert alert-success">
    <b>EXERCISE</b>: Plot how many roles Brad Pitt has played in each year of his career.
</div>

In [None]:
cast[cast.name == 'Brad Pitt'].year.value_counts().sort_index().plot()

<div class="alert alert-success">
    <b>EXERCISE</b>: What are the 10 most film titles roles that start with the word "The Life"?
</div>

In [None]:
c = cast
c[c.title.str.startswith('The Life')].title.value_counts().head(10)

<div class="alert alert-success">
    <b>EXERCISE</b>: How many leading (n=1) roles were available to actors, and how many to actresses, in the 1950s? And in 2000s?
</div>

In [None]:
c = cast
c = c[c.year // 10 == 195]
c = c[c.n == 1]
c.type.value_counts()

In [None]:
c = cast
c = c[c.year // 10 == 200]
c = c[c.n == 1]
c.type.value_counts()