# Tidy Data

> Structuring datasets to facilitate analysis [(Wickham 2014)](http://www.jstatsoft.org/v59/i10/paper)

If there's one maxim I can impart it's that your tools shouldn't get in the way of your analysis. Your problem is already difficult enough, don't let the data or your tools make it any harder.

## The Rules

In a tidy dataset...

1. Each variable forms a column
2. Each observation forms a row

Consistently following these rules when writing your data-processing functions makes the whole experience smoother.
We'll cover a few methods that help you tidy your data.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style="white", context="talk")
plt.rcParams['figure.figsize'] = 12, 8
pd.options.display.max_rows = 10
%matplotlib inline

Earlier, I fetched some data

```python
tables = pd.read_html("http://www.basketball-reference.com/leagues/NBA_2016_games.html")
games = tables[0]
games.to_csv('data/games.csv', index=False)
```

In [None]:
pd.read_html?

In [None]:
!head -n  5 data/games.csv

So the data is roughly like

| Date        | Visitor Team | Visitor Points | Home Team | Home Points |
| ----------- | ------------ | -------------- | --------- | ----------- |
| 2015-10-07  | Detroit      | 106            | Atlanta   | 94          |
| ...         | ...          | ...            | ...       | ...         |

Plus some extra junk.

[The Question](http://stackoverflow.com/questions/22695680/python-pandas-timedelta-specific-rows):
> **How many days of rest did each team get between each game?**

Whether or not your dataset is tidy depends on your question. Given our question, what is an observation?

---
<a href="#answer" class="btn btn-default" data-toggle="collapse">Show Answer</a>
<div id="answer" class="collapse">
An observation is a (team, game) pair. So no, we don't have a tidy dataset.
A tidy dataset would be like

<table>
<thead>
<tr class="header">
<th>Date</th>
<th>Team</th>
<th>Home / Away</th>
<th>Points</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>2015-10-07</td>
<td>Detroit</td>
<td>Away</td>
<td>106</td>
</tr>
<tr class="even">
<td>2015-10-07</td>
<td>Atlanta</td>
<td>Home</td>
<td>94</td>
</tr>
<tr class="odd">
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
</tbody>
</table>

</div>




In [None]:
column_names = {'Date': 'date', 'Start (ET)': 'start',
                'Unamed: 2': 'box', 'Visitor/Neutral': 'away_team', 
                'PTS': 'away_points', 'Home/Neutral': 'home_team',
                'PTS.1': 'home_points', 'Unamed: 7': 'n_ot'}

games = (
    pd.read_csv("data/games.csv")
      .rename(columns=column_names)
      .dropna(thresh=4)
      [['date', 'away_team', 'away_points', 'home_team', 'home_points']]
      .assign(date=lambda x: pd.to_datetime(x['date'], format='%a, %b %d, %Y'))
      .set_index('date', append=True)
      .rename_axis(["game_id", "date"])
      .sort_index()
)
games.head()

Above, we saw that we need to collapse the away / home teams down to two columns: one identifier and one for the value. Likewise with the points.
We'll also need to repeat the metadata fields, like the date and `game_id`, so that each observation is matched with the correct date. `pd.melt` does all this for us.

In [None]:
tidy = pd.melt(games.reset_index(),
               id_vars=['game_id', 'date'], value_vars=['away_team', 'home_team'],
               value_name='team').sort_values(['game_id', 'date'])

tidy.head()

Now the translation from question to operation is direct:

In [None]:
# How many days of rest for each team
# For each team...  get number of days between games
tidy.groupby('team')['date'].diff().dt.days - 1

In [None]:
tidy['rest'] = tidy.groupby('team').date.diff().dt.days - 1
tidy.dropna().head()

You can "invert" a `melt` with `pd.pivot_table`

In [None]:
by_game = pd.pivot_table(tidy, values='rest',
                         index=['game_id', 'date'],
                         columns='variable').rename(
    columns={'away_team': 'away_rest', 'home_team': 'home_rest'}
)
by_game.columns.name = None

by_game.dropna().head()

`concat` will merge two dataframes, expanding an `axis`, while aligning on the other axis.

In [None]:
df = pd.concat([games, by_game], axis='columns')
df.head()

In [None]:
g = sns.FacetGrid(data=tidy.dropna(), col='team', col_wrap=5, hue='team')
g.map(sns.barplot, "variable", "rest");

In [None]:
delta = (df['home_rest'] - df['away_rest']).dropna().astype(int)
(delta.value_counts()
    .reindex(np.arange(delta.min(), delta.max() + 1), fill_value=0)
    .sort_index().plot(kind='bar', color='k', width=.9, rot=0, figsize=(12, 6)))
sns.despine()
plt.xlabel("Difference in Rest (home - away)")
plt.grid(axis='y');

# Stack / Unstack

Not all APIs expect tidy data, so you need to convert between "wide" and "long" form data.

In [None]:
rest = (tidy.groupby(['date', 'variable'])
            .rest.mean()
            .dropna())
rest.head()

`rest` is in "long" form. `DataFrame.plot` for example, expects wide form data.

In [None]:
rest.unstack().head()

In [None]:
rest.unstack().rolling(14).mean().plot()

<div class="alert alert-success">
  <h1>Mini Project: Home Court Advantage?</h1>
</div>

What's the effect (in terms of probability to win) of being
the home team?
What are the components of the advantage?

We'll fit a Logistic regression like

`home_win ~ home_strength + away_strength + home_rest + away_rest`

Our final dataframe will have one row per game (like `df`).
Most examples I've seen use a "team strength" variable in their regression estimating the home court advantage. We'll use the team's win percent as a proxy for team strength (which is cheating, but oh well).

## Step 0: Outcome variable

In [None]:
df['home_win'] = df.home_points > df.away_points

## Step 1: Calculate Win %


*Get each team's win percent as home and away*

- name the resulting DataFrame `wins`
- The output should be a DataFrame were
  * The index is a MultiIndex of `team`, `is_home` pairs
  * The columns are `win_pct`, `n_wins`, `n_games`

This is our final goal, but we have a few intermediate stages:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th></th>
      <th>n_wins</th>
      <th>n_games</th>
      <th>win_pct</th>
    </tr>
    <tr>
      <th>team</th>
      <th>is_home</th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th rowspan="2" valign="top">Atlanta Hawks</th>
      <th>away_team</th>
      <td>21.0</td>
      <td>41</td>
      <td>0.512195</td>
    </tr>
    <tr>
      <th>home_team</th>
      <td>27.0</td>
      <td>41</td>
      <td>0.658537</td>
    </tr>
    <tr>
      <th rowspan="2" valign="top">Boston Celtics</th>
      <th>away_team</th>
      <td>20.0</td>
      <td>41</td>
      <td>0.487805</td>
    </tr>
    <tr>
      <th>home_team</th>
      <td>28.0</td>
      <td>41</td>
      <td>0.682927</td>
    </tr>
    <tr>
      <th>Brooklyn Nets</th>
      <th>away_team</th>
      <td>7.0</td>
      <td>41</td>
      <td>0.170732</td>
    </tr>
  </tbody>
</table>

### 1.1: `pd.melt` `df` like before

Before we melted down for reset, this time we have home_win.
Get a DataFrame with one row per `(game, team)` pair that includes the boolean `home_win` and whether that team was `home_team` or `away_team`.

In [None]:
id_vars = ...  # this changed, we have an extra column
value_vars = ...
value_name = 'team'
var_name = 'home_or_away'
games = pd.melt(df.reset_index(), id_vars=id_vars, value_vars=value_vars,
                var_name=var_name, value_name=value_name)
games.head()

In [None]:
%load -r 4:14 solutions/solutions_tidy.py

### 1.2: Assign a new column indicating whether the team in that row won.

Hint: Must either have

- `games.home_win` be True and `games.home_or_away == 'home_team'`
- `games.home_win` be False and `games.home_or_away == 'away_team'`

Goal:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>game_id</th>
      <th>date</th>
      <th>home_win</th>
      <th>home_or_away</th>
      <th>team</th>
      <th>win</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>1</td>
      <td>2015-10-27</td>
      <td>False</td>
      <td>home_team</td>
      <td>Atlanta Hawks</td>
      <td>False</td>
    </tr>
    <tr>
      <th>1</th>
      <td>2</td>
      <td>2015-10-27</td>
      <td>True</td>
      <td>home_team</td>
      <td>Chicago Bulls</td>
      <td>True</td>
    </tr>
    <tr>
      <th>2</th>
      <td>3</td>
      <td>2015-10-27</td>
      <td>True</td>
      <td>home_team</td>
      <td>Golden State Warriors</td>
      <td>True</td>
    </tr>
    <tr>
      <th>3</th>
      <td>4</td>
      <td>2015-10-28</td>
      <td>True</td>
      <td>home_team</td>
      <td>Boston Celtics</td>
      <td>True</td>
    </tr>
    <tr>
      <th>4</th>
      <td>5</td>
      <td>2015-10-28</td>
      <td>False</td>
      <td>home_team</td>
      <td>Brooklyn Nets</td>
      <td>False</td>
    </tr>
  </tbody>
</table>

In [None]:
games['win'] = ...

In [None]:
%load -r 15:18 solutions/solutions_tidy.py

### 1.3: Aggregate

Use a `groupby` to get the

- number of wins
- number of total games
- win percent

For each team at home and away.

Hint: You can control the output of `.agg()` like

```python
groupby['column'].agg({
    output_name1: aggfunc1,
    output_name2: aggfunc2,
})
```

For example `{'n_wins': 'sum', 'n_games': 'count', ... }`

In [None]:
# Your solution
wins = games.groupby(...)['win'].agg(...)

In [None]:
%load -r 19:25 solutions/solutions_tidy.py

Quick vis

In [None]:
(wins.win_pct
    .unstack()
    .assign(**{'Home Win % - Away %': lambda x: x.home_team - x.away_team,
               'Overall %': lambda x: (x.home_team + x.away_team) / 2})
     .pipe((sns.regplot, 'data'), x='Overall %', y='Home Win % - Away %')
)
sns.despine()

In [None]:
g = sns.FacetGrid(wins.reset_index(), hue='team', size=10, aspect=.5, palette=['k'])
g.map(sns.pointplot, 'home_or_away', 'win_pct').set(ylim=(0, 1));

## Step 2: Calculate the actual win percent.

Use `wins` to get `win_percent`.

Output should be a `Series` where the index is the team name and the values is the win percent (home or away).

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>win_percent</th>
    </tr>
    <tr>
      <th>team</th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Atlanta Hawks</th>
      <td>0.585366</td>
    </tr>
    <tr>
      <th>Boston Celtics</th>
      <td>0.585366</td>
    </tr>
    <tr>
      <th>Brooklyn Nets</th>
      <td>0.256098</td>
    </tr>
    <tr>
      <th>Charlotte Hornets</th>
      <td>0.585366</td>
    </tr>
    <tr>
      <th>Chicago Bulls</th>
      <td>0.512195</td>
    </tr>
  </tbody>
</table>


In [None]:
# Your code here
win_percent = ...

In [None]:
%load -r 26:35 solutions/solutions_tidy.py

## Step 3: Incorporate the `win_percent`  values

Bring the `strength` valuess into `df` for each team, for each game. Assign them to `away_strength` and `home_strength` in the `df` DataFrame.

Hint: Lookup `pd.Series.map?`

Also calculate `point_diff` (home - away) and `rest_diff` (home - away).

In [None]:
win_percent

In [None]:
# Your code here


In [None]:
%load -r 36:43 solutions/solutions_tidy.py

Now we can fit the model

In [None]:
import statsmodels.formula.api as sm

In [None]:
mod = sm.logit('home_win ~ home_strength + away_strength + home_rest + away_rest', df)
res = mod.fit()
res.summary()

# Recap

- Tidy data: one row per observation
    - melt / stack: wide to long
    - pivot_table / unstack: long to wide