# Pandas

Let's start with some data from Tyler Vigen's website [Spurious correlations](http://www.tylervigen.com/spurious-correlations).
This particular data set concerns the strong correlation between per capita consumption of mozzarella cheese and the number of awarded civil engineering doctorates.

In [None]:
years = range(2000, 2010)
cheese = [9.3, 9.7, 9.7, 9.7, 9.9, 10.2, 10.5, 11, 10.6, 10.6]
doctorates = [480, 501, 540, 552, 547, 622, 655, 701, 712, 708]

In [None]:
import numpy as np
import pandas as pd

In [None]:
df = pd.DataFrame({'cheese': cheese, 'doctorates': doctorates})
df

We can add a column to the data set like so. In fact, we could have just started with an empty dataframe and added to it like so.

`head` just gives the first several rows of a dataframe.

In [None]:
df['year'] = years
df.head()

Now, `years` isn't really data, so let's make it into an index.

In [None]:
df.set_index('year', inplace=True)
df.head()

Here's another way we could have built such a data frame.

In [None]:
pd.DataFrame(
    np.array([cheese, doctorates]).transpose(),
    index=years,
    columns='cheese doctorates'.split())

We can do math with these like we did with numpy arrays.

In [None]:
df['cheesy doctorates'] = df['cheese']*df['doctorates']
df.head()

We can get rid of columns with `pop`.

In [None]:
df.pop('cheesy doctorates')

## Plotting

Plotting with pandas is great, although it's a layer on top of matplotlib, which is a bit of a strange animal.

The [pandas plotting documentation](http://pandas.pydata.org/pandas-docs/stable/visualization.html) is very good so I'll just give two trivial examples.

In [None]:
import matplotlib.pyplot as plt
import matplotlib as mpl

%matplotlib inline
%config InlineBackend.figure_formats=['svg']

These are Erick's preferences.

In [None]:
import seaborn as sns
sns.set_style('ticks')

mpl.rcParams.update({
    'font.size': 16, 'axes.titlesize': 17, 'axes.labelsize': 15,
    'xtick.labelsize': 10, 'ytick.labelsize': 13,
    'font.family': 'Lato', 'font.weight': 600,
    'axes.labelweight': 600, 'axes.titleweight': 600,
    'figure.autolayout': True})

In [None]:
df['cheese'].plot()

In [None]:
df.plot.scatter(x='cheese', y='doctorates')

## Reading, merging, and writing

Let's load everyone's favorite data set, [the cost of sequencing the human genome](http://www.genome.gov/sequencingcosts/), which I downloaded via [a plotly page](https://plot.ly/~Dreamshot/79/cost-per-genome/).

In [None]:
genome = pd.read_csv('data/cost-per-genome.csv')
genome.head()

In [None]:
genome.index = map(int, genome['decimal year'])
genome.tail()

Here's how to do a simple merge on indices. 
Merging is a complex topic [with lots of pandas documentation](http://pandas.pydata.org/pandas-docs/stable/merging.html).

In [None]:
merged = pd.merge(df, genome, left_index=True, right_index=True)
merged.pop('decimal year')
merged.index.names = ['year']
merged.head()

It's easy to write to a variety of formats, e.g.

In [None]:
merged.to_csv('amazing_dataset.csv')

## Indexing

I use the indexing functionality in pandas as follows.
You may find some other means that you prefer.

### .iloc for integer indexing

If we want to just get entries and slices using numpy-style indexing and slicing we can do using `.iloc`.

In [None]:
merged.iloc[1, 1]

In [None]:
s = merged.iloc[:5, :3]
s

In [None]:
s.iloc[0, 0] += 1e6
s

However, here slicing returns a copy so that the original dataframe remains the same.

In [None]:
merged.head()

### .loc for logical indexing 

Say we have a binary list or array that we would like to use for indexing.

In [None]:
leap_year = [y % 4 == 0 for y in merged.index]

merged.loc[leap_year]

In [None]:
merged.loc[merged['cheese'] > 10].head()

We can also select ranges of columns.

In [None]:
merged.loc[leap_year, 'doctorates':'genome cost']

In [None]:
merged['year type'] = 'normal'
merged.loc[leap_year, 'year type'] = 'leap'
merged.head()

## Multiindex

Sometimes it's nice to be able to index things in various ways.

In [None]:
merged.set_index('year type', append=True, inplace=True)
merged.head()

Again indices are not included as data to be plotted.

In [None]:
merged.plot(logy=True)

We can get back the values of an index.

In [None]:
merged.index.get_level_values('year')

We can put the indices back as columns of the data frame.

In [None]:
merged.reset_index().head()

## Groupby

Groupby objects are handy ways to split up dataframes.

In [None]:
g = merged.reset_index().groupby('year type')
g

In [None]:
for yt, df in g:
    print yt, len(df)

We might want to make a multi-panel plot for example.

In [None]:
fig, axs = plt.subplots(2, 1, figsize=(5, 6))
axd = {'leap': axs[0], 'normal': axs[1]}

for yt, df in g:
    df.plot.scatter(ax=axd[yt], x='cheese', y='doctorates', c='genome cost', title=yt)

sns.despine()  # Removes the boxes around the plots.

In [None]:
fig.savefig('lovely.pdf')