# Exploratory data analysis in Jupyter

### Keyboard shortcuts available in Edit and Command modes
* `Enter` key to enter Edit mode (`Escape` to enter Command mode)
* `Ctrl`-`Enter`: run the cell
* `Shift`-`Enter`: run the cell and select the cell below
* `Alt`-`Enter`: run the cell and insert a new cell below
* `Ctrl`-`s`: save the notebook

### Useful keyboard shortcuts in **Command mode**
 - `Tab` key for code completion or indentation
 - `m` and `y` to toggle between Markdown and Code cells
 - `d-d` to delete a cell
 - `z` to undo deleting
 - `a/b` to insert cells above/below current cell
 - `x/c/v` to cut/copy/paste cells
 - `Up/Down` or `k/j` to select previous/next cells

> Some of the following material has been adapted from an example in the [IPython Cookbook](http://ipython-books.github.io/), by Cyrille Rossant, Packt Publishing, 2014.


We start with three important Python packages
- `matplotlib` is the standard Python package for plotting, "grandfather of all Python visualization packages"
- `numpy` is the fundamental package for scientific computing with Python
- `pandas` is a more recently developed package for data manipulation and analysis 

We will download and process a dataset about attendance on Montreal's bicycle tracks. 

The first step is to import the python modules that will be used


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Location of dataset 

In [None]:
url = "https://github.com/ipython-books/cookbook-data/raw/master/bikes.csv"

Pandas defines a `read_csv` function that can read any CSV file. By giving the URL to the file, Pandas will automatically download and parse the file, and return a `DataFrame` object. We need to specify a few options to make sure the dates are parsed correctly.

In [None]:
pd.read_csv?

In [None]:
#df = pd.read_csv(url, index_col='Date', parse_dates=True, dayfirst=True)
df = pd.read_csv("data/bikes.csv", index_col='Date', parse_dates=True, dayfirst=True) # in case of internet problems

The `df` variable now contains a `DataFrame` object, a Pandas data structure that contains 2D tabular data. The `head(n)` method displays the first `n` rows of this table.

In [None]:
df.head(5)

Every row contains the number of bicycles on every track of the city, for every day of the year.

Get some summary statistics of the table with the `describe` method:

In [None]:
df.describe()

Pandas has plotting capabilities through a layer over `matplotlib`  
Let's plot the daily attendance of two tracks. First, we select the two columns `'Berri1'` and `'PierDup'`. Then, we call the `plot` method

In [None]:
df[['Berri1', 'PierDup']].plot(figsize=(8,4),
                               style=['-', '--']);

Let's now look at the attendance of all tracks as a function of the weekday. We can get the week day easily with Pandas: the `index` attribute of the `DataFrame` contains the dates of all rows in the table. This index has a few date-related attributes, including `weekday`.

In [None]:
df.index.weekday

However, we would like to have names (Monday, Tuesday, etc.) instead of numbers between 0 and 6. First, we create an array `days` with all weekday names. Then, we index it by `df.index.weekday`. This operation replaces every integer in the index by the corresponding name in `days`. The first element, `Monday`, has index 0, so every 0 in `df.index.weekday` is replaced by `Monday`, and so on. We assign this new index to a new column `Weekday` in the `DataFrame`.

In [None]:
days = np.array(['Monday', 'Tuesday', 'Wednesday', 
                 'Thursday', 'Friday', 'Saturday', 
                 'Sunday'])
df['Weekday'] = days[df.index.weekday]

In [None]:
df.head(5)

To get the attendance as a function of the weekday, we need to group the table by the weekday. The `groupby` method lets us do just that. Once grouped, we can sum all rows in every group.

In [None]:
df.groupby?

In [None]:
df_week = df.groupby('Weekday',sort=False).sum()

In [None]:
df_week

We can now display this information in a figure. We first need to reorder the table by the weekday using `ix` (indexing operation). Then, we plot the table, specifying the line width and the figure size.

In [None]:
df_week.ix[days].plot(lw=3, figsize=(6,4));
plt.ylim(0);  # Set the bottom axis to 0.

Finally, let's illustrate interactive capabilities through `widgets`. We plot a *smoothed* version of the track attendance as a function of time (**rolling mean**). The idea is to compute the mean value in the neighborhood of any day. The larger the neighborhood, the smoother the curve. We will create an interactive slider in the notebook to vary this parameter in real-time in the plot.

In [None]:
from ipywidgets import interact
#from IPython.html.widgets import interact # IPython < 4.x
@interact
def plot(n=(1, 30)):
    plt.figure(figsize=(8,4));
    pd.rolling_mean(df['Berri1'], n).dropna().plot();
    plt.ylim(0, 8000);
    plt.show();

### <font color="red"> *Exercise* </font>

- Create a widget that computes the square of integers between 0 and 10!

### <font color="green"> *Solution* </font>

In [None]:
from ipywidgets import interact  # IPython.html.widgets before IPython 4.0
@interact(x=(0, 10))
def square(x):
    return("The square of %d is %d." % (x, x**2))

## Seaborn

- while `matplotlib` is extremely powerful, it can also be complex. 
- sometimes much effort needed to get good-looking graphs
- `seaborn` is a higher-level visualization package based on `matplotlib`
- default visualization much more appealing compared to `matplotlib`!


### <font color="blue"> Demo: Seaborn </font>

We first load the `seaborn` module 

In [None]:
import seaborn as sns

`Seaborn`'s `heatmap` plots a heatmap for a `numpy` array

In [None]:
ax = sns.heatmap(df_week.ix[days],linewidths=.5)

### <font color="red"> *Exercise* </font>

- Annotate each cell with the numeric value using integer formatting!

### <font color="green"> *Solution* </font>

In [None]:
sns.heatmap?

In [None]:
ax = sns.heatmap(df_week.ix[days],linewidths=.5, annot=True, fmt="d")

### Nobel prizes 

We move on to another dataset, importing packages we need and loading dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# dataset from http://oppnadata.se/en/dataset/nobel-prizes/resource/f3da8ba9-a17f-4911-9003-4bcef93619cc
nobel = pd.read_csv("data/nobels.csv")
nobel

Add column with ones (one Nobel prize per laureate...)

In [None]:
nobel["number"]=1

### <font color="red"> *Exercise*

- Use the groupby method and `sum()` to extract total numbers of Nobel prizes by country

### <font color="green"> *Solution* </font>

In [None]:
nobels_by_country = nobel.groupby('country',sort=False).sum()
nobels_by_country

Let's extract just the number of prizes per country

In [None]:
# extract Series from DataFrame:
print(type(nobels_by_country))
nobels_by_country = nobels_by_country["number"]
print(type(nobels_by_country))

In [None]:
nobels_by_country

Hmm, West Germany is listed separately. Let's unify Germany!

In [None]:
nobel = nobel.replace(to_replace="Federal Republic of Germany",value="Germany")
nobels_by_country = nobel.groupby('country',sort=False).sum()
nobels_by_country = nobels_by_country["number"]
nobels_by_country

How many prizes has Finland received?

In [None]:
nobels_by_country["Finland"]

Who was it?

In [None]:
nobel.loc?

In [None]:
nobel.loc[nobel['country'] == "Finland"]

Ok, this dataset seems to be incomplete, according to [this link](https://en.wikipedia.org/wiki/List_of_Nobel_laureates_by_country#Finland) Finland has received 5 prizes...

We move on. Let's extract four countries with the highest number of prizes, and generate some plots

In [None]:
countries = np.array(["Germany", "France", "USA", "United Kingdom"])

In [None]:
nobel2 = nobel.loc[nobel['country'].isin(countries)]
nobel2

In [None]:
sns.violinplot(y="year", x="country",inner="stick", data=nobel2);

We can also use multiple conditions. Let's extract only physics prizes

In [None]:
nobel3 = nobel.loc[nobel['country'].isin(countries) & nobel['category'].isin(['physics'])]
sns.violinplot(y="year", x="country",inner="stick", data=nobel3);

In [None]:
sns.swarmplot(y="year", x="country", data=nobel2, alpha=.5);

In [None]:
sns.factorplot(x="country", y="year", col="category", data=nobel2, kind="swarm");

In [None]:
sns.factorplot(x="country", col="category", data=nobel2, kind="count");

## Other types of media

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('j9YpkSX7NNM')

In [None]:
from IPython.display import Audio
Audio("data/GW150914_L1_whitenbp.wav")