# Jupyter Notebooks - part 3
* ### Data analysis and visualization

In this lesson you will learn 
- to do some data analysis using pandas 
- to produce nice plots with seaborn
- how other types of media can be embedded in a notebook

## Exploratory data analysis in Jupyter

We will use four important Python packages
1. `numpy` is the fundamental package for scientific computing with Python
2. `pandas` is a more recently developed package for data manipulation and analysis 
 - powerful high-level tool for data exploration
 - two fundamental data structures which can be applied to many types of data: `Series` and `DataFrames`  

3. `matplotlib` is the standard Python package for plotting, "grandfather of all Python visualization packages"
4. `seaborn` is a higher-level visualization package based on `matplotlib`
 - while `matplotlib` is extremely powerful, it can also be complex. 
 - sometimes much effort needed to get good-looking graphs
 - default visualization much more appealing compared to `matplotlib`

We will download and process a dataset on Nobel prizes. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

pandas defines a `read_csv` function that can read any CSV file. By giving the URL to the file, pandas will automatically download and parse the file, and return a `DataFrame` object. We need to specify a few options to make sure the dates are parsed correctly.

In [None]:
# dataset from http://oppnadata.se/en/dataset/nobel-prizes/resource/f3da8ba9-a17f-4911-9003-4bcef93619cc
nobel = pd.read_csv("data/nobels.csv")

The `nobel` variable now contains a `DataFrame` object, a Pandas data structure that contains 2D tabular data. The `head(n)` method displays the first `n` rows of this table.

In [None]:
nobel.head()

Each column (and row) of the `DataFrame` is a `Series`

In [None]:
nobel["year"]

In [None]:
type(nobel["year"])

In [None]:
nobel.loc[0]

In [None]:
type(nobel.loc[0])

We immediately have access to common statistical quantities

In [None]:
nobel["share"].min()

In [None]:
nobel["share"].max()

In [None]:
nobel["share"].mean()

In [None]:
nobel["share"].std()

To calculate some more elaborate statistics, we first add a column (one Nobel prize per laureate)

In [None]:
nobel["number"]=1

In [None]:
nobel.count()

The dataset is clearly not quite complete...

### Age statistics

Let's first look at statistics based on the age of prize recipients.  
We need to convert the "born" column to datetime format

In [None]:
type(nobel["born"][0])

In [None]:
nobel["born"] = pd.to_datetime(nobel["born"], errors ='coerce')

In [None]:
type(nobel["born"][0])

In [None]:
nobel["born"].dt.year

We can now add a column to the DataFrame with age when prize was received 

In [None]:
nobel["age"] = nobel["year"] - nobel["born"].dt.year
nobel[["surname","age"]].head(10)
#print(nobel["age"].to_string())

We can now plot a histogram of the age at which laureates receive their prize, using the inbuilt matplotlib support of pandas 

In [None]:
nobel.plot?

In [None]:
nobel["age"].plot.hist(bins=[20,30,40,50,60,70,80,90,100],alpha=0.6);

To extract the numbers, use the value_counts method

In [None]:
nobel["age"].value_counts(bins=[20,30,40,50,60,70,80,90,100])

### Country statistics

We use the powerful `groupby` method to split data into groups, select the column "number", and sum up to get the total sum of Nobel prizes by country 

In [None]:
nobels_by_country = nobel.groupby('bornCountry',sort=True)["number"].sum()
nobels_by_country

In [None]:
nobels_by_country.describe()

The pandas Series only shows a limited number of rows. Let's print them all

In [None]:
print(nobels_by_country.to_string())

How many prizes have people born in Sweden received?

In [None]:
nobels_by_country["Sweden"]

Who were they?

In [None]:
nobel.loc[nobel['bornCountry'] == "Sweden"]

We move on. Let's extract four countries and generate some plots

In [None]:
countries = np.array(["France", "USA", "United Kingdom", "Sweden"])

In [None]:
nobel2 = nobel.loc[nobel['bornCountry'].isin(countries)]

We now group by both `bornCountry` and `category`

In [None]:
nobels_by_country2 = nobel2.groupby(['bornCountry',"category"],sort=True).sum()
nobels_by_country2["number"].head(50)

We can reshape the `DataFrame` a bit using the pivot_table method to create a spreadsheet-like representation

In [None]:
table = nobel2.pivot_table(values="number",index="bornCountry", columns="category",aggfunc=np.sum)
table

This representation can be used to make a heatmap visualization

### Visualizing data with seaborn

In [None]:
sns.heatmap(table,linewidths=.5, annot=True);

Violin plots can reveal trends (but are not very quantitative)

In [None]:
sns.violinplot(y="year", x="bornCountry",inner="stick", data=nobel2);

We can also use multiple conditions. Let's extract only physics prizes

In [None]:
nobel3 = nobel.loc[nobel['bornCountry'].isin(countries) & nobel['category'].isin(['physics'])]
sns.violinplot(y="year", x="bornCountry",inner="stick", data=nobel3);

Swarmplots display categorical scatterplots with non-overlapping points

In [None]:
sns.swarmplot(y="year", x="bornCountry", data=nobel2, alpha=.5);

In [None]:
sns.factorplot(x="bornCountry", y="year", col="category", data=nobel2, kind="swarm");

In [None]:
sns.factorplot(x="bornCountry", col="category", data=nobel2, kind="count");

### Other visualization packages
* [Plotly](https://plot.ly/) - commercial online service for creating and sharing visualizations in notebooks
* [Bokeh](http://bokeh.pydata.org/en/latest/) - web-based, general-purpose and fast visualization toolkit
* [mpld3](http://mpld3.github.io/examples/index.html) - must be seen...

## Other types of media

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('j9YpkSX7NNM')

In [None]:
from IPython.display import Audio
Audio("data/GW150914_L1_whitenbp.wav")

In [None]:
from IPython.display import IFrame
IFrame("http://jupyter.org",width='100%',height=350)