# Exploratory Data Analysis

Brief introduction to Pandas, Matplotlib and Seaborn

## 1. Data munging in Pandas

In [None]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv("titanic-train.csv")

## Quick exploration

In [None]:
df.head(3)

In [None]:
df.info()

In [None]:
df.describe()

## Indexing

In [None]:
df.ix[0]

In [None]:
df.iloc[3]

In [None]:
df.loc[0:4,'Ticket']

In [None]:
df['Ticket'].head()

## Selections

In [None]:
df[df.Age>65]

In [None]:
df[(df.Age==11)&(df.SibSp==5)]

In [None]:
df[(df.Age==11)|(df.SibSp==5)]

## Distinct Elements

In [None]:
df['Embarked'].unique()

## Basic Stats

In [None]:
print(df['Age'].mean())
print(df['Fare'].median())
print(df['Sex'] == 'female').sum()

## Missing Data

In [None]:
df.info()

In [None]:
df['Age'].fillna(30)

## Groupby

In [None]:
# Find average age of passengers that survived vs. died
df.groupby('Survived')['Age'].mean()

## Pivot Tables

In [None]:
df.pivot_table(index='Sex', columns='Parch', values='Survived', aggfunc='sum')

In [None]:
df.pivot_table(index='Sex', columns='SibSp', values='Survived', aggfunc='sum')

## Exercises:

- select passengers that died
- select passengers who paid less than 40.000 and were in third class
- locate the name of passegner Id 674
- count the number of survived and the number of dead passengers
- count the number of survived and dead per each gender
- calculate average price paid by survived and dead people

In [None]:
df.groupby(['Sex', 'Survived'])['PassengerId'].count()

## 2. Data Visualization

In [None]:
df.Age.plot()

Let's give it a title and bigger fonts

In [None]:
df.Age.plot(fontsize=15)
plt.title('Line Plot', size=20)

What if I wanted to plot the data points?

In [None]:
df.Age.plot(style='o', fontsize=15)
plt.title('Point Plot', size=20)

How about looking at the distribution of this data?

In [None]:
df.Age.plot(kind='hist', fontsize=15)
plt.title('Histogram', size=20)

Here are various plot kinds:
    - 'line' : line plot (default)
    - 'bar' : vertical bar plot
    - 'barh' : horizontal bar plot
    - 'hist' : histogram
    - 'box' : boxplot
    - 'kde' : Kernel Density Estimation plot
    - 'density' : same as 'kde'
    - 'area' : area plot
    - 'pie' : pie plot

Try doing a density plot or a box plot and see what happens.

## Exercises:
- plot the age histogram of the titanic passengers
- plot a pie chart of survived
- plot the age histogram of the two sub-populations of dead and survived passengers (in the same plot)
- plot the age histogram of the two sub-populations of male and female (in the same plot)
- plot a bar chart of the port of embarkement

## 3 Prettier plots with seaborn

In [None]:
import seaborn as sns
df['Age'].plot(kind='hist')
plt.title('Histogram of Age')
plt.xlabel('Age')

What if I wanted to look at the influence of multiple variables on survival?

In [None]:
sns.set(style="ticks")
x = df['Age']
y = df[df['Fare'] < 100]['Fare']
sns.jointplot(x, y, kind="hex", color="#4CB391")

In [None]:
sns.jointplot(x, y, kind="kde", color="#4CB391")

In [None]:
sns.violinplot(x="Sex", y="Age", hue="Survived", data=df, palette="muted", split=True)

## Exercise:
- Explore the [Seaborn Gallery](https://stanford.edu/~mwaskom/software/seaborn/examples/index.html) and try out other plot types.

Copyright © Francesco Mosconi & Dataweekends.com