# 01 Introduction to Python: EDA 101

In [None]:
!pip install numpy pandas seaborn matplotlib

In [None]:
import numpy as np
import pandas as pd

## Pandas

The most usefull and commonly used library for tabular data.

In [None]:
url = 'https://raw.github.com/mattdelhey/kaggle-titanic/master/Data/train.csv'
titanic = pd.read_csv(url)
titanic.info()

In [None]:
titanic

In [None]:
titanic.describe()

In [None]:
titanic.sort_values(by='age', ascending=False).head(5)

Indexing can be tricky.

In [None]:
titanic[['age', 'name']].head(5)

In [None]:
titanic.iloc[[2, 5, 6], 2:5]

In [None]:
type(titanic)

You can extract a numpy array

In [None]:
type(titanic.values)  # depracted
type(titanic.to_numpy())

In [None]:
ages = titanic.age.to_numpy()
ages.shape, ages.dtype

See more details here: 10 Minutes to pandas (actually it requires much more)

http://pandas.pydata.org/pandas-docs/stable/10min.html

## Matplotlib

A workhorse of scientific visualization in Python.

In [None]:
from matplotlib import pyplot as plt

[deprecated] Set figure appearance in notebook (no pop up).

In [None]:
# %matplotlib inline

## Seaborn

A high-level library for visualization and exploratory data analysis.

In [None]:
!pip install seaborn

In [None]:
import seaborn as sns

In [None]:
# sns.set() allows to use a more attractive color scheme for plots
sns.set()

In [None]:
sns.catplot(x="pclass", kind="count", data=titanic)

In [None]:
sns.catplot(data=titanic, x="pclass", hue="sex", kind="count")

In [None]:
fg = sns.FacetGrid(titanic, hue="sex", aspect=3)
fg.map(sns.kdeplot, "age", fill=True)
fg.set(xlim=(0, 80));

In [None]:
fg = sns.FacetGrid(titanic, col="sex", row="pclass", hue="sex", height=2.5, aspect=2.5)
fg.map(sns.kdeplot, "age", fill=True)
fg.map(sns.rugplot, "age")
sns.despine(left=True)
fg.set(xlim=(0, 80));

See more example of Seaborn visualizations for the Titanic dataset here

https://gist.github.com/mwaskom/8224591

### Hands-on
1. Upload data from the csv file
2. Check column names
3. Look for dependencies between features and the target vector

## Scikit learn
A machine learning library

In [None]:
!pip install scikit-learn

In [None]:
from sklearn.neighbors import KNeighborsClassifier

Let's do  little bit of processing to make some different variables that might be more interesting to plot. Since this notebook is focused on visualization, we're going to do this without much comment.

In [None]:
titanic = titanic.drop(["name", "ticket", "cabin"], axis=1)
titanic["sex"] = titanic.sex.map({"male":0, "female":1})
titanic = pd.get_dummies(titanic, dummy_na=True, columns=['embarked',])
titanic.head(6)

In [None]:
titanic.count()

In [None]:
titanic.dropna(inplace=True)
titanic.head(6)

In [None]:
titanic.count()

In [None]:
# extract X - features & y - targets
X = titanic.drop('survived', axis=1)
y = titanic.survived

#### Now it's time to build a model

In [None]:
# initialize a classifier
clf = KNeighborsClassifier()

# train the classifier
clf.fit(X, y)

# calculate predictions
y_predicted = clf.predict(X)

# estimate accuracy
print('Accuracy of prediction is {}'.format(np.mean(y == y_predicted)))

In [None]:
#you can also specify some parameters during initialization
clf = KNeighborsClassifier(n_neighbors=10)

clf.fit(X, y)
y_predicted = clf.predict(X)
print('Accuracy of prediction is {}'.format(np.mean(y == y_predicted)))

In [None]:
# you can also predict probabilities of belonging to a particular class
proba = clf.predict_proba(X)
proba_df = pd.DataFrame(proba, index=y.index, columns=[0, 1])
proba_df['true'] = y

fg = sns.FacetGrid(proba_df, hue="true", aspect=3)
fg.map(sns.kdeplot, 0, fill=True)
plt.xlabel('Predicted probability of survivance')
plt.legend(['survived=0', 'survived=1'])