# Supervised Learning

Supervised learning is the branch of Machine Learning (ML) that involves predicting labels, such as 'Survived' or 'Not'. Such models learn from labelled data, which is data that includes whether a passenger survived (called "model training"), and then predict on unlabelled data.

* You want to build a model that learns patterns in the training set, and
* You then use the model to make predictions on the test set.

## End-to-End ML Project Steps:
1. Undersatnd the problem
2. Get the data
3. Perform an Exploratory Data Analysis (EDA) on your data set;
4. Prepare the data for traning
5. Select the proper model and train it
6. Iterate 3-5. You will do more EDA and build another model;
7. Engineer features: take the features that you already have and combine them or extract more information from them to eventually come to the last point, which is
8. Get a model that performs better and present your solution.

### The problem
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

***apply the tools of machine learning to predict which passengers survived the tragedy.***

### Setup and Get the data

In [None]:
# Import modules
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import numpy as np
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Figures inline and set visualization style
%matplotlib inline
sns.set()

# Import data
df_train = ____
df_test = ____

### EDA

In [None]:
# view first lines of training data
____

* What are all these features? Check out the Kaggle data documentation [here](https://www.kaggle.com/c/titanic/data).

### Important note on terminology:

* The target variable is the one you are trying to predict (Survival);
* Other variables are known as features (or predictor variables).

In [None]:
# view first lines of test data
____

* Use the DataFrame .info() method to check out datatypes, missing values and more (of df_train).

In [None]:
____

* Use the DataFrame .describe() method to check out summary statistics of numeric columns (of df_train).

In [None]:
____

## Visualizations

* Use seaborn to build a bar plot of Titanic survival (your target variable).

In [None]:
____

**Take-away**: In the training set, less people survived than didn't. Let's then build a first model that **predict that nobody survived**.

This is a bad model as we know that people survived. But it gives us a baseline: any model that we build later needs to do better than this one.

* Create a column 'Survived' for df_test that encodes 'did not survive' for all rows;
* Save 'PassengerId' and 'Survived' columns of df_test to a .csv and submit to Kaggle.

In [None]:
____ = 0
df_test[['PassengerId', 'Survived']].____ #save at './data/predictions/bad_pred.csv'

## EDA on features

* Use seaborn to build a bar plot of the Titanic dataset feature 'Sex' of df_train

In [None]:
____

* Use seaborn to build bar plots of the Titanic dataset feature 'Survived' split (faceted) over the feature 'Sex'.

In [None]:
____

**Take-away**: Women were more likely to survive than men.

* Use pandas to figure out how many women and how many men survived.

In [None]:
____

* Use pandas to calculate the survival rate for males and females:

In [None]:
num_females = ____
females_survived = ____
female_survival = females_survived / num_females
num_males = ____
males_survived = ____
male_survival = males_survived / num_men
print('female survival = ', female_survival)
print('male survival = ', male_survival)

Let's now build a second model and predict that all women survived and all men didn't. Once again, this is an unrealistic model, but it will provide a baseline against which to compare future models.

* Create a column 'Survived' for df_test that encodes the above prediction.
* Save 'PassengerId' and 'Survived' columns of df_test to a .csv and submit to Kaggle.

In [None]:
df_test['Survived'] = ____

In [None]:
df_test[['PassengerId', 'Survived']].to_csv('data/predictions/women_survive.csv', index=False)

* Use ``seaborn`` to build bar plots of the Titanic dataset feature ``'Survived'`` split (faceted) over the feature ``'Pclass'``.

In [None]:
sns.factorplot(x='Survived', col='Pclass', kind='count', data=df_train);

**Take-away**: Passengers that travelled in first class were more likely to survive. On the other hand, passengers travelling in third class were more unlikely to survive.

* Use `seaborn` to build bar plots of the Titanic dataset feature `'Survived'` split (faceted) over the feature `'Embarked'`.

In [None]:
sns.factorplot(x='Survived', col='Embarked', kind='count', data=df_train);

**Take-away**: Passengers that embarked in Southampton were less likely to survive.

## EDA with Numeric features

* Use seaborn to plot a histogram of the 'Fare' column of df_train.

In [None]:
sns.distplot(df_train.Fare, kde=False);

**Take-away**: Most passengers paid less than 100 for travelling with the Titanic.

* Use a pandas plotting method to plot the column 'Fare' for each value of 'Survived' on the same plot.

In [None]:
df_train.groupby('Survived').Fare.hist(alpha=0.6)

**Take-away**: It looks as though those that paid more had a higher chance of surviving.

* Use seaborn to plot a histogram of the 'Age' column of df_train. You'll need to drop null values before doing so.

In [None]:
df_train_drop = df_train.dropna()
sns.distplot(df_train_drop.Age, kde=False)

* Plot a strip plot & a swarm plot of 'Fare' with 'Survived' on the x-axis.

In [None]:
sns.stripplot(x='Survived', y='Fare', data=df_train, alpha=0.3, jitter=True)

In [None]:
sns.swarmplot(x='Survived', y='Fare', data=df_train)

**Take-away**: Fare definitely seems to be correlated with survival aboard the Titanic.

* Use the DataFrame method .describe() to check out summary statistics of 'Fare' as a function of survival.

In [None]:
df_train.groupby('Survived').Fare.describe()

* Use seaborn to plot a scatter plot of 'Age' against 'Fare', colored by 'Survived'.

In [None]:
sns.lmplot(x='Age', y='Fare', hue='Survived', data=df_train, fit_reg=False, scatter_kws={'alpha':0.5});


**Take-away**: It looks like those who survived either paid quite a bit for their ticket or they were young.

* Use seaborn to create a pairplot of df_train, colored by 'Survived'. A pairplot is a great way to display most of the information that you have already discovered in a single grid of plots.


In [None]:
sns.pairplot(df_train_drop, hue='Survived');

# Build your first ML model

* Below, you will drop the target 'Survived' from the training dataset and create a new DataFrame data that consists of training and test sets combined;
* But first, you'll store the target variable of the training data for safe keeping.

In [None]:
# Store target variable of training data in a safe place
survived_train = ____

data = ____

* Check out your new DataFrame data using the info() method.

In [None]:
____

In [None]:
# Impute missing numerical variables, using the median
data['Age'] = ____
data['Fare'] = ____

data.info()

In [None]:
# change 'male' and 'female' to numbers using pandas function get_dummies
data = ____
data.head()

* Select the columns ['Sex_male', 'Fare', 'Age','Pclass', 'SibSp'] from your DataFrame to build your first machine learning model:

In [None]:
# Select the columns for ml model
data = ____
data.head()

## Let's Build a decision tree classifier

What is a Decision tree classsifier? It is a tree that allows you to classify data points (aka predict target variables) based on feature variables.

* You first fit such a model to your training data, which means deciding (based on the training data) which decisions will split at each branching point in the tree: e.g., that the first branch is on 'Male' or not and that 'Male' results in a prediction of 'Dead'.


* Before fitting a model to your data, split it back into training and test sets:

In [None]:
data_train = data.iloc[:891]
data_test = data.iloc[891:]

* You'll use scikit-learn, which requires your data as arrays, not DataFrames so transform them:

In [None]:
X = ____
y = ____
test = ____

* Now you get to build your decision tree classifier! First create such a model with max_depth=3 and then fit it your data:

In [None]:
# Instantiate model and fit to data
clf = ____
____

* Make predictions on your test set, create a new column 'Survived' and store your predictions in it. Save 'PassengerId' and 'Survived' columns of df_test to a .csv and submit to Kaggle.

In [None]:
# Make prediction and store in 'Survived' column of df_test
y_pred = ____
df_test['Survived'] = y_pred
df_test[['PassengerId', 'Survived']].to_csv('data/predictions/1st_dec_tree.csv', index=False)

## Why would you choose max_depth=3 ?

The depth of the tree is known as a hyperparameter, which means a parameter we need to decide before we fit the model to the data. If we choose a larger max_depth, we'll get a more complex decision boundary.

* If our decision boundary is too complex we can overfit to the data, which means that our model will be describing noise as well as signal.

* If our max_depth is too small, we may be underfitting the data, meaning that our model doesn't contain enough of the signal.

**How do we tell whether we're overfitting or underfitting?** Note: this is also referred to as the bias-variance trade-off and we won;t go into details on that here.

One way is to hold out a test set from our training data. We can then fit the model to our training data, make predictions on our test set and see how well our prediction does on the test set.

* You'll now do this: split your original training data into training and test sets:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42, stratify=y)

* Iterate over values of max_depth ranging from 1 to 9 and plot the accuracy of the models on training and test sets:

In [None]:
# Setup arrays to store train and test accuracies
dep = np.arange(1, 9)
train_accuracy = np.empty(len(dep))
test_accuracy = np.empty(len(dep))

# Loop over different values of k
for i, k in enumerate(dep):
    # Setup a Decision Tree Classifier
    clf = tree.DecisionTreeClassifier(max_depth=k)

    # Fit the classifier to the training data
    clf.fit(X_train, y_train)

    #Compute accuracy on the training set
    train_accuracy[i] = clf.score(X_train, y_train)

    #Compute accuracy on the testing set
    test_accuracy[i] = clf.score(X_test, y_test)

# Generate plot
plt.title('clf: Varying depth of tree')
plt.plot(dep, test_accuracy, label = 'Testing Accuracy')
plt.plot(dep, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Depth of tree')
plt.ylabel('Accuracy')
plt.show()