---
# Crash Course Python for Data Science — Predictive Modelling
---
# 02 - Classification modelling
---
## STOP! BEFORE GOING ANY FURTHER...  

Remember, this exercises are open book, open neighbour, open everything! Try to do them on your own before looking at the solution samples.

---

## Exercise

Run this cell to load the Titanic data:

In [23]:
# linear algebra
import numpy as np 

# data processing
import pandas as pd 

# data visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style

# Algorithms
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression       
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

# Evaluation
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score 

# Set plot preference
plt.style.use(style='ggplot')
plt.rcParams['figure.figsize'] = (10, 6)

# time
import time

print('Libraries imported.')

Libraries imported.


Then, train a [Logistic Regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba), [Decision Tree](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), or [Random Forest](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) model. Use any features and parameters you want. 

Try to get better than 78.0% accuracy on the test set! (This is not required, but encouraged.)

Do refer to the lecture notebook — but try not to copy-paste.

> You must type each of these exercises in, manually. If you copy and paste, you might as well not even do them. The point of these exercises is to train your hands, your brain, and your mind in how to read, write, and see code. If you copy-paste, you are cheating yourself out of the effectiveness of the lessons. —*[Learn Python the Hard Way](https://learnpythonthehardway.org/book/intro.html)*

After this, you may want to try [Kaggle's Titanic challenge](https://www.kaggle.com/c/titanic)!

Again, I've written an example solution covereing everything, from exploratory data analysis (cleaning data and all that) to model evaluation for you to see the whole process. I strongly encourage you to come up with your own solution before looking at mine!

In [24]:
# Run this to split your data into train and test

train, test = train_test_split(sns.load_dataset('titanic').drop(columns=['alive']), random_state=0)
target = 'survived'

## 1. Exploratory Data Analysis


In [25]:
#@title Double click here for a sample solution
train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 668 entries, 105 to 684
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     668 non-null    int64   
 1   pclass       668 non-null    int64   
 2   sex          668 non-null    object  
 3   age          535 non-null    float64 
 4   sibsp        668 non-null    int64   
 5   parch        668 non-null    int64   
 6   fare         668 non-null    float64 
 7   embarked     666 non-null    object  
 8   class        668 non-null    category
 9   who          668 non-null    object  
 10  adult_male   668 non-null    bool    
 11  deck         156 non-null    category
 12  embark_town  666 non-null    object  
 13  alone        668 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(4)
memory usage: 60.5+ KB


#### Feature description

- `survived`:    Survival  
- `pclass`:    Ticket class     
- `sex`:    Sex     
- `age`:    Age in years     
- `sibsp`:    # of siblings / spouses aboard the Titanic     
- `parch`:    # of parents / children aboard the Titanic         
- `fare`:    Passenger fare     
- `deck`:    Deck    
- `embarked`: Port of Embarkation
- `embark_town`: Town of Embarkation
- `alone`: if person was travelling alone or not
- `who`: male or female
- `adult_male`: If person was male or not

In [26]:
# Your Code Goes Here

### `age` vs `sex`

In [27]:
# Your code goes here

Seems pretty clear that `sex` has to do with survival probability

### `embarked`, `pclass` and `sex`

In [28]:
# your code goes here

`embarked` seems correlated with survival depending on gender, as `pclass`

In [29]:
# your code goes here


`pclass` appears to be contributing to survival

In [30]:
# your code goes here

Assumption about `pclass` 1 contributing ti survival appears true. There seems to be a low probability of persons in `pclass` 3 not surviving.

In [31]:
# Your code goes here

Seems like younger people travelling alone have a higher probability or survival, whereas travelling not alone is more relatively equally distributed between age groups.

## 2. Data cleaning

#### Drop `class`, `adult_male` and `who` as they are repetitive. Also dropping `sibsp` and `parch`, as `alone` already accounts for travelling alone or with family/friends. 

In [32]:
# Your code goes here

### Missing values `deck`

In [33]:
# Your code goes here 



### Missing values `age`

In [34]:
# Your code goes here



### Missing values `embarked` and `embark_town`

In [35]:
# Your code goes here

## 3. Preparing data for modelling

In [36]:
# Your code goes here


#### `fare` from `float64` to `int64`

In [37]:
# Your code goes here


#### `sex` to numeric


In [38]:
# Your code goes here



#### `embarked` to numeric

In [39]:
# Your code goes here


### `alone` from boolean to numeric

In [40]:
# Your code goes here



### Getting dummies for categorical `deck` and `embark_town`

In [41]:
# Your code goes here



### Multicollinearity

In [42]:
# Your code goes here



Areas of multi-collinearity::

`embark_town_Queenston` and `embarked_Cherbourg` show a strong positive correlation with `embarked`, whistle `embark_town_Southampton` is strongly negatively correlated with `embarked`. This suggests that including `embark_town` only as a feature should be enough to control for the influence of place of embarkment.

`deck_0` which is our `NaN` values, and `pclass` are perfectly positively correlated and negatively correlated with `fare`. This could mean different things. Perhaps is because deck numbers are correlated with ticket class and people without a ticket were all from certain class. Thus, dropping `deck_0` should be ok, as it is being taken into account by `pclass`. 

As expected, `sex` and `survived` also show strong positive correlation. 

Unsurprisingly, `fare` and `pclass` are strongly negatively correlated, so one will be dropped.

### Feature scaling

Finally, predictive features `X` and the target feature `y` can be separated, and `X` will be scaled with `StandardScaler` from `sklearn`.

In [43]:
# Your code goes here 



## 4. Building a Machine Learning Model

### Random Forest

In [44]:
# Your code goes here



## 5. Conclusion


As expected, `sex` and `age` are the main features by far, with a model accuracy of 79.37%, which could be improved by doing some hyper-parameter tunning.
