## Feature Engineering and Machine Learning

In [None]:
# Imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import numpy as np
from sklearn import tree
from sklearn.model_selection import GridSearchCV

# Figures inline and set visualization style
%matplotlib inline
sns.set()

# Import data
df_train = pd.read_csv('data/train.csv')
df_test = pd.read_csv('data/test.csv')

# Store target variable of training data in a safe place
survived_train = df_train.Survived

# Concatenate training and test sets
data = pd.concat([df_train.drop(['Survived'], axis=1), df_test])

# View head
data.head()

## Why feature engineer at all?

To extract more information from your data. For example, check out the 'Name' column:

In [None]:
# View head of 'Name' column
____

Notive that this columns contains strings (text) that contain 'Title' such as 'Mr', 'Master' and 'Dona'. You can use regular expressions to extract the Title (to learn more about regular expresssions, check out my write up of our last [FB Live code along event](https://www.datacamp.com/community/tutorials/web-scraping-python-nlp)):

In [None]:
# Extract Title from Name, store in column and plot barplot
data['Title'] = data.Name.apply(lambda x: re.search(' ([A-Z][a-z]+)\.', x).group(1))
sns.countplot(x='Title', data=data);
plt.xticks(rotation=45);

* There are several titles and it makes sense to put them in fewer buckets:

In [None]:
data['Title'] = data['Title'].replace({'Mlle':'Miss', 'Mme':'Mrs', 'Ms':'Miss'})
data['Title'] = data['Title'].replace(['Don', 'Dona', 'Rev', 'Dr',
                                            'Major', 'Lady', 'Sir', 'Col', 'Capt', 'Countess', 'Jonkheer'],'Special')
sns.countplot(x='Title', data=data);
plt.xticks(rotation=45);

* Check out your data again and make sure that we have a 'Title' column:

In [None]:
# View head of data
____

### Being cabinless may be important

* There are several NaNs (missing values) in the 'Cabin' column. It is reasonable to presume that those NaNs didn't have a cabin, which may tell us something about 'Survival' so now create a new column that encodes this information:

In [None]:
# Did they have a Cabin?
____

# View head of data
data.head()

* Drop columns that contain no more useful information (or that we're not sure what to do with:) `['Cabin', 'Name', 'PassengerId', 'Ticket']`:

In [None]:
# Drop columns and view head
____
data.head()

### Dealing with missing values

* Figure out if there are any missing values left:

In [None]:
____

* Impute missing values:

In [None]:
# Impute missing values for Age, Fare, Embarked
data.Age = ____
data.Fare = ____
data['Embarked'] = data['Embarked'].fillna('S')
data.info()

In [None]:
data.head()

### Bin numerical data

* Use the `pandas` function `qcut` to bin your numerical data:

In [None]:
# Binning numerical columns
data['CatAge'] = ____
data['CatFare']= ____
data.head()

* You can now safely drop 'Age' and 'Fare' columns:

In [None]:
data = ____
data.head()

## Create a new column: number of members in family onboard

In [None]:
# Create column of number of Family members onboard
data.Fam_Size = ____
data = data.drop(['SibSp','Parch'], axis=1)
data.head()

## Transform all variables into numerical variables

In [None]:
# Transform into binary variables
data_dum = ____
data_dum.head()

## Building models with our new dataset!

* As before, first you'll split your `data` back into training and test sets; then you'll transform them into arrays:

In [None]:
# Split into test.train
data_train = data_dum.iloc[:891]
data_test = data_dum.iloc[891:]

# Transform into arrays for scikit-learn
X = data_train.values
test = data_test.values
y = survived_train.values

You're now going to build a decision tree on your brand new feature-engineered dataset. To choose your hyperparameter `max_depth`, you'll use a variation on test train split called cross validation.

<img src="img/cv.png" width="400">

We begin by splitting the dataset into 5 groups or *folds*. Then we hold out the first fold as a test set, fit our model on the remaining four folds, predict on the test set and compute the metric of interest. Next we hold out the second fold as our test set, fit on the remaining data, predict on the test set and compute the metric of interest. Then similarly with the third, fourth and fifth. 		
 
As a result we get five values of accuracy, from which we can compute statistics	of interest, such as the median and/or mean and 95% confidence intervals. 

We do this for each value of each hyperparameter that we're tuning and choose the set of hyperparameters that performs the best. This is called _grid search_.

* Let's get it! In the following, you'll use cross validation and grid search to choose the best `max_depth` for your new feature engineered dataset:

In [None]:
# Setup the hyperparameter grid
dep = ____
param_grid = ____

# Instantiate a logistic regression classifier: clf
clf = ____

# Instantiate the GridSearchCV object: clf_cv
clf_cv = ____

# Fit it to the data
____

# Print the tuned parameter and score
print("Tuned Decision Tree Parameters: {}".format(clf_cv.best_params_))
print("Best score is {}".format(clf_cv.best_score_))


* Make predictions on your test set, create a new column "Survived" and store your predictions in it. Save 'PassengerId' and 'Survived' columns of `df_test` to a .csv and submit to Kaggle.

In [None]:
Y_pred = ____
df_test['Survived'] = Y_pred
df_test[['PassengerId', 'Survived']].to_csv('data/predictions/dec_tree_feat_eng.csv', index=False)

* What was the accuracy?

_Accuracy_ was 78.9; an improvement, but by how much? By 1% !

## Next steps

See if you can do some more feature engineering and try some new models out to improve on this score.