# Lecture 8 - Project Example

In this lecture we are going to discuss and analyze the structure of a 
good-quality analysis for the final project of this course.

## Titanic, Learning from the disaster

---
__Suggestions__:
At this stage, you should answer to the following questions (not necessarily all of them):

1. Define the objective in business term
2. How will your solution be used?
3. What are the current solutions/workarounds (if any)?
4. How should you frame this problem (supervised/unsupervised, online/offline learning)?
5. How should performance be measured?
6. Is the performance measure aligned with the business objective?
7. What would be the minimum performance needed to reach the business objective?
8. What are comparable problems? Can you reuse experience or tools?
9. Is human expertise available?
10. How would you solve the problem manually?
11. List the assumptions made so far and verify them.

---

The main goal of this project is to solve the following task:
    
> __Problem Statement__: Given information about people who unfortunately were embarked on the titanic, deciding wether or not
  a person will survive shipwreck or not.
  
The solution provided in this work is developed to be a stand-alone solution, therefore the proposed approach does not have any requirements
about compatibility of integration with other models.

It appears that the best strategy for tackling the problem is by the means of a classification algorithm, thus we are going solve a supervised 
learning problem.

Since we decided to tackle the problem, the best way for measuring the performance would be computing any classification-related error measures,
e.g., accuracy, confusion matrix, recall... (initially you can be vague, you may want to return later for a further description).


## Get the Data

---
__Suggestions__:
In this section, you should address the following point (again, you are not required to address every single point):

0. Introduce the reader to the data. Provide a first qualitative analysis.
1. List the data you need and how much you need.
2. Find and document where you can get that data.
3. Check how much space it will take.
4. Check legal obligations, and get authorization if necessary.
5. Get access authorizations.
6. Create a workspace (with enough storage space).
7. Get the data.
8. Convert the data to a format you can easily manipulate (without changing the
data itself).
9. Ensure sensitive information is deleted or protected (e.g., anonymized).
10. Check the size and type of data (time series, sample, geographical, etc.).
11. Sample a test set, put it aside, and never look at it (no data snooping!).

---
The dataset contains information like gender, ticket class, age, etc. related to the titanic passenger.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

from copy import copy


In [None]:
df = pd.read_csv('data/train.csv')
df.head()

Here is a more detailed description about the meaning of each attribute and its values:

* PassengerID - A column added by Kaggle to identify each row and make submissions easier
* Survived - Whether the passenger survived or not and the value we are predicting (0=No, 1=Yes) __(*)__
* Pclass - The class of the ticket the passenger purchased (1=1st, 2=2nd, 3=3rd)
* Sex - The passenger's sex
* Age - The passenger's age in years
* SibSp - The number of siblings or spouses the passenger had aboard the Titanic
* Parch - The number of parents or children the passenger had aboard the Titanic
* Ticket - The passenger's ticket number
* Fare - The fare the passenger paid
* Cabin - The passenger's cabin number
* Embarked - The port where the passenger embarked (C=Cherbourg, Q=Queenstown, S=Southampton)

---

__(*)__ this is thea attribute containing the class we aim to predict

---

## Explore the Data

---
__Suggestions__:
1. Create a copy of the data for exploration (sampling it down to a manageable size
    if necessary).
2. Create a Jupyter notebook to keep a record of your data exploration.
3. Study each attribute and its characteristics like:
    * Name, Type (categorical, int/float, bounded/unbounded, text, structured, etc.)
    * Percentage of missing values
    * Noisiness and type of noise 8tochastic, outliers, rounding errors, etc.)
    * Type of distribution, e.g., Gaussian, uniform, logarithmic, etc.
4. For supervised learning tasks, identify the target attribute(s).
5. Visualize the data.
6. Study the correlations between attributes.
7. Study how you would solve the problem manually.
8. Identify the promising transformations you may want to apply.
9. Identify extra data that would be useful (go back to “Get the Data” on page 498).
10. Document what you have learned.
---

Since the training set has a rather small size we can explore the data considering the entire dataset.
The dataset dimensions are:

In [None]:
print(df.shape)

With the following instruction we chekc wether or not there are missing values.

In [None]:
df.info()

It should be noted that some attributes have missing values.

In [None]:
pd.DataFrame(df.isnull().sum()/df.shape[0]).T

As we can see, ``Age`` and ``Cabin`` present missing values (We will deal with this problem later).

Let's take a quick look at the numerical attributes

In [None]:
df.describe()


---

The main goal of this stage is to understand what variables are important for predicting wether or not 
a person will survive the shipwreck.

Two attributes that may be very promising are the age and the class of the ticket. 

In fact, in general, in this scenario, children and women are the category with the most priority.
In order to verify this assumption we can issue the following instruction.

In [None]:
df_sex = df.groupby(['Sex', 'Survived'])[['PassengerId']].count()
df_sex = df_sex.rename({'PassengerId': 'count'}, axis=1)
df_sex['prob'] = df_sex['count'] / df.groupby('Sex').count()['Survived']
df_sex

Also, one may expects that passengers with higher ticket class would have more privilege than low class passenger.

In [None]:
df_pclass = df.groupby(['Pclass', 'Survived'])[['PassengerId']].count()
df_pclass = df_pclass.rename({'PassengerId': 'count'}, axis=1)
df_pclass['prob'] = df_pclass['count'] / df.groupby('Pclass').count()['Survived']
df_pclass

Now we conduct the same kind of analysis for the age of the passengers.

In [None]:
df['Age'].describe()

``Age`` is a real-valued attribute. Therefore it is better to plot the age distribution wrt to 
the ``Survived`` class.

In [None]:
survived = df[df["Survived"] == 1]
died = df[df["Survived"] == 0]
fig, axes = plt.subplots(1,2)
axes[0].hist(survived["Age"].dropna(), alpha=0.5, color='red', bins=30)
axes[1].hist(died["Age"].dropna() ,alpha=0.5,color='blue',bins=30)
axes[0].set_title('Survived')
axes[1].set_title('Not Survived')
axes[0].set_xlabel('Age')
axes[1].set_xlabel('Age')
plt.show()

It appears that only a small fraction of people of age within 0 and 20 does not survive the shipwreck.

In order to gather some other insight we plot the same histograms, but this times we distinguish between 
male and female.

In [None]:
survived = df[df["Survived"] == 1]
died = df[df["Survived"] == 0]
fig, axes = plt.subplots(1,2)
axes[0].hist(survived.loc[survived["Sex"] == "male"]["Age"].dropna(), label='Male', alpha=0.5, bins=30)
axes[0].hist(survived.loc[survived["Sex"] == "female"]["Age"].dropna(), label='Female', alpha=0.5, bins=30)

axes[1].hist(died.loc[died["Sex"] == "male"]["Age"].dropna(), label='Male', alpha=0.5, bins=30)
axes[1].hist(died.loc[died["Sex"] == "female"]["Age"].dropna(), label='Female', alpha=0.5, bins=30)

axes[0].set_title('Survived')
axes[1].set_title('Not Survived')

axes[0].set_xlabel('Age')
axes[1].set_xlabel('Age')

axes[0].legend()
axes[1].legend()
plt.show()

Since most of male passengers between 20 and 40 had bad luck, another interesting attribute fo focus on would
be ``SibSp``, since it is most likley that male people of that age were travelling with some relatives
(the number of siblings or spouses the passenger had aboard the Titanic).



In [None]:
df_sib = pd.pivot_table(df, index='SibSp', values='Survived')

fig, ax = plt.subplots(1,1)

ax.plot(df_sib, marker='o')

ax.set_xlabel("SibSp")
ax.set_title('Prob. of Survival')

As partially unveiled by the previous analysis, as the number the of siblings increase the probability of
survival tends to decrease.

Finally, since there may be some hidden pattern that we might have missed, we plot a factor matrix.

In [None]:
_ = sns.pairplot(df,
                 x_vars = ['Parch', 'Fare', 'SibSp'],
                 y_vars = ['Parch', 'Fare', 'SibSp'],
                 hue='Survived', 
                 diag_kind='kde', 
                 diag_kws={'alpha':0.5})


### Recap
We can immediately see that females survived with much higher proportions than males did.

Analougsly, the probability of survival increases as the ticket value increases.

Another important features is the Age of the passengers. 
In general, youngsters have a higher probability of survival.

Moreover, it seems that there is a sort of correlation between the probability of survival and the number
of relative a person is travelling with.

This first stage of evaluation also highlights some aspect about the data that need to be considered in the 
preprocessing stage.

For instance:
* There are missing values in the ``Age`` and ``Cabin`` column
* The ``Age`` column is real-valued attributes. We should discretize it.
* There are several columns with categorical attributes, we may want to represent them with one-hot-encoding

---
## Prepare the Data
---
__Suggestions__

1. Write functions for all data transformations you apply, for five reasons:
    * So you can easily prepare the data the next time you get a fresh dataset
    * So you can apply these transformations in future projects
    * To clean and prepare the test set
    * To clean and prepare new data instances once your solution is live
    * To make it easy to treat your preparation choices as hyperparameters
2. Data cleaning:
    * Fix or remove outliers (optional).
    * Fill in missing values (e.g., with zero, mean, median...) or drop their rows (or columns).
3. Feature selection (optional):
    * Drop the attributes that provide no useful information for the task.
4. Feature engineering, where appropriate:
    * Discretize continuous features.
    * Decompose features (e.g., categorical, date/time, etc.).
    * Add promising transformations of features (e.g., log(x), sqrt(x), x^2, etc.).
    * Aggregate features into promising new features.
5. Feature scaling: standardize or normalize features.
---

__Dealing with ``Cabin`` and ``Embarked``__

Since ``Cabin`` has 77% of missing values it appears that the best way for dealing with this missing data is by
dropping the entire column.

Analougsly we can drop the few NaN instances in the corresponding to the ``Embarked`` column.

In [None]:
df = df.drop('Cabin', axis=1)
df = df.dropna(subset=['Embarked'], axis=0)
df.shape

__Dealing with ``Age``__

First, we will discretize the age column and then apply a one-hot-encoding. 

In [None]:
from sklearn.preprocessing import OneHotEncoder, Imputer, KBinsDiscretizer
from sklearn.pipeline import Pipeline

age_pipeline = Pipeline([
        ('imputer', Imputer(strategy="median")),
        ('discretize', KBinsDiscretizer(n_bins=5, encode='onehot'))
    ])

age_one_hot = age_pipeline.fit_transform(df[['Age']])
age_one_hot.shape

__Dealing with ``PClass``__

Again, the best way of representing this attribute would be by a one-hot-encoding vector.

In [None]:
pclass_pipeline = Pipeline([
    ('discretize', OneHotEncoder())
])

pclass_one_hot = pclass_pipeline.fit_transform(df[['Pclass']])
pclass_one_hot.shape


__Dealing with ``Name``, ``Sex``, ``Ticket``, ``Embarked``__

These columns represent categorical values, it is better to have them as a one-hot-encoding vector.

In [None]:
cat_pipeline = Pipeline([
    ('onehot', OneHotEncoder())
])

cat_df = df.select_dtypes(['object'])
cat_df.head(10)

Probably the ``Ticket`` column does not have any value, we can drop it.

In [None]:
df = df.drop('Ticket', axis=1)
cat_df = cat_df.drop('Ticket', axis=1)

It should be noted that also the ``Name`` column could bring some interesting information. 

In fact, Mr., Miss, Mrs., Maester, can denote the social class of an individual. This may have some degree of influence on
the probaiblity of survival.

Therefore, we will replace the original column with another one containing only the tile of a person (e.g., Mr., Mrs...)


In [None]:
from collections import Counter
names = cat_df.Name.str.split(' ')
lnames = []
for entry in names:
    lnames.extend(entry)
c = Counter(lnames)
print(c.most_common()[:20])

The most common titles seems to be: Mr., Miss., and Master.

We will use a regular expression for extracting this information.

In [None]:
import re
cat_df['Name'] = cat_df.Name.str.extract('(Mr|Miss|Mrs|Master)', flags=re.IGNORECASE).values
if cat_df['Name'].isnull().sum() > 0:
    cat_df['Name'] = cat_df['Name'].fillna('unk')
assert cat_df.Name.isnull().sum() == 0

Now we define a pipeline for transforming every categorical columns into a one-hot-encoding vector

In [None]:
cat_df.info()

There are no more missing values

In [None]:
cat_pipeline = Pipeline(
    [('onehot', OneHotEncoder())]
)
sex_tick_emb_one_hot = cat_pipeline.fit_transform(cat_df)
print(cat_df.columns,sex_tick_emb_one_hot.shape, df.columns)

---
Finally, we can merge all these transformations in a single pipeline and then obtain the final dataset.

In [None]:
Xdf, ydf = df.loc[:, ["Name", "Sex", "Embarked", "Pclass", "Age", "SibSp", "Parch"]], df.loc[:,["Survived"]]
# replace the name column
Xdf['Name'] = cat_df['Name']
Xdf.head()

In [None]:
from sklearn.compose import ColumnTransformer

cat_attribs = list(cat_df.columns)

full_pipeline = ColumnTransformer([
        ("cat", cat_pipeline, cat_attribs),
        ("age_pipe", age_pipeline, ['Age']),
        ("pclass_pipe", pclass_pipeline, ["Pclass"])
    ], remainder='passthrough')

X, y = full_pipeline.fit_transform(Xdf), ydf.values.reshape(-1)
print(X.shape, y.shape)

---
__Short-List Promising Models__
__Suggestions__:

1. If the data is huge, you may want to sample smaller training sets so you can train
many different models in a reasonable time (be aware that this penalizes complex
models such as large neural nets or Random Forests).
Once again, try to automate these steps as much as possible.
2. Train many quick and dirty models from different categories (e.g., linear, naive
   Bayes, SVM, Random Forests, neural net, etc.) using standard parameters.
3. Measure and compare their performance. 
   For each model, use N-fold cross-validation and compute the mean and stan‐
   dard deviation of the performance measure on the N folds.
4. Analyze the most significant variables for each algorithm.
5. Analyze the types of errors the models make.
   What data would a human have used to avoid these errors?
6. Have a quick round of feature selection and engineering.
7. Have one or two more quick iterations of the five previous steps.
8. Short-list the top three to five most promising models, preferring models that
   make different types of errors.   
---

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

def compute_performance(estimators, X, y, scoring_funs, cross_val=True):
    '''
    Compute every score function for each estimator
    
    Parameters:
    ----------
        estimators : a list of predictors
        X : data used for computed the corss_validation_test
        y : target values
        scoring_fun : list of performances measures
    
    Returns:
    --------
        A dictionary containing the score of each estimator
    '''
    score_dict = {}
    for e in estimators:
        score_dict[e.__class__.__name__] = {}
        if not cross_val:
            # we call predict on the estimator
            y_pred = e.predict(X)

        for f in scoring_funs: 
            if cross_val:
                score = cross_val_score(e, X ,y, cv=10 ,scoring="accuracy").mean()
                name = f
            else:
                _f, name = f
                score = _f(y_pred, y)
            score_dict[e.__class__.__name__][name] = score
    
    return score_dict

X_train, X_test,  y_train, y_test = train_test_split(X,y, test_size=0.2)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

In the following sections we will try several classification algorithms.

__SGDClassifier__

In [None]:
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(max_iter=5, tol=-np.infty, random_state=42)
sgd_clf.fit(X_train, y_train)
compute_performance([sgd_clf], X_train, y_train, ['accuracy', 'precision', 'recall', 'f1'])

__NaiveBayes__


In [None]:
from sklearn.naive_bayes import GaussianNB
nb_clf = GaussianNB()
nb_clf = nb_clf.fit(X_train, y_train)
compute_performance([nb_clf], X_train, y_train, ['accuracy', 'precision', 'recall', 'f1'])

__LogisticRegression__

In [None]:
from sklearn.linear_model import LogisticRegression
logreg_clf = LogisticRegression(solver='lbfgs')
logreg_clf.fit(X_train, y_train)
compute_performance([logreg_clf], X_train, y_train, ['accuracy', 'precision', 'recall', 'f1'])

__Support Vector Machine__


In [None]:
from sklearn.svm import SVC
svc = SVC(gamma='auto')
svc.fit(X_train, y_train)
compute_performance([svc], X_train, y_train, ['accuracy', 'precision', 'recall', 'f1'])

---
__Fine-Tune the System__

Notes:
0. You will want to use as much data as possible for this step, especially as you move
  toward the end of fine-tuning.
1. Treat your data transformation choices as hyperparameters, especially when
you are not sure about them (e.g., should I replace missing values with zero or
with the median value? Or just drop the rows?).
2. Unless there are very few hyperparameter values to explore, prefer random
search over grid search. If training is very long, you may prefer a Bayesian
optimization approach.
3. Try Ensemble methods. Combining your best models will often perform better
than running them individually.
4. Once you are confident about your final model, measure its performance on the
   test set to estimate the generalization error.
---

It appears that the last two models are the most promising ones.

For this reason in the following section we are going to fine-tuning their parameter in
order to conduct a more involved analysis.


__Parameter Tuning for the LogisticRegression__

We decide to tune the some of its parameter. More specifically: the ``solver`` and the ``penalty`` parameter.

In [None]:
from sklearn.model_selection import GridSearchCV

param_logreg_grid = [
    {'penalty': ['l2'], 'solver': ['newton-cg', 'sag','lbfgs']},
  ]

grid_search = GridSearchCV(logreg_clf, param_logreg_grid, cv=10,
                           scoring='accuracy', return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best Param:", grid_search.best_params_)
best_logreg_clf = grid_search.best_estimator_
compute_performance([best_logreg_clf], X_train, y_train, scoring_funs=['accuracy', 'recall','precision','f1'])

_Parameter Tuning for the SVC_

We decide to try different kernel function


In [None]:
param_svc_grid = [
    {'kernel': ['linear', 'poly', 'rbf']},
  ]

grid_search = GridSearchCV(svc, param_svc_grid, cv=10,
                           scoring='accuracy', return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best Param:", grid_search.best_params_)
best_svc = grid_search.best_estimator_
compute_performance([best_svc], X_train, y_train, scoring_funs=['accuracy', 'recall','precision','f1'])

### Recap

From this analysis, it appears that the SVC is the best choice.

## Ensemble Learning

Before the ultimate deicsion about what is the best solution we want to try some ensemble technique.

Since LogisticRegression behaves very well, we try a BaggingClassifier.





In [None]:
from sklearn.ensemble import BaggingClassifier
bag_clf = BaggingClassifier(
                SVC(kernel='rbf'), #its best configuration 
                n_estimators=50,
                bootstrap=True,
                n_jobs=-1,
                oob_score=True)

bag_clf.fit(X_train, y_train)
compute_performance([bag_clf], X_train, y_train, scoring_funs=['accuracy', 'recall','precision','f1'])

It appears that the BaggingClassifier is not able to boost the performance of the single learner.

For this reason we try other ensamble methods.

*Voting Classifier*

Since LogisticRegression and SVC classifiers share similar performance, we can try to combine them
in a VotingClassifier. 

We also include an additional DecisionTreeClassifier.

In [None]:
from sklearn.ensemble import VotingClassifier
from sklearn.tree import DecisionTreeClassifier

best_svc.probability = True

voting_clf = VotingClassifier(
    estimators=[('lr', best_logreg_clf), ('svc', best_svc), ('dct', DecisionTreeClassifier())],
    voting='soft')

voting_clf.fit(X_train, y_train)
compute_performance([voting_clf], X_train, y_train, scoring_funs=['accuracy', 'recall', 'precision', 'f1'])

This time we are able to get a slightly boost in terms of accuracy and f1 score.


*Random Forest*


In [None]:
from sklearn.ensemble import RandomForestClassifier
rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rnd_clf.fit(X,y)
compute_performance([rnd_clf], X_train, y_train, scoring_funs=['accuracy', 'recall', 'precision', 'f1'])

RandomForest appears to behave worse than other simpler model. 
However we can use it for computing the importance of each feature.

In [None]:
for i,x in enumerate(rnd_clf.feature_importances_):
    print("Position:{}: {}".format(i,x))


---
__Suggestion__:
For a more exhaustive evaluation, you should try other ensamble learning methods (see Lecture 7a).
However, if you get to this point, where none ensamble methods seems to guarantee a considerable boost
in performance, you have basically two choices:

   > (i) You settle for the best performance recorded so far, but you should also motivate why there was not any improvement

   > (ii) You can reconsider some decisions taken earlier. 
    For instance, we neither normalized or standardize our dataset. 
    Another option would be to sample a subset of attributes from the original data, driven by the feature importance
    compute by the RandomForestClassifier (be careful with the actual attributes they refer to, especially with One-Hot-Encoding

Now, since I am the examiner and not the subject being examined, we will pretend the BaggingClassifier did an excellent job! :-)

---
---
## Present Your Solution
1. Document what you have done.
2. Create a nice presentation.
   Make sure you highlight the big picture first.
3. Explain why your solution achieves the business objective.
4. Don’t forget to present interesting points you noticed along the way. Describe what worked and what did not. List your assumptions and your system’s limitations.
5. Ensure your key findings are communicated through beautiful visualizations or easy-to-remember statements (e.g., “the median income is the number-one    predictor of housing prices”

---

## Model Selection

Based on the previous analysis, three models have emerged as the most promising ones.

In this section we are going to test every one of them against the test-set in order to confirm
the results observed during the training stage.


In [None]:
from sklearn.metrics import accuracy_score, f1_score, recall_score

In [None]:
score_dict = compute_performance([rnd_clf, voting_clf,bag_clf], X_test, y_test, scoring_funs=[(accuracy_score, "accuracy"), (f1_score, "f1")], cross_val=False)
score_df = pd.DataFrame(score_dict)
score_df.head()

___
__Note__:

Here is a list of suggestions for improving the quality of this analysis.

* You may want to remove some feature according to the random forest features importance
* You may want to show the performance of each classifier wrt to other measure (e.g., the ROC curve)
* If at the first try you don't succeed you can also report the failure and try to explain why that happens. 
  Actually, in this case we did not obtain considerable boost in performance due to the ensamble methods.
  You could try to provide an explaination.
  
* Note also that the ``train_test_split`` should be called earlier than what we did. Also, it is better use pipeline for 
  automating the training episodes and the dataset transformation.

