# ML Pipeline

This notebook simply introduces to some of the concepts to be considered when designing a ML pipeline. We will play with the Titanic dataset to illustrate the sequence of tasks to be accomplished to solve a typical ML problem.

We will leverage most of our routines in the SKLearn API to build the pipelines and some of the tasks (including the model), but typically you will also need `pandas` and `numpy`. Optionally, you will find useful to plot results using `matplotlib` or `seaborn`.

Imports

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

plt.style.use("seaborn")

import warnings
warnings.filterwarnings("ignore", category=UserWarning)

Read the data

In [3]:
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
data = pd.read_csv(url)
data.head().T

Unnamed: 0,0,1,2,3,4
PassengerId,1,2,3,4,5
Survived,0,1,1,1,0
Pclass,3,1,3,1,3
Name,"Braund, Mr. Owen Harris","Cumings, Mrs. John Bradley (Florence Briggs Th...","Heikkinen, Miss. Laina","Futrelle, Mrs. Jacques Heath (Lily May Peel)","Allen, Mr. William Henry"
Sex,male,female,female,female,male
Age,22.0,38.0,26.0,35.0,35.0
SibSp,1,1,0,1,0
Parch,0,0,0,0,0
Ticket,A/5 21171,PC 17599,STON/O2. 3101282,113803,373450
Fare,7.25,71.2833,7.925,53.1,8.05


## Quantitative/Qualitative information about data

We need to take a look at the data shape, column types, correlations, missing values, etc.. Many different ways of doing it in Python, and many libraries to help you with. Let's try `pandas_profiling` [library](https://github.com/ydataai/pandas-profiling).

In [5]:
from pandas_profiling import ProfileReport
profile = ProfileReport(data, title="Pandas Profiling Report")
profile.to_widgets()

Summarize dataset: 100%|██████████| 52/52 [00:05<00:00,  9.50it/s, Completed]                       
Generate report structure: 100%|██████████| 1/1 [00:01<00:00,  1.89s/it]
Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

                                                             

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

A different way of looking to the histograms, simpler, though not so informative, using pandas ability to plot the data that DataFrame holds.

## Basic Data Cleaning

We've missing values in Age, Embarked and Cabin features. Let's count them.

One possible strategy is to impute, but we don't really know if that will work, so I decide to drop those missing values, since they're only two.

I'm also getting rid of `Cabin` and `Ticket` since they are noise at this moment.

### Age NULLs

We've a problem with that column... so what can we do? Count them, first.

## Imputation 

Let's work on `Age`. We should build a powerful predictor that will help us to assign the missing values for Age column, but since we're starting with ML, let's simply use the median of the age (per gender) to fill them.

Let's see how `Age` looks like now, using a density plot (`data.plot.density`)

## Relationships between variables.

Start by knowing what is the relation between Fare and Age for those who survived and those who didn't. Use `matplotlib` _scatter_ plot withe the `color` argument taking the values of the field `Survived`. To plot the legend we need to do something like:

```python
label_names = [] # array with names of labels, sorted.
plt.legend(handles=scatter.legend_elements()[0], labels=label_names, title="Lengen Title")
```

But we can make a different plot for each class, using the `pandas.plot.scatter` method, much easily. Let's plot three with the different classes (`Pclass`) values, so we get an idea of how many people survived from 1st, 2nd and 3rd class.

Tip: Use `colormap=Dark2` or any other variant in matplotlib, to change default colors.

Work now with Fare and Age, for the different genders.

How about boxplots to see how the people from the different classes and Ages survived? Maybe this plot is more informative??

And finally, histograms of how many people survived, but grouping by gender and class.

Tip: play with the `layout=(2, 3)` or the `sharey=True` arguments.

What is the age distribution per class? We can use `df.plot(kind='kde')`.

## Feature Engineering

**Family size**

Given that the variable `Parch`is the nr of Parents/Children aboard, and `SibSp` is the number of siblings/spouses aboard, we can build a feature that tell us if you've family aboard, and what is its size.

We can also build a categorical that tell us if someone is alone in the boat, or whether has a normal family size (say <=4) or large one (>4).

Tip: `df.loc[condition, new_field] = 'value'`

How many people survived, based on this new criteria of family size?? Use a histogram to plot results.

### **Convert** categorical into numericals by using OneHotEncoder

We will use pandas `get_dummies`, and we need to do it for `Sex`, `FamSize` and `Embarked`. Don't forget to remove original categorical variables after dummification.

## Model Construction and Evaluation

Ensure reproducibility

In [37]:
np.random.seed(1234)

Define what are your features and target variable, and build your training and test sets.

In [1]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Linear Regression

In [52]:
linreg = LinearRegression()

Logistic Regression

In [2]:
logreg = LogisticRegression()