---

# Let's practice

Before you start working and playing with the different models that we have seen, we are going to download the dataset with which you are going to work, the titanic dataset, widely known and used in machine learning courses.

For this we are going to do the following:

```python
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml

titanic = fetch_openml("titanic", version=1, as_frame=True, return_X_y=False)
df_titanic = pd.DataFrame(
    data=np.c_[titanic['data'], titanic['target']],
    columns= titanic['feature_names'] + ['target']
)
df_titanic = df_titanic.rename(columns={'target': 'survived'})
```

So using the same dataset (Titanic), you should train 4 models:

* Decision Tree
* SVM
* Random Forest
* Extra: XGBoost

And you should apply the following concepts:

* Train/Test Split.
* Feature engineering.
* GridSearch or RandomSearch with CV.
* Metrics.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.svm import SVC
from matplotlib import style
from matplotlib import pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [None]:
titanic = fetch_openml("titanic", version=1, as_frame=True, return_X_y=False)
df_titanic = pd.DataFrame(
    data=np.c_[titanic["data"], titanic["target"]],
    columns=titanic["feature_names"] + ["target"],
)
df_titanic = df_titanic.rename(columns={"target": "survived"})
df_titanic.head()

Unnamed: 0,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,survived
0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,S,2.0,,"St Louis, MO",1
1,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON",1
2,1.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON",0
3,1.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON",0
4,1.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON",0


## Exploration

**TODO:** We always have to explore our datasets, so this is not going to be the exception.

## Feature Engineering

**TODO:** 
    
* Fill in the missing values using any criteria that you consider appropriate.
* Eliminate those features that you consider necessary.
* Format categorical features, using Label and/or Hot encoder.

## Hyperparameter Optimization

**TODO:**

* Split the dataset into 80% train and 20% test.
* Using GridSearchCV or RandomSearchCV, tests different hyperparameter values for each model and chooses the best model from each of them.
* Evaluate the metrics of each model (accuracy, precision, recall, f1-score, roc-auc score) and choose the one with the best performance.
* Plot the precision and recall curves (tip: there is a sklearn method for this)
* Plot the ROC curve (tip: there is a sklearn method for it)

## Metrics

**TODO:**

* Evaluate the metrics of each model (accuracy, precision, recall, f1-score, roc-auc score) and choose the one with the best performance.
* Plot the precision and recall curves (tip: there is a sklearn method for this)
* Plot the ROC curve (tip: there is a sklearn method for it)

---

# Pipeline with ColumnTransformer and GridSearchCV

Only toy datasets like the __iris dataset__ will contain only numeric data, as we saw in the previous exercise, the __titanic dataset__ had a variety of different data types and not just numeric data.

By having different types of data we will not be able to apply the same transformations to each of them, instead we will have to apply different transformations depending on the type of data.

Next we are going to see an example of how to use __ColumnTransformer__ to simplify the application of these different transformations and above all to be able to insert it into a __Pipeline__.

Let’s use the toy dataset, which contains both numerical and categorical data, and apply:

* Normalize the Income column with MinMaxScaler()
* Encode Categorical Columns with OneHotEncoder()
* Group the Age column with binning.

In [None]:
titanic = fetch_openml("titanic", version=1, as_frame=True, return_X_y=False)
df_titanic = pd.DataFrame(
    data=np.c_[titanic["data"], titanic["target"]],
    columns=titanic["feature_names"] + ["target"],
)
df_titanic = df_titanic.rename(columns={"target": "survived"})
df_titanic = df_titanic[["sex", "cabin", "age", "fare", "survived"]]

df_titanic["age"] = df_titanic["age"].astype("float64")
df_titanic["fare"] = df_titanic["fare"].astype("float64")

df_titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   sex       1309 non-null   object 
 1   cabin     295 non-null    object 
 2   age       1046 non-null   float64
 3   fare      1308 non-null   float64
 4   survived  1309 non-null   object 
dtypes: float64(2), object(3)
memory usage: 51.3+ KB


In [None]:
df_titanic.isna().sum()

sex            0
cabin       1014
age          263
fare           1
survived       0
dtype: int64

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

# Numeric features
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)

# Categorical features
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, ["age", "fare"]),
        ("cat", categorical_transformer, ["sex", "cabin"]),
    ]
)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df_titanic.drop("survived", axis=1),
    df_titanic.survived,
    test_size=0.2,
    random_state=0,
)

In [None]:
my_pipe = Pipeline(
    [("preprocessor", preprocessor), ("classifier", DecisionTreeClassifier())]
)

my_params = {"classifier__max_depth": [2, 3, 4, 5, 6, 7, 8]}

grid = GridSearchCV(my_pipe, my_params, cv=5)
grid.fit(X_train, y_train)
score = grid.score(X_test, y_test)

print(f"Test score: {score}")
print(f"Best parameters: {grid.best_params_}")
print(f"Best score: {grid.best_score_}")

Test score: 0.7977099236641222
Best parameters: {'classifier__max_depth': 5}
Best score: 0.7927044884939621


**TODO:**

Using __ColumnTransformer__ and __Pipeline__, build a pipeline where different transformations are applied to different types of data, you can use the Titanic dataset again. Also, do some research about the [FutureUnion](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html) method.