# Re-create your own _One Hot Encoder_ 

In [None]:
import pandas as pd
import seaborn as sns

## (1) The Titanic Dataset

In [None]:
# Loading 100% of the dataset. 
# Choose 0.5 to load only 50% of the rows randomly

data = sns.load_dataset('titanic').sample(frac = 1) 
data.head()

In [None]:
from sklearn.model_selection import train_test_split

X = data.drop(columns = ['survived', 'alive', 'who', 'adult_male'])
y = data['survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

In [None]:
X_train

## (2) A first pipeline

❓ Create a basic Pipeline which ***encodes categorical features*** and ***scales numerical features*** ❓

💡 Use [`make_pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html) and [`make_column_transformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html)

In [None]:
num_features = ['age','fare','sibsp','parch']
cat_features = ['pclass','sex','embarked','class','embark_town','alone']

In [None]:
# YOUR CODE HERE

<details>
    <summary>👩🏻‍🏫 <i>Pipeline</i> vs. <i>make_pipeline</i></summary>

* When you create a Pipeline with `Pipeline()`, you have to:
    - specify all the ***sequential steps of the pipeline*** in a list
    - each step is a tuple with:
        - "name_of_the_step"
        - official Scikit-Learn name of the step
    
```python
Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])
```
  
* When you create a Pipeline with `make_pipeline()`,
    - you don't have give a name to each step
    - you can simply chain all the steps together using their official Scikit-Learn name
    - the names of the steps are automatically induced by `make_pipeline`
    
```python
make_pipeline(
    SimpleImputer(strategy='median'),
    StandardScaler()
)
```
    
</details>

<details>
    <summary>👩🏻‍🏫 <i>ColumnTransformer</i> vs. <i>make_column_transformer</i></summary>

* When you create a ColumnTransformer with `ColumnTransformer()`, you have to:
    - specify all the ***parallel steps of the columns' transformer*** in a list
    - each step is a tuple with:
        - "name_of_the_transformer"
        - the transformer
        - the columns which will be impacted by the transformer
    
```python
ColumnTransformer([
    ('num_transformer', num_transformer, num_features),
    ('cat_transformer', cat_transformer, cat_features)
])
```
  
* When you create a ColumnTransformer with `make_column_transformer()`,
    - you don't have give a name to each parallel step
    - each step is a tuple with:
        - the transformer
        - the columns which will be impacted by the transformer
    
```python
make_column_transformer(
    (num_transformer, num_features),
    (cat_transformer, cat_features)
)
```
    
</details>

❓ Chain this preprocessing pipeline with a classifier and optimize it ❓

In [None]:
# YOUR CODE HERE

❓ What are the best params and the best score ❓

In [None]:
# YOUR CODE HERE

## (3) How could we design a Custom Encoder to keep track of the columns' names?

In [None]:
# By default, OneHotEncoder works with Numpy and loses track of columns' names...
ohe = OneHotEncoder(sparse=False)
ohe.fit_transform(X_train[['sex']])

In [None]:
# ... however, we can access the one-hot-encoded names as follows
ohe.get_feature_names_out()

❓ Try to create your own OneHotEncoder so that it preserves the columns names ❓

In [None]:
# YOUR CODE HERE

🏁 If you want to build a very advanced pipeline, feel free to explore the Optional Challenge dealing the `cars dataset` !

💾 Don't forget to git add/commit/push your notebook.

👏 Congratulations, you are now a master at Pipeline and ColumnTransformer.