<a href="https://colab.research.google.com/github/zerotodeeplearning/ztdl-masterclasses/blob/master/notebooks/Feature_Engineering_with_Scikit_Learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Learn with us: www.zerotodeeplearning.com

Copyright © 2021: Zero to Deep Learning ® Catalit LLC.

In [None]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Feature Engineering with Scikit Learn

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
url = 'https://raw.githubusercontent.com/zerotodeeplearning/ztdl-masterclasses/master/data/'

In [None]:
df = pd.read_csv(url + 'titanic-train.csv')
df.head()

In [None]:
df.info()

### Feature inspection


In [None]:
df.isna().sum()

In [None]:
df.select_dtypes('O').apply(lambda x:len(x.unique()), axis=0)

In [None]:
plt.figure(figsize=(10, 6))
i = 1
for c in df.select_dtypes('number').columns:
  plt.subplot(3, 3, i)
  df[c].plot.hist()
  plt.title(c)
  i = i+1

plt.tight_layout()

Missing data:
- drop `'PassengerId', 'Name', 'Cabin', 'Ticket'`
- drop 2 rows with missing Embarked
- impute missing `Age` values with mean

In [None]:
df1 = df.drop(['PassengerId', 'Name', 'Cabin', 'Ticket'], axis=1)

In [None]:
df1 = df1.dropna(subset=['Embarked'])

In [None]:
df1['Age'] = df1['Age'].fillna(df1['Age'].mean())

#### Scalers

Explore `Age` and `Fare` transformations

In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer, KBinsDiscretizer, QuantileTransformer

In [None]:
df1['Age'].plot.hist(bins=30, title='Age Distribution');

In [None]:
scaled_age = StandardScaler().fit_transform(df1[['Age']])

In [None]:
plt.hist(scaled_age, bins=30)
plt.title('Scaled Age Distribution');

In [None]:
kbd = KBinsDiscretizer(n_bins=7, encode='onehot-dense')

age_discretized = kbd.fit_transform(df1[['Age']])

age_discretized

In [None]:
kbd.bin_edges_

In [None]:
pd.Series(age_discretized.argmax(axis=1)).value_counts(sort=False)

In [None]:
df1['Fare'].plot.hist(bins=30, title='Fare Distribution');

In [None]:
qt = QuantileTransformer(n_quantiles=100)

scaled_fare = qt.fit_transform(df1[['Fare']])

In [None]:
plt.hist(scaled_fare, bins=30)
plt.title('Scaled Fare Distribution');

In [None]:
plt.plot(qt.quantiles_);

### Exercise 1
- Create a label variable `y = df1['Survived']`
- Create a variable `X` that only contains the following raw features: `['Pclass', 'Age', 'Fare', 'Parch', 'SibSp']`

- Use the [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer) to perform all the following transformations at once on df1:
  - Feature engineering:
    - binarize Sex with `OneHotEncoder`
    - One-hot encode Pclass and Embarked with `OneHotEncoder`
    - Scale Age with `StandardScaler`
    - Discretize Age with `KBinsDiscretizer` into 7 bins.
    - Transform `Fare` with `QuantileTransformer` with 100 quantiles.
    - Create boolean columns for `Parch` and `Sibsp` using `FunctionTransformer` (0 if 0, 1 if > 0)

  - Also, use the `passthrough` option for:
    - `Age`
    - `SibSp`
    - `Parch`

- Create a new variable `X_new` with the transformed features.

If you've done everything correctly you should have the following features:

```python
new_features = ['male',
                'Pclass_1',
                'Pclass_2',
                'Pclass_3',
                'Embarked_C',
                'Embarked_Q',
                'Embarked_S',
                'Age_scaled',
                'Age_bins_0',
                'Age_bins_1',
                'Age_bins_2',
                'Age_bins_3',
                'Age_bins_4',
                'Age_bins_5',
                'Age_bins_6',
                'Fare_transformed',
                'Parch_bool',
                'SibSp_bool',
                'Age',
                'SibSp',
                'Parch'
                ]
```

In [None]:
from sklearn.compose import ColumnTransformer

### Exercise 2

- Train a `DecisionTreeClassifier` on `X` and `X_new` and compare their performance. For this exercise we will not perform a train/test split and just evaluate the performance on the whole dataset
- Compare the feature importances for the 2 models using the `model.feature_importances_` calculated attribute.
- Bonus points if you plot the features importances with a bar chart
- Which model is performing better?

### Exercise 3: Feature Selection

- Create a final model using the [`Recursive Feature Elimination`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html) transformer and re-train it on `X_new`
- Which features are retained?
- Does the model performance drop by a lot?