# Feature Engineering
  
You will now get exposure to different types of features. You will modify existing features and create new ones. Also, you will treat the missing data accordingly.


## Resources
  
**Notebook Syntax**
  
<span style='color:#7393B3'>NOTE:</span>  
- Denotes additional information deemed to be *contextually* important
- Colored in blue, HEX #7393B3
  
<span style='color:#E74C3C'>WARNING:</span>  
- Significant information that is *functionally* critical  
- Colored in red, HEX #E74C3C
  
---
  
**Links**
  
[NumPy Documentation](https://numpy.org/doc/stable/user/index.html#user)  
[Pandas Documentation](https://pandas.pydata.org/docs/user_guide/index.html#user-guide)  
[Matplotlib Documentation](https://matplotlib.org/stable/index.html)  
[Seaborn Documentation](https://seaborn.pydata.org)  
[Scikit-Learn Documentation](https://scikit-learn.org/stable/)  
  
---
  
**Notable Functions**
  
<table>
  <tr>
    <th>Index</th>
    <th>Operator</th>
    <th>Use</th>
  </tr>
  <tr>
    <td>1</td>
    <td>sklearn.ensemble.RandomForestRegressor</td>
    <td>Create a Random Forest Regressor model.</td>
  </tr>
  <tr>
    <td>2</td>
    <td>sklearn.model_selection.KFold</td>
    <td>Split dataset into K consecutive folds for cross-validation.</td>
  </tr>
  <tr>
    <td>3</td>
    <td>sklearn.model_selection.KFold.split</td>
    <td>Generate indices to split data into training and test sets using KFold.</td>
  </tr>
  <tr>
    <td>4</td>
    <td>sklearn.metrics.mean_squared_error</td>
    <td>Calculate the mean squared error between true and predicted values.</td>
  </tr>
  <tr>
    <td>5</td>
    <td>numpy.mean</td>
    <td>Compute the arithmetic mean along a specified axis.</td>
  </tr>
  <tr>
    <td>6</td>
    <td>numpy.std</td>
    <td>Compute the standard deviation along a specified axis.</td>
  </tr>
  <tr>
    <td>7</td>
    <td>pandas.concat</td>
    <td>Concatenate DataFrames along a particular axis.</td>
  </tr>
  <tr>
    <td>8</td>
    <td>pandas.to_datetime</td>
    <td>Convert the argument to datetime.</td>
  </tr>
  <tr>
    <td>9</td>
    <td>sklearn.preprocessing.LabelEncoder</td>
    <td>Encode labels with a numeric value.</td>
  </tr>
  <tr>
    <td>10</td>
    <td>DataFrame.map</td>
    <td>Apply a function to each element of a Series.</td>
  </tr>
  <tr>
    <td>11</td>
    <td>DataFrame.sample</td>
    <td>Randomly sample rows from a DataFrame.</td>
  </tr>
  <tr>
    <td>12</td>
    <td>DataFrame.drop_duplicates</td>
    <td>Remove duplicate rows from a DataFrame.</td>
  </tr>
  <tr>
    <td>13</td>
    <td>sklearn.impute.SimpleImputer</td>
    <td>Impute missing values using a specified strategy.</td>
  </tr>
</table>



  
---
  
**Language and Library Information**  
  
Python 3.11.0  
  
Name: numpy  
Version: 1.24.3  
Summary: Fundamental package for array computing in Python  
  
Name: pandas  
Version: 2.0.3  
Summary: Powerful data structures for data analysis, time series, and statistics  
  
Name: matplotlib  
Version: 3.7.2  
Summary: Python plotting package  
  
Name: seaborn  
Version: 0.12.2  
Summary: Statistical data visualization  
  
Name: scikit-learn  
Version: 1.3.0  
Summary: A set of python modules for machine learning and data mining  
  
---
  
**Miscellaneous Notes**
  
<span style='color:#7393B3'>NOTE:</span>  
  
`python3.11 -m IPython` : Runs python3.11 interactive jupyter notebook in terminal.
  
`nohup ./relo_csv_D2S.sh > ./output/relo_csv_D2S.log &` : Runs csv data pipeline in headless log.  
  
`print(inspect.getsourcelines(test))` : Get self-defined function schema  
  
<span style='color:#7393B3'>NOTE:</span>  
  
Snippet to plot all built-in matplotlib styles :
  
```python

x = np.arange(-2, 8, .1)
y = 0.1 * x ** 3 - x ** 2 + 3 * x + 2
fig = plt.figure(dpi=100, figsize=(10, 20), tight_layout=True)
available = ['default'] + plt.style.available
for i, style in enumerate(available):
    with plt.style.context(style):
        ax = fig.add_subplot(10, 3, i + 1)
        ax.plot(x, y)
    ax.set_title(style)
```
  

In [8]:
import numpy as np                  # Numerical Python:         Arrays and linear algebra
import pandas as pd                 # Panel Datasets:           Dataset manipulation
import matplotlib.pyplot as plt     # MATLAB Plotting Library:  Visualizations
import seaborn as sns               # Seaborn:                  Visualizations

# Setting a standard figure size
plt.rcParams['figure.figsize'] = (8, 8)

# Setting a standard style
plt.style.use('ggplot')

# Set the maximum number of columns to be displayed
pd.set_option('display.max_columns', 50)

## Feature engineering
  
Once we know the properties of the input data and have a reliable validation scheme, it's time to start building prediction models.
  
**Solution workflow**
  
Recall from the previous chapter the solution workflow for the competitions. We've already covered the first three blocks. Let's now consider the modeling stage.
  
<center><img src='../_images/feature-engineering-kaggle.png' alt='img' width='740'></center>
  
**Modeling stage**
  
This stage is the longest one in the competition, and kind of feels like a marathon.
  
<center><img src='../_images/feature-engineering-kaggle1.png' alt='img' width='740'></center>
  
During the modeling loop we pre-process data, create new features, enhance models, apply different tricks and iterate over and over again. The majority of the ideas and experiments will not work, but the goal is to find a subsample of actions which improve both local validation and Public Leaderboard scores.
  
<center><img src='../_images/feature-engineering-kaggle2.png' alt='img' width='740'></center>
  
So, after any change we should look at the validation score. If we observe an improvement on local validation, then we keep our change, otherwise, discard it. The important rule is to tweak only a single thing at a time, because changing multiple things does not allow us to detect what actually works and what doesn't.
  
<center><img src='../_images/feature-engineering-kaggle3.png' alt='img' width='740'></center>
  
**Feature engineering**
  
This particular chapter is devoted to feature engineering. It is the process of creating new features. It helps our Machine Learning models to get the additional information and consequently to better predict the target variable.
  
<center><img src='../_images/feature-engineering-kaggle4.png' alt='img' width='740'></center>
  
The ideas for new features can come from prior experience working with similar data. Another source is EDA. Having looked at the data, we could potentially generate ideas for new valuable features. One more source is domain knowledge of the problem we're solving. It allows us to use ideas and approaches that work for this particular domain.
  
<center><img src='../_images/feature-engineering-kaggle5.png' alt='img' width='740'></center>
  
**Feature types**
  
There is a number of different feature types. The most popular include: Numerical features. It's usual numbers, measures and counts. For example, price, number of bedrooms and so on. Categorical features. It's some group the observation belongs to. For example, country names, marital status and so on. Date features include various date and time information. Coordinates describe geospatial data. Text features contain different descriptions, addresses and so on. Finally, images include some visual data for each observation.
  
- Numerical
- Categorical
- Datetime
- Coordinates
- Text
- Images
  
**Creating features**
  
There are some situations when we need to generate features for train and test independently and for each validation split in the k-fold cross-validation. However, in the majority of cases features are created for train and test sets simultaneously. For this purpose, we concatenate train and test DataFrames into a single DataFrame using `pandas`' `.concat()` method. Then we generate some new features. And split our DataFrame back to the train and test. We could use the `.isin()` method to find the original train and test ids, respectively.
  
<center><img src='../_images/feature-engineering-kaggle6.png' alt='img' width='740'></center>
  
**Arithmetical features**
  
The simplest engineered features are arithmetical features. We just take two numerical features, apply arithmetical operations to them and obtain new features. Let's consider a subsample from two sigma connect dataset with only number of bathrooms and bedrooms in the apartments, together with the price. Then, for example, we could generate such features as price per one bedroom. Or the overall number of bedrooms and bathrooms. And so on.
  
<center><img src='../_images/feature-engineering-kaggle7.png' alt='img' width='740'></center>
  
**Datetime features**
  
Another type of the data we will speak about in this lesson, is datetime. Let's look at the demand forecasting data. It contains item sales for each date. To generate features from this date, firstly, we convert the date column to datetime object using `pandas`' `to_datetime()` method. Then, we could use the `.dt` attribute and obtain any date feature we'd like.
  
<center><img src='../_images/feature-engineering-kaggle8.png' alt='img' width='740'></center>
  
For example, we could start with the year number. Using `.dt` attribute and proceeding with `.year` attribute. Then, for example, month number. January is encoded as 1, February as 2 and so on to December encoded as 12. We can also get a consecutive number of the week during the year. And various possibilities for day features. Like a consecutive number of the day during the year, month and week. Note that day of the week encodes Monday as 0, Tuesday as 1 proceeding to Sunday as 6.
  
<center><img src='../_images/feature-engineering-kaggle9.png' alt='img' width='740'></center>
  
**Let's practice!**
  
All right, let's get some practical experience creating new numerical and date features!

### Arithmetical features
  
To practice creating new features, you will be working with a subsample from the Kaggle competition called "House Prices: Advanced Regression Techniques". The goal of this competition is to predict the price of the house based on its properties. It's a regression problem with Root Mean Squared Error as an evaluation metric.
  
Your goal is to create new features and determine whether they improve your validation score. To get the validation score from 5-fold cross-validation, you're given the `get_kfold_rmse()` function. Use it with the `train` DataFrame, available in your workspace, as an argument.
  
---
  
1. Create a new feature representing the total area (basement, 1st and 2nd floors) of the house. The columns `"TotalBsmtSF"`, `"FirstFlrSF"` and `"SecondFlrSF"` give the areas of the basement, 1st and 2nd floors, respectively.
2. Create a new feature representing the area of the garden. It is a difference between the total area of the property (`"LotArea"`) and the first floor area (`"FirstFlrSF"`).
3. Create a new feature representing the total number of bathrooms in the house. It is a sum of full bathrooms (`"FullBath"`) and half bathrooms (`"HalfBath"`).

In [9]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

# Model instantiation
kf = KFold(n_splits=5, shuffle=True, random_state=123)

def get_kfold_rmse(train):
    mse_scores = []

    for train_index, test_index in kf.split(train):
        train = train.fillna(0)
        feats = [x for x in train.columns if x not in ['Id', 'SalePrice', 'RoofStyle', 'CentralAir']]
        
        fold_train, fold_test = train.loc[train_index], train.loc[test_index]

        # Fit the data and make predictions
        # Create a Random Forest object
        rf = RandomForestRegressor(n_estimators=10, min_samples_split=10, random_state=123)

        # Train a model
        rf.fit(X=fold_train[feats], y=fold_train['SalePrice'])

        # Get predictions for the test set
        pred = rf.predict(fold_test[feats])
    
        fold_score = mean_squared_error(fold_test['SalePrice'], pred)
        mse_scores.append(np.sqrt(fold_score))
        
    return round(np.mean(mse_scores) + np.std(mse_scores), 2)

In [10]:
train = pd.read_csv('../_datasets/house_prices_train.csv')
test = pd.read_csv('../_datasets/house_prices_test.csv')

In [11]:
# Look at the initial RMSE
print('RMSE before feature engineering:', get_kfold_rmse(train))

# Find the total area of the house
train['totalArea'] = train['TotalBsmtSF'] + train['1stFlrSF'] + train['2ndFlrSF']

# Look at the updated RMSE
print('RMSE with total area:', get_kfold_rmse(train))

# Find the area of the garden
train['GardenArea'] = train['LotArea'] - train['1stFlrSF']
print('RMSE with garden area:', get_kfold_rmse(train))

# Find total number of bathrooms
train['TotalBath'] = train['FullBath'] + train['HalfBath']
print('RMSE with number of bathromms:', get_kfold_rmse(train))

RMSE before feature engineering: 36029.39
RMSE with total area: 35073.2
RMSE with garden area: 34413.55
RMSE with number of bathromms: 34506.78


Nice! You've created three new features. Here you see that house area improved the RMSE by almost 1,000. Adding garden area improved the RMSE by another 600. However, with the total number of bathrooms, the RMSE has increased. It means that you keep the new area features, but do not add "TotalBath" as a new feature. Let's now work with the datetime features!

### Date features
  
You've built some basic features using numerical variables. Now, it's time to create features based on date and time. You will practice on a subsample from the Taxi Fare Prediction Kaggle competition data. The data represents information about the taxi rides and the goal is to predict the price for each ride.
  
Your objective is to generate date features from the pickup datetime. Recall that it's better to create new features for `train` and `test` data simultaneously. After the features are created, split the data back into the `train` and `test` DataFrames. Here it's done using `pandas`' `.isin()` method.
  
The `train` and `test` DataFrames are already available in your workspace.
  
---
  
1. Concatenate the `train` and `test` DataFrames into a single DataFrame `taxi`.
2. Convert the "pickup_datetime" column to a datetime object.
3. Create the day of week (using `.dayofweek` attribute) and hour (using `.hour` attribute) features from the `"pickup_datetime"` column.

In [12]:
train = pd.read_csv('../_datasets/taxi_train_chapter_4.csv')
test = pd.read_csv('../_datasets/taxi_test_chapter_4.csv')
print(train.shape)
print(test.shape)
train.head()

(20000, 8)
(9914, 7)


Unnamed: 0,id,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,0,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.84161,40.712278,1
1,1,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1
2,2,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2
3,3,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1
4,4,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1


In [13]:
# Concatenate train and test together
taxi = pd.concat([train, test])
print(taxi.shape)
print(taxi.dtypes)

# Convert pickup date to datetime object
taxi['pickup_datetime'] = pd.to_datetime(taxi['pickup_datetime'])
print(taxi.dtypes)

# Create a day of week feature
taxi['dayofweek'] = taxi['pickup_datetime'].dt.dayofweek

# Create an hour feature
taxi['hour'] = taxi['pickup_datetime'].dt.hour

# Split back into train and test
new_train = taxi[taxi['id'].isin(train['id'])]
new_test = taxi[taxi['id'].isin(test['id'])]

new_train.head()

(29914, 8)
id                     int64
fare_amount          float64
pickup_datetime       object
pickup_longitude     float64
pickup_latitude      float64
dropoff_longitude    float64
dropoff_latitude     float64
passenger_count        int64
dtype: object
id                                 int64
fare_amount                      float64
pickup_datetime      datetime64[ns, UTC]
pickup_longitude                 float64
pickup_latitude                  float64
dropoff_longitude                float64
dropoff_latitude                 float64
passenger_count                    int64
dtype: object


Unnamed: 0,id,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,dayofweek,hour
0,0,4.5,2009-06-15 17:26:21+00:00,-73.844311,40.721319,-73.84161,40.712278,1,0,17
1,1,16.9,2010-01-05 16:52:16+00:00,-74.016048,40.711303,-73.979268,40.782004,1,1,16
2,2,5.7,2011-08-18 00:35:00+00:00,-73.982738,40.76127,-73.991242,40.750562,2,3,0
3,3,7.7,2012-04-21 04:30:42+00:00,-73.98713,40.733143,-73.991567,40.758092,1,5,4
4,4,5.3,2010-03-09 07:51:00+00:00,-73.968095,40.768008,-73.956655,40.783762,1,1,7


Great! Now you know how to perform feature engineering for train and test DataFrames simultaneously. Having considered numerical and datetime features, move forward to master feature engineering for categorical ones!

## Categorical features
  
We started working with numerical features in the previous lesson. In this lesson, we will generate some new features from categorical variables.
  
**Label encoding**
  
Consider the example of a categorical feature on the slide. The majority of machine learning models does not handle string values and categorical features automatically. So, before passing the data to the model we need to pre-process the categorical features into some meaningful numbers. There are lots of ways to encode categorical features. We'll consider two of the most popular options. The first one is label encoding. The idea is to map each category into the integer number. In this case, A is mapped into 0, B is mapped into 1 and so on.
  
<center><img src='../_images/categorical-features-kaggle.png' alt='img' width='740'></center>
  
To apply label encoding we will use `LabelEncoder` from `sklearn`. Firstly, create the object of this class. Then call the `.fit_transform()` method on the column needed. df is an example DataFrame from the previous slide. So, now we have label encoded categories! The problem with Label encoding is that we implicitly assume that there is a ranking dependency between the categories. For example, category C has label 2 which is much higher than category A with label 0. Such an approach is harmful to linear models, although it still works for tree-based models.
  
<center><img src='../_images/categorical-features-kaggle1.png' alt='img' width='740'></center>
  
**One-Hot encoding**
  
To overcome the problem of ranking dependency between the categories, we could use one-hot encoding. In this type of encoding, we create a separate column for each of the categories. So, in this example, we created 4 columns instead of a single initial one. Then we set 1 for the corresponding category value and 0 for all other categories.
  
<center><img src='../_images/categorical-features-kaggle2.png' alt='img' width='740'></center>
  
There are multiple ways to implement one-hot encoding. We will consider `pandas`' `.get_dummies()` method. Let's call it on a column to be encoded specifying the `prefix=` parameter that will assign column names. Then we drop the initial categorical column, because it is not needed anymore. Lastly, we concatenate the original features with the one-hot encoded feature into a single DataFrame. The resulting DataFrame has the expected structure. The drawback of such approach arises if the feature has a lot of different categories. For example, if we have a feature with 1,000 different categories, we'll have to create 1,000 new columns.
  
<center><img src='../_images/categorical-features-kaggle3.png' alt='img' width='740'></center>
  
**Binary Features**
  
One special case of categorical features is binary features. It relates to categorical variables that have only two possible values. For example, Yes-No answers or whether some property is On or Off. For such features we always apply label encoding, substituting the first category with zero and the second category with one.
  
<center><img src='../_images/categorical-features-kaggle4.png' alt='img' width='740'></center>
  
**Other encoding approaches**
  
There is a long list of other categorical features encoders.
  
<center><img src='../_images/categorical-features-kaggle5.png' alt='img' width='740'></center>
  
The most widely used at Kaggle is target encoder. We will learn more about it in the next lesson.
  
**Let's practice!**
  
But for now, let's get some practical experience with label and one-hot encoders!

### Label encoding
  
Let's work on categorical variables encoding. You will again work with a subsample from the House Prices Kaggle competition.
  
Your objective is to encode categorical features `"RoofStyle"` and `"CentralAir"` using label encoding. The `train` and `test` DataFrames are already available in your workspace.
  
---
  
1. Concatenate `train` and `test` DataFrames into a single DataFrame `houses`.
2. Create a `LabelEncoder` object without arguments and assign it to `le`.
3. Create new label-encoded features for `"RoofStyle"` and `"CentralAir"` using the same le object.

In [14]:
from sklearn.preprocessing import LabelEncoder

train = pd.read_csv('../_datasets/house_prices_train.csv')
test = pd.read_csv('../_datasets/house_prices_test.csv')

# Concatenate train and test together
houses = pd.concat([train, test])

# Label encoder
le = LabelEncoder()

# Create new features
houses['RoofStyle_enc'] = le.fit_transform(houses['RoofStyle'])
houses['CentralAir_enc'] = le.fit_transform(houses['CentralAir'])

# Look at new features
houses[['RoofStyle', 'RoofStyle_enc', 'CentralAir', 'CentralAir_enc']].head()

Unnamed: 0,RoofStyle,RoofStyle_enc,CentralAir,CentralAir_enc
0,Gable,1,Y,1
1,Gable,1,Y,1
2,Gable,1,Y,1
3,Gable,1,Y,1
4,Gable,1,Y,1


All right! You can see that categorical variables have been label encoded. However, as you already know, label encoder is not always a good choice for categorical variables. Let's go further and apply One-Hot encoding.

### One-Hot encoding
  
The problem with label encoding is that it implicitly assumes that there is a ranking dependency between the categories. So, let's change the encoding method for the features `"RoofStyle"` and `"CentralAir"` to one-hot encoding. Again, the train and test DataFrames from House Prices Kaggle competition are already available in your workspace.
  
Recall that if you're dealing with binary features (categorical features with only two categories) it is suggested to apply label encoder only.
  
Your goal is to determine which of the mentioned features is not binary, and to apply one-hot encoding only to this one.
  
---
  
1. Determine the distribution of `"RoofStyle"` and `"CentralAir"` features using `pandas`' `.value_counts()` method.
2. Question  
Which of the features is binary?  
- [ ] "RoofStyle".
- [x] "CentralAir".
3. As long as `"CentralAir"` is a binary feature, encode it with a label encoder (0 - for one class and 1 - for another class).
4. For the categorical feature `"RoofStyle"` let's use the one-hot encoder. Firstly, create one-hot encoded features using the `.get_dummies()` method. Then they are concatenated to the initial `houses` DataFrame.

In [15]:
# Concatenate train and test together
houses = pd.concat([train, test])

# Look at feature distributions
print(houses['RoofStyle'].value_counts(), '\n')
print(houses['CentralAir'].value_counts())

RoofStyle
Gable      2310
Hip         551
Gambrel      22
Flat         20
Mansard      11
Shed          5
Name: count, dtype: int64 

CentralAir
Y    2723
N     196
Name: count, dtype: int64


In [16]:
# Label encode binary 'CentralAir' feature
le = LabelEncoder()
houses['CentralAir_enc'] = le.fit_transform(houses['CentralAir'])

# Create One-Hot encoded features
ohe = pd.get_dummies(houses['RoofStyle'], prefix='RoofStyle')

# Concatenate OHE features to houses
houses = pd.concat([houses, ohe], axis=1)

# Look at OHE features
houses[[col for col in houses.columns if 'RoofStyle' in col]].head(5)

Unnamed: 0,RoofStyle,RoofStyle_Flat,RoofStyle_Gable,RoofStyle_Gambrel,RoofStyle_Hip,RoofStyle_Mansard,RoofStyle_Shed
0,Gable,False,True,False,False,False,False
1,Gable,False,True,False,False,False,False
2,Gable,False,True,False,False,False,False
3,Gable,False,True,False,False,False,False
4,Gable,False,True,False,False,False,False


Congratulations! Now you've mastered one-hot encoding as well! The one-hot encoded features look as expected. Remember to drop the initial string column, because models will not handle it automatically. OK, we're done with simple categorical encoders. Let's move to the target encoder!

## Target encoding
  
We eventually come to one of the secret sauces of Kaggle competitions. It's called target encoding.
  
**High cardinality categorical features**
  
To begin with, let's discuss high cardinality categorical features. These are categorical features that have a large number of category values (at least over 10 different category values). A label encoder would encode each category with a separate number. In case of high-cardinality, it means that we'll have a feature with lots of unordered integer numbers. Another option is a one-hot encoder. In this case, we have to create a large number of new features for each category value. So, the best alternative to the two methods above is target encoding. As a label encoder, it creates only a single column, but it also introduces the correlation between the categories and the target variable.
  
- Label encoder provides only distinct number for each category
- One-hot encoder creates a new feature for each category value
  
**Mean target encoding**
  
There are various options for the encoding function. We will consider the most frequently used on Kaggle: the mean target encoding. Say we have a binary classification problem with a single categorical feature. On the left is our train data with known labels. On the right is our test data on which we'd like to make predictions.
  
<center><img src='../_images/target-encoding-kaggle.png' alt='img' width='740'></center>
  
To apply mean target encoding to a particular feature we need to perform the following steps: First, we calculate the mean target value for each category on the whole train data. Then we apply these statistics to the corresponding category in the test data. Next, we divide the train data into folds. For each fold, we calculate the target mean on all the folds except for this particular one. It's called 'out-of-fold' data. Further, out-of-fold statistics are applied to this particular fold. This prevents overfitting to the train set. Now, both train and test data have this new feature. So, we can add this mean target encoded feature to our model.
  
1. Calculate mean on the train, apply to the test
2. Split train into K folds, Calculate mean on (K-1) folds, apply to the K-th fold
3. Add mean target encoded feature to the model
  
**Calculate mean on the train**
  
To encode categories in the test data, we simply take the whole train data and calculate mean target values for each category.
  
In this case, for category A it equals 0.66 (2 positive values out of 3 observations).
  
<center><img src='../_images/target-encoding-kaggle1.png' alt='img' width='740'></center>
  
And for category B it equals 0.25 (1 positive value out of 4 observations).
  
<center><img src='../_images/target-encoding-kaggle2.png' alt='img' width='740'></center>
  
**Test encoding**
  
These statistics are applied to the corresponding category in the test data. As a result, we've obtained a new feature.
  
<center><img src='../_images/target-encoding-kaggle3.png' alt='img' width='740'></center>
  
**Train encoding using out-of-fold**
  
Now, we need to calculate this mean target encoded feature for the train data. As we said, we'll be using out-of-fold statistics. Let's split the train data into 2 folds: one and two.
  
<center><img src='../_images/target-encoding-kaggle4.png' alt='img' width='740'></center>
  
Take fold number 1. We calculate the target mean out of this fold, so using only fold number 2 observations.
  
<center><img src='../_images/target-encoding-kaggle5.png' alt='img' width='740'></center>
  
That's why category A obtains 0 and category B obtains 0.5.
  
<center><img src='../_images/target-encoding-kaggle6.png' alt='img' width='740'></center>
  
Now we calculate out-of-fold target means for the second fold using only the first fold observations.
  
<center><img src='../_images/target-encoding-kaggle7.png' alt='img' width='740'></center>
  
Thus, category A obtains 1 and category B obtains 0. We now have this mean encoded category in both the train and test data. So, we can use it as a new feature and pass to our model.
  
<center><img src='../_images/target-encoding-kaggle8.png' alt='img' width='740'></center>
  
**Practical guides**
  
Before moving to practice, let's discuss some practical tips that are always applied together with mean target encoding.
  
The first one is smoothing. Initially, for a specific category, we took a simple mean. However, if we had some rare categories with only one or two values, they would get a strict 0 or 1 mean encoding. It could lead to overfitting. That's why we introduce regularization. We first calculate the global mean. It is the target mean value for the whole train data. Then assume that we add alpha new observations with this global mean to each category. Now, if the category is large, we will trust the mean encoding, otherwise, we will stick to the global mean. Alpha is a hyperparameter we have to specify manually. Usually, values from 5 to 10 work pretty well by default.
  
<center><img src='../_images/target-encoding-kaggle9.png' alt='img' width='740'></center>
  
Another practical advice is about new categories in the test data. In such case, we do not know what is the target mean value for this category. That's why new category values are simply filled in with a target global mean.
  
Take a look at the example. In the initial setting category A would get 0.5 and category B -- one third. However, with alpha equals 5, category A gets about 0.43 and category B -- about 0.38. While the new category C in the test data gets the global mean, that equals 0.4.
  
<center><img src='../_images/target-encoding-kaggle10.png' alt='img' width='740'></center>
  
**Let's practice!**
  
All right, let's transform these theoretical considerations into the Python code!

### Mean target encoding
  
First of all, you will create a function that implements mean target encoding. Remember that you need to develop the two following steps:
  
1. Calculate the mean on the train, apply to the test
2. Split train into K folds. Calculate the out-of-fold mean for each fold, apply to this particular fold
  
Each of these steps will be implemented in a separate function: `test_mean_target_encoding()` and `train_mean_target_encoding()`, respectively.
  
The final function `mean_target_encoding()` takes as arguments: the `train` and `test` DataFrames, the name of the categorical column to be encoded, the name of the target column and a smoothing parameter alpha. It returns two values: a new feature for train and test DataFrames, respectively.
  
---
  
1. You need to add smoothing to avoid overfitting. So, add α parameter to the denominator in `train_statistics` calculations.
2. You need to treat new categories in the test data. So, pass a global mean as an argument to the `.fillna()` method.
3. To calculate the train mean encoded feature you need to use out-of-fold statistics, splitting train into several folds. Specify the train and test indices for each validation split to access it.
4. Finally, you just calculate train and test target mean encoded features and return them from the function. So, return `train_feature` and `test_feature` obtained.

In [17]:
def test_mean_target_encoding(train, test, target, categorical, alpha=5):
    # Calculate global mean on the train data
    global_mean = train[target].mean()
    
    # Group by the categorical feature and calculate its properties
    train_groups = train.groupby(categorical)
    category_sum = train_groups[target].sum()
    category_size = train_groups.size()
    
    # Calculate smoothed mean target statistics
    train_statistics = (category_sum + global_mean * alpha) / (category_size + alpha)
    
    # Apply statistics to the test data and fill new categories
    test_feature = test[categorical].map(train_statistics).fillna(global_mean)
    return test_feature.values


def train_mean_target_encoding(train, target, categorical, alpha=5):
    # Create 5-fold cross-validation
    kf = KFold(n_splits=5,random_state=123, shuffle=True)
    train_feature = pd.Series(index=train.index, dtype='float')
    
    # For each folds split
    for train_index, test_index in kf.split(train):
        cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]
        
        # Calculate out-of-fold statistics and apply to cv_test
        cv_test_feature = test_mean_target_encoding(cv_train, cv_test, target, 
                                                    categorical, alpha)
        
        # Save new feature for this particular fold
        train_feature.iloc[test_index] = cv_test_feature
    return train_feature.values


def mean_target_encoding(train, test, target, categorical, alpha=5):
    # Get the train feature
    train_feature = train_mean_target_encoding(train, target, categorical, alpha)
    
    # Get the test feature
    test_feature = test_mean_target_encoding(train, test, target, categorical, alpha)
    
    # Return new features to add to the model
    return train_feature, test_feature

Great! Now you are equipped with a function that performs mean target encoding of any categorical feature. Move on to learn how to implement mean target encoding for the K-fold cross-validation using the `mean_target_encoding()` function you've just built!

### K-fold cross-validation
  
You will work with a binary classification problem on a subsample from Kaggle playground competition. The objective of this competition is to predict whether a famous basketball player Kobe Bryant scored a basket or missed a particular shot.
  
Train data is available in your workspace as `bryant_shots` DataFrame. It contains data on 10,000 shots with its properties and a target variable `"shot\_made\_flag"` -- whether shot was scored or not.
  
One of the features in the data is `"game_id"` -- a particular game where the shot was made. There are 541 distinct games. So, you deal with a high-cardinality categorical feature. Let's encode it using a target mean!
  
Suppose you're using 5-fold cross-validation and want to evaluate a mean target encoded feature on the local validation.
  
---
  
1. To achieve this, you need to repeat encoding procedure for the `"game_id"` categorical feature inside each folds split separately. Your goal is to specify all the missing parameters for the `mean_target_encoding()` function call inside each folds split.
2. Recall that the train and test parameters expect the `train` and `test` DataFrames.
3. While the target and categorical parameters expect names of the target variable and categorical feature to be encoded.

In [20]:
bryant_shots = pd.read_csv('../_datasets/bryant_shots.csv')
print(bryant_shots.shape)

# Create 5-fold cross-validation
kf = KFold(n_splits=5, random_state=123, shuffle=True)

# For each folds split
for train_index, test_index in kf.split(bryant_shots):
    cv_train, cv_test = bryant_shots.iloc[train_index].copy(), bryant_shots.iloc[test_index].copy()
    
    # Create mean target encoded feature
    cv_train['game_id_enc'], cv_test['game_id_enc'] = mean_target_encoding(
        train=cv_train,
        test=cv_test,
        target='shot_made_flag',
        categorical='game_id',
        alpha=5
    )
    
    # Look at the encoding
    print(cv_train[['game_id', 'shot_made_flag', 'game_id_enc']].sample(n=1))

print(bryant_shots.shape)

(10000, 10)
       game_id  shot_made_flag  game_id_enc
5245  20400027             0.0      0.47706
       game_id  shot_made_flag  game_id_enc
3780  20200842             0.0     0.343787


       game_id  shot_made_flag  game_id_enc
4649  20300500             1.0     0.311673
       game_id  shot_made_flag  game_id_enc
9333  20601057             1.0     0.285334
       game_id  shot_made_flag  game_id_enc
8442  20600340             1.0     0.421983
(10000, 10)


Nice! You could see different game encodings for each validation split in the output. The main conclusion you should make: while using local cross-validation, you need to repeat mean target encoding procedure inside each folds split separately. Go on to try other problem types beyond binary classification!

### Beyond binary classification
  
Of course, binary classification is just a single special case. Target encoding could be applied to any target variable type:
  
- For **binary classification** usually mean target encoding is used
- For **regression** mean could be changed to median, quartiles, etc.
- For **multi-class classification** with $N$ classes we create $N$ features with target mean for each category in one vs. all fashion
  
The `mean_target_encoding()` function you've created could be used for any target type specified above. Let's apply it for the regression problem on the example of House Prices Kaggle competition.
  
Your goal is to encode a categorical feature `"RoofStyle"` using mean target encoding. The `train` and `test` DataFrames are already available in your workspace.
  
---
  
1. Specify all the missing parameters for the `mean_target_encoding()` function call. Target variable name is `"SalePrice"`. Set α hyperparameter to 10.
2. Recall that the `train=` and `test=` parameters expect the `train` and `test` DataFrames.
3. While the `target` and `categorical` parameters expect names of the target variable and feature to be encoded.

In [21]:
train = pd.read_csv('../_datasets/house_prices_train.csv')
test = pd.read_csv('../_datasets/house_prices_test.csv')

# Create mean target encoded feature
train['RoofStyle_enc'], test['RoofStyle_enc'] = mean_target_encoding(train=train,
                                                                     test=test, 
                                                                     target='SalePrice',
                                                                     categorical='RoofStyle',
                                                                     alpha=10)
# Look at the encoding
test[['RoofStyle', 'RoofStyle_enc']].drop_duplicates()

Unnamed: 0,RoofStyle,RoofStyle_enc
0,Gable,171565.947836
1,Hip,217594.645131
98,Gambrel,164152.950424
133,Flat,188703.563431
362,Mansard,180775.938759
1053,Shed,188267.663242


So, you observe that houses with the `Hip` roof are the most pricy, while houses with the `Gambrel` roof are the cheapest. It's exactly the goal of target encoding: you've encoded categorical feature in such a manner that there is now a correlation between category values and target variable. We're done with categorical encoders. Now it's time to talk about the missing data!

## Missing data
  
We've considered some feature types and how to create new features for them. In this lesson, we will deal with the missing data.
  
**Missing data**
  
Some machine learning algorithms like XGBoost or LightGBM can treat missing data without any preprocessing. However, it's always a good idea to implement your own missing value imputation in order to improve the model. For example, consider the data presented on the slide. Let's assume that we need to solve the binary classification problem with labels 0 and 1. We have one categorical feature and one numerical feature. We'll consider how to deal with the missing values in order to impute gaps in the data. For example, observations with IDs 4 and 5 have missing values. Note that they are denoted as `'NaN'` values in `pandas` DataFrames.
  
<center><img src='../_images/missing-data-kaggle.png' alt='img' width='740'></center>
  
**Impute missing data**
  
For the numerical features the simplest method is mean or median imputing. It means that we fill each missing value with the mean or median of the available observations.
  
<center><img src='../_images/missing-data-kaggle1.png' alt='img' width='740'></center>
  
In this example, we would change the missing value to 4.72. However, imputation with mean or median just assigns an average observation to the missing value. So, we lose the information that this value was actually missing. To emphasize that the data was missing sometimes special constant values are used.
  
<center><img src='../_images/missing-data-kaggle2.png' alt='img' width='740'></center>
  
For example, -999. It's not a good choice for linear models but works perfectly for tree-based models.
  
<center><img src='../_images/missing-data-kaggle3.png' alt='img' width='740'></center>
  
To impute the categorical features we again have two choices. Either to fill in the most frequent category in the data,
  
<center><img src='../_images/missing-data-kaggle4.png' alt='img' width='740'></center>
  
In this example it would be category A. Or create a new category for the missing values. It again allows the model to get information that this observation had missing value.
  
<center><img src='../_images/missing-data-kaggle5.png' alt='img' width='740'></center>
  
For example, create a new category `'MISS'` and fill in the missing value.
  
<center><img src='../_images/missing-data-kaggle6.png' alt='img' width='740'></center>
  
1. Numerical data
- Mean/median imputation
- Constant value imputation
  
2. Categorical data
- Most frequent category imputation
- New category imputation
  
**Find missing data**
  
Let df be the `pandas` DataFrame from the example table presented on the previous slides. The `pandas`' method `.isnull()` returns the DataFrame with Booleans as cell values. If the value is missing, it returns `True`. If the value is present, it returns `False`. Therefore, we could call the `.sum()` method on this DataFrame and obtain the number of missing values in each column. In this case, we have one missing categorical feature and one missing numerical feature.
  
<center><img src='../_images/missing-data-kaggle7.png' alt='img' width='740'></center>
  
**Numerical missing data**
  
Let's now consider Python implementation. Again we will use the scikit-learn package. Import `SimpleImputer` from the `impute` module. To impute numerical data, we could create an object of this class. For mean imputing, we set the `strategy=` parameter to `'mean'`. For constant imputing, we set the `strategy=` to `'constant'` and specify the filling value (in this example, -999). Finally, we impute the value applying the `.fit_transform()` method to the selected columns. Note that we could select multiple columns to be imputed simultaneously. For this purpose, just pass the list of columns to be imputed. Note that even if we want to impute a single column, we have to use double brackets.
  
<center><img src='../_images/missing-data-kaggle8.png' alt='img' width='740'></center>
  
**Categorical missing data**
  
Imputation of categorical missing data is absolutely similar. We again could use two different strategies: the most frequent category or constant category. In this case, for example, category `'MISS'` for the missing data. Then we apply the selected imputer to the list of columns we'd like to impute.
  
<center><img src='../_images/missing-data-kaggle9.png' alt='img' width='740'></center>
  
**Let's practice!**
  
So, now you know the approaches to impute missing data. Let's polish them on practice!

### Find missing data
  
Let's impute missing data on a real Kaggle dataset. For this purpose, you will be using a data subsample from the Kaggle "Two sigma connect: rental listing inquiries" competition.
  
Before proceeding with any imputing you need to know the number of missing values for each of the features. Moreover, if the feature has missing values, you should explore the type of this feature.
  
---
  
1. Read the `"twosigma_train.csv"` file using `pandas`.
2. Find the number of missing values in each column.
3. Select the columns with the missing values and look at the head of the DataFrame.

In [22]:
# Read dataframe
twosigma = pd.read_csv('../_datasets/twosigma_rental_train_null.csv')

# find the number of missing values in each column
print(twosigma.isnull().sum())

twosigma[['building_id', 'price']].head()

id                 0
bathrooms          0
bedrooms           0
building_id       13
latitude           0
longitude          0
manager_id         0
price             32
interest_level     0
dtype: int64


Unnamed: 0,building_id,price
0,53a5b119ba8f7b61d4e010512e0dfc85,3000.0
1,c5c8a357cba207596b04d1afd1e4f130,5465.0
2,c3ba40552e2120b0acfc3cb5730bb2aa,2850.0
3,28d9ad350afeaab8027513a3e52ac8d5,3275.0
4,,3350.0


All right, you've found out that `'building_id'` and `'price'` columns have missing values. Looking at the head of the DataFrame, we may conclude that `'price'` is a numerical feature, while `'building_id'` is a categorical feature that is encoding buildings as hashes.

### Impute missing data
  
You've found that "price" and "building_id" columns have missing values in the Rental Listing Inquiries dataset. So, before passing the data to the models you need to impute these values.
  
Numerical feature "price" will be encoded with a mean value of non-missing prices.
  
Imputing categorical feature "building_id" with the most frequent category is a bad idea, because it would mean that all the apartments with a missing "building_id" are located in the most popular building. The better idea is to impute it with a new category.
  
The DataFrame `rental_listings` with competition data is read for you.
  
---
  
1. Create a `SimpleImputer` object with "mean" `strategy=`. Impute missing prices with the mean value.
2. Create an imputer with "constant" `strategy=`. Use "MISSING" as `fill_value=`. Impute missing buildings with a constant value.

In [23]:
from sklearn.impute import SimpleImputer

# Create mean imputer
mean_imputer = SimpleImputer(strategy='mean')

# Price imputation
twosigma[['price']] = mean_imputer.fit_transform(twosigma[['price']])

# Create constant inputer
constant_imputer = SimpleImputer(strategy='constant', fill_value='MISSING')

# building_id imputation
twosigma[['building_id']] = constant_imputer.fit_transform(twosigma[['building_id']])

In [24]:
twosigma.isnull().sum()

id                0
bathrooms         0
bedrooms          0
building_id       0
latitude          0
longitude         0
manager_id        0
price             0
interest_level    0
dtype: int64

Nice! Now our data is ready to be passed to any Machine Learning model. Move on to the next chapter to build and improve your models!