**[Feature Engineering Home Page](https://www.kaggle.com/learn/feature-engineering)**

---


# Introduction

In this exercise you'll apply more advanced encodings to encode the categorical variables ito improve your classifier model. The encodings you will implement are:

- Count Encoding
- Target Encoding
- Leave-one-out Encoding
- CatBoost Encoding
- Feature embedding with SVD 

You'll refit the classifier after each encoding to check its performance on hold-out data. First, run the next cell to repeat the work you did in the last exercise.

In [1]:
import numpy as np
import pandas as pd
from sklearn import preprocessing, metrics
import lightgbm as lgb

# Set up code checking
# This can take a few seconds, thanks for your patience
from learntools.core import binder
binder.bind(globals())
from learntools.feature_engineering.ex2 import *

clicks = pd.read_parquet('../input/feature-engineering-data/baseline_data.pqt')

Here I'll define a couple functions to help test the new encodings.

In [2]:
def get_data_splits(dataframe, valid_fraction=0.1):
    """ Splits a dataframe into train, validation, and test sets. First, orders by 
        the column 'click_time'. Set the size of the validation and test sets with
        the valid_fraction keyword argument.
    """

    dataframe = dataframe.sort_values('click_time')
    valid_rows = int(len(dataframe) * valid_fraction)
    train = dataframe[:-valid_rows * 2]
    # valid size == test size, last two sections of the data
    valid = dataframe[-valid_rows * 2:-valid_rows]
    test = dataframe[-valid_rows:]
    
    return train, valid, test

def train_model(train, valid, test=None, feature_cols=None):
    if feature_cols is None:
        feature_cols = train.columns.drop(['click_time', 'attributed_time',
                                           'is_attributed'])
    dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed'])
    dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed'])
    
    param = {'num_leaves': 64, 'objective': 'binary', 
             'metric': 'auc', 'seed': 7}
    num_round = 1000
    print("Training model!")
    bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], 
                    early_stopping_rounds=20, verbose_eval=False)
    
    valid_pred = bst.predict(valid[feature_cols])
    valid_score = metrics.roc_auc_score(valid['is_attributed'], valid_pred)
    print(f"Validation AUC score: {valid_score}")
    
    if test is not None: 
        test_pred = bst.predict(test[feature_cols])
        test_score = metrics.roc_auc_score(test['is_attributed'], test_pred)
        return bst, valid_score, test_score
    else:
        return bst, valid_score

Run this cell to get a baseline score. If your encodings do better than this, you can keep them.

In [3]:
print("Baseline model")
train, valid, test = get_data_splits(clicks)
_ = train_model(train, valid)

Baseline model
Training model!
Validation AUC score: 0.9622743228943659


### 1) Categorical encodings and leakage

These encodings are all based on statistics calculated from the dataset like counts and means. Considering this, what data should you be using to calculate the encodings?

Uncomment the following line after you've decided your answer.

In [4]:
q_1.solution()

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> You should calculate the encodings from the training set only. If you include data from the validation and test sets into the encodings, you'll overestimate the model's performance. You should in general be vigilant to avoid leakage, that is, including any information from the validation and test sets into the model. For a review on this topic, see our lesson on [data leakage](https://www.kaggle.com/alexisbcook/data-leakage)

### 2) Count encodings

Here, encode the categorical features `['ip', 'app', 'device', 'os', 'channel']` using the count of each value in the data set. Using `CountEncoder` from the `category_encoders` library, fit the encoding using the categorical feature columns defined in `cat_features`. Then apply the encodings to the train and validation sets, adding them as new columns with names suffixed `"_count"`.

In [5]:
import category_encoders as ce

cat_features = ['ip', 'app', 'device', 'os', 'channel']
train, valid, test = get_data_splits(clicks)

# Create the count encoder
count_enc = ce.CountEncoder(cols=cat_features)

# Learn encoding from the training set
count_enc.fit(train[cat_features])

# Apply encoding to the train and validation sets as new columns
# Make sure to add `_count` as a suffix to the new columns
train_encoded = train.join(count_enc.transform(train[cat_features]).add_suffix("_count"))
valid_encoded = valid.join(count_enc.transform(valid[cat_features]).add_suffix("_count"))

q_2.check()
train_encoded

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,day,hour,minute,second,ip_count,app_count,device_count,os_count,channel_count
0,27226,3,1,13,120,2017-11-06 15:13:23,,0,6,15,13,23,68,292254,1648091,370652,26760
1,110007,35,1,13,10,2017-11-06 15:41:07,2017-11-07 08:17:19,1,6,15,41,7,4,60114,1648091,370652,41256
2,1047,6,1,13,157,2017-11-06 15:42:32,,0,6,15,42,32,118,19564,1648091,370652,31221
3,76270,3,1,13,120,2017-11-06 15:56:17,,0,6,15,56,17,29,292254,1648091,370652,26760
4,57862,3,1,13,120,2017-11-06 15:57:01,,0,6,15,57,1,31,292254,1648091,370652,26760
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1840445,11718,18,1,17,26,2017-11-09 04:50:14,,0,9,4,50,14,28,129507,1648091,87078,39586
1840449,32849,15,1,19,75,2017-11-09 04:50:14,,0,9,4,50,14,51,140369,1648091,420998,84548
1840446,249422,48,1,19,103,2017-11-09 04:50:14,2017-11-09 04:53:15,1,9,4,50,14,1,845,1648091,420998,7403
1840443,232256,19,6,21,59,2017-11-09 04:50:14,2017-11-09 04:50:59,1,9,4,50,14,1,104701,492,15303,107747


In [6]:
# Uncomment if you need some guidance
#q_2.hint()
#q_2.solution()

In [7]:
# Train the model on the encoded datasets
# This can take around 30 seconds to complete
_ = train_model(train_encoded, valid_encoded)

Training model!
Validation AUC score: 0.9653051135205329


Count encoding improved our model's score!

### 3) Why is count encoding effective?
At first glance, it could be surprising that Count Encoding helps make accurate models. 
Why do you think is count encoding is a good idea, or how does it improve the model score?

Uncomment the following line after you've decided your answer.

In [8]:
q_3.solution()

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
    Rare values tend to have similar counts (with values like 1 or 2), so you can classify rare 
    values together at prediction time. Common values with large counts are unlikely to have 
    the same exact count as other values. So, the common/important values get their own 
    grouping.
    

### 4) Target encoding

Here you'll try some supervised encodings that use the labels (the targets) to transform categorical features. The first one is target encoding. Create the target encoder from the `category_encoders` library. Then, learn the encodings from the training dataset, apply the encodings to all the datasets and retrain the model.

In [9]:
cat_features = ['ip', 'app', 'device', 'os', 'channel'] #Validation AUC score: 0.9540530347873288
# cat_features = ['app', 'device', 'os', 'channel'] #Validation AUC score: 0.9627457957514338
train, valid, test = get_data_splits(clicks)

# Create the target encoder. You can find this easily by using tab completion.
# Start typing ce. the press Tab to bring up a list of classes and functions.
target_enc = ce.TargetEncoder(cols=cat_features)

# Learn encoding from the training set. Use the 'is_attributed' column as the target.
target_enc.fit(train[cat_features], train['is_attributed'])

# Apply encoding to the train and validation sets as new columns
# Make sure to add `_target` as a suffix to the new columns
train_encoded = train.join(target_enc.transform(train[cat_features]).add_suffix('_target'))
valid_encoded = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_target'))

q_4.check()
train_encoded

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,day,hour,minute,second,ip_target,app_target,device_target,os_target,channel_target
0,27226,3,1,13,120,2017-11-06 15:13:23,,0,6,15,13,23,0.073529,0.028328,0.152087,0.138712,0.034043
1,110007,35,1,13,10,2017-11-06 15:41:07,2017-11-07 08:17:19,1,6,15,41,7,0.247523,0.995841,0.152087,0.138712,0.950262
2,1047,6,1,13,157,2017-11-06 15:42:32,,0,6,15,42,32,0.135593,0.009252,0.152087,0.138712,0.019378
3,76270,3,1,13,120,2017-11-06 15:56:17,,0,6,15,56,17,0.103448,0.028328,0.152087,0.138712,0.034043
4,57862,3,1,13,120,2017-11-06 15:57:01,,0,6,15,57,1,0.096774,0.028328,0.152087,0.138712,0.034043
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1840445,11718,18,1,17,26,2017-11-09 04:50:14,,0,9,4,50,14,0.071429,0.048499,0.152087,0.109913,0.043930
1840449,32849,15,1,19,75,2017-11-09 04:50:14,,0,9,4,50,14,0.098039,0.021686,0.152087,0.157243,0.009202
1840446,249422,48,1,19,103,2017-11-09 04:50:14,2017-11-09 04:53:15,1,9,4,50,14,0.197764,0.951479,0.152087,0.157243,0.970012
1840443,232256,19,6,21,59,2017-11-09 04:50:14,2017-11-09 04:50:59,1,9,4,50,14,0.197764,0.948415,0.953252,0.953408,0.953326


In [10]:
# Uncomment these if you need some guidance
#q_4.hint()
#q_4.solution()

In [11]:
_ = train_model(train_encoded, valid_encoded)

Training model!
Validation AUC score: 0.9540530347873288


### 5) Try removing IP encoding

Try leaving `ip` out of the encoded features and retrain the model with target encoding again. You should find that the score increases and is above the baseline score! Why do you think the score is below baseline when we encode the IP address but above baseline when we don't?

Uncomment the following line after you've decided your answer.

In [12]:
q_5.solution()

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
    Target encoding attempts to measure the population mean of the target for each 
    level in a categorical feature. This means when there is less data per level, the 
    estimated mean will be further away from the "true" mean, there will be more variance. 
    There is little data per IP address so it's likely that the estimates are much noisier
    than for the other features. The model will rely heavily on this feature since it is 
    extremely predictive. This causes it to make fewer splits on other features, and those
    features are fit on just the errors left over accounting for IP address. So, the 
    model will perform very poorly when seeing new IP addresses that weren't in the 
    training data (which is likely most new data). Going forward, we'll leave out the IP feature when trying
    different encodings.
    

### 6) CatBoost Encoding

The CatBoost encoder is supposed to working well with the LightGBM model. Encode the categorical features with `CatBoostEncoder` and train the model on the encoded data again.

In [13]:
train, valid, test = get_data_splits(clicks)
cb_features = ['app', 'device', 'os', 'channel']

# Create the CatBoost encoder
cb_enc = ce.CatBoostEncoder(cols=cb_features)

# Learn encoding from the training set
cb_enc.fit(train[cb_features], train['is_attributed'])

# Apply encoding to the train and validation sets as new columns
# Make sure to add `_cb` as a suffix to the new columns
train_encoded = train.join(cb_enc.transform(train[cb_features]).add_suffix('_cb'))
valid_encoded = valid.join(cb_enc.transform(valid[cb_features]).add_suffix('_cb'))
q_6.check()
train_encoded

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,day,hour,minute,second,app_cb,device_cb,os_cb,channel_cb
0,27226,3,1,13,120,2017-11-06 15:13:23,,0,6,15,13,23,0.028329,0.152087,0.138712,0.034049
1,110007,35,1,13,10,2017-11-06 15:41:07,2017-11-07 08:17:19,1,6,15,41,7,0.995828,0.152087,0.138712,0.950244
2,1047,6,1,13,157,2017-11-06 15:42:32,,0,6,15,42,32,0.009261,0.152087,0.138712,0.019384
3,76270,3,1,13,120,2017-11-06 15:56:17,,0,6,15,56,17,0.028329,0.152087,0.138712,0.034049
4,57862,3,1,13,120,2017-11-06 15:57:01,,0,6,15,57,1,0.028329,0.152087,0.138712,0.034049
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1840445,11718,18,1,17,26,2017-11-09 04:50:14,,0,9,4,50,14,0.048500,0.152087,0.109914,0.043934
1840449,32849,15,1,19,75,2017-11-09 04:50:14,,0,9,4,50,14,0.021687,0.152087,0.157243,0.009204
1840446,249422,48,1,19,103,2017-11-09 04:50:14,2017-11-09 04:53:15,1,9,4,50,14,0.950588,0.152087,0.157243,0.969908
1840443,232256,19,6,21,59,2017-11-09 04:50:14,2017-11-09 04:50:59,1,9,4,50,14,0.948408,0.951720,0.953358,0.953319


In [14]:
# Uncomment these if you need some guidance
#q_6.hint()
#q_6.solution()

In [15]:
_ = train_model(train, valid)

Training model!
Validation AUC score: 0.9622743228943659


The CatBoost encodings work the best, so we'll keep those.

In [16]:
encoded = cb_enc.transform(clicks[cb_features])
for col in encoded:
    clicks.insert(len(clicks.columns), col + '_cb', encoded[col])

## Categorical feature embeddings with SVD

Now you'll create embeddings from pairs of columns using SVD to learn from a count matrix.

In [17]:
import itertools
from sklearn.decomposition import TruncatedSVD

### 7) Learn SVD components as embeddings.

Here you'll use SVD to learn embeddings for the categorical features from a matrix of counts. First, create the SVD transformer with `TruncatedSVD`. Then for each pair of features, create a count matrix and learn the SVD components. Remember you should be learning the embeddings from the train dataset to avoid leakage.

In [18]:
train, valid, test = get_data_splits(clicks)
cat_features = ['app', 'device', 'os', 'channel']

# Create the SVD transformer
svd = TruncatedSVD(n_components=5, random_state=7)

# Learn SVD feature vectors and store in svd_components as DataFrames
# Make sure you're only using the train set!
svd_components = {}
for col1, col2 in itertools.permutations(cat_features, 2):
    # Create the count matrix
    pair_counts = train.groupby([col1, col2])['is_attributed'].count()
    pair_matrix = pair_counts.unstack(fill_value=0)
    
    # Fit the SVD and transform to get the components
    svd_comp = pd.DataFrame(svd.fit_transform(pair_matrix))
    
    # Store the components in the dictionary. 
    svd_components['_'.join([col1, col2])] = svd_comp

q_7.check()
svd_components

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

{'app_device':                 0             1             2             3         4
 0    1.300105e-10  2.901272e+02  4.654852e-07  7.549226e-06  0.015482
 1    4.567491e+04 -5.192989e-07 -7.026271e+02  1.597962e+02  6.190715
 2    1.726700e+05 -2.828537e-06 -1.498031e+03  4.743132e+01 -5.236503
 3    2.832167e+05 -4.486057e-06  1.385305e+03 -1.540166e+03  3.161593
 4    4.904132e+02  6.423896e-08 -4.725334e-01  1.331522e+00  2.302140
 ..            ...           ...           ...           ...       ...
 366  9.992396e-01 -1.569303e-11  3.458743e-02 -1.799640e-02  0.000058
 367  4.239255e-13  9.982279e-01 -3.308542e-09 -1.665186e-08 -0.000022
 368  9.992396e-01 -1.569303e-11  3.458743e-02 -1.799640e-02  0.000058
 369  4.239255e-13  9.982279e-01 -3.308542e-09 -1.665186e-08 -0.000022
 370  9.992396e-01 -1.569303e-11  3.458743e-02 -1.799640e-02  0.000058
 
 [371 rows x 5 columns],
 'app_os':                  0           1          2            3            4
 0         0.200938  246.847

In [19]:
# Uncomment these if you need some guidance
#q_7.hint()
#q_7.solution()

### 8) Encode categorical features with SVD components

With the components learned from the train dataset, encode the categorical features and create a dataframe `svd_encodings`. The columns need to be named with the feature pair, `svd`, and the component index, such as `"os_device_svd_0"`.

In [20]:
svd_encodings = pd.DataFrame(index=clicks.index)

for feature in svd_components:
    # Get the feature column the SVD components are encoding
    col = feature.split('_')[0]
    feature_components = svd_components[feature]
    
    ## Use SVD components to encode the categorical features
    comp_cols = feature_components.reindex(clicks[col]).set_index(clicks.index)
    
    # Add prefix to column names
    svd_cols = comp_cols.add_prefix(feature + '_svd_')

    # Join encoded features with svd_encodings
    svd_encodings = svd_encodings.join(svd_cols)

# Fill null values with the mean
svd_encodings = svd_encodings.fillna(svd_encodings.mean())

q_8.check()
svd_encodings

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

Unnamed: 0,app_device_svd_0,app_device_svd_1,app_device_svd_2,app_device_svd_3,app_device_svd_4,app_os_svd_0,app_os_svd_1,app_os_svd_2,app_os_svd_3,app_os_svd_4,...,channel_device_svd_0,channel_device_svd_1,channel_device_svd_2,channel_device_svd_3,channel_device_svd_4,channel_os_svd_0,channel_os_svd_1,channel_os_svd_2,channel_os_svd_3,channel_os_svd_4
0,283216.718138,-4.486057e-06,1385.304677,-1540.165516,3.161593,102150.019578,-8.633965,51.333141,59.647089,1279.646762,...,26587.195203,-711.013555,754.972091,-397.795706,39.706117,9605.064364,-639.692285,56.155759,-338.290917,334.849976
1,60058.351768,-8.708913e-07,2079.688316,-1077.069410,6.152669,26427.370290,-2.390482,-1421.668517,-1635.306384,-1793.084266,...,41142.467644,-1093.126005,1331.917842,-669.993124,68.997836,19689.842015,-1232.489887,-1175.734245,-1840.713269,2246.722843
2,18740.410320,-2.848323e-07,-125.681549,-10.350328,0.345621,6639.894162,-0.542968,-32.530579,268.887232,550.518065,...,772.546174,-20.761715,16.562638,-9.748269,0.999530,78.679008,-6.979560,169.783860,-9.697992,3.655545
3,283216.718138,-4.486057e-06,1385.304677,-1540.165516,3.161593,102150.019578,-8.633965,51.333141,59.647089,1279.646762,...,26587.195203,-711.013555,754.972091,-397.795706,39.706117,9605.064364,-639.692285,56.155759,-338.290917,334.849976
4,283216.718138,-4.486057e-06,1385.304677,-1540.165516,3.161593,102150.019578,-8.633965,51.333141,59.647089,1279.646762,...,26587.195203,-711.013555,754.972091,-397.795706,39.706117,9605.064364,-639.692285,56.155759,-338.290917,334.849976
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2300556,172669.961330,-2.828537e-06,-1498.030919,47.431321,-5.236503,63061.997716,-5.509504,95.700139,-1133.038948,-1611.494215,...,122.497195,-3.561528,-11.890012,3.698852,0.295177,48.520465,-3.325046,0.580528,0.197144,-0.480342
2300557,79928.764567,-1.205976e-06,1545.895064,-919.183624,4.333729,28752.787540,-2.443380,158.195767,-633.909891,-848.521571,...,95.230290,-2.698418,-5.307795,1.314516,-0.096705,41.842097,-2.662455,-1.770468,-3.110683,4.310531
2300558,189787.724105,-2.947984e-06,-3215.582416,733.272197,-4.208618,70473.135089,-5.627409,-709.120366,250.891669,2240.970109,...,40161.115471,-1130.602497,-1848.755738,431.205539,-28.881444,16899.283067,-1150.141776,-828.147677,-32.152357,-603.239424
2300559,189787.724105,-2.947984e-06,-3215.582416,733.272197,-4.208618,70473.135089,-5.627409,-709.120366,250.891669,2240.970109,...,27273.494469,-766.254400,-1173.615189,259.866933,-17.775070,10910.902308,-732.890408,-98.370534,-268.506719,56.509892


In [21]:
# Uncomment these if you need some guidance
#q_8.hint()
#q_8.solution()

Test the encoded data.

In [22]:
train, valid, test = get_data_splits(clicks.join(svd_encodings))
_ = train_model(train, valid)

Training model!
Validation AUC score: 0.9632510172357092


# Keep Going

Now you are ready to **[generating completely new features](https://www.kaggle.com/matleonard/feature-generation)** from the data itself.

---
**[Feature Engineering Home Page](https://www.kaggle.com/learn/feature-engineering)**





*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum) to chat with other Learners.*