## Categorical Feature Encoding Challenge WITH PYTHON
[Crislânio Macêdo](https://medium.com/sapere-aude-tech) -  December, 31th, 2019

 🐱 CatComp - Simple Target Encoding : [ 🐱 CatComp - Simple Target Encoding ](https://www.kaggle.com/caesarlupum/catcomp-simple-target-encoding)

----------
----------

## About this Competition

![](https://imgflip.com/s/meme/Smiling-Cat.jpg)

> #### In this competition, you will be predicting the probability [0, 1] of a binary target column.

The data contains binary features (bin_*), nominal features (nom_*), ordinal features (ord_*) as well as (potentially cyclical) day (of the week) and month features. The string ordinal features ord_{3-5} are lexically ordered according to string.ascii_letters.
Since the purpose of this competition is to explore various encoding strategies, the data has been simplified in that (1) there are no missing values, and (2) the test set does not contain any unseen feature values (See this). (Of course, in real-world settings both of these factors are often important to consider!)

#### Files
- train.csv - the training set
- test.csv - the test set; you must make predictions against this data
- sample_submission.csv - a sample submission file in the correct format

#### Target encoding—as implemented in contrib.scikit-learn.org/categorical-encoding—can prove powerful especially to encode high cardinality categorical features. 
> This implementation assumes that the target is ordinal (which is the case here as it is a binary outcome, but for many multiclass classification that is often not the case).

Here we use it for all features as a starting point, but many of those features might better contribute to the overall predictive power when encoded with alternative techniques.

We use k-fold to mitigate data leaks that would otherwise almost certainly lead to overfitting. Alternatively, we could split the train set, but given the small size of it, a resampling technique sounds preferable.


<html>
<body>

<p><font size="5" color="Blue">
If you find this kernel useful or interesting, please don't forget to upvote the kernel =)
</font></p>

</body>
</html>



## Import libs

In [1]:
import pandas as pd
import numpy as np
import category_encoders as ce
import lightgbm as lgb
from sklearn import linear_model
from sklearn.model_selection import StratifiedKFold
import gc

In [2]:
# Suppress warnings 
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
from IPython.display import HTML


## Read datasets

In [3]:
train = pd.read_csv('../input/cat-in-the-dat/train.csv')
test = pd.read_csv('../input/cat-in-the-dat/test.csv')
print(train.target.value_counts()[0]/300000, train.target.value_counts()[1]/300000, )
train.sort_index(inplace=True)
train_y = train['target']
test_id = test['id']
train.drop(['target', 'id'], axis=1, inplace=True)
test.drop('id', axis=1, inplace=True)

0.69412 0.30588


In [4]:
train

Unnamed: 0,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,nom_4,...,nom_8,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month
0,0,0,0,T,Y,Green,Triangle,Snake,Finland,Bassoon,...,c389000ab,2f4cb3d51,2,Grandmaster,Cold,h,D,kr,2,2
1,0,1,0,T,Y,Green,Trapezoid,Hamster,Russia,Piano,...,4cd920251,f83c56c21,1,Grandmaster,Hot,a,A,bF,7,8
2,0,0,0,F,Y,Blue,Trapezoid,Lion,Russia,Theremin,...,de9c9f684,ae6800dd0,1,Expert,Lava Hot,h,R,Jc,7,2
3,0,1,0,F,Y,Red,Trapezoid,Snake,Canada,Oboe,...,4ade6ab69,8270f0d71,1,Grandmaster,Boiling Hot,i,D,kW,2,1
4,0,0,0,F,N,Red,Trapezoid,Lion,Canada,Oboe,...,cb43ab175,b164b72a7,1,Grandmaster,Freezing,a,R,qP,7,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
299995,0,0,0,T,N,Red,Trapezoid,Snake,India,Oboe,...,7508f4ef1,e027decef,1,Contributor,Freezing,k,K,dh,3,8
299996,0,0,0,F,Y,Green,Trapezoid,Lion,Russia,Piano,...,397dd0274,80f1411c8,2,Novice,Freezing,h,W,MO,3,2
299997,0,0,0,F,Y,Blue,Star,Axolotl,Russia,Oboe,...,5d7806f53,314dcc15b,3,Novice,Boiling Hot,o,A,Bn,7,9
299998,0,1,0,F,Y,Green,Square,Axolotl,Costa Rica,Piano,...,1f820c7ce,ab0ce192b,1,Master,Boiling Hot,h,W,uJ,3,8


# Encoding the Features

## Target encoding
 		
Target-based encoding is numerization of categorical variables via target. In this method, we replace the categorical variable with just one new numerical variable and replace each category of the categorical variable with its corresponding probability of the target (if categorical) or average of the target (if numerical). The main drawbacks of this method are its dependency to the distribution of the target, and its lower predictability power compare to the binary encoding method.

for example,
![](http://www.renom.jp/notebooks/tutorial/preprocessing/category_encoding/renom_cat_target.png)




# Encoding Categories

In [5]:
HTML('<iframe width="680" height="620" src="https://www.youtube.com/embed/8odLEbSGXoI" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')


In [6]:
from sklearn.metrics import roc_auc_score
cat_feat_to_encode = train.columns.tolist()
# target =0  0.69412%, target =1 0.30588
smoothing=0.50

oof = pd.DataFrame([])
for tr_idx, oof_idx in StratifiedKFold(n_splits=5, random_state=1, shuffle=True).split(train, train_y):
    
    ce_target_encoder = ce.TargetEncoder(cols = cat_feat_to_encode, smoothing=smoothing)
    ce_target_encoder.fit(train.iloc[tr_idx, :], train_y.iloc[tr_idx])
    oof = oof.append(ce_target_encoder.transform(train.iloc[oof_idx, :]), ignore_index=False)

    
    
    
ce_target_encoder = ce.TargetEncoder(cols = cat_feat_to_encode, smoothing=smoothing)
ce_target_encoder.fit(train, train_y)
train = oof.sort_index() 
test = ce_target_encoder.transform(test)

## Logistic Regression

In [7]:
glm = linear_model.LogisticRegression(
  random_state=1, solver='lbfgs', max_iter=2019, fit_intercept=True, 
  penalty='none', verbose=0)

glm.fit(train, train_y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=2019,
                   multi_class='warn', n_jobs=None, penalty='none',
                   random_state=1, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

## predict proba

In [8]:
from datetime import datetime
pd.DataFrame({'id': test_id, 'target': glm.predict_proba(test)[:,1]}).to_csv(
    'sub_' + str(datetime.now().strftime('%Y-%m-%d_%H-%M-%S')) + '.csv', 
    index=False)

# General Findings

### Categorical Material
https://www.kaggle.com/c/cat-in-the-dat/discussion/110924#latest-638837

https://www.kaggle.com/c/cat-in-the-dat/discussion/105512#latest-656503

https://www.kaggle.com/c/cat-in-the-dat/discussion/111930#latest-666056

https://www.kaggle.com/c/cat-in-the-dat/discussion/113213#latest-666299

### Cyclic features
https://www.kaggle.com/c/cat-in-the-dat/discussion/106630#latest-648493

https://www.kaggle.com/c/cat-in-the-dat/discussion/105610#latest-647944

https://www.kaggle.com/c/cat-in-the-dat/discussion/108805#latest-629677


### Techniques to handle categorical variables

https://www.kaggle.com/c/cat-in-the-dat/discussion/108805#latest-629677

https://www.kaggle.com/c/cat-in-the-dat/discussion/108805#latest-629677




<html>
<body>

<p><font size="5" color="blue">If you like my kernel please consider upvoting it</font></p>
<p><font size="4" color="purple">Remember the upvote button is next to the fork button, and it's free too! ;)</font></p>

</body>
</html>


# Final