# Categorical Feature Encoding Challenge
[Crislânio Macêdo](https://medium.com/sapere-aude-tech) -  Last Update in January, 20th, 2020


- [**Github**](https://github.com/crislanio)
- [**Linkedin**](https://www.linkedin.com/in/crislanio/)
- [**Medium**](https://medium.com/sapere-aude-tech)
- [**Quora**](https://www.quora.com/profile/Crislanio)
- [**Ensina.AI**](https://medium.com/ensina-ai/an%C3%A1lise-dos-dados-abertos-do-governo-federal-ba65af8c421c)
- [**Hackerrank**](https://www.hackerrank.com/crislanio_ufc?hr_r=1)
- [**Blog**](https://medium.com/@crislanio.ufc)
- [**Personal Page**](https://crislanio.wordpress.com/about)
- [**Twitter**](https://twitter.com/crs_macedo)

----------
----------



# About this Competition
![](http://img08.deviantart.net/3e2f/i/2016/121/7/8/beerus__god_of_destruction_by_liloutehcat-da0wye6.png)

> #### In this competition, you will be predicting the probability [0, 1] of a binary target column.

The data contains binary features (bin_*), nominal features (nom_*), ordinal features (ord_*) as well as (potentially cyclical) day (of the week) and month features. The string ordinal features ord_{3-5} are lexically ordered according to string.ascii_letters.
Since the purpose of this competition is to explore various encoding strategies, the data has been simplified in that (1) there are no missing values, and (2) the test set does not contain any unseen feature values (See this). (Of course, in real-world settings both of these factors are often important to consider!)

#### Files
- train.csv - the training set
- test.csv - the test set; you must make predictions against this data
- sample_submission.csv - a sample submission file in the correct format

> #### Inspired by:
- https://www.kaggle.com/felipeleiteantunes/h2o-ai-from-linear-models-to-deep-learning (upvote this !) Not only useful but also valuable


1. # Instructions to download:
> #### http://docs.h2o.ai/h2o/latest-stable/h2o-docs/downloading.html

1. # Documentation:
> #### https://h2o-release.s3.amazonaws.com/h2o/rel-turan/4/docs-website/h2o-py/docs/intro.html

1. # A booklet:
> #### http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/PythonBooklet.pdf

1. # A presentation:
> #### https://pt.slideshare.net/0xdata/intro-to-h2o-in-python-data-science-la

And many more questions:
<html>
<body>

<p><font size="5" color="Blue">
If you find this kernel useful or interesting, please don't forget to upvote the kernel =)
</font></p>

</body>
</html>



In [1]:
conda install gxx_linux-64 gcc_linux-64 swig

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - 

# Load the H2O library and start up the H2O cluter locally on your machine

In [2]:
import h2o
h2o.init(ip="localhost", port=54323)

Checking whether there is an H2O instance running at http://localhost:54323 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_222"; OpenJDK Runtime Environment (build 1.8.0_222-8u222-b10-1~deb9u1-b10); OpenJDK 64-Bit Server VM (build 25.222-b10, mixed mode)
  Starting server from /opt/conda/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpt2q5du5b
  JVM stdout: /tmp/tmpt2q5du5b/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmpt2q5du5b/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54323
Connecting to H2O server at http://127.0.0.1:54323 ... successful.


0,1
H2O cluster uptime:,02 secs
H2O cluster timezone:,Etc/UTC
H2O data parsing timezone:,UTC
H2O cluster version:,3.26.0.5
H2O cluster version age:,4 months and 4 days !!!
H2O cluster name:,H2O_from_python_unknownUser_15rcz5
H2O cluster total nodes:,1
H2O cluster free memory:,3.556 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


# Import Libraries

In [3]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

# Load files

In [4]:

train_data = h2o.import_file("/kaggle/input/cat-in-the-dat/train.csv")
test_data = h2o.import_file("/kaggle/input/cat-in-the-dat/test.csv")

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


In [5]:
 test_id = h2o.import_file('/kaggle/input/cat-in-the-dat/test.csv')['id']

Parse progress: |█████████████████████████████████████████████████████████| 100%


### Import H2O GLM

In [6]:

from h2o.estimators.glm import H2OGeneralizedLinearEstimator

Train a default GLM We first create an object of class, "H2OGeneralizedLinearEstimator".

H2OGeneralizedLinearEstimator

In [7]:
glm_fit1 = H2OGeneralizedLinearEstimator(family='binomial', model_id='glm_fit1')

Now that glm_fit1 object is initialized, we can train the model:

In [8]:
train_data["target"] = train_data["target"].asfactor()

In [9]:
train, valid, test = train_data.split_frame(ratios=[0.7, 0.15], seed=42)  
y = 'target'
x = list(train_data.columns)

In [10]:
id_var = 'id'
x.remove(id_var)  #remove the response

In [11]:
x.remove(y)  #remove the response
print(x)

['bin_0', 'bin_1', 'bin_2', 'bin_3', 'bin_4', 'nom_0', 'nom_1', 'nom_2', 'nom_3', 'nom_4', 'nom_5', 'nom_6', 'nom_7', 'nom_8', 'nom_9', 'ord_0', 'ord_1', 'ord_2', 'ord_3', 'ord_4', 'ord_5', 'day', 'month']


#### H2O Machine Learning
> Now that we have prepared the data, we can train some models. We will start by training a single model from each of the H2O supervised algos:

- Generalized Linear Model (GLM)
- Random Forest (RF)
- Gradient Boosting Machine (RF)
- Deep Learning (DL)
- Generalized Linear Model

Let's start with a basic binomial Generalized Linear Model (GLM). By default, H2O's GLM uses a regularized, elastic net model.

In [12]:
glm_fit1.train(x=x, y=y, training_frame=train)

glm Model Build progress: |███████████████████████████████████████████████| 100%


#### Train a GLM with lambda search
Next we will do some automatic tuning by passing in a validation frame and setting lambda_search = True. Since we are training a GLM with regularization, we should try to find the right amount of regularization (to avoid overfitting). The model parameter, lambda, controls the amount of regularization in a GLM model and we can find the optimal value for lambda automatically by setting lambda_search = True and passing in a validation frame (which is used to evaluate model performance using a particular value of lambda).

In [13]:
glm_fit2 = H2OGeneralizedLinearEstimator(family='binomial', model_id='glm_fit2', lambda_search=True,balance_classes = True)
glm_fit2.train(x=x, y=y, training_frame=train, validation_frame=valid)

glm Model Build progress: |███████████████████████████████████████████████| 100%


Evaluate model performance

Let's compare the performance of the two GLMs that were just trained.

In [14]:
glm_perf1 = glm_fit1.model_performance(test)
glm_perf2 = glm_fit2.model_performance(test)

# Retreive test set AUC

In [15]:

print (glm_perf1.gini())
print (glm_perf2.gini())

0.4019330361359976
0.5077353028517264


# Compare test AUC to the training AUC and validation AUC

In [16]:

print (glm_fit2.gini(train=True))
print (glm_fit2.gini(valid=True))

0.5078651888270929
0.5058107081086602


### Random Forest

H2O's Random Forest (RF) is implements a distributed version of the standard Random Forest algorithm and variable importance measures.

# Import H2O RF

In [17]:

from h2o.estimators.random_forest import H2ORandomForestEstimator

Train and a default RF First we will train a basic Random Forest model with default parameters. Random Forest will infer the response distribution from the 
response encoding. A seed is required for reproducibility. :


# Initialize the RF estimator


In [18]:

rf_fit1 = H2ORandomForestEstimator(model_id='rf_fit1',   seed=1)


Now that rf_fit1 object is initialized, we can train the model:

In [19]:
rf_fit1.train(x=x, y=y, training_frame=train,validation_frame=valid)

drf Model Build progress: |███████████████████████████████████████████████| 100%


Train an RF with more trees Next we will increase the number of trees used in the forest by setting ntrees = 100. The default number of trees in an H2O Random Forest is 50, so this RF will be twice as big as the default. Usually increasing the number of trees in an RF will increase performance as well. Unlike Gradient Boosting Machines (GBMs), Random Forests are fairly resistant (although not free from) overfitting by increasing the number of trees. See the GBM example below for additional guidance on preventing overfitting using H2O's early stopping functionality.

In [20]:
rf_fit2 = H2ORandomForestEstimator(model_id='rf_fit2', ntrees=100,   seed=1)
rf_fit2.train(x=x, y=y, training_frame=train,validation_frame=valid)

drf Model Build progress: |███████████████████████████████████████████████| 100%


Compare model performance Let's compare the performance of the two RFs that were just trained.

In [21]:
rf_perf1 = rf_fit1.model_performance(test)
rf_perf2 = rf_fit2.model_performance(test)

# Retreive test set AUC

In [22]:

print(rf_perf1.gini())
print(rf_perf2.gini())

0.4869298838721965
0.5048942150568902


Cross-validate performance Rather than using held-out test set to evaluate model performance, a user may wish to estimate model performance using cross-validation. Using the RF algorithm (with default model parameters) as an example, we demonstrate how to perform k-fold cross-validation using H2O. No custom code or loops are required, you simply specify the number of desired folds in the nfolds argument. Since we are not going to use a test set here, we can use the original (full) dataset, which we called data rather than the subsampled train dataset. Note that this will take approximately k (nfolds) times longer than training a single RF model, since it will train k models in the cross-validation process (trained on n(k-1)/k rows), in addition to the final model trained on the full training_frame dataset with n rows.

In [23]:
rf_fit3 = H2ORandomForestEstimator(model_id='rf_fit3', seed=1, nfolds=5)
rf_fit3.train(x=x, y=y, training_frame=train)

drf Model Build progress: |███████████████████████████████████████████████| 100%


To evaluate the cross-validated AUC, do the following:

In [24]:
print( rf_fit3.gini(xval=True))

0.48041045727836984


# Import H2O GBM

In [25]:

from h2o.estimators.gbm import H2OGradientBoostingEstimator

Train a default GBM First we will train a basic GBM model with default parameters. GBM will infer the response distribution from the response encoding if not specified explicitly through the distribution argument. A seed is required for reproducibility.

# Initialize and train the GBM estimator


In [26]:

gbm_fit1 = H2OGradientBoostingEstimator(model_id='gbm_fit1',   seed=1)
gbm_fit1.train(x=x, y=y, training_frame=train, validation_frame=valid)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


Train a GBM with more trees Next we will increase the number of trees used in the GBM by setting ntrees=500. The default number of trees in an H2O GBM is 50, so this GBM will trained using ten times the default. Increasing the number of trees in a GBM is one way to increase performance of the model, however, you have to be careful not to overfit your model to the training data by using too many trees. To automatically find the optimal number of trees, you must use H2O's early stopping functionality. This example will not do that, however, the following example will.

In [27]:
gbm_fit2 = H2OGradientBoostingEstimator(model_id='gbm_fit2', ntrees=500,   seed=1)
gbm_fit2.train(x=x, y=y, training_frame=train,validation_frame=valid)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


Train a GBM with early stopping We will again set ntrees = 500, however, this time we will use early stopping in order to prevent overfitting (from too many trees). All of H2O's algorithms have early stopping available, however, with the exception of Deep Learning, it is not enabled by default. There are several parameters that should be used to control early stopping. The three that are generic to all the algorithms are: stopping_rounds, stopping_metric and stopping_tolerance. The stopping metric is the metric by which you'd like to measure performance, and so we will choose AUC here. The score_tree_interval is a parameter specific to Random Forest and GBM. Setting score_tree_interval=5 will score the model after every five trees. The parameters we have set below specify that the model will stop training after there have been three scoring intervals where the AUC has not increased more than 0.0005. Since we have specified a validation frame, the stopping tolerance will be computed on validation AUC rather than training AUC.

# Now let's use early stopping to find optimal ntrees


In [28]:
# Now let's use early stopping to find optimal ntrees

gbm_fit3 = H2OGradientBoostingEstimator(model_id='gbm_fit3', 
                                        ntrees=1000, 
                                        score_tree_interval=5,     #used for early stopping
                                        stopping_rounds=3,         #used for early stopping
                                        stopping_metric='AUC',     #used for early stopping
                                        stopping_tolerance=0.0005, #used for early stopping
                                        seed=1)
# The use of a validation_frame is recommended with using early stopping
gbm_fit3.train(x=x, y=y, training_frame=train, validation_frame=valid)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


# Let's try XGBOOSTING

In [29]:
# Let's try XGBOOSTING
from h2o.estimators import H2OXGBoostEstimator
param = {
      "model_id": 'gbm_fit4'
    , "ntrees" : 100
    , "max_depth" : 10
    , "learn_rate" : 0.02
    , "sample_rate" : 0.7
    , "col_sample_rate_per_tree" : 0.9
    , "min_rows" : 5
    , "seed": 4241
    , "score_tree_interval": 100
}
gbm_fit4 = H2OXGBoostEstimator(**param)
gbm_fit4.train(x=x, y=y, training_frame=train, validation_frame=valid)

xgboost Model Build progress: |███████████████████████████████████████████| 100%


Compare model performance Let's compare the performance of the three GBMs that were just trained.

In [30]:
gbm_perf1 = gbm_fit1.model_performance(test)
gbm_perf2 = gbm_fit2.model_performance(test)
gbm_perf3 = gbm_fit3.model_performance(test)
gbm_perf4 = gbm_fit4.model_performance(test)

# Retreive test set AUC

In [31]:

print (gbm_perf1.gini())
print (gbm_perf2.gini())
print (gbm_perf3.gini())
print (gbm_perf4.gini())

0.5104463406139268
0.4825652979538897
0.5174772134023606
0.44100651514704436


# Deep Learning
H2O's Deep Learning algorithm is a multilayer feed-forward artificial neural network. It can also be used to train an autoencoder, however, in the example below we will train a standard supervised prediction model

# Import H2O DL

In [32]:
# Import H2O DL:
from h2o.estimators.deeplearning import H2ODeepLearningEstimator

Train a default DL First we will train a basic DL model with default parameters. DL will infer the response distribution from the response encoding if not specified explicitly through the distribution argument. H2O's DL will not be reproducbible if run on more than a single core, so in this example, the performance metrics below may vary slightly from what you see on your machine. In H2O's DL, early stopping is enabled by default, so below, it will use the training set and default stopping parameters to perform early stopping.

# Initialize and train the DL estimator


In [33]:
# Initialize and train the DL estimator:

dl_fit1 = H2ODeepLearningEstimator(model_id='dl_fit1',   seed=1,  balance_classes = True)
dl_fit1.train(x=x, y=y, training_frame=train,validation_frame=valid)

deeplearning Model Build progress: |██████████████████████████████████████| 100%


Train a DL with new architecture and more epochs Next we will increase the number of epochs used in the GBM by setting epochs=20 (the default is 10). Increasing the number of epochs in a deep neural net may increase performance of the model, however, you have to be careful not to overfit your model. To automatically find the optimal number of epochs, you must use H2O's early stopping functionality. Unlike the rest of the H2O algorithms, H2O's DL will use early by default, so we will first turn it off in the next example by setting stopping_rounds=0, for comparison.

In [34]:
dl_fit2 = H2ODeepLearningEstimator(model_id='dl_fit2', 
                                   epochs=50, 
                                   hidden=[10,10], 
                                   stopping_rounds=0,  #disable early stopping
                                   seed=1,
                                   balance_classes = True)
dl_fit2.train(x=x, y=y, training_frame=train,validation_frame=valid)


deeplearning Model Build progress: |██████████████████████████████████████| 100%


Train a DL with early stopping This example will use the same model parameters as dl_fit2, however, we will turn on early stopping and specify the stopping criterion. We will also pass a validation set, as is recommended for early stopping.

In [35]:
dl_fit3 = H2ODeepLearningEstimator(model_id='dl_fit3', 
                                   epochs=500, 
                                   hidden=[10,10],
                                   score_interval=1,          #used for early stopping
                                   stopping_rounds=50,         #used for early stopping
                                   stopping_metric='AUC',     #used for early stopping
                                   stopping_tolerance=0.0005, #used for early stopping
                                   seed=1,  
                                   balance_classes = True)
dl_fit3.train(x=x, y=y, training_frame=train, validation_frame=valid)

deeplearning Model Build progress: |██████████████████████████████████████| 100%


Compare model performance Again, we will compare the model performance of the three models using a test set and AUC.

In [36]:
dl_perf1 = dl_fit1.model_performance(test)
dl_perf2 = dl_fit2.model_performance(test)
dl_perf3 = dl_fit3.model_performance(test)

# Retreive test set AUC

In [37]:
# Retreive test set AUC
print (dl_perf1.gini())
print (dl_perf2.gini())
print( dl_perf3.gini())

0.5910557104401248
0.5887652509461268
0.589960949897901


In [38]:
test_pred = gbm_fit4.predict(test_id) # test

xgboost prediction progress: | (failed)


OSError: Job with key $03017f00000134d4ffffffff$_8e9a8c11d68f16777a53fb120d2d4ac3 failed with an exception: java.lang.IllegalArgumentException: Test/Validation dataset has no columns in common with the training set
stacktrace: 
java.lang.IllegalArgumentException: Test/Validation dataset has no columns in common with the training set
	at hex.Model.adaptTestForTrain(Model.java:1326)
	at hex.Model.adaptTestForTrain(Model.java:1165)
	at hex.Model.score(Model.java:1443)
	at water.api.ModelMetricsHandler$1.compute2(ModelMetricsHandler.java:381)
	at water.H2O$H2OCountedCompleter.compute(H2O.java:1417)
	at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
	at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
	at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
	at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
	at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)


test_pred

In [39]:
test_pred

NameError: name 'test_pred' is not defined

# General Findinds

- Is data synthetic? by [cpmpml](https://www.kaggle.com/cpmpml)

source: https://www.kaggle.com/c/cat-in-the-dat/discussion/105713


- Encoding cyclical features using sin and cos transformation by [gogo827jz](https://www.kaggle.com/gogo827jz)
source: https://www.kaggle.com/c/cat-in-the-dat/discussion/105610

- CATEGORICAL MATERIAL MUST READ by (brunhs)[https://www.kaggle.com/brunhs]

source: https://www.kaggle.com/c/cat-in-the-dat/discussion/105512

- CATEGORICAL MATERIAL SURVEY🐱 & Deduplication & Record Linkage. by [caesarlupum](https://www.kaggle.com/caesarlupum)

source: https://www.kaggle.com/c/cat-in-the-dat/discussion/111930




# Top kernels

-  ### 🐱 Cat with Null Importance - Target Permutation by @CaesarLupum

Source: https://www.kaggle.com/caesarlupum/cat-with-null-importance-target-permutation

- ###  An Overview of Encoding Techniques by @shahules

Source: https://www.kaggle.com/shahules/an-overview-of-encoding-techniques

-  ### EDA & Feat Engineering - Encode & Conquer by @kabure

Source: https://www.kaggle.com/kabure/eda-feat-engineering-encode-conquer

- ###  Why Not Logistic Regression? by @peterhurford

Source: https://www.kaggle.com/peterhurford/why-not-logistic-regression

-  ### OH my Ca by @superant

Source: https://www.kaggle.com/superant/oh-my-cat

- ###  Entity embeddings to handle categories by @abhishek

Source: https://www.kaggle.com/abhishek/entity-embeddings-to-handle-categories

- ###  2nd place Solution - Categorical FE Callenge by @adaubas

Source: https://www.kaggle.com/adaubas/2nd-place-solution-categorical-fe-callenge

- ###  🐱 CatComp - Simple Target Encoding by @CaesarLupum

Source: https://www.kaggle.com/caesarlupum/catcomp-simple-target-encoding

-  ### Handling Categorical Variables:Encoding & Modeling by @vikassingh1996

Source: https://www.kaggle.com/vikassingh1996/handling-categorical-variables-encoding-modeling

-  ### R GLMNET by @ccccat

Source: https://www.kaggle.com/ccccat/r-glmnet

-  ### Exploring CATegorical encodings  by @artgor

Source: https://www.kaggle.com/artgor/exploring-categorical-encodings

- ### CatBoost Baseline with Feature Importance by @gogo827jz

Source: https://www.kaggle.com/gogo827jz/catboost-baseline-with-feature-importance



<html>
<body>

<p><font size="5" color="purple">If you like my kernel please consider upvoting it</font></p>
<p><font size="4" color="purple">Don't hesitate to give your suggestions in the comment section</font></p>

</body>
</html>


## Final