This example shows how to do feature encoding for two simple datasets, one regression problem and one binary classification problem.  Feature encoding is where you change the way data is represented in the table so that algorithms can understand the pattern better.  Most algorithms only understand numbers (not strings, dates or categories) so the data needs to be converted into something the algorithms can understand.  

In [1]:
import pandas as pd
import koleksyon.encode as ee

First, here is an example of using the library to encode data for a classification problem.  We will use the adult dataset from UCI machine learning (https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data).  This data categorizes adults having an income greater than or equal to 50K or less than 50K.

In [2]:
column_names = ['age', 'workclass', 'fnlwgt', 'education', 'educational-num','marital-status',
                        'occupation', 'relationship', 'race', 'gender','capital-gain', 'capital-loss',
                        'hours-per-week', 'native-country','income']
df = pd.read_csv("../../data/adult.data", names=column_names)
df

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


The 'target' or variable we are trying to predict is 'income'.  There are many different encoders we could use, so we will use koleksyon to compare the various encoders and see which one works best for this data.  The way this works is it will construct a random forest (a simple algorithm that we can improve on later) and place that random forest in an encoding pipeline.  It will then benchmark a pipeline for each type of encoder in the category_encoders library to help us determine the best way to handle our data.

In [3]:
target_name = "income"
ep = ee.EncodePipeline(df, target_name, "classifier")

In [4]:
results = ep.evaluate_encoders()

*******************************************************************
Benchmarking Encoders...
*******************************************************************
Preparing Data...
Building Train/Test dataset
*******************************************************************
Building Simple Algorithm...
Evaluating Encoders...
*******************************************************************
<class 'category_encoders.backward_difference.BackwardDifferenceEncoder'>
Training....
Predicting....
{'accuracy_score': 0.8570551205281745, 'true_positives': 4646, 'false_positives': 363, 'false_negatives': 568, 'true_negatives': 936, 'f1_score': 0.7883933554653106, 'precision': 0.9275304451986425, 'recall': 0.8910625239739164, 'roc_auc': 0.7749354353652786}
*******************************************************************
<class 'category_encoders.basen.BaseNEncoder'>
Training....
Predicting....
{'accuracy_score': 0.8587440503608168, 'true_positives': 4658, 'false_positives': 351, 'false_negati