# Tutorial: Técnicas para codificar las variables categóricas

#### Importamos las librerías necesarias

In [1]:
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
from category_encoders import BinaryEncoder
from category_encoders import HashingEncoder

#### El dataset Adult
Cargamos el dataset Adult y le añadimos el nombre de las columnas

In [1]:
# Cargamos el dataset
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/'
                 'adult/adult.data', encoding = 'utf-8',header = None)

# Añadimos el nombre de cada variable
df.columns = ["age", "workclass", "fnlwgt", "education", "education-num", 
              "marital-status", "occupation", "relationship", "race", "sex", 
              "capital-gain", "capital-loss", "hours-per-week", 
              "native-country", "class"]

Echamos un primer vistazo a los datos mostramos las primeras cinco observaciones.

In [2]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


Mostramos las variables y su tipo.

In [3]:
df.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
class             object
dtype: object

### Método 1: Codificación ordinal
Reemplaza cada valor de la variable con un número entero distinto. Los usaremos con variables ordinales como "education".

In [4]:
df.education.value_counts()

 HS-grad         10501
 Some-college     7291
 Bachelors        5355
 Masters          1723
 Assoc-voc        1382
 11th             1175
 Assoc-acdm       1067
 10th              933
 7th-8th           646
 Prof-school       576
 9th               514
 12th              433
 Doctorate         413
 5th-6th           333
 1st-4th           168
 Preschool          51
Name: education, dtype: int64

Creamos el codificador indicandole el orden de la variables y lo aplicamos a la variable "education".

In [5]:
# Creamos el codificador indicandole el orden de la variables
encoder = OrdinalEncoder(categories=[[" Preschool", " 1st-4th", " 5th-6th", 
                                      " 7th-8th", " 9th", " 10th", " 11th", 
                                      " 12th", " HS-grad", " Some-college", 
                                      " Assoc-voc", " Assoc-acdm", 
                                      " Bachelors", " Masters", 
                                      " Prof-school", " Doctorate"]])

# Ajustamos el codificador con la variable education y la transformamos
encoder.fit(df[["education"]])
df["education-encoded"] = encoder.transform(df[["education"]])

In [6]:
df[["education", "education-encoded"]].head(10)

Unnamed: 0,education,education-encoded
0,Bachelors,12.0
1,Bachelors,12.0
2,HS-grad,8.0
3,11th,6.0
4,Bachelors,12.0
5,Masters,13.0
6,9th,4.0
7,HS-grad,8.0
8,Masters,13.0
9,Bachelors,12.0


Podemos ver en el dataset que la variable "education-num" contiene exactamente el resultado obtenido con este proceso.

In [7]:
df[["education-num"]].head()

Unnamed: 0,education-num
0,13
1,13
2,9
3,7
4,13


### Método 2: Codificación One-Hot
Consiste en crear una nueva variable binaria por cada categoría existente en la variable a codificar.  Estas nuevas variables contendrán 1s en aquellas observaciones que pertenezcan a esa categoría y 0s en el resto.

In [8]:
df.race.value_counts()

 White                 27816
 Black                  3124
 Asian-Pac-Islander     1039
 Amer-Indian-Eskimo      311
 Other                   271
Name: race, dtype: int64

Creamos las variables binarias con el método get_dummies y ajustamos el argumento drop_first a True para eliminar primera variable y evitar así los problemas de redundancia.

In [9]:
# Creamos las variables binarias
dummies = pd.get_dummies(df['race'], drop_first = True)
dummies.head()

Unnamed: 0,Asian-Pac-Islander,Black,Other,White
0,0,0,0,1
1,0,0,0,1
2,0,0,0,1
3,0,1,0,0
4,0,1,0,0


Añadimos estas nuevas variables a nuestro DataFrame y eliminamos la variable original "race".

In [10]:
# Añadimos las variables binarias al DataFrame
df = pd.concat([df, dummies], axis = 1)

# Eliminamos la vairable original race
df = df.drop(columns=['race'])

In [11]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,sex,capital-gain,capital-loss,hours-per-week,native-country,class,education-encoded,Asian-Pac-Islander,Black,Other,White
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,Male,2174,0,40,United-States,<=50K,12.0,0,0,0,1
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,Male,0,0,13,United-States,<=50K,12.0,0,0,0,1
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,Male,0,0,40,United-States,<=50K,8.0,0,0,0,1
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Male,0,0,40,United-States,<=50K,6.0,0,1,0,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Female,0,0,40,Cuba,<=50K,12.0,0,1,0,0
