 # **<font color="#07a8ed">Codificación de variables categóricas</font>**

<p align="justify">
👀 Las variables categóricas son aquellas que representan categorías o clases discretas en lugar de valores numéricos continuos. Estas variables son comunes en muchos conjuntos de datos y pueden tomar un conjunto finito de valores únicos, como "rojo", "verde", "azul" en el caso de una variable categórica que representa el color.
<br><br>
Las variables categóricas pueden ser de dos tipos principales:
<br><br>
<ol align="justify">
<li>
<b>Variables categóricas nominales:</b> Estas variables representan categorías que no tienen un orden inherente. Por ejemplo, una variable categórica nominal podría ser "gato", "perro" o "pájaro".
</li>
<li>
<b>Variables categóricas ordinales:</b> Estas variables representan categorías que tienen un orden inherente. Por ejemplo, una variable categórica ordinal podría ser "bajo", "medio" o "alto".
</li>
</ol>
<br>
<p align="justify">
Las variables categóricas son fundamentales en muchos problemas de aprendizaje automático y análisis de datos. Sin embargo, la gran mayoría de los algoritmos de aprendizaje automático están diseñados para trabajar con variables numéricas, por lo que es necesario realizar un preprocesamiento adecuado de las variables categóricas antes de utilizarlas en modelos de aprendizaje automático.



<p align="justify">
👀 En este Colab, presentaremos las formas típicas de tratar las variables categóricas codificándolas
</p>



In [1]:
# Importamos Pandas
import pandas as pd

 ## **<font color="#07a8ed">Carga y visualización del conjunto de datos</font>**

In [2]:
# Realizamos la carga del conjunto de datos
adult_census = pd.read_csv("https://raw.githubusercontent.com/cristiandarioortegayubro/BDS/main/datasets/adult_census.csv")

In [3]:
# Visualizamos el conjunto de datos
adult_census.head()

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


 ## **<font color="#07a8ed">Identificación de las variables categóricas</font>**

<p align="justify">
👀 Una variable numérica es una cantidad representada por un número real o un número entero. Estas variables pueden manejarse naturalmente mediante algoritmos de aprendizaje automático que generalmente se componen de una secuencia de instrucciones aritméticas, como sumas y multiplicaciones, por el contrario...
<br><br>
Las variables categóricas tienen valores discretos, normalmente representados por cadenas de caracteres tomadas de una lista finita de opciones posibles.


<p align="justify">
👀 ¿Cómo podemos reconocer fácilmente las columnas categóricas entre el conjunto de datos?... parte de la respuesta radica en el tipo de datos que tienen las columnas...
</p>


In [4]:
# Visualizamos el tipo de datos de las variables
adult_census.dtypes

Unnamed: 0,0
age,int64
workclass,object
education,object
education-num,int64
marital-status,object
occupation,object
relationship,object
race,object
sex,object
capital-gain,int64


<p align="justify">
👀 Tambien lo podemos ver con el método <code>info()</code>...</p>

In [5]:
adult_census.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       48842 non-null  object
 2   education       48842 non-null  object
 3   education-num   48842 non-null  int64 
 4   marital-status  48842 non-null  object
 5   occupation      48842 non-null  object
 6   relationship    48842 non-null  object
 7   race            48842 non-null  object
 8   sex             48842 non-null  object
 9   capital-gain    48842 non-null  int64 
 10  capital-loss    48842 non-null  int64 
 11  hours-per-week  48842 non-null  int64 
 12  native-country  48842 non-null  object
 13  class           48842 non-null  object
dtypes: int64(5), object(9)
memory usage: 5.2+ MB


In [6]:
adult_census.size

683788

Separamos las variables explicativas de la objetivo

El método `drop()` de Pandas se utiliza para eliminar filas o columnas de un DataFrame.

In [7]:
# Separamos nuestro conjunto de datos en variables explicativas y variable objetivo
data = adult_census.drop(columns="class")
target = adult_census["class"]

In [8]:
data.head()

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States
1,38,Private,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States
2,28,Local-gov,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States
3,44,Private,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States
4,18,?,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States


In [9]:
target.head()

Unnamed: 0,class
0,<=50K
1,<=50K
2,>50K
3,>50K
4,<=50K


In [10]:
target.value_counts()

Unnamed: 0_level_0,count
class,Unnamed: 1_level_1
<=50K,37155
>50K,11687


In [11]:
porcentaje = ((target.value_counts()) / (len(target)))  ## porcentaje de variable objetivo
porcentaje

Unnamed: 0_level_0,count
class,Unnamed: 1_level_1
<=50K,0.760718
>50K,0.239282


 ## **<font color="#07a8ed">Selección de variables según el tipo de dato</font>**

<p align="justify">
👀 Es posible definir manualmente las columnas categóricas creando una lista con los nombres de las columnas que son categóricas.
<br><br>
Sin embargo, esto no es eficiente cuando tenemos una cantidad considerable de variables y queremos automatizar el proceso, por lo que es aconsejable usar el método <code>select_dtypes()</code> de Pandas, este método nos permite seleccionar columnas según el tipo de datos que posea en forma automatizada, sin tener que hacerlo manualmente.
<br><br>
👀 Veamos como usarlo:
</p>


In [12]:
# Separamos las variables categóricas y numéricas dentro de las variables explicativas
data_categorical = data.select_dtypes(include="object")
data_numerical = data.select_dtypes(exclude="object")

In [13]:
data_categorical.head()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country
0,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,United-States
1,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,United-States
2,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,United-States
3,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,United-States
4,?,Some-college,Never-married,?,Own-child,White,Female,United-States


In [14]:
data_categorical.shape

(48842, 8)

In [15]:
data_numerical.head()

Unnamed: 0,age,education-num,capital-gain,capital-loss,hours-per-week
0,25,7,0,0,40
1,38,9,0,0,50
2,28,12,0,0,40
3,44,10,7688,0,40
4,18,10,0,0,30


In [16]:
data_numerical.shape

(48842, 5)

 ## **<font color="#07a8ed">Codificación de categorías nominales (sin asumir orden)</font>**

<p align="justify">
👀 La codificación one hot es utilizado cuando las categorías de una variable no asumen un orden. Para una variable determinada, creará tantas columnas nuevas como clases o valores contenga esa columna. Para una muestra dada, el valor de la columna correspondiente a la categoría se establecerá en $1$ mientras que todas las columnas de las demás categorías se establecerán en $0$.


![](https://miro.medium.com/v2/resize:fit:720/format:webp/1*ggtP4a5YaRx6l09KQaYOnw.png?raw=true)


<p align="center">
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Pandas_logo.svg/250px-Pandas_logo.svg.png?raw=true">
</p>

https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html

<p align="justify">
👀 El método <code>get_dummies</code> en Pandas se utiliza para convertir variables categóricas en variables dummy (ficticia o binaria), que son columnas binarias que indican la presencia o ausencia de una categoría particular en un conjunto de datos.

In [17]:
data_categorical

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country
0,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,United-States
1,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,United-States
2,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,United-States
3,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,United-States
4,?,Some-college,Never-married,?,Own-child,White,Female,United-States
...,...,...,...,...,...,...,...,...
48837,Private,Assoc-acdm,Married-civ-spouse,Tech-support,Wife,White,Female,United-States
48838,Private,HS-grad,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,United-States
48839,Private,HS-grad,Widowed,Adm-clerical,Unmarried,White,Female,United-States
48840,Private,HS-grad,Never-married,Adm-clerical,Own-child,White,Male,United-States


In [18]:
# Codificación de variables categóricas con pandas
data_categorical_encoded = pd.get_dummies(data_categorical, dtype=int)
data_categorical_encoded.head()

Unnamed: 0,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,workclass_ Without-pay,education_ 10th,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


<p align="justify">
👀 El número de características después de la codificación es $10$ veces mayor que los datos originales porque algunas variables, como la ocupación y el país de origen, tienen muchos valores únicos.
</p>

In [19]:
data_categorical_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Columns: 102 entries, workclass_ ? to native-country_ Yugoslavia
dtypes: int64(102)
memory usage: 38.0 MB


In [20]:
data_categorical_encoded.columns

Index(['workclass_ ?', 'workclass_ Federal-gov', 'workclass_ Local-gov',
       'workclass_ Never-worked', 'workclass_ Private',
       'workclass_ Self-emp-inc', 'workclass_ Self-emp-not-inc',
       'workclass_ State-gov', 'workclass_ Without-pay', 'education_ 10th',
       ...
       'native-country_ Portugal', 'native-country_ Puerto-Rico',
       'native-country_ Scotland', 'native-country_ South',
       'native-country_ Taiwan', 'native-country_ Thailand',
       'native-country_ Trinadad&Tobago', 'native-country_ United-States',
       'native-country_ Vietnam', 'native-country_ Yugoslavia'],
      dtype='object', length=102)

In [21]:
data_categorical["occupation"].value_counts()

Unnamed: 0_level_0,count
occupation,Unnamed: 1_level_1
Prof-specialty,6172
Craft-repair,6112
Exec-managerial,6086
Adm-clerical,5611
Sales,5504
Other-service,4923
Machine-op-inspct,3022
?,2809
Transport-moving,2355
Handlers-cleaners,2072


<p align="justify">
👀 El método <code>concat</code> en Pandas se utiliza para concatenar (unir) múltiples objetos como DataFrames o Series a lo largo de un eje (filas o columnas).

In [22]:
# Concatenación de variables categóricas codificadas y variables numéricas
data_encoded = pd.concat([data_numerical, data_categorical_encoded], axis="columns")
data_encoded.head()

Unnamed: 0,age,education-num,capital-gain,capital-loss,hours-per-week,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
0,25,7,0,0,40,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
1,38,9,0,0,50,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
2,28,12,0,0,40,0,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
3,44,10,7688,0,40,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
4,18,10,0,0,30,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


💗 Queda listo el DataFrame para el modelado y posterior evaluación!

<p align="center">
<img src="https://github.com/cristiandarioortegayubro/BDS/blob/main/images/Logo%20Scikit-learn.png?raw=true">
</p>

https://scikit-learn.org/stable/


In [23]:
data_categorical.head()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country
0,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,United-States
1,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,United-States
2,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,United-States
3,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,United-States
4,?,Some-college,Never-married,?,Own-child,White,Female,United-States


<p align="justify">
A continuación presentaremos otra herramienta para codificar datos categóricos en datos numéricos que pueden ser utilizados por los algoritmos de aprendizaje automático...
</p>


https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

<p align="justify">
👀 El método <CODE>OneHotEncoder</CODE> de la biblioteca sklearn se utiliza para realizar codificación en variables categóricas. A diferencia de <code>get_dummies</code>, <code>OneHotEncoder</code> es más flexible y forma parte del pipeline (flujo de trabajo) de preprocesamiento de scikit-learn, por lo que se integra mejor con pipelines y otros procesos de machine learning.

In [24]:
from sklearn.preprocessing import OneHotEncoder

In [25]:
# Codificación de variables categóricas con sklearn
encoder = OneHotEncoder(sparse_output = False)  ### parametro: sparse_output Devolverá una matriz dispersa si se establece en Verdadero, de lo contrario devolverá una matriz.
categorical_encoded = encoder.fit_transform(data_categorical)
categorical_encoded

array([[0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 1., ..., 1., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.]])

<p align="justify">
👀 <code>sparse_output = False</code> se utiliza en <code>OneHotEncoder</code> con fines didácticos, es decir para una visualización más fácil de los datos.

In [26]:
# Transformación del array en DataFrame
columns_encoded = encoder.get_feature_names_out()  ### función personalizada, por lo que definitivamente debería funcionar y devolverá una lista de columnas después de la transformación.
categorical_encoded = pd.DataFrame(categorical_encoded, columns=columns_encoded)
categorical_encoded.head()

Unnamed: 0,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,workclass_ Without-pay,education_ 10th,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [27]:
# Concatenación de variables categóricas codificadas y variables numéricas
data_encoded = pd.concat([data_numerical, categorical_encoded], axis="columns")
data_encoded.head()

Unnamed: 0,age,education-num,capital-gain,capital-loss,hours-per-week,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
0,25,7,0,0,40,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,38,9,0,0,50,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,28,12,0,0,40,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,44,10,7688,0,40,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,18,10,0,0,30,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


💗 Queda listo el DataFrame para el modelado y posterior evaluación!

 # **<font color="#07a8ed">Creamos un Pipeline</font>**

In [28]:
from sklearn.compose import ColumnTransformer

<p align="justify">
👀 La función <code>ColumnTransformer</code> sirve para preprocesar un conjunto de datos que tiene columnas categóricas y numéricas.

In [29]:
# Definimos el transformador de columnas
pipeline = ColumnTransformer(
    transformers=[
        ("categorical", OneHotEncoder(handle_unknown="ignore"), data_categorical.columns),
        ("numerical", "passthrough", data_numerical.columns),  ###
    ]
)

In [30]:
pipeline

💗 Queda listo el pipeline para su posterior modelado y evaluación!

 # **<font color="#07a8ed">Resumen</font>**

<p align="center">
<img src="https://github.com/cristiandarioortegayubro/BDS/blob/main/images/ColumnTransformers-001.png?raw=true" width="600">
</p>

<p align="justify">
👀 En este colab nosotros:<br>
<br>✅ Cargamos los datos de un archivo <code>CSV</code> usando <code>Pandas</code>.
<br>✅ Examinamos las variables categóricas.
<br>✅ Utilizamos estrategias para codificar categorias.
<br>✅ Creamos el Pipeline.
</p>

<p align="justify">



<br>
<br>
<p align="center"><b>
💗
<font color="#07a8ed">
Hemos llegado al final de nuestro colab, a seguir codeando...
</font>
</p>
