<a href="https://colab.research.google.com/github/Viny2030/sklearn/blob/main/03_categorical_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Encoding of categorical variables

In this notebook, we present some typical ways of dealing with **categorical
variables** by encoding them, namely **ordinal encoding** and **one-hot
encoding**.

# Codificación de variables categóricas
En este cuaderno, presentamos algunas formas típicas de tratar con variables categóricas mediante su codificación, a saber, la codificación ordinal y la codificación one-hot.


Let's first load the entire adult dataset containing both numerical and
categorical data.

Primero carguemos todo el conjunto de datos de adultos que contiene datos numéricos y categóricos.

In [1]:
import pandas as pd

adult_census = pd.read_csv("https://raw.githubusercontent.com/Viny2030/datasets/refs/heads/main/adult_census.csv")
# drop the duplicated column `"education-num"` as stated in the first notebook
adult_census = adult_census.drop(columns="education-num")

target_name = "class"
target = adult_census[target_name]

data = adult_census.drop(columns=[target_name])

In [2]:
data

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States
1,38,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States
2,28,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States
3,44,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States
4,18,?,Some-college,Never-married,?,Own-child,White,Female,0,0,30,United-States
...,...,...,...,...,...,...,...,...,...,...,...,...
48837,27,Private,Assoc-acdm,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States
48838,40,Private,HS-grad,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States
48839,58,Private,HS-grad,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States
48840,22,Private,HS-grad,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States


In [3]:
target

Unnamed: 0,class
0,<=50K
1,<=50K
2,>50K
3,>50K
4,<=50K
...,...
48837,<=50K
48838,>50K
48839,<=50K
48840,<=50K



## Identify categorical variables

As we saw in the previous section, a numerical variable is a
quantity represented by a real or integer number. These variables can be
naturally handled by machine learning algorithms that are typically composed
of a sequence of arithmetic instructions such as additions and
multiplications.

In contrast, categorical variables have discrete values, typically
represented by string labels (but not only) taken from a finite list of
possible choices. For instance, the variable `native-country` in our dataset
is a categorical variable because it encodes the data using a finite list of
possible countries (along with the `?` symbol when this information is
missing):

# **Identificar variables categóricas**
Como vimos en la sección anterior, una variable numérica es una cantidad representada por un número real o entero. Estas variables pueden ser manejadas naturalmente por algoritmos de aprendizaje automático que normalmente están compuestos por una secuencia de instrucciones aritméticas como sumas y multiplicaciones.

En cambio, las variables categóricas tienen valores discretos, normalmente representados por etiquetas de cadena (pero no solo) tomadas de una lista finita de opciones posibles. Por ejemplo, la variable native-country en nuestro conjunto de datos es una variable categórica porque codifica los datos utilizando una lista finita de posibles países (junto con el símbolo ? cuando falta esta información):

In [4]:
data["native-country"].value_counts().sort_index()

Unnamed: 0_level_0,count
native-country,Unnamed: 1_level_1
?,857
Cambodia,28
Canada,182
China,122
Columbia,85
Cuba,138
Dominican-Republic,103
Ecuador,45
El-Salvador,155
England,127


How can we easily recognize categorical columns among the dataset? Part of
the answer lies in the columns' data type:

¿Cómo podemos reconocer fácilmente las columnas categóricas en el conjunto de datos? Parte de la respuesta se encuentra en el tipo de datos de las columnas:

In [5]:
data.dtypes

Unnamed: 0,0
age,int64
workclass,object
education,object
marital-status,object
occupation,object
relationship,object
race,object
sex,object
capital-gain,int64
capital-loss,int64


If we look at the `"native-country"` column, we observe its data type is
`object`, meaning it contains string values.

## Select features based on their data type

In the previous notebook, we manually defined the numerical columns. We could
do a similar approach. Instead, we can use the scikit-learn helper function
`make_column_selector`, which allows us to select columns based on their data
type. We now illustrate how to use this helper.

Si observamos la columna "native-country", observamos que su tipo de datos es objeto, lo que significa que contiene valores de cadena.

Seleccione las características en función de su tipo de datos
En el cuaderno anterior, definimos manualmente las columnas numéricas. Podríamos hacer un enfoque similar. En su lugar, podemos usar la función auxiliar de scikit-learn make_column_selector, que nos permite seleccionar columnas en función de su tipo de datos. Ahora ilustramos cómo usar esta función auxiliar.


In [6]:
from sklearn.compose import make_column_selector as selector

categorical_columns_selector = selector(dtype_include=object)
categorical_columns = categorical_columns_selector(data)
categorical_columns

['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native-country']

Here, we created the selector by passing the data type to include; we then
passed the input dataset to the selector object, which returned a list of
column names that have the requested data type. We can now filter out the
unwanted columns:

Aquí, creamos el selector pasando el tipo de datos que se incluirá; luego,
pasamos el conjunto de datos de entrada al objeto selector, que devolvió una lista de
nombres de columnas que tienen el tipo de datos solicitado. Ahora podemos filtrar las
columnas no deseadas:

In [7]:
data_categorical = data[categorical_columns]
data_categorical.head()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country
0,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,United-States
1,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,United-States
2,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,United-States
3,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,United-States
4,?,Some-college,Never-married,?,Own-child,White,Female,United-States


In [8]:
print(f"The dataset is composed of {data_categorical.shape[1]} features")

The dataset is composed of 8 features


In the remainder of this section, we will present different strategies to
encode categorical data into numerical data which can be used by a
machine-learning algorithm.

## Strategies to encode categories

### Encoding ordinal categories

The most intuitive strategy is to encode each category with a different
number. The `OrdinalEncoder` transforms the data in such manner. We start by
encoding a single column to understand how the encoding works.

En el resto de esta sección, presentaremos diferentes estrategias para codificar datos categóricos en datos numéricos que puedan ser utilizados por un algoritmo de aprendizaje automático.

Estrategias para codificar categorías
Codificación de categorías ordinales
La estrategia más intuitiva es codificar cada categoría con un número diferente. OrdinalEncoder transforma los datos de esa manera. Comenzamos codificando una sola columna para comprender cómo funciona la codificación.

In [9]:
from sklearn.preprocessing import OrdinalEncoder

education_column = data_categorical[["education"]]

encoder = OrdinalEncoder().set_output(transform="pandas")
education_encoded = encoder.fit_transform(education_column)
education_encoded

Unnamed: 0,education
0,1.0
1,11.0
2,7.0
3,15.0
4,15.0
...,...
48837,7.0
48838,11.0
48839,11.0
48840,11.0


We see that each category in `"education"` has been replaced by a numeric
value. We could check the mapping between the categories and the numerical
values by checking the fitted attribute `categories_`.

Vemos que cada categoría de "educación" ha sido reemplazada por un valor numérico. Podemos comprobar la correspondencia entre las categorías y los valores numéricos consultando el atributo ajustado categorías_.


In [10]:
encoder.categories_

[array([' 10th', ' 11th', ' 12th', ' 1st-4th', ' 5th-6th', ' 7th-8th',
        ' 9th', ' Assoc-acdm', ' Assoc-voc', ' Bachelors', ' Doctorate',
        ' HS-grad', ' Masters', ' Preschool', ' Prof-school',
        ' Some-college'], dtype=object)]

Now, we can check the encoding applied on all categorical features.

Now, we can check the encoding applied on all categorical features.

Ahora, podemos comprobar la codificación aplicada a todas las características categóricas.

In [11]:
data_encoded = encoder.fit_transform(data_categorical)
data_encoded[:5]

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country
0,4.0,1.0,4.0,7.0,3.0,2.0,1.0,39.0
1,4.0,11.0,2.0,5.0,0.0,4.0,1.0,39.0
2,2.0,7.0,2.0,11.0,0.0,4.0,1.0,39.0
3,4.0,15.0,2.0,7.0,0.0,2.0,1.0,39.0
4,0.0,15.0,4.0,0.0,3.0,4.0,0.0,39.0


In [12]:
print(f"The dataset encoded contains {data_encoded.shape[1]} features")

The dataset encoded contains 8 features


We see that the categories have been encoded for each feature (column)
independently. We also note that the number of features before and after the
encoding is the same.

However, be careful when applying this encoding strategy:
using this integer representation leads downstream predictive models
to assume that the values are ordered (0 < 1 < 2 < 3... for instance).

By default, `OrdinalEncoder` uses a lexicographical strategy to map string
category labels to integers. This strategy is arbitrary and often
meaningless. For instance, suppose the dataset has a categorical variable
named `"size"` with categories such as "S", "M", "L", "XL". We would like the
integer representation to respect the meaning of the sizes by mapping them to
increasing integers such as `0, 1, 2, 3`.
However, the lexicographical strategy used by default would map the labels
"S", "M", "L", "XL" to 2, 1, 0, 3, by following the alphabetical order.

The `OrdinalEncoder` class accepts a `categories` constructor argument to
pass categories in the expected ordering explicitly. You can find more
information in the
[scikit-learn documentation](https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features)
if needed.

If a categorical variable does not carry any meaningful order information
then this encoding might be misleading to downstream statistical models and
you might consider using one-hot encoding instead (see below).

### Encoding nominal categories (without assuming any order)

`OneHotEncoder` is an alternative encoder that prevents the downstream
models to make a false assumption about the ordering of categories. For a
given feature, it creates as many new columns as there are possible
categories. For a given sample, the value of the column corresponding to the
category is set to `1` while all the columns of the other categories
are set to `0`.

We can encode a single feature (e.g. `"education"`) to illustrate how the
encoding works.

Vemos que las categorías se han codificado para cada característica (columna) de forma independiente. También notamos que la cantidad de características antes y después de la codificación es la misma.

Sin embargo, tenga cuidado al aplicar esta estrategia de codificación: el uso de esta representación entera hace que los modelos predictivos posteriores asuman que los valores están ordenados (0 < 1 < 2 < 3... por ejemplo).

De forma predeterminada, OrdinalEncoder utiliza una estrategia lexicográfica para asignar etiquetas de categorías de cadenas a números enteros. Esta estrategia es arbitraria y, a menudo, sin sentido. Por ejemplo, supongamos que el conjunto de datos tiene una variable categórica llamada "tamaño" con categorías como "S", "M", "L", "XL". Nos gustaría que la representación entera respetara el significado de los tamaños al asignarlos a números enteros crecientes como 0, 1, 2, 3. Sin embargo, la estrategia lexicográfica utilizada de forma predeterminada asignaría las etiquetas "S", "M", "L", "XL" a 2, 1, 0, 3, siguiendo el orden alfabético.

La clase OrdinalEncoder acepta un argumento constructor de categorías para pasar categorías en el orden esperado explícitamente. Puede encontrar más información en la documentación de scikit-learn si es necesario.

Si una variable categórica no contiene ninguna información de orden significativa, esta codificación puede ser engañosa para los modelos estadísticos posteriores y puede considerar usar la codificación one-hot en su lugar (ver a continuación).

Codificación de categorías nominales (sin asumir ningún orden)
OneHotEncoder es un codificador alternativo que evita que los modelos posteriores hagan una suposición falsa sobre el orden de las categorías. Para una característica dada, crea tantas columnas nuevas como categorías posibles haya. Para una muestra dada, el valor de la columna correspondiente a la categoría se establece en 1 mientras que todas las columnas de las otras categorías se establecen en 0.

Podemos codificar una sola característica (por ejemplo, "educación") para ilustrar cómo funciona la codificación.


In [13]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False).set_output(transform="pandas")
education_encoded = encoder.fit_transform(education_column)
education_encoded

Unnamed: 0,education_ 10th,education_ 11th,education_ 12th,education_ 1st-4th,education_ 5th-6th,education_ 7th-8th,education_ 9th,education_ Assoc-acdm,education_ Assoc-voc,education_ Bachelors,education_ Doctorate,education_ HS-grad,education_ Masters,education_ Preschool,education_ Prof-school,education_ Some-college
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
48838,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
48839,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
48840,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p><tt class="docutils literal">sparse_output=False</tt> is used in the <tt class="docutils literal">OneHotEncoder</tt> for didactic purposes,
namely easier visualization of the data.</p>
<p class="last">Sparse matrices are efficient data structures when most of your matrix
elements are zero. They won't be covered in detail in this course. If you
want more details about them, you can look at
<a class="reference external" href="https://scipy-lectures.org/advanced/scipy_sparse/introduction.html#why-sparse-matrices">this</a>.</p>
</div>

Nota

sparse_output=False se utiliza en OneHotEncoder con fines didácticos, es decir, para facilitar la visualización de los datos.

Las matrices dispersas son estructuras de datos eficientes cuando la mayoría de los elementos de la matriz son cero. No se tratarán en detalle en este curso. Si desea obtener más detalles sobre ellas, puede consultar este artículo.


We see that encoding a single feature gives a dataframe full of zeros
and ones. Each category (unique value) became a column; the encoding
returned, for each sample, a 1 to specify which category it belongs to.

Let's apply this encoding on the full dataset.

Vemos que la codificación de una sola característica genera un marco de datos lleno de ceros y unos. Cada categoría (valor único) se convirtió en una columna; la codificación devolvió, para cada muestra, un 1 para especificar a qué categoría pertenece.

Apliquemos esta codificación en el conjunto de datos completo.

In [14]:
print(f"The dataset is composed of {data_categorical.shape[1]} features")
data_categorical.head()

The dataset is composed of 8 features


Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country
0,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,United-States
1,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,United-States
2,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,United-States
3,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,United-States
4,?,Some-college,Never-married,?,Own-child,White,Female,United-States


In [15]:
data_encoded = encoder.fit_transform(data_categorical)
data_encoded[:5]

Unnamed: 0,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,workclass_ Without-pay,education_ 10th,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [16]:
print(f"The encoded dataset contains {data_encoded.shape[1]} features")

The encoded dataset contains 102 features


Look at how the `"workclass"` variable of the 3 first records has been encoded
and compare this to the original string representation.

The number of features after the encoding is more than 10 times larger than
in the original data because some variables such as `occupation` and
`native-country` have many possible categories.

Observe cómo se ha codificado la variable "clase de trabajo" de los 3 primeros registros y compárela con la representación de cadena original.

La cantidad de características después de la codificación es más de 10 veces mayor que en los datos originales porque algunas variables, como la ocupación y el país de origen, tienen muchas categorías posibles.

### Choosing an encoding strategy

Choosing an encoding strategy depends on the underlying models and the type of
categories (i.e. ordinal vs. nominal).

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">In general <tt class="docutils literal">OneHotEncoder</tt> is the encoding strategy used when the
downstream models are <strong>linear models</strong> while <tt class="docutils literal">OrdinalEncoder</tt> is often a
good strategy with <strong>tree-based models</strong>.</p>
</div>

# Nota

En general, OneHotEncoder es la estrategia de codificación que se utiliza cuando los modelos descendentes son modelos lineales, mientras que OrdinalEncoder suele ser una buena estrategia con modelos basados ​​en árboles.

Using an `OrdinalEncoder` outputs ordinal categories. This means
that there is an order in the resulting categories (e.g. `0 < 1 < 2`). The
impact of violating this ordering assumption is really dependent on the
downstream models. Linear models would be impacted by misordered categories
while tree-based models would not.

You can still use an `OrdinalEncoder` with linear models but you need to be
sure that:
- the original categories (before encoding) have an ordering;
- the encoded categories follow the same ordering than the original
  categories.
The **next exercise** highlights the issue of misusing `OrdinalEncoder` with
a linear model.

One-hot encoding categorical variables with high cardinality can cause
computational inefficiency in tree-based models. Because of this, it is not
recommended to use `OneHotEncoder` in such cases even if the original
categories do not have a given order. We will show this in the **final
exercise** of this sequence.

El uso de un OrdinalEncoder genera categorías ordinales. Esto significa que existe un orden en las categorías resultantes (p. ej., 0 < 1 < 2). El impacto de violar este supuesto de ordenamiento depende realmente de los modelos posteriores. Los modelos lineales se verían afectados por categorías mal ordenadas, mientras que los modelos basados ​​en árboles no.

Puede seguir utilizando un OrdinalEncoder con modelos lineales, pero debe asegurarse de que:

las categorías originales (antes de la codificación) tengan un orden;
las categorías codificadas sigan el mismo orden que las categorías originales. El siguiente ejercicio destaca el problema del uso incorrecto de OrdinalEncoder con un modelo lineal.
La codificación one-hot de variables categóricas con alta cardinalidad puede causar ineficiencia computacional en modelos basados ​​en árboles. Debido a esto, no se recomienda utilizar OneHotEncoder en dichos casos, incluso si las categorías originales no tienen un orden determinado. Lo demostraremos en el ejercicio final de esta secuencia.


## Evaluate our predictive pipeline

We can now integrate this encoder inside a machine learning pipeline like we
did with numerical data: let's train a linear classifier on the encoded data
and check the generalization performance of this machine learning pipeline using
cross-validation.

Before we create the pipeline, we have to linger on the `native-country`.
Let's recall some statistics regarding this column.

# Evaluar nuestra secuencia predictiva
Ahora podemos integrar este codificador dentro de una secuencia de aprendizaje automático como lo hicimos con los datos numéricos: entrenemos un clasificador lineal con los datos codificados y verifiquemos el rendimiento de generalización de esta secuencia de aprendizaje automático mediante validación cruzada.

Antes de crear la secuencia, debemos centrarnos en el país de origen. Recordemos algunas estadísticas sobre esta columna.



In [17]:
data["native-country"].value_counts()

Unnamed: 0_level_0,count
native-country,Unnamed: 1_level_1
United-States,43832
Mexico,951
?,857
Philippines,295
Germany,206
Puerto-Rico,184
Canada,182
El-Salvador,155
India,151
Cuba,138


We see that the `"Holand-Netherlands"` category is occurring rarely. This will
be a problem during cross-validation: if the sample ends up in the test set
during splitting then the classifier would not have seen the category during
training and would not be able to encode it.

In scikit-learn, there are some possible solutions to bypass this issue:

* list all the possible categories and provide them to the encoder via the
  keyword argument `categories` instead of letting the estimator automatically
  determine them from the training data when calling fit;
* set the parameter `handle_unknown="ignore"`, i.e. if an unknown category is
  encountered during transform, the resulting one-hot encoded columns for this
  feature will be all zeros;
* adjust the `min_frequency` parameter to collapse the rarest categories
  observed in the training data into a single one-hot encoded feature. If you
  enable this option, you can also set `handle_unknown="infrequent_if_exist"`
  to encode the unknown categories (categories only observed at predict time)
  as ones in that last column.

In this notebook we only explore the second option, namely
`OneHotEncoder(handle_unknown="ignore")`. Feel free to evaluate the
alternatives on your own, for instance using a sandbox notebook.

Vemos que la categoría "Holanda-Países Bajos" aparece con poca frecuencia. Esto será un problema durante la validación cruzada: si la muestra termina en el conjunto de prueba durante la división, el clasificador no habría visto la categoría durante el entrenamiento y no podría codificarla.

En scikit-learn, existen algunas posibles soluciones para evitar este problema:

Enumerar todas las categorías posibles y proporcionárselas al codificador a través del argumento de palabra clave categorías en lugar de dejar que el estimador las determine automáticamente a partir de los datos de entrenamiento cuando se llama a fit;
Establecer el parámetro handle_unknown="ignore", es decir, si se encuentra una categoría desconocida durante la transformación, las columnas codificadas one-hot resultantes para esta característica serán todas ceros;
Ajustar el parámetro min_frequency para contraer las categorías más raras observadas en los datos de entrenamiento en una única característica codificada one-hot. Si habilita esta opción, también puede configurar handle_unknown="infrequent_if_exist" para codificar las categorías desconocidas (categorías observadas únicamente en el momento de la predicción) como las que se encuentran en esa última columna.
En este cuaderno, solo exploramos la segunda opción, es decir, OneHotEncoder(handle_unknown="ignore"). Siéntase libre de evaluar las alternativas por su cuenta, por ejemplo, utilizando un cuaderno sandbox.


<div class="admonition tip alert alert-warning">
<p class="first admonition-title" style="font-weight: bold;">Tip</p>
<p class="last">Be aware the <tt class="docutils literal">OrdinalEncoder</tt> exposes a parameter also named <tt class="docutils literal">handle_unknown</tt>.
It can be set to <tt class="docutils literal">use_encoded_value</tt>. If that option is chosen, you can define
a fixed value that is assigned to all unknown categories during <tt class="docutils literal">transform</tt>.
For example, <tt class="docutils literal"><span class="pre">OrdinalEncoder(handle_unknown='use_encoded_value',</span> <span class="pre">unknown_value=-1)</span></tt> would set all values encountered during <tt class="docutils literal">transform</tt> to <tt class="docutils literal"><span class="pre">-1</span></tt>
which are not part of the data encountered during the <tt class="docutils literal">fit</tt> call. You are
going to use these parameters in the next exercise.</p>
</div>

We can now create our machine learning pipeline.

In [18]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

model = make_pipeline(
    OneHotEncoder(handle_unknown="ignore"), LogisticRegression(max_iter=500)
)

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">Here, we need to increase the maximum number of iterations to obtain a fully
converged <tt class="docutils literal">LogisticRegression</tt> and silence a <tt class="docutils literal">ConvergenceWarning</tt>. Contrary
to the numerical features, the one-hot encoded categorical features are all
on the same scale (values are 0 or 1), so they would not benefit from
scaling. In this case, increasing <tt class="docutils literal">max_iter</tt> is the right thing to do.</p>
</div>

# Nota

Aquí, necesitamos aumentar la cantidad máxima de iteraciones para obtener una LogisticRegression completamente convergente y silenciar una ConvergenceWarning. A diferencia de las características numéricas, las características categóricas codificadas one-hot están todas en la misma escala (los valores son 0 o 1), por lo que no se beneficiarían del escalamiento. En este caso, aumentar max_iter es lo correcto.

Finally, we can check the model's generalization performance only using the
categorical columns.

Finalmente, podemos comprobar el rendimiento de generalización del modelo utilizando únicamente las columnas categóricas.

In [19]:
from sklearn.model_selection import cross_validate

cv_results = cross_validate(model, data_categorical, target)
cv_results

{'fit_time': array([1.52229452, 1.43866801, 0.61795759, 0.59785604, 0.62051582]),
 'score_time': array([0.29993486, 0.0836513 , 0.07827306, 0.08502126, 0.1055069 ]),
 'test_score': array([0.83232675, 0.83570478, 0.82831695, 0.83292383, 0.83497133])}

In [20]:
scores = cv_results["test_score"]
print(f"The accuracy is: {scores.mean():.3f} ± {scores.std():.3f}")

The accuracy is: 0.833 ± 0.003


As you can see, this representation of the categorical variables is
slightly more predictive of the revenue than the numerical variables
that we used previously.


In this notebook we have:
* seen two common strategies for encoding categorical features: **ordinal
  encoding** and **one-hot encoding**;
* used a **pipeline** to use a **one-hot encoder** before fitting a logistic
  regression.

Como puede ver, esta representación de las variables categóricas es ligeramente más predictiva de los ingresos que las variables numéricas que utilizamos anteriormente.

En este cuaderno hemos:

visto dos estrategias comunes para codificar características categóricas: codificación ordinal y codificación one-hot;
hemos utilizado una secuencia de comandos para utilizar un codificador one-hot antes de ajustar una regresión logística.