<a href="https://colab.research.google.com/github/Viny2030/sklearn/blob/main/02_numerical_pipeline_introduction_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# First model with scikit-learn

In this notebook, we present how to build predictive models on tabular
datasets, with only numerical features.

In particular we highlight:

* the scikit-learn API: `.fit(X, y)`/`.predict(X)`/`.score(X, y)`;
* how to evaluate the generalization performance of a model with a train-test
  split.

Here API stands for "Application Programming Interface" and refers to a set of
conventions to build self-consistent software. Notice that you can visit the
Glossary for more info on technical jargon.

## Loading the dataset with Pandas

We use the "adult_census" dataset described in the previous notebook. For more
details about the dataset see <http://www.openml.org/d/1590>.

Numerical data is the most natural type of data used in machine learning and
can (almost) directly be fed into predictive models. Here we load a subset of
the original data with only the numerical columns.

En este cuaderno, presentamos cómo construir modelos predictivos en conjuntos de datos tabulares, con solo características numéricas.

En particular, destacamos:

la API de scikit-learn: .fit(X, y)/.predict(X)/.score(X, y);
cómo evaluar el rendimiento de generalización de un modelo con una división de entrenamiento-prueba.
Aquí, API significa "Interfaz de programación de aplicaciones" y se refiere a un conjunto de convenciones para crear software autoconsistente. Tenga en cuenta que puede visitar el Glosario para obtener más información sobre la jerga técnica.

Carga del conjunto de datos con Pandas
Usamos el conjunto de datos "adult_census" descrito en el cuaderno anterior. Para obtener más detalles sobre el conjunto de datos, consulte http://www.openml.org/d/1590.

Los datos numéricos son el tipo de datos más natural utilizado en el aprendizaje automático y pueden (casi) introducirse directamente en modelos predictivos. Aquí cargamos un subconjunto de los datos originales con solo las columnas numéricas.


In [1]:
import pandas as pd

adult_census = pd.read_csv("https://raw.githubusercontent.com/Viny2030/datasets/refs/heads/main/adult_census.csv")

Let's have a look at the first records of this dataframe:

In [2]:
adult_census.head()

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [3]:
adult_census['class'] = adult_census['class'].apply(lambda x:x.replace("<=50K", "0"))
adult_census['class'] = adult_census['class'].apply(lambda x:x.replace(">50K", "1"))
adult_census['class'] = adult_census['class'].astype(int)

In [5]:
adult_census.head(6)

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,0
1,38,Private,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,0
2,28,Local-gov,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,1
3,44,Private,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,1
4,18,?,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,0
5,34,Private,10th,6,Never-married,Other-service,Not-in-family,White,Male,0,0,30,United-States,0


We see that this CSV file contains all information: the target that we would
like to predict (i.e. `"class"`) and the data that we want to use to train our
predictive model (i.e. the remaining columns). The first step is to separate
columns to get on one side the target and on the other side the data.

## Separate the data and the target

Vemos que este archivo CSV contiene toda la información: el objetivo que queremos predecir (es decir, la "clase") y los datos que queremos utilizar para entrenar nuestro modelo predictivo (es decir, las columnas restantes). El primer paso es separar las columnas para tener en un lado el objetivo y en el otro lado los datos.

Separar los datos y el objetivo

In [6]:
target_name = "class"
target = adult_census[target_name]
target

Unnamed: 0,class
0,0
1,0
2,1
3,1
4,0
...,...
48837,0
48838,1
48839,0
48840,0


In [7]:
data = adult_census.drop(columns=[target_name])
data.head()

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States
1,38,Private,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States
2,28,Local-gov,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States
3,44,Private,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States
4,18,?,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States


We can now linger on the variables, also denominated features, that we later
use to build our predictive model. In addition, we can also check how many
samples are available in our dataset.

Ahora podemos detenernos en las variables, también denominadas características, que luego utilizaremos para construir nuestro modelo predictivo. Además, también podemos verificar cuántas muestras están disponibles en nuestro conjunto de datos.

In [8]:
data.columns

Index(['age', 'workclass', 'education', 'education-num', 'marital-status',
       'occupation', 'relationship', 'race', 'sex', 'capital-gain',
       'capital-loss', 'hours-per-week', 'native-country'],
      dtype='object')

In [9]:
print(
    f"The dataset contains {data.shape[0]} samples and "
    f"{data.shape[1]} features"
)

The dataset contains 48842 samples and 13 features


## Fit a model and make predictions

We now build a classification model using the "K-nearest neighbors" strategy.
To predict the target of a new sample, a k-nearest neighbors takes into
account its `k` closest samples in the training set and predicts the majority
target of these samples.

<div class="admonition caution alert alert-warning">
<p class="first admonition-title" style="font-weight: bold;">Caution!</p>
<p class="last">We use a K-nearest neighbors here. However, be aware that it is seldom useful
in practice. We use it because it is an intuitive algorithm. In the next
notebook, we will introduce better models.</p>
</div>

The `fit` method is called to train the model from the input (features) and
target data.

Ajuste de un modelo y realización de predicciones
Ahora construimos un modelo de clasificación utilizando la estrategia de "K vecinos más cercanos". Para predecir el objetivo de una nueva muestra, un método de K vecinos más cercanos tiene en cuenta sus k muestras más cercanas en el conjunto de entrenamiento y predice el objetivo mayoritario de estas muestras.

¡Precaución!

Aquí utilizamos un método de K vecinos más cercanos. Sin embargo, tenga en cuenta que rara vez es útil en la práctica. Lo utilizamos porque es un algoritmo intuitivo. En el próximo cuaderno, presentaremos mejores modelos.

El método de ajuste se llama para entrenar el modelo a partir de los datos de entrada (características) y de destino.


In [11]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder

# Create a LabelEncoder object
encoder = LabelEncoder()

# Assuming 'data' is a Pandas DataFrame
# Iterate through columns and encode any string columns
for column in data.columns:
    if data[column].dtype == 'object':  # Check if column is of object type (string)
        data[column] = encoder.fit_transform(data[column]) # Encode the string values to numerical

# Now fit the model with the encoded data
model = KNeighborsClassifier()
_ = model.fit(data, target)

Learning can be represented as follows:

![Predictor fit diagram](../figures/api_diagram-predictor.fit.svg)

In scikit-learn an object that has a `fit` method is called an **estimator**.
The method `fit` is composed of two elements: (i) a **learning algorithm** and
(ii) some **model states**. The learning algorithm takes the training data and
training target as input and sets the model states. These model states are
later used to either predict (for classifiers and regressors) or transform
data (for transformers).

Both the learning algorithm and the type of model states are specific to each
type of model.


554 / 5.000
El aprendizaje se puede representar de la siguiente manera:

Diagrama de ajuste de predictores

En scikit-learn, un objeto que tiene un método de ajuste se denomina estimador. El ajuste del método se compone de dos elementos: (i) un algoritmo de aprendizaje y (ii) algunos estados del modelo. El algoritmo de aprendizaje toma los datos de entrenamiento y el objetivo de entrenamiento como entrada y establece los estados del modelo. Estos estados del modelo se utilizan posteriormente para predecir (para clasificadores y regresores) o transformar datos (para transformadores).

Tanto el algoritmo de aprendizaje como el tipo de estados del modelo son específicos de cada tipo de modelo.


<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">Here and later, we use the name <tt class="docutils literal">data</tt> and <tt class="docutils literal">target</tt> to be explicit. In
scikit-learn documentation, <tt class="docutils literal">data</tt> is commonly named <tt class="docutils literal">X</tt> and <tt class="docutils literal">target</tt> is
commonly called <tt class="docutils literal">y</tt>.</p>
</div>

# Nota

Aquí y más adelante, usamos los nombres data y target para ser explícitos. En la documentación de scikit-learn, data se suele denominar X y target se suele llamar y.

Let's use our model to make some predictions using the same dataset.

Utilicemos nuestro modelo para hacer algunas predicciones utilizando el mismo conjunto de datos.

In [12]:
target_predicted = model.predict(data)

An estimator (an object with a `fit` method) with a `predict` method is called
a **predictor**. We can illustrate the prediction mechanism as follows:

![Predictor predict diagram](../figures/api_diagram-predictor.predict.svg)

To predict, a model uses a **prediction function** that uses the input data
together with the model states. As for the learning algorithm and the model
states, the prediction function is specific for each type of model.

Un estimador (un objeto con un método de ajuste) con un método de predicción se denomina predictor. Podemos ilustrar el mecanismo de predicción de la siguiente manera:

Diagrama de predicción de predictor

Para predecir, un modelo utiliza una función de predicción que utiliza los datos de entrada junto con los estados del modelo. En cuanto al algoritmo de aprendizaje y los estados del modelo, la función de predicción es específica para cada tipo de modelo.

Let's now have a look at the computed predictions. For the sake of simplicity,
we look at the five first predicted targets.

Ahora veamos las predicciones calculadas. Para simplificar, veremos los cinco primeros objetivos previstos.

In [13]:
target_predicted[:5]

array([0, 0, 0, 1, 0])

Indeed, we can compare these predictions to the actual data...

In [14]:
target[:5]

Unnamed: 0,class
0,0
1,0
2,1
3,1
4,0


...and we could even check if the predictions agree with the real targets:

...e incluso podríamos comprobar si las predicciones coinciden con los objetivos reales:

In [15]:
target[:5] == target_predicted[:5]

Unnamed: 0,class
0,True
1,True
2,False
3,True
4,True


In [16]:
print(
    "Number of correct prediction: "
    f"{(target[:5] == target_predicted[:5]).sum()} / 5"
)

Number of correct prediction: 4 / 5


Here, we see that our model makes a mistake when predicting for the first
sample.

To get a better assessment, we can compute the average success rate.

Aquí vemos que nuestro modelo comete un error al realizar la predicción para la primera muestra.

Para obtener una mejor evaluación, podemos calcular la tasa de éxito promedio.

In [17]:
(target == target_predicted).mean()

0.8891732525285615

This result means that the model makes a correct prediction for approximately
82 samples out of 100. Note that we used the same data to train and evaluate
our model. Can this evaluation be trusted or is it too good to be true?

## Train-test data split

When building a machine learning model, it is important to evaluate the
trained model on data that was not used to fit it, as **generalization** is
more than memorization (meaning we want a rule that generalizes to new data,
without comparing to data we memorized). It is harder to conclude on
never-seen instances than on already seen ones.

Correct evaluation is easily done by leaving out a subset of the data when
training the model and using it afterwards for model evaluation. The data used
to fit a model is called training data while the data used to assess a model
is called testing data.

We can load more data, which was actually left-out from the original data set.

Este resultado significa que el modelo hace una predicción correcta para aproximadamente 82 muestras de 100. Tenga en cuenta que usamos los mismos datos para entrenar y evaluar nuestro modelo. ¿Se puede confiar en esta evaluación o es demasiado buena para ser verdad?

División de datos de entrenamiento y prueba
Al crear un modelo de aprendizaje automático, es importante evaluar el modelo entrenado con datos que no se usaron para ajustarlo, ya que la generalización es más que la memorización (lo que significa que queremos una regla que se generalice a nuevos datos, sin comparar con los datos que memorizamos). Es más difícil sacar conclusiones sobre instancias nunca vistas que sobre las ya vistas.

La evaluación correcta se realiza fácilmente dejando fuera un subconjunto de los datos al entrenar el modelo y usándolo después para la evaluación del modelo. Los datos utilizados para ajustar un modelo se denominan datos de entrenamiento, mientras que los datos utilizados para evaluar un modelo se denominan datos de prueba.

Podemos cargar más datos, que en realidad se dejaron fuera del conjunto de datos original.


In [18]:
##adult_census_test = pd.read_csv("../datasets/adult-census-numeric-test.csv")

From this new data, we separate our input features and the target to predict,
as in the beginning of this notebook.

A partir de estos nuevos datos, separamos nuestras características de entrada y el objetivo a predecir, como en el comienzo de este cuaderno.

In [19]:
##target_test = adult_census_test[target_name]
##data_test = adult_census_test.drop(columns=[target_name])

We can check the number of features and samples available in this new set.

Podemos comprobar la cantidad de características y muestras disponibles en este nuevo conjunto.

In [20]:
##print(
  ##  f"The testing dataset contains {data_test.shape[0]} samples and "
 ##   f"{data_test.shape[1]} features"
##)

Instead of computing the prediction and manually computing the average success
rate, we can use the method `score`. When dealing with classifiers this method
returns their performance metric.

En lugar de calcular la predicción y calcular manualmente la tasa de éxito promedio, podemos utilizar el método de puntuación. Cuando se trabaja con clasificadores, este método devuelve su métrica de rendimiento.

In [21]:
###accuracy = model.score(data_test, target_test)
##model_name = model.__class__.__name__

##print(f"The test accuracy using a {model_name} is {accuracy:.3f}")

We use the generic term **model** for objects whose goodness of fit can be
measured using the `score` method. Let's check the underlying mechanism when
calling `score`:

![Predictor score diagram](../figures/api_diagram-predictor.score.svg)

To compute the score, the predictor first computes the predictions (using the
`predict` method) and then uses a scoring function to compare the true target
`y` and the predictions. Finally, the score is returned.

Utilizamos el término genérico modelo para los objetos cuya bondad de ajuste se puede medir mediante el método de puntuación. Veamos el mecanismo subyacente al llamar a puntuación:

Diagrama de puntuación del predictor

Para calcular la puntuación, el predictor primero calcula las predicciones (mediante el método de predicción) y luego utiliza una función de puntuación para comparar el objetivo real y con las predicciones. Finalmente, se devuelve la puntuación.

If we compare with the accuracy obtained by wrongly evaluating the model on
the training set, we find that this evaluation was indeed optimistic compared
to the score obtained on a held-out test set.

It shows the importance to always testing the generalization performance of
predictive models on a different set than the one used to train these models.
We will discuss later in more detail how predictive models should be
evaluated.

Si comparamos con la precisión obtenida al evaluar incorrectamente el modelo en el conjunto de entrenamiento, descubrimos que esta evaluación fue realmente optimista en comparación con la puntuación obtenida en un conjunto de prueba no utilizado.

Esto demuestra la importancia de probar siempre el rendimiento de generalización de los modelos predictivos en un conjunto diferente al utilizado para entrenar estos modelos. Más adelante analizaremos con más detalle cómo se deben evaluar los modelos predictivos.

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">In this MOOC, we refer to <strong>generalization performance</strong> of a model when
referring to the test score or test error obtained by comparing the prediction
of a model and the true targets. Equivalent terms for <strong>generalization
performance</strong> are predictive performance and statistical performance. We refer
to <strong>computational performance</strong> of a predictive model when assessing the
computational costs of training a predictive model or using it to make
predictions.</p>
</div>

# Nota

En este MOOC, nos referimos al rendimiento de generalización de un modelo cuando nos referimos a la puntuación o el error de prueba obtenidos al comparar la predicción de un modelo con los objetivos reales. Los términos equivalentes para el rendimiento de generalización son rendimiento predictivo y rendimiento estadístico. Nos referimos al rendimiento computacional de un modelo predictivo cuando evaluamos los costos computacionales de entrenar un modelo predictivo o usarlo para hacer predicciones.

## Notebook Recap

In this notebook we:

* fitted a **k-nearest neighbors** model on a training dataset;
* evaluated its generalization performance on the testing data;
* introduced the scikit-learn API `.fit(X, y)` (to train a model),
  `.predict(X)` (to make predictions) and `.score(X, y)` (to evaluate a
  model);
* introduced the jargon for estimator, predictor and model.

# Resumen del cuaderno
En este cuaderno:

ajustamos un modelo de k vecinos más cercanos a un conjunto de datos de entrenamiento;
evaluamos su rendimiento de generalización en los datos de prueba;
presentamos la API de scikit-learn .fit(X, y) (para entrenar un modelo), .predict(X) (para hacer predicciones) y .score(X, y) (para evaluar un modelo);
presentamos la jerga para estimador, predictor y modelo.

In [23]:
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/Viny2030/datasets/refs/heads/main/adult_census.csv")

In [24]:
df

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,27,Private,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
48838,40,Private,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
48839,58,Private,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
48840,22,Private,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [26]:
df['class'] = df['class'].apply(lambda x:x.replace("<=50K", "0"))
df['class'] = df['class'].apply(lambda x:x.replace(">50K", "1"))
df['class'] = df['class'].astype(int)

In [27]:
df

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,0
1,38,Private,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,0
2,28,Local-gov,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,1
3,44,Private,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,1
4,18,?,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,27,Private,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,0
48838,40,Private,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,1
48839,58,Private,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,0
48840,22,Private,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,0


In [29]:
df.isnull().sum()

Unnamed: 0,0
age,0
workclass,0
education,0
education-num,0
marital-status,0
occupation,0
relationship,0
race,0
sex,0
capital-gain,0


In [31]:
# Select only columns with numeric data types:
df_num = df.select_dtypes(exclude=['object'])

In [32]:
df_cat = df.select_dtypes(include=['object'])

In [33]:
df_num

Unnamed: 0,age,education-num,capital-gain,capital-loss,hours-per-week,class
0,25,7,0,0,40,0
1,38,9,0,0,50,0
2,28,12,0,0,40,1
3,44,10,7688,0,40,1
4,18,10,0,0,30,0
...,...,...,...,...,...,...
48837,27,12,0,0,38,0
48838,40,9,0,0,40,1
48839,58,9,0,0,40,0
48840,22,9,0,0,20,0


In [34]:
df_cat

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country
0,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,United-States
1,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,United-States
2,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,United-States
3,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,United-States
4,?,Some-college,Never-married,?,Own-child,White,Female,United-States
...,...,...,...,...,...,...,...,...
48837,Private,Assoc-acdm,Married-civ-spouse,Tech-support,Wife,White,Female,United-States
48838,Private,HS-grad,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,United-States
48839,Private,HS-grad,Widowed,Adm-clerical,Unmarried,White,Female,United-States
48840,Private,HS-grad,Never-married,Adm-clerical,Own-child,White,Male,United-States


In [38]:
##encodear variables categoricas
encoder = LabelEncoder()
for col in df_cat.columns:
    df_cat[col] = encoder.fit_transform(df_cat[col])


In [39]:
df_cat

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country
0,4,1,4,7,3,2,1,39
1,4,11,2,5,0,4,1,39
2,2,7,2,11,0,4,1,39
3,4,15,2,7,0,2,1,39
4,0,15,4,0,3,4,0,39
...,...,...,...,...,...,...,...,...
48837,4,7,2,13,5,4,0,39
48838,4,11,2,7,0,4,1,39
48839,4,11,6,1,4,4,0,39
48840,4,11,4,1,3,4,1,39


In [40]:
df = pd.concat([df_num, df_cat], axis=1)

In [41]:
df

Unnamed: 0,age,education-num,capital-gain,capital-loss,hours-per-week,class,workclass,education,marital-status,occupation,relationship,race,sex,native-country
0,25,7,0,0,40,0,4,1,4,7,3,2,1,39
1,38,9,0,0,50,0,4,11,2,5,0,4,1,39
2,28,12,0,0,40,1,2,7,2,11,0,4,1,39
3,44,10,7688,0,40,1,4,15,2,7,0,2,1,39
4,18,10,0,0,30,0,0,15,4,0,3,4,0,39
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,27,12,0,0,38,0,4,7,2,13,5,4,0,39
48838,40,9,0,0,40,1,4,11,2,7,0,4,1,39
48839,58,9,0,0,40,0,4,11,6,1,4,4,0,39
48840,22,9,0,0,20,0,4,11,4,1,3,4,1,39


In [42]:
X = df.drop(columns=['class'])
y = df['class']

In [43]:
X

Unnamed: 0,age,education-num,capital-gain,capital-loss,hours-per-week,workclass,education,marital-status,occupation,relationship,race,sex,native-country
0,25,7,0,0,40,4,1,4,7,3,2,1,39
1,38,9,0,0,50,4,11,2,5,0,4,1,39
2,28,12,0,0,40,2,7,2,11,0,4,1,39
3,44,10,7688,0,40,4,15,2,7,0,2,1,39
4,18,10,0,0,30,0,15,4,0,3,4,0,39
...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,27,12,0,0,38,4,7,2,13,5,4,0,39
48838,40,9,0,0,40,4,11,2,7,0,4,1,39
48839,58,9,0,0,40,4,11,6,1,4,4,0,39
48840,22,9,0,0,20,4,11,4,1,3,4,1,39


In [44]:
y

Unnamed: 0,class
0,0
1,0
2,1
3,1
4,0
...,...
48837,0
48838,1
48839,0
48840,0


In [46]:
from sklearn.model_selection import train_test_split # Import train_test_split from the correct module

In [48]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

###ALGORITMOS

1-ARBOL DE DECISION

In [49]:
from sklearn.tree import DecisionTreeClassifier

In [50]:
DecisionTreeClassifier = DecisionTreeClassifier()

In [53]:
model = DecisionTreeClassifier

In [54]:
model.fit(X_train, y_train)

In [56]:
model.get_params()

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'random_state': None,
 'splitter': 'best'}

In [57]:
y_pred = model.predict(X_test)

In [60]:
from sklearn.metrics import classification_report

In [62]:
target_names = ['class 0', 'class 1']

In [63]:
print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

     class 0       0.88      0.89      0.88      7417
     class 1       0.64      0.60      0.62      2352

    accuracy                           0.82      9769
   macro avg       0.76      0.75      0.75      9769
weighted avg       0.82      0.82      0.82      9769



## optimizar modelo -gridsearch

In [70]:
params = {
    'criterion': ['gini'],  # Wrap 'gini' in a list
    'max_depth': [4],  # Wrap 4 in a list
    'min_samples_split': [2],  # Wrap 2 in a list
    'min_samples_leaf': [1],  # Wrap 1 in a list
    'min_weight_fraction_leaf': [0.0],  # Wrap 0.0 in a list
    'max_features': [None],  # Wrap None in a list
    'random_state': [None],  # Wrap None in a list
    'max_leaf_nodes': [None],  # Wrap None in a list
    'min_impurity_decrease': [0.0],  # Wrap 0.0 in a list
    'class_weight': [None],  # Wrap None in a list
    'ccp_alpha': [0.0],  # Wrap 0.0 in a list
    'monotonic_cst': [None]  # Wrap None in a list
}

In [71]:
from sklearn.model_selection import GridSearchCV

In [72]:
gridsearch = GridSearchCV(model, params, cv=5)

In [73]:
gridsearch.fit(X_train, y_train)

In [74]:
y_pred = gridsearch.predict(X_test)

In [75]:
target_names = ['class 0', 'class 1']

In [76]:
print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

     class 0       0.86      0.95      0.90      7417
     class 1       0.77      0.52      0.62      2352

    accuracy                           0.85      9769
   macro avg       0.81      0.74      0.76      9769
weighted avg       0.84      0.85      0.84      9769



# 2-RANDOM FOREST

In [77]:
from sklearn.ensemble import RandomForestClassifier

In [78]:
RandomForestClassifier = RandomForestClassifier()

In [79]:
model1 = RandomForestClassifier

In [80]:
model1.fit(X_train, y_train)

In [82]:
model1.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [83]:
y_pred = model1.predict(X_test)

In [84]:
from sklearn.metrics import classification_report

In [85]:
target_names = ['class 0', 'class 1']

In [86]:
print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

     class 0       0.88      0.92      0.90      7417
     class 1       0.72      0.62      0.67      2352

    accuracy                           0.85      9769
   macro avg       0.80      0.77      0.79      9769
weighted avg       0.85      0.85      0.85      9769



optimizar modelo -gridsearch

In [92]:
params = {
    'n_estimators':[100],
    # 'criterion':['gini'],
    'max_depth':[4],
    'min_samples_split':[2], # Changed: Enclose key in quotes
    'min_samples_leaf':[1],  # Changed: Enclose key in quotes
}

In [93]:
from sklearn.model_selection import GridSearchCV

In [96]:

# Fit the GridSearchCV object to your training data
gridsearch1.fit(X_train, y_train)

# Now you can make predictions
y_pred = gridsearch1.predict(X_test)

In [97]:
target_names = ['class 0', 'class 1']

In [98]:
print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

     class 0       0.85      0.97      0.90      7417
     class 1       0.81      0.47      0.59      2352

    accuracy                           0.85      9769
   macro avg       0.83      0.72      0.75      9769
weighted avg       0.84      0.85      0.83      9769



# 3-GradientBoostingClassifier

In [99]:
from sklearn.ensemble import GradientBoostingClassifier

In [100]:
GradientBoostingClassifier = GradientBoostingClassifier()

In [101]:
model2 = GradientBoostingClassifier

In [102]:
model2.fit(X_train, y_train)

In [103]:
model2.get_params()

{'ccp_alpha': 0.0,
 'criterion': 'friedman_mse',
 'init': None,
 'learning_rate': 0.1,
 'loss': 'log_loss',
 'max_depth': 3,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_iter_no_change': None,
 'random_state': None,
 'subsample': 1.0,
 'tol': 0.0001,
 'validation_fraction': 0.1,
 'verbose': 0,
 'warm_start': False}

In [104]:
y_pred = model2.predict(X_test)

In [105]:
from sklearn.metrics import classification_report

In [106]:
print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

     class 0       0.88      0.95      0.92      7417
     class 1       0.79      0.60      0.69      2352

    accuracy                           0.87      9769
   macro avg       0.84      0.78      0.80      9769
weighted avg       0.86      0.87      0.86      9769



# optimizar modelo -gridsearch

In [115]:
params = {
    'loss': ['log_loss'],  # Wrap 'log_loss' in a list
    'learning_rate': [0.1],  # Wrap single values in lists
    'n_estimators': [100],
    'subsample': [1.0],
    'criterion': ['friedman_mse'],
    'min_samples_split': [2],
    'min_samples_leaf': [1],
    'min_weight_fraction_leaf': [0.0],
    'max_depth': [3],
    'min_impurity_decrease': [0.0],
    'init': [None],
    'random_state': [None],
    'max_features': [None],
    'verbose': [0],
    'max_leaf_nodes': [None],
    'warm_start': [False],
    'validation_fraction': [0.1],
    'n_iter_no_change': [None],
    'tol': [0.0001],
    'ccp_alpha': [0.0]
}

In [116]:
from sklearn.model_selection import GridSearchCV

In [117]:
gridsearch2 = GridSearchCV(model2, params, cv=5)

In [118]:

# Fit the GridSearchCV object to your training data
gridsearch2.fit(X_train, y_train)

# Now you can make predictions
y_pred = gridsearch2.predict(X_test)

In [119]:
from sklearn.metrics import classification_report

In [120]:
print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

     class 0       0.88      0.95      0.92      7417
     class 1       0.79      0.60      0.69      2352

    accuracy                           0.87      9769
   macro avg       0.84      0.78      0.80      9769
weighted avg       0.86      0.87      0.86      9769



# 4-ExtraTreesClassifier

In [121]:
from sklearn.ensemble import ExtraTreesClassifier

In [122]:
ExtraTreesClassifier = ExtraTreesClassifier()

In [123]:
model3 = ExtraTreesClassifier

In [124]:
model3.fit(X_train, y_train)

In [125]:
model3.get_params()

{'bootstrap': False,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [126]:
y_pred = model3.predict(X_test)

In [127]:
from sklearn.metrics import classification_report

In [128]:
target_names = ['class 0', 'class 1']

In [129]:
print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

     class 0       0.88      0.92      0.90      7417
     class 1       0.70      0.61      0.65      2352

    accuracy                           0.84      9769
   macro avg       0.79      0.76      0.78      9769
weighted avg       0.84      0.84      0.84      9769



# optimizar modelo -gridsearch

In [136]:
params= {'n_estimators':[100], 'criterion':['gini'], 'max_depth':[None], 'min_samples_split':[2], 'min_samples_leaf':[1], 'min_weight_fraction_leaf':[0.0], 'max_features':['sqrt'], 'max_leaf_nodes':[None], 'min_impurity_decrease':[0.0], 'bootstrap':[False], 'oob_score':[False], 'n_jobs':[None], 'random_state':[None], 'verbose':[0], 'warm_start':[False], 'class_weight':[None], 'ccp_alpha':[0.0], 'max_samples':[None], 'monotonic_cst':[None]}

In [137]:
from sklearn.model_selection import GridSearchCV

In [138]:
gridsearch3 = GridSearchCV(model3, params, cv=5)

In [139]:

# Fit the GridSearchCV object to your training data
gridsearch3.fit(X_train, y_train)

# Now you can make predictions
y_pred = gridsearch3.predict(X_test)

In [140]:
from sklearn.metrics import classification_report

In [141]:
target_names = ['class 0', 'class 1']

In [142]:
print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

     class 0       0.88      0.92      0.90      7417
     class 1       0.70      0.61      0.65      2352

    accuracy                           0.84      9769
   macro avg       0.79      0.76      0.77      9769
weighted avg       0.84      0.84      0.84      9769



## 5-AdamBoostClassifier

In [143]:
from sklearn.ensemble import AdaBoostClassifier

In [144]:
AdaBoostClassifier = AdaBoostClassifier()

In [145]:
model4 = AdaBoostClassifier

In [146]:
model4.fit(X_train, y_train)



In [147]:
model4.get_params()

{'algorithm': 'SAMME.R',
 'estimator': None,
 'learning_rate': 1.0,
 'n_estimators': 50,
 'random_state': None}

https://skops.readthedocs.io/en/stable/model_card.html#model-card-content

In [148]:
from sklearn.metrics import classification_report

In [149]:
target_names = ['class 0', 'class 1']

In [150]:
print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

     class 0       0.88      0.92      0.90      7417
     class 1       0.70      0.61      0.65      2352

    accuracy                           0.84      9769
   macro avg       0.79      0.76      0.77      9769
weighted avg       0.84      0.84      0.84      9769



## **optimizar modelo -gridsearch**

In [156]:
params = {'n_estimators':[50], 'learning_rate':[1.0], 'algorithm':['SAMME.R'], 'random_state':[None]}

In [157]:
from sklearn.model_selection import GridSearchCV

In [158]:
gridsearch4 = GridSearchCV(model4, params, cv=5)

In [159]:

# Fit the GridSearchCV object to your training data
gridsearch4.fit(X_train, y_train)

# Now you can make predictions
y_pred = gridsearch4.predict(X_test)



In [160]:
gridsearch4.get_params()

{'cv': 5,
 'error_score': nan,
 'estimator__algorithm': 'SAMME.R',
 'estimator__estimator': None,
 'estimator__learning_rate': 1.0,
 'estimator__n_estimators': 50,
 'estimator__random_state': None,
 'estimator': AdaBoostClassifier(),
 'n_jobs': None,
 'param_grid': {'n_estimators': [50],
  'learning_rate': [1.0],
  'algorithm': ['SAMME.R'],
  'random_state': [None]},
 'pre_dispatch': '2*n_jobs',
 'refit': True,
 'return_train_score': False,
 'scoring': None,
 'verbose': 0}

In [161]:
from sklearn.metrics import classification_report

In [162]:
target_names = ['class 0', 'class 1']

In [163]:
print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

     class 0       0.88      0.94      0.91      7417
     class 1       0.77      0.60      0.67      2352

    accuracy                           0.86      9769
   macro avg       0.82      0.77      0.79      9769
weighted avg       0.85      0.86      0.85      9769



### **5-HistGradientBoostingClassifier**

In [176]:
from sklearn.ensemble import HistGradientBoostingClassifier

In [177]:
model5 = HistGradientBoostingClassifier()

In [178]:
model5.fit(X_train, y_train)

In [180]:
model5.get_params()

{'categorical_features': 'warn',
 'class_weight': None,
 'early_stopping': 'auto',
 'interaction_cst': None,
 'l2_regularization': 0.0,
 'learning_rate': 0.1,
 'loss': 'log_loss',
 'max_bins': 255,
 'max_depth': None,
 'max_features': 1.0,
 'max_iter': 100,
 'max_leaf_nodes': 31,
 'min_samples_leaf': 20,
 'monotonic_cst': None,
 'n_iter_no_change': 10,
 'random_state': None,
 'scoring': 'loss',
 'tol': 1e-07,
 'validation_fraction': 0.1,
 'verbose': 0,
 'warm_start': False}

In [179]:
y_pred = model5.predict(X_test)

In [181]:
from sklearn.metrics import classification_report

In [182]:
target_names = ['class 0', 'class 1']

In [183]:
print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

     class 0       0.89      0.94      0.92      7417
     class 1       0.79      0.64      0.71      2352

    accuracy                           0.87      9769
   macro avg       0.84      0.79      0.81      9769
weighted avg       0.87      0.87      0.87      9769



## **optimizar modelo -gridsearch**

In [189]:
params = {
    'loss': ['log_loss'],  # Wrap 'log_loss' in a list
    'learning_rate': [0.1],  # Wrap 0.1 in a list
    'max_iter': [100],  # Wrap 100 in a list
    'max_leaf_nodes': [31],  # Wrap 31 in a list
    'max_depth': [None],  # Wrap None in a list
    'min_samples_leaf': [20],  # Wrap 20 in a list
    'l2_regularization': [0.0],  # Wrap 0.0 in a list
    'max_features': [1.0],  # Wrap 1.0 in a list
    'max_bins': [255],  # Wrap 255 in a list
    'categorical_features': ['warn'],  # Wrap 'warn' in a list
    'monotonic_cst': [None],  # Wrap None in a list
    'interaction_cst': [None],  # Wrap None in a list
    'warm_start': [False],  # Wrap False in a list
    'early_stopping': ['auto'],  # Wrap 'auto' in a list
    'scoring': ['loss'],  # Wrap 'loss' in a list
    'validation_fraction': [0.1],  # Wrap 0.1 in a list
    'n_iter_no_change': [10],  # Wrap 10 in a list
    'tol': [1e-07],  # Wrap 1e-07 in a list
    'verbose': [0],  # Wrap 0 in a list
    'random_state': [None],  # Wrap None in a list
    'class_weight': [None]  # Wrap None in a list
}

In [190]:
from sklearn.model_selection import GridSearchCV

In [191]:
gridsearch5 = GridSearchCV(model5, params, cv=5)

In [192]:
gridsearch5.fit(X_train, y_train)

In [193]:
gridsearch5.get_params()

{'cv': 5,
 'error_score': nan,
 'estimator__categorical_features': 'warn',
 'estimator__class_weight': None,
 'estimator__early_stopping': 'auto',
 'estimator__interaction_cst': None,
 'estimator__l2_regularization': 0.0,
 'estimator__learning_rate': 0.1,
 'estimator__loss': 'log_loss',
 'estimator__max_bins': 255,
 'estimator__max_depth': None,
 'estimator__max_features': 1.0,
 'estimator__max_iter': 100,
 'estimator__max_leaf_nodes': 31,
 'estimator__min_samples_leaf': 20,
 'estimator__monotonic_cst': None,
 'estimator__n_iter_no_change': 10,
 'estimator__random_state': None,
 'estimator__scoring': 'loss',
 'estimator__tol': 1e-07,
 'estimator__validation_fraction': 0.1,
 'estimator__verbose': 0,
 'estimator__warm_start': False,
 'estimator': HistGradientBoostingClassifier(),
 'n_jobs': None,
 'param_grid': {'loss': ['log_loss'],
  'learning_rate': [0.1],
  'max_iter': [100],
  'max_leaf_nodes': [31],
  'max_depth': [None],
  'min_samples_leaf': [20],
  'l2_regularization': [0.0],
  

In [194]:
y_pred = gridsearch5.predict(X_test)

In [195]:
from sklearn.metrics import classification_report

In [196]:
target_names = ['class 0', 'class 1']

In [197]:
print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

     class 0       0.89      0.94      0.92      7417
     class 1       0.79      0.65      0.71      2352

    accuracy                           0.87      9769
   macro avg       0.84      0.80      0.81      9769
weighted avg       0.87      0.87      0.87      9769

