<a href="https://colab.research.google.com/github/Viny2030/sklearn/blob/main/02_numerical_pipeline_introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://lms.fun-mooc.fr/courses/course-v1:inria+41026+session04/courseware/4594c1d8c9f847bdbc733c34d941c988/928e7401d2ed48a791036c555bca6d06/


https://cienciadedatos.net/documentos/py55-pandas-category-modelos-machine-learning


# First model with scikit-learn

In this notebook, we present how to build predictive models on tabular
datasets, with only numerical features.

In particular we highlight:

* the scikit-learn API: `.fit(X, y)`/`.predict(X)`/`.score(X, y)`;
* how to evaluate the generalization performance of a model with a train-test
  split.

Here API stands for "Application Programming Interface" and refers to a set of
conventions to build self-consistent software. Notice that you can visit the
Glossary for more info on technical jargon.

## Loading the dataset with Pandas

We use the "adult_census" dataset described in the previous notebook. For more
details about the dataset see <http://www.openml.org/d/1590>.

Numerical data is the most natural type of data used in machine learning and
can (almost) directly be fed into predictive models. Here we load a subset of
the original data with only the numerical columns.

# Primer modelo con scikit-learn
En este cuaderno, presentamos cómo construir modelos predictivos en conjuntos de datos tabulares, con solo características numéricas.

# En particular, destacamos:

la API de scikit-learn: .fit(X, y)/.predict(X)/.score(X, y);
cómo evaluar el rendimiento de generalización de un modelo con una división de entrenamiento-prueba.
Aquí, API significa "Interfaz de programación de aplicaciones" y se refiere a un conjunto de convenciones para construir software autoconsistente. Tenga en cuenta que puede visitar el Glosario para obtener más información sobre la jerga técnica.

Carga del conjunto de datos con Pandas
Usamos el conjunto de datos "adult_census" descrito en el cuaderno anterior. Para obtener más detalles sobre el conjunto de datos, consulte http://www.openml.org/d/1590.

Los datos numéricos son el tipo de datos más natural utilizado en el aprendizaje automático y pueden (casi) introducirse directamente en los modelos predictivos. Aquí cargamos un subconjunto de los datos originales con solo las columnas numéricas.


In [1]:
##import pandas as pd

##adult_census = pd.read_csv("../datasets/adult-census-numeric.csv")

In [2]:
import pandas as pd

In [3]:
url = "https://raw.githubusercontent.com/Viny2030/datasets/refs/heads/main/adult_census.csv"

In [4]:
adult_census = pd.read_csv(url)

Let's have a look at the first records of this dataframe:

In [5]:
adult_census.head()

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


We see that this CSV file contains all information: the target that we would
like to predict (i.e. `"class"`) and the data that we want to use to train our
predictive model (i.e. the remaining columns). The first step is to separate
columns to get on one side the target and on the other side the data.

## Separate the data and the target

In [6]:
target_name = "class"
target = adult_census[target_name]
target

Unnamed: 0,class
0,<=50K
1,<=50K
2,>50K
3,>50K
4,<=50K
...,...
48837,<=50K
48838,>50K
48839,<=50K
48840,<=50K


In [7]:
for i in target:
  if i == "<=50K":
    i = 0
  else:
    pass # or some other operation
    i = 1
print(target)

0         <=50K
1         <=50K
2          >50K
3          >50K
4         <=50K
          ...  
48837     <=50K
48838      >50K
48839     <=50K
48840     <=50K
48841      >50K
Name: class, Length: 48842, dtype: object


In [8]:
adult_census

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,27,Private,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
48838,40,Private,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
48839,58,Private,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
48840,22,Private,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [9]:
data = adult_census.drop(columns=[target_name])
data.head()

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States
1,38,Private,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States
2,28,Local-gov,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States
3,44,Private,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States
4,18,?,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States


We can now linger on the variables, also denominated features, that we later
use to build our predictive model. In addition, we can also check how many
samples are available in our dataset.

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       48842 non-null  object
 2   education       48842 non-null  object
 3   education-num   48842 non-null  int64 
 4   marital-status  48842 non-null  object
 5   occupation      48842 non-null  object
 6   relationship    48842 non-null  object
 7   race            48842 non-null  object
 8   sex             48842 non-null  object
 9   capital-gain    48842 non-null  int64 
 10  capital-loss    48842 non-null  int64 
 11  hours-per-week  48842 non-null  int64 
 12  native-country  48842 non-null  object
dtypes: int64(5), object(8)
memory usage: 4.8+ MB


In [11]:
data.columns

Index(['age', 'workclass', 'education', 'education-num', 'marital-status',
       'occupation', 'relationship', 'race', 'sex', 'capital-gain',
       'capital-loss', 'hours-per-week', 'native-country'],
      dtype='object')

In [12]:
print(
    f"The dataset contains {data.shape[0]} samples and "
    f"{data.shape[1]} features"
)

The dataset contains 48842 samples and 13 features


## Fit a model and make predictions

We now build a classification model using the "K-nearest neighbors" strategy.
To predict the target of a new sample, a k-nearest neighbors takes into
account its `k` closest samples in the training set and predicts the majority
target of these samples.

<div class="admonition caution alert alert-warning">
<p class="first admonition-title" style="font-weight: bold;">Caution!</p>
<p class="last">We use a K-nearest neighbors here. However, be aware that it is seldom useful
in practice. We use it because it is an intuitive algorithm. In the next
notebook, we will introduce better models.</p>
</div>

The `fit` method is called to train the model from the input (features) and
target data.

Ajuste de un modelo y realización de predicciones
Ahora construimos un modelo de clasificación utilizando la estrategia de "K vecinos más cercanos". Para predecir el objetivo de una nueva muestra, un método de K vecinos más cercanos tiene en cuenta sus k muestras más cercanas en el conjunto de entrenamiento y predice el objetivo mayoritario de estas muestras.

# ¡Precaución!

Aquí utilizamos un método de K vecinos más cercanos. Sin embargo, tenga en cuenta que rara vez es útil en la práctica. Lo utilizamos porque es un algoritmo intuitivo. En el próximo cuaderno, presentaremos mejores modelos.

El método de ajuste se llama para entrenar el modelo a partir de los datos de entrada (características) y de destino.


In [13]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder

# Assuming 'data' is your DataFrame

# 1. Identify columns with string (object) data type:
categorical_cols = data.select_dtypes(include=['object']).columns

# 2. Create a LabelEncoder instance:
encoder = LabelEncoder()

# 3. Iterate through categorical columns and encode them:
for col in categorical_cols:
    data[col] = encoder.fit_transform(data[col])

In [14]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()
_ = model.fit(data, target)

Learning can be represented as follows:

![Predictor fit diagram](https://github.com/Viny2030/sklearn/blob/figures/api_diagram-predictor.fit.svg?raw=1)

In scikit-learn an object that has a `fit` method is called an **estimator**.
The method `fit` is composed of two elements: (i) a **learning algorithm** and
(ii) some **model states**. The learning algorithm takes the training data and
training target as input and sets the model states. These model states are
later used to either predict (for classifiers and regressors) or transform
data (for transformers).

Both the learning algorithm and the type of model states are specific to each
type of model.

El aprendizaje se puede representar de la siguiente manera:

# Diagrama de ajuste de predictores

En scikit-learn, un objeto que tiene un método de ajuste se denomina estimador. El ajuste del método se compone de dos elementos: (i) un algoritmo de aprendizaje y (ii) algunos estados del modelo. El algoritmo de aprendizaje toma los datos de entrenamiento y el objetivo de entrenamiento como entrada y establece los estados del modelo. Estos estados del modelo se utilizan posteriormente para predecir (para clasificadores y regresores) o transformar datos (para transformadores).

Tanto el algoritmo de aprendizaje como el tipo de estados del modelo son específicos de cada tipo de modelo.

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">Here and later, we use the name <tt class="docutils literal">data</tt> and <tt class="docutils literal">target</tt> to be explicit. In
scikit-learn documentation, <tt class="docutils literal">data</tt> is commonly named <tt class="docutils literal">X</tt> and <tt class="docutils literal">target</tt> is
commonly called <tt class="docutils literal">y</tt>.</p>
</div>

Nota

Aquí y más adelante, usamos los nombres data y target para ser explícitos. En la documentación de scikit-learn, data se suele denominar X y target se suele llamar y.

Let's use our model to make some predictions using the same dataset.

Utilicemos nuestro modelo para hacer algunas predicciones utilizando el mismo conjunto de datos.

In [15]:
y_pred = model.predict(data)

An estimator (an object with a `fit` method) with a `predict` method is called
a **predictor**. We can illustrate the prediction mechanism as follows:

![Predictor predict diagram](https://github.com/Viny2030/sklearn/blob/figures/api_diagram-predictor.predict.svg?raw=1)

To predict, a model uses a **prediction function** that uses the input data
together with the model states. As for the learning algorithm and the model
states, the prediction function is specific for each type of model.

Let's now have a look at the computed predictions. For the sake of simplicity,
we look at the five first predicted targets.

Un estimador (un objeto con un método de ajuste) con un método de predicción se denomina predictor. Podemos ilustrar el mecanismo de predicción de la siguiente manera:

Diagrama de predicción de predictor

Para predecir, un modelo utiliza una función de predicción que utiliza los datos de entrada junto con los estados del modelo. En cuanto al algoritmo de aprendizaje y los estados del modelo, la función de predicción es específica para cada tipo de modelo.

Veamos ahora las predicciones calculadas. Para simplificar, observamos los primeros cinco objetivos predichos.

In [18]:
##target_predicted[:5]

Indeed, we can compare these predictions to the actual data...

De hecho, podemos comparar estas predicciones con los datos reales...

In [19]:
target[:5]

Unnamed: 0,class
0,<=50K
1,<=50K
2,>50K
3,>50K
4,<=50K


...and we could even check if the predictions agree with the real targets:

...e incluso podríamos comprobar si las predicciones coinciden con los objetivos reales:

In [21]:
##target[:5] == target_predicted[:5]

In [23]:
print(
    "Number of correct prediction: "
  ##  f"{(target[:5] == target_predicted[:5]).sum()} / 5"
)

Number of correct prediction: 


Here, we see that our model makes a mistake when predicting for the first
sample.

To get a better assessment, we can compute the average success rate.

Aquí vemos que nuestro modelo comete un error al realizar la predicción para la primera muestra.

Para obtener una mejor evaluación, podemos calcular la tasa de éxito promedio.

In [25]:
##(target == target_predicted).mean()

This result means that the model makes a correct prediction for approximately
82 samples out of 100. Note that we used the same data to train and evaluate
our model. Can this evaluation be trusted or is it too good to be true?

## Train-test data split

When building a machine learning model, it is important to evaluate the
trained model on data that was not used to fit it, as **generalization** is
more than memorization (meaning we want a rule that generalizes to new data,
without comparing to data we memorized). It is harder to conclude on
never-seen instances than on already seen ones.

Correct evaluation is easily done by leaving out a subset of the data when
training the model and using it afterwards for model evaluation. The data used
to fit a model is called training data while the data used to assess a model
is called testing data.

We can load more data, which was actually left-out from the original data set.

Este resultado significa que el modelo hace una predicción correcta para aproximadamente 82 muestras de 100. Tenga en cuenta que usamos los mismos datos para entrenar y evaluar nuestro modelo. ¿Se puede confiar en esta evaluación o es demasiado buena para ser verdad?

División de datos de entrenamiento y prueba
Al crear un modelo de aprendizaje automático, es importante evaluar el modelo entrenado con datos que no se usaron para ajustarlo, ya que la generalización es más que la memorización (lo que significa que queremos una regla que se generalice a nuevos datos, sin comparar con los datos que memorizamos). Es más difícil sacar conclusiones sobre instancias nunca vistas que sobre las ya vistas.

La evaluación correcta se realiza fácilmente dejando fuera un subconjunto de los datos al entrenar el modelo y usándolo después para la evaluación del modelo. Los datos utilizados para ajustar un modelo se denominan datos de entrenamiento, mientras que los datos utilizados para evaluar un modelo se denominan datos de prueba.

Podemos cargar más datos, que en realidad se dejaron fuera del conjunto de datos original.


In [26]:
from sklearn.model_selection import train_test_split

In [57]:
adult_census1 = pd.read_csv("https://raw.githubusercontent.com/Viny2030/datasets/refs/heads/main/adult_census.csv")

In [58]:
adult_census1

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,27,Private,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
48838,40,Private,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
48839,58,Private,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
48840,22,Private,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [61]:
adult_census1['class'] = adult_census1['class'].astype(str)

In [62]:
adult_census1['class'].replace({'<=50K': 0, '>50K': 1}, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  adult_census1['class'].replace({'<=50K': 0, '>50K': 1}, inplace=True)


In [66]:
import pandas as pd

# Suponiendo que tienes tu DataFrame cargado en adult_census1

# Verificar el tipo de dato
print(adult_census1['class'].dtype)

# Convertir a string si es necesario
adult_census1['class'] = adult_census1['class'].astype(str)

# Reemplazar los valores y asignar a la columna
adult_census1['class'] = adult_census1['class'].replace({'<=50K': 0, '>50K': 1})

# Verificar los cambios
print(adult_census1['class'].head())

object
0     <=50K
1     <=50K
2      >50K
3      >50K
4     <=50K
Name: class, dtype: object


In [68]:
adult_census1['class'].value_counts()

Unnamed: 0_level_0,count
class,Unnamed: 1_level_1
<=50K,37155
>50K,11687


In [69]:
import pandas as pd
import numpy as np

# Suponiendo que tienes tu DataFrame cargado en adult_census1

# Verificar el tipo de dato
print(adult_census1['class'].dtype)

# Convertir a string si es necesario
adult_census1['class'] = adult_census1['class'].astype(str)

# Rellenar valores faltantes (si los hay)
adult_census1['class'].fillna('unknown', inplace=True)

# Reemplazar los valores y asignar a la columna
adult_census1['class'] = adult_census1['class'].replace({'<=50K': 0, '>50K': 1})

# Verificar los cambios
print(adult_census1['class'].head())

object
0     <=50K
1     <=50K
2      >50K
3      >50K
4     <=50K
Name: class, dtype: object


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  adult_census1['class'].fillna('unknown', inplace=True)


In [72]:
import pandas as pd

# Cargar el DataFrame (ajusta la ruta si es necesario)
adult_census1 = pd.read_csv("https://raw.githubusercontent.com/Viny2030/datasets/refs/heads/main/adult_census.csv")

# Convertir la columna 'class' a string (si es necesario)
adult_census1['class'] = adult_census1['class'].astype(str)

# Rellenar valores faltantes (si los hay)
adult_census1['class'].fillna('unknown', inplace=True)

# Crear un diccionario de reemplazo
replace_dict = {'<=50K': 0, '>50K': 1}

# Reemplazar los valores y asignar a la columna
adult_census1['class'] = adult_census1['class'].replace(replace_dict)

# Verificar los cambios
print(adult_census1.head())

   age   workclass      education  education-num       marital-status  \
0   25     Private           11th              7        Never-married   
1   38     Private        HS-grad              9   Married-civ-spouse   
2   28   Local-gov     Assoc-acdm             12   Married-civ-spouse   
3   44     Private   Some-college             10   Married-civ-spouse   
4   18           ?   Some-college             10        Never-married   

           occupation relationship    race      sex  capital-gain  \
0   Machine-op-inspct    Own-child   Black     Male             0   
1     Farming-fishing      Husband   White     Male             0   
2     Protective-serv      Husband   White     Male             0   
3   Machine-op-inspct      Husband   Black     Male          7688   
4                   ?    Own-child   White   Female             0   

   capital-loss  hours-per-week  native-country   class  
0             0              40   United-States   <=50K  
1             0              5

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  adult_census1['class'].fillna('unknown', inplace=True)


In [73]:
adult_census1

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,27,Private,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
48838,40,Private,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
48839,58,Private,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
48840,22,Private,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


From this new data, we separate our input features and the target to predict,
as in the beginning of this notebook.

A partir de estos nuevos datos, separamos nuestras características de entrada y el objetivo a predecir, como en el comienzo de este cuaderno.

In [74]:
X = adult_census.drop(columns=[target_name])
y = adult_census[target_name]

In [75]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

In [76]:
y_train

Unnamed: 0,class
5551,<=50K
36721,<=50K
2638,<=50K
36214,<=50K
27010,>50K
...,...
11284,<=50K
44732,<=50K
38158,<=50K
860,<=50K


We can check the number of features and samples available in this new set.

Podemos comprobar la cantidad de características y muestras disponibles en este nuevo conjunto.

In [77]:
print(
    f"The testing dataset contains {X_test.shape[0]} samples and "
    f"{X_test.shape[1]} features"
)

The testing dataset contains 16118 samples and 13 features


Instead of computing the prediction and manually computing the average success
rate, we can use the method `score`. When dealing with classifiers this method
returns their performance metric.

In [78]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# ... (assuming X_train, X_test, y_train, y_test are defined)

# Identify categorical features in X_test
categorical_features = X_test.select_dtypes(include=['object']).columns.tolist()

# Create a preprocessor to handle categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', 'passthrough', X_test.select_dtypes(exclude=['object']).columns.tolist()),
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_features)
    ])


# Create a pipeline with preprocessing and the KNN model
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', KNeighborsClassifier())
])

# Fit the pipeline to your training data
pipeline.fit(X_train, y_train) # Ensure y_train is also appropriately encoded if categorical

# Calculate accuracy using the pipeline
accuracy = pipeline.score(X_test, y_test) # y_test may need encoding too

model_name = pipeline.named_steps['classifier'].__class__.__name__

print(f"The test accuracy using a {model_name} is {accuracy:.3f}")

The test accuracy using a KNeighborsClassifier is 0.852


We use the generic term **model** for objects whose goodness of fit can be
measured using the `score` method. Let's check the underlying mechanism when
calling `score`:

![Predictor score diagram](https://github.com/Viny2030/sklearn/blob/figures/api_diagram-predictor.score.svg?raw=1)

To compute the score, the predictor first computes the predictions (using the
`predict` method) and then uses a scoring function to compare the true target
`y` and the predictions. Finally, the score is returned.

Utilizamos el término genérico modelo para los objetos cuya bondad de ajuste se puede medir mediante el método de puntuación. Veamos el mecanismo subyacente al llamar a puntuación:

Diagrama de puntuación del predictor

Para calcular la puntuación, el predictor primero calcula las predicciones (mediante el método de predicción) y luego utiliza una función de puntuación para comparar el objetivo real y con las predicciones. Finalmente, se devuelve la puntuación.

If we compare with the accuracy obtained by wrongly evaluating the model on
the training set, we find that this evaluation was indeed optimistic compared
to the score obtained on a held-out test set.

It shows the importance to always testing the generalization performance of
predictive models on a different set than the one used to train these models.
We will discuss later in more detail how predictive models should be
evaluated.

Si comparamos con la precisión obtenida al evaluar incorrectamente el modelo en el conjunto de entrenamiento, descubrimos que esta evaluación fue realmente optimista en comparación con la puntuación obtenida en un conjunto de prueba no utilizado.

Esto demuestra la importancia de probar siempre el rendimiento de generalización de los modelos predictivos en un conjunto diferente al utilizado para entrenar estos modelos. Más adelante analizaremos con más detalle cómo se deben evaluar los modelos predictivos.

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">In this MOOC, we refer to <strong>generalization performance</strong> of a model when
referring to the test score or test error obtained by comparing the prediction
of a model and the true targets. Equivalent terms for <strong>generalization
performance</strong> are predictive performance and statistical performance. We refer
to <strong>computational performance</strong> of a predictive model when assessing the
computational costs of training a predictive model or using it to make
predictions.</p>
</div>

# Nota

En este MOOC, nos referimos al rendimiento de generalización de un modelo cuando nos referimos a la puntuación o el error de prueba obtenidos al comparar la predicción de un modelo con los objetivos reales. Los términos equivalentes para el rendimiento de generalización son rendimiento predictivo y rendimiento estadístico. Nos referimos al rendimiento computacional de un modelo predictivo cuando evaluamos los costos computacionales de entrenar un modelo predictivo o usarlo para hacer predicciones.


## Notebook Recap

In this notebook we:

* fitted a **k-nearest neighbors** model on a training dataset;
* evaluated its generalization performance on the testing data;
* introduced the scikit-learn API `.fit(X, y)` (to train a model),
  `.predict(X)` (to make predictions) and `.score(X, y)` (to evaluate a
  model);
* introduced the jargon for estimator, predictor and model.

In [79]:
from sklearn.metrics import classification_report

In [None]:
import numpy as np
from sklearn.metrics import classification_report

# Ensure y_pred is generated using the same data split (X_test) used for y_test
# Assuming you have a trained classifier 'clf':
y_pred = model.predict(X_test)  # X_test should correspond to y_test

# Now, y_pred should have the same length as y_test
print(classification_report(y_test, y_pred))

# Resumen del cuaderno
# En este cuaderno:

ajustamos un modelo de k vecinos más cercanos a un conjunto de datos de entrenamiento;
evaluamos su rendimiento de generalización en los datos de prueba;
presentamos la API de scikit-learn .fit(X, y) (para entrenar un modelo), .predict(X) (para hacer predicciones) y .score(X, y) (para evaluar un modelo);
presentamos la jerga para estimador, predictor y modelo.


✅ El método replace de pandas se utiliza para reemplazar valores específicos en un DataFrame o una Serie con nuevos valores. Permite realizar cambios en los datos según criterios definidos y ajustarlos según las necesidades del análisis.

Los parámetros principales son los siguientes:

to_replace: especifica el valor o los valores que se desean reemplazar en el DataFrame o la Serie. Puede ser un valor escalar, una lista de valores, un diccionario de mapeo.

value: Especifica el valor o los valores que se utilizarán como reemplazo. Puede ser un valor escalar, una lista de valores o un diccionario de mapeo.


#df_clientes.Genero.replace(['F','M'], [1,0], inplace=True)
##df_clientes.Auto.replace(['Y','N'], [1,0], inplace=True)
##df_clientes.Propiedad.replace(['Y','N'], [1,0], inplace=True)


#Diccionario de mapeo:
#df_clientes.replace({'Auto':{'Y':1,'N':0},'Propiedad':{'Y':1,'N':0}}, inplace=True)