# Evaluación Práctica - Progreso 1

Instrucciones

Desarrollar un estudio de regresión para el conjunto de datos denominado Abalone, disponible en http://archive.ics.uci.edu/dataset/1/abalone.  El conjunto de datos tiene como entrada 8 atributos de caracoles marinos (sexo, longitud, diámetro, altura, peso total, peso descascarado, peso de las vísceras y peso de concha) y una salida que es el número de anillos en su concha (sirve para predecir la edad del caracol sumándole 1.5).

Usando scikit-learn determinar el mejor modelo que permita reducir el RMSE de la predicción del número de anillos entre SVR, KNN y procesos Gaussianos.  Ajustar los hiper-parámetros más importantes de cada modelo para obtener el mejor resultado mediante una evaluación por hold-out 70%-30%.

Entregar la impresión del cuaderno de Python en formato PDF mostrando todo el código utilizado y resaltando el mejor modelo con su menor RMSE.

# Configuración Inicial: Importar Bibliotecas

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.gaussian_process import GaussianProcessRegressor

# Obtener y Explorar los Datos

In [None]:
!pip install -U ucimlrepo



In [None]:
from ucimlrepo import fetch_ucirepo

In [None]:
abalone = fetch_ucirepo(id=1)

X = abalone.data.features
y = abalone.data.targets

In [None]:
# Metadata
print(abalone.metadata)

{'uci_id': 1, 'name': 'Abalone', 'repository_url': 'https://archive.ics.uci.edu/dataset/1/abalone', 'data_url': 'https://archive.ics.uci.edu/static/public/1/data.csv', 'abstract': 'Predict the age of abalone from physical measurements', 'area': 'Biology', 'tasks': ['Classification', 'Regression'], 'characteristics': ['Tabular'], 'num_instances': 4177, 'num_features': 8, 'feature_types': ['Categorical', 'Integer', 'Real'], 'demographics': [], 'target_col': ['Rings'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1994, 'last_updated': 'Mon Aug 28 2023', 'dataset_doi': '10.24432/C55C7W', 'creators': ['Warwick Nash', 'Tracy Sellers', 'Simon Talbot', 'Andrew Cawthorn', 'Wes Ford'], 'intro_paper': None, 'additional_info': {'summary': 'Predicting the age of abalone from physical measurements.  The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- 

In [None]:
print(abalone.variables)

             name     role         type demographic  \
0             Sex  Feature  Categorical        None   
1          Length  Feature   Continuous        None   
2        Diameter  Feature   Continuous        None   
3          Height  Feature   Continuous        None   
4    Whole_weight  Feature   Continuous        None   
5  Shucked_weight  Feature   Continuous        None   
6  Viscera_weight  Feature   Continuous        None   
7    Shell_weight  Feature   Continuous        None   
8           Rings   Target      Integer        None   

                   description  units missing_values  
0         M, F, and I (infant)   None             no  
1    Longest shell measurement     mm             no  
2      perpendicular to length     mm             no  
3           with meat in shell     mm             no  
4                whole abalone  grams             no  
5               weight of meat  grams             no  
6  gut weight (after bleeding)  grams             no  
7        

In [None]:
print("Primeras 5 filas de las características (X):")
print(X.head())

Primeras 5 filas de las características (X):
  Sex  Length  Diameter  Height  Whole_weight  Shucked_weight  Viscera_weight  \
0   M   0.455     0.365   0.095        0.5140          0.2245          0.1010   
1   M   0.350     0.265   0.090        0.2255          0.0995          0.0485   
2   F   0.530     0.420   0.135        0.6770          0.2565          0.1415   
3   M   0.440     0.365   0.125        0.5160          0.2155          0.1140   
4   I   0.330     0.255   0.080        0.2050          0.0895          0.0395   

   Shell_weight  
0         0.150  
1         0.070  
2         0.210  
3         0.155  
4         0.055  


In [None]:
print("\nPrimeras 5 filas del target (y):")
print(y.head())


Primeras 5 filas del target (y):
   Rings
0     15
1      7
2      9
3     10
4      7


# Determinar X e y

In [None]:
X = abalone.data.features
y = abalone.data.targets

In [None]:
print("Variables independientes (X):")
print(X.head())

Variables independientes (X):
  Sex  Length  Diameter  Height  Whole_weight  Shucked_weight  Viscera_weight  \
0   M   0.455     0.365   0.095        0.5140          0.2245          0.1010   
1   M   0.350     0.265   0.090        0.2255          0.0995          0.0485   
2   F   0.530     0.420   0.135        0.6770          0.2565          0.1415   
3   M   0.440     0.365   0.125        0.5160          0.2155          0.1140   
4   I   0.330     0.255   0.080        0.2050          0.0895          0.0395   

   Shell_weight  
0         0.150  
1         0.070  
2         0.210  
3         0.155  
4         0.055  


In [None]:
print("\nVariable objetivo (y):")
print(y.head())


Variable objetivo (y):
   Rings
0     15
1      7
2      9
3     10
4      7


# Limpieza de Datos

In [None]:
print("Valores faltantes por columna:")
print(X.isnull().sum())

Valores faltantes por columna:
Sex               0
Length            0
Diameter          0
Height            0
Whole_weight      0
Shucked_weight    0
Viscera_weight    0
Shell_weight      0
dtype: int64


In [None]:
print("\nValores faltantes en el target (y):")
print(y.isnull().sum())


Valores faltantes en el target (y):
Rings    0
dtype: int64


In [None]:
print("\nNúmero de filas duplicadas:", pd.concat([X, y], axis=1).duplicated().sum())


Número de filas duplicadas: 0


# Cambiar Columna Sexo a Valores Númericos

In [None]:
print("Valores originales de 'Sex':")
print(X['Sex'].head())

Valores originales de 'Sex':
0    M
1    M
2    F
3    M
4    I
Name: Sex, dtype: object


In [None]:
sex_mapping = {'M': 0, 'F': 1, 'I': 2}
X['Sex'] = X['Sex'].map(sex_mapping)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['Sex'] = X['Sex'].map(sex_mapping)


In [None]:
print("\nValores numéricos de 'Sex' después de Label Encoding:")
print(X['Sex'].head())


Valores numéricos de 'Sex' después de Label Encoding:
0    0
1    0
2    1
3    0
4    2
Name: Sex, dtype: int64


In [None]:
print("Variables independientes (X):")
print(X.head())

Variables independientes (X):
   Sex  Length  Diameter  Height  Whole_weight  Shucked_weight  \
0    0   0.455     0.365   0.095        0.5140          0.2245   
1    0   0.350     0.265   0.090        0.2255          0.0995   
2    1   0.530     0.420   0.135        0.6770          0.2565   
3    0   0.440     0.365   0.125        0.5160          0.2155   
4    2   0.330     0.255   0.080        0.2050          0.0895   

   Viscera_weight  Shell_weight  
0          0.1010         0.150  
1          0.0485         0.070  
2          0.1415         0.210  
3          0.1140         0.155  
4          0.0395         0.055  


In [None]:
print("\nVariable objetivo (y):")
print(y.head())


Variable objetivo (y):
   Rings
0     15
1      7
2      9
3     10
4      7


# Normalizar X

In [None]:
scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

X_scaled = pd.DataFrame(X_scaled, columns=X.columns)

print("Primeras 5 filas de X normalizado:")
print(X_scaled.head())

Primeras 5 filas de X normalizado:
        Sex    Length  Diameter    Height  Whole_weight  Shucked_weight  \
0 -1.154346 -0.574558 -0.432149 -1.064424     -0.641898       -0.607685   
1 -1.154346 -1.448986 -1.439929 -1.183978     -1.230277       -1.170910   
2  0.053798  0.050033  0.122130 -0.107991     -0.309469       -0.463500   
3 -1.154346 -0.699476 -0.432149 -0.347099     -0.637819       -0.648238   
4  1.261943 -1.615544 -1.540707 -1.423087     -1.272086       -1.215968   

   Viscera_weight  Shell_weight  
0       -0.726212     -0.638217  
1       -1.205221     -1.212987  
2       -0.356690     -0.207139  
3       -0.607600     -0.602294  
4       -1.287337     -1.320757  


# Separar en Conjuntos de Entrenamiento y Prueba

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42
)

print(f"Tamaño del conjunto de entrenamiento: {X_train.shape}")
print(f"Tamaño del conjunto de prueba: {X_test.shape}")
print(f"min= {y_train.min()}, max= {y_train.max()}")

Tamaño del conjunto de entrenamiento: (2923, 8)
Tamaño del conjunto de prueba: (1254, 8)
min= Rings    1
dtype: int64, max= Rings    29
dtype: int64


# Entrenar Modelos de Regresión

In [None]:
from sklearn import metrics

## Soporte Vectorial de Regresión (SVR)

In [None]:
mdl = SVR(
    gamma=0.1,
    tol=1e-10,
    C=7,
    epsilon=1,
)
mdl.fit(X_train, y_train)
Y_hat = mdl.predict(X_test)

svr_mse = metrics.mean_squared_error(y_test, Y_hat)
print("SVR MSE =", svr_mse, ", RMSE = ", np.sqrt(svr_mse))

  y = column_or_1d(y, warn=True)


SVR MSE = 4.502144322268116 , RMSE =  2.121825704969217


## K-Nearest Neighbors de Regresión

In [None]:
mdl = KNeighborsRegressor(
    n_neighbors=18,
    weights = 'uniform',
    algorithm = 'auto',
)
mdl.fit(X_train, y_train)
Y_hat = mdl.predict(X_test)

knr_mse = metrics.mean_squared_error(y_test, Y_hat)
print("KNN MSE =", knr_mse, ", RMSE = ", np.sqrt(knr_mse))

KNN MSE = 4.8315784551164676 , RMSE =  2.19808517922224


Gaussian Process Regression

In [None]:
mdl = GaussianProcessRegressor(
    alpha= 1e-1,
    n_restarts_optimizer= 1,
    optimizer= 'fmin_l_bfgs_b'
)
mdl.fit(X_train, y_train)
Y_hat = mdl.predict(X_test)

gpr_mse = metrics.mean_squared_error(y_test, Y_hat)
print("MSE =", gpr_mse, ", RMSE = ", np.sqrt(gpr_mse))

MSE = 5.188118124548347 , RMSE =  2.2777440867113117


## COMPARATIVA

In [None]:
print("SVR MSE =", svr_mse, ", RMSE = ", np.sqrt(svr_mse))
print("KNN MSE =", knr_mse, ", RMSE = ", np.sqrt(knr_mse))
print("GPR MSE =", gpr_mse, ", RMSE = ", np.sqrt(gpr_mse))

SVR MSE = 4.502144322268116 , RMSE =  2.121825704969217
KNN MSE = 4.8315784551164676 , RMSE =  2.19808517922224
GPR MSE = 5.188118124548347 , RMSE =  2.2777440867113117


# CONCLUSIÓN

Al utilizar los tres modelos de regresión, se concluyó que el SVR es el que da los mejores resultados al modificar ciertos hiperparametros.
1. SVR MSE = 4.502144322268116 , RMSE =  2.121825704969217
2. KNN MSE = 4.8315784551164676 , RMSE =  2.19808517922224
3. GPR MSE = 5.188118124548347 , RMSE =  2.2777440867113117