# TP3: Estimación de peso y dimensiones de los envíos de Mercado Libre

# Materia: Introducción al aprendizaje supervisado

## Análisis del dataset. Comunicación de resultados y conclusiones

A partir de lo visto en la teoría de la materia y del segundo laboratorio, diagramar una comunicación en formato textual o interactivo describiendo la solución de las actividades propuestas a continuación. Al final de las mismas se proveen actividades opcionales (no obligatorias) que pueden resultar de interés.

### Actividades Propuestas:

    1.	Definir el target. En nuestro caso, proponemos usar la columna `SHP_WEIGHT` que representa el peso del item en gramos.

    2.	Hacer un split de train/test del dataset limpio. Se recomienda usar train_test_split con 80% para training y 20% para test.

    3.	Elegir entre 2 modelos distintos vistos en clases (Sugerencia: LinearRegression y kNN) y entrenar los modelos.

    4.	Evaluar y reportar métricas (Sugerimos: Confusion matrix, precision, recall y f1-score). Puede ser de utilidad ver: link1, link2.

    5.	Splitear el training set en train y validación (Sugerencia 80-20). Re-entrenar usando el conjunto de train y evaluar en el conjunto de validación con distintos hiper-parámetros los distintos modelos elegidos. Reportar las métricas de las distintas pruebas y describir cual es la mejor elección de hiper-parámetros. Explicar las métricas entendiendo el desbalance de clases del target. Reportar las métricas del mejor modelo sobre el conjunto de test y compararlo con los modelos entrenados en el punto 3.

La comunicación debe estar apuntada a un público técnico pero sin conocimiento del tema particular, como por ejemplo, sus compañeros de clase o stakeholders del proyecto. Idealmente, además del documento se debería generar una presentación corta para stakeholders explicando el análisis realizado sobre los datos y las conclusiones obtenidas de tal análisis.

Se evaluarán los siguientes aspectos:

    ●	El informe debe contener un mensaje claro y presentado de forma concisa.
    ●	Los gráficos deben aplicar los conceptos de percepción visual vistos en clase.
    ●	Se debe describir o estimar la significancia estadística de su trabajo.


## Carga de datos

In [58]:
import pandas as pd
import random
import matplotlib.pyplot as plt
import numpy as np
import seaborn
import scipy as sc
from math import sqrt
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelBinarizer
from ast import literal_eval
from pandas.io.json import json_normalize
from fancyimpute import KNN
#extras para TP3
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
# from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, mean_squared_error
from sklearn.metrics import mean_absolute_error, median_absolute_error, mean_squared_error
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.linear_model import RidgeCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import PolynomialFeatures
%matplotlib inline

In [5]:
#from visualization import plot_confusion_matrix, plot_learning_curve
#import os
#import sys
#sys.path.append(os.getcwd())
#from ml.visualization import plot_confusion_matrix, plot_learning_curve


In [13]:
random.seed(0)
DATASET = '../meli_dataset_20190426.csv'
df_original = pd.read_csv(DATASET, low_memory=False)

In [14]:
df= df_original
df = df.head(10000)

## Actividad 1:    

Definir el target. En nuestro caso, proponemos usar la columna `SHP_WEIGHT` que representa el peso del item en gramos.

En base a lo desarrollado en el TP2 se eliminan los registros con `STATUS` 404 o con faltantes en `SHP_WEIGHT`, se agrupa por `ITEM_ID` y se reemplaza por la mediana. Además, se codifican algunas variables categóricas y se imputan valores a los faltantes de la variable PRICE. Finalmente, se toma logaritmo natural de la variable de estudio `SHP_WEIGHT`.

In [15]:
df = df[df.STATUS != "404"]
df = df.drop(columns=['STATUS'])
df.sample(5)

df = df.dropna(subset=['SHP_WEIGHT'])

# Agrupación por item id y reemplazo por mediana
# Agrupamos por item_id
df_grouped = df.groupby(['ITEM_ID'], as_index=False).median()
#Ordenamos el dataframe por item_id
df.sort_values('ITEM_ID', inplace = True)
# Eliminamos filas con item_id duplicados
df.drop_duplicates(subset='ITEM_ID', keep=False, inplace=True)
# Actualizamos dataframe original con la mediana de pesos y medidas
df.set_index('ITEM_ID', inplace=True)
df.update(df_grouped.set_index('ITEM_ID', inplace=True))
df.reset_index()

# Binarización de CATALOG_PRODUCT_ID, CONDITION y DOMAIN_ID

column = 'CATALOG_PRODUCT_ID'
lb = LabelBinarizer()
lb_results = lb.fit_transform(df[column])
#pd.DataFrame(lb_results, columns=(column + '_') + pd.Series(lb.classes_)).head(10)
CATALOG_PRODUCT_ID_ENCODED = pd.DataFrame(lb_results, columns=(column + '_') + pd.Series(lb.classes_))

column = "CONDITION"
lb = LabelBinarizer()
lb_results = lb.fit_transform(df[column].astype(str))
pd.DataFrame(lb_results, columns=(column + '_') + pd.Series(lb.classes_)).head(10)
CONDITION_ENCODED = pd.DataFrame(lb_results, columns=(column + '_') + pd.Series(lb.classes_))

column = 'DOMAIN_ID'
lb = LabelBinarizer()
lb_results = lb.fit_transform(df[column].astype(str))
#pd.DataFrame(lb_results, columns=(column + '_') + pd.Series(lb.classes_)).head(10)
DOMAIN_ID_ENCODED = pd.DataFrame(lb_results, columns=(column + '_') + pd.Series(lb.classes_))
# Pegado de las variables categoricas codificadas al dataset
df["id"]=CONDITION_ENCODED.index
df=df.set_index("id")
df = pd.concat([df,CATALOG_PRODUCT_ID_ENCODED, CONDITION_ENCODED, DOMAIN_ID_ENCODED], axis=1)


In [16]:
df

Unnamed: 0,SHP_WEIGHT,SHP_LENGTH,SHP_WIDTH,SHP_HEIGHT,ATTRIBUTES,CATALOG_PRODUCT_ID,CONDITION,DOMAIN_ID,PRICE,SELLER_ID,...,DOMAIN_ID_MLB-WIRELESS_ANTENNAS,DOMAIN_ID_MLB-WIRELESS_CHARGERS,DOMAIN_ID_MLB-WIRELESS_FM_TRANSMITTERS,DOMAIN_ID_MLB-WIRE_STRIPPERS,DOMAIN_ID_MLB-WOMEN_SWIMWEAR,DOMAIN_ID_MLB-WRENCHES,DOMAIN_ID_MLB-WRENCH_SETS,DOMAIN_ID_MLB-WRISTWATCHES,DOMAIN_ID_MLB-XENON_KITS,DOMAIN_ID_nan
0,775.0,50.0,20.0,10.0,"[{'id': 'BRAND', 'name': 'Marca', 'value_id': ...",H53U1H7Q5G,new,MLB-ENGINE_GASKET_SETS,750.00,QD3YJ9751S,...,0,0,0,0,0,0,0,0,0,0
1,6100.0,70.0,25.0,5.0,"[{'id': 'BEDDING_SET_SIZE', 'name': 'Tamanho',...",H53U1H7Q5G,new,MLB-BEDDING_SETS,119.90,J3EY3QAB29,...,0,0,0,0,0,0,0,0,0,0
2,464.0,20.0,11.0,10.0,"[{'id': 'BRAND', 'name': 'Marca', 'value_id': ...",H53U1H7Q5G,new,MLB-AUTOMOBILE_FUEL_PUMPS,349.90,NO4W1R9S3D,...,0,0,0,0,0,0,0,0,0,0
3,150.0,25.0,25.0,11.0,"[{'id': 'BRAND', 'name': 'Marca', 'value_id': ...",H53U1H7Q5G,new,MLB-PENDRIVES,21.99,KIQX6YQZI4,...,0,0,0,0,0,0,0,0,0,0
4,3719.0,42.0,34.0,13.0,"[{'id': 'BRAND', 'name': 'Marca', 'value_id': ...",GITRVCM7WO,used,MLB-GAME_CONSOLES,849.00,ZQIKYCCZ7E,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3987,431.0,25.0,25.0,5.0,"[{'id': 'CLOSING', 'name': 'Fecho', 'value_id'...",H53U1H7Q5G,new,MLB-FANNY_PACKS,69.90,GPWP5IFQEN,...,0,0,0,0,0,0,0,0,0,0
3988,150.0,20.0,20.0,20.0,"[{'id': 'ITEM_CONDITION', 'name': 'Condição do...",H53U1H7Q5G,new,MLB-PORTABLE_ELECTRIC_MASSAGERS,7.50,OFLRK20BUP,...,0,0,0,0,0,0,0,0,0,0
3989,3880.0,36.0,24.0,13.0,"[{'id': 'BRAND', 'name': 'Marca', 'value_id': ...",H53U1H7Q5G,new,MLB-ENGINE_OILS,145.90,MQICEHKRH5,...,0,0,0,0,0,0,0,0,0,0
3990,1040.0,28.0,18.0,8.0,"[{'id': 'BRAND', 'name': 'Marca', 'value_id': ...",CCNZQYJ1G6,new,MLB-ROUTERS,329.49,ANYX5441IO,...,0,0,0,0,0,0,0,0,0,0


In [17]:
# Imputación de faltantes de PRICE por KNN
df_numeric = df.select_dtypes([np.number])
df_filled = pd.DataFrame(KNN(3).fit_transform(df_numeric))
df_filled.columns=df_numeric.columns
df=df_filled

Imputing row 1/3992 with 0 missing, elapsed time: 206.036
Imputing row 101/3992 with 0 missing, elapsed time: 206.058
Imputing row 201/3992 with 0 missing, elapsed time: 206.067
Imputing row 301/3992 with 0 missing, elapsed time: 206.075
Imputing row 401/3992 with 0 missing, elapsed time: 206.083
Imputing row 501/3992 with 0 missing, elapsed time: 206.090
Imputing row 601/3992 with 0 missing, elapsed time: 206.098
Imputing row 701/3992 with 0 missing, elapsed time: 206.107
Imputing row 801/3992 with 0 missing, elapsed time: 206.115
Imputing row 901/3992 with 0 missing, elapsed time: 206.122
Imputing row 1001/3992 with 0 missing, elapsed time: 206.130
Imputing row 1101/3992 with 0 missing, elapsed time: 206.137
Imputing row 1201/3992 with 0 missing, elapsed time: 206.144
Imputing row 1301/3992 with 0 missing, elapsed time: 206.151
Imputing row 1401/3992 with 1 missing, elapsed time: 206.159
Imputing row 1501/3992 with 0 missing, elapsed time: 206.167
Imputing row 1601/3992 with 0 missin

In [18]:
np.log(df["SHP_WEIGHT"])

0       6.652863
1       8.716044
2       6.139885
3       5.010635
4       8.221210
          ...   
3987    6.066108
3988    5.010635
3989    8.263590
3990    6.946976
3991    5.560682
Name: SHP_WEIGHT, Length: 3992, dtype: float64

In [19]:
df["SHP_WEIGHT_LOG"] = np.log(df["SHP_WEIGHT"])

Se eliminan los features:

    1. "SHP_WEIGHT", porque se usa el logaritmo.
    
    2. "SHP_LENGTH", "SHP_WIDTH", "SHP_HEIGHT", porque si deseamos estimar "SHP_WEIGHT", creemos que no tenemos esa información.
    
    3. "CATALOG_PRODUCT_ID_A0RY70BE19", 'CONDITION_nan', "DOMAIN_ID_nan", o sea una dummy de cada categórica, para evirar la trampa de las veriables dummy, o sea evitar multicolinealidad perfecta.

In [20]:
df= df.drop(columns=["SHP_WEIGHT", "SHP_LENGTH", "SHP_WIDTH", "SHP_HEIGHT",  "CATALOG_PRODUCT_ID_A0RY70BE19", 'CONDITION_nan', "DOMAIN_ID_nan"])

# Actividad 2:

Hacer un split de train/test del dataset limpio. Se recomienda usar train_test_split con 80% para training y 20% para test.

In [21]:
# división entre instancias y etiquetas
X, y = df.iloc[:, :-1], df.SHP_WEIGHT_LOG

# división entre entrenamiento y evaluación
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [22]:
X

Unnamed: 0,PRICE,CATALOG_PRODUCT_ID_A2H2JJFBXM,CATALOG_PRODUCT_ID_A4M0AP2TSK,CATALOG_PRODUCT_ID_A6X73QCLS9,CATALOG_PRODUCT_ID_A7Y7QKJ7EF,CATALOG_PRODUCT_ID_ADKMKF0FVM,CATALOG_PRODUCT_ID_AF4WQUGCVH,CATALOG_PRODUCT_ID_AFPLIBE9VN,CATALOG_PRODUCT_ID_AG9UI846DP,CATALOG_PRODUCT_ID_AGE41D6OTF,...,DOMAIN_ID_MLB-WINE_GLASSES,DOMAIN_ID_MLB-WIRELESS_ANTENNAS,DOMAIN_ID_MLB-WIRELESS_CHARGERS,DOMAIN_ID_MLB-WIRELESS_FM_TRANSMITTERS,DOMAIN_ID_MLB-WIRE_STRIPPERS,DOMAIN_ID_MLB-WOMEN_SWIMWEAR,DOMAIN_ID_MLB-WRENCHES,DOMAIN_ID_MLB-WRENCH_SETS,DOMAIN_ID_MLB-WRISTWATCHES,DOMAIN_ID_MLB-XENON_KITS
0,750.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,119.900000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,349.900000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,21.990000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,849.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3987,69.900000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3988,7.500000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3989,145.900000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3990,329.490000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [23]:
y

0       6.652863
1       8.716044
2       6.139885
3       5.010635
4       8.221210
          ...   
3987    6.066108
3988    5.010635
3989    8.263590
3990    6.946976
3991    5.560682
Name: SHP_WEIGHT_LOG, Length: 3992, dtype: float64

# Actividad 3: 

Elegir entre 2 modelos distintos vistos en clases (Sugerencia: LinearRegression y kNN) y entrenar los modelos.

### Regresión lineal

In [24]:
reg = LinearRegression()
reg.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [38]:
neigh = KNeighborsRegressor(n_neighbors=5)

In [39]:
neigh.fit(X_train, y_train)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                    weights='uniform')

# Actividad 4 :

Evaluar y reportar algunas métricas (Sugerimos: Mean absolute error, Median
absolute error, Mean Squared Error y Root Mean squared error). Reportar como
mínimo dos métricas y entender la diferencia entre ellas. Por ejemplo, responder
qué diferencia tiene evaluar usando media vs mediana?

### Métricas para la regresión lineal estimada

In [40]:
print('Mean Absolute Error para entrenamiento: %.2f' % 
      mean_absolute_error(y_train, reg.predict(X_train)))

print('Median Absolute Error para entrenamiento: %.2f' % 
      median_absolute_error(y_train, reg.predict(X_train)))

print('Mean Squared Error para entrenamiento: %.2f' % 
      mean_squared_error(y_train, reg.predict(X_train)))

print('Root Mean Squared Error para entrenamiento: %.2f' % 
     sqrt(mean_squared_error(y_train, reg.predict(X_train))))

print("###############################################")

print('Mean Absolute Error para test: %.2f' % 
      mean_absolute_error(y_test, reg.predict(X_test)))

print('Median Absolute Error para test: %.2f' % 
      median_absolute_error(y_test, reg.predict(X_test)))

print('Mean Squared Error para test: %.2f' % 
      mean_squared_error(y_test, reg.predict(X_test)))

print('Root Mean Squared Error para test: %.2f' % 
     sqrt(mean_squared_error(y_test, reg.predict(X_test))))


Mean Absolute Error para entrenamiento: 0.59
Median Absolute Error para entrenamiento: 0.37
Mean Squared Error para entrenamiento: 0.81
Root Mean Squared Error para entrenamiento: 0.90
###############################################
Mean Absolute Error para test: 36796599.18
Median Absolute Error para test: 0.82
Mean Squared Error para test: 82267896196156352.00
Root Mean Squared Error para test: 286823806.89


### Métricas para la regresión estimada usando knn

In [41]:
print('Mean Absolute Error para entrenamiento: %.2f' % 
      mean_absolute_error(y_train, neigh.predict(X_train)))

print('Median Absolute Error para entrenamiento: %.2f' % 
      median_absolute_error(y_train, neigh.predict(X_train)))

print('Mean Squared Error para entrenamiento: %.2f' % 
      mean_squared_error(y_train, neigh.predict(X_train)))

print('Root Mean Squared Error para entrenamiento: %.2f' % 
     sqrt(mean_squared_error(y_train, neigh.predict(X_train))))

print("###############################################")

print('Mean Absolute Error para test: %.2f' % 
      mean_absolute_error(y_test, neigh.predict(X_test)))

print('Median Absolute Error para test: %.2f' % 
      median_absolute_error(y_test, neigh.predict(X_test)))

print('Mean Squared Error para test: %.2f' % 
      mean_squared_error(y_test, neigh.predict(X_test)))

print('Root Mean Squared Error para test: %.2f' % 
     sqrt(mean_squared_error(y_test, neigh.predict(X_test))))


Mean Absolute Error para entrenamiento: 0.85
Median Absolute Error para entrenamiento: 0.71
Mean Squared Error para entrenamiento: 1.14
Root Mean Squared Error para entrenamiento: 1.07
###############################################
Mean Absolute Error para test: 1.06
Median Absolute Error para test: 0.91
Mean Squared Error para test: 1.75
Root Mean Squared Error para test: 1.32


Es notable la superioridad de knn en poder predictivo al aplicar al conjunto de test. En el caso de la regresión lineal, como no está regularizada, el poder predictivo decae drásticamente al pasar al conjunto de test. Es notable que el desempeño en el conjuto de test es relativamante aceptable cuando se usa como métrica al Median Absolute Error. Esto seguramente se debe a que la mediana es mas robusta que la media, dado que la última sufre mucho la influencia de valores extremos.

# Actividad 5:

Splitear el training set en train y validación (Sugerencia 80-20). Re-entrenar
usando el conjunto de train y evaluar en el conjunto de validación con distintos
hiper-parámetros los distintos modelos elegidos. Reportar las métricas de las
distintas pruebas y describir cual es la mejor elección de hiper-parámetros.
Reportar las métricas del mejor modelo sobre el conjunto de test y compararlo
con los modelos entrenados en el punto 3.

In [58]:
# división de train data entre train y validation
# Esta es una posibilidad, pero directamente se puede usar cv=5 en gridsearchCV
# X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=0)

## Regresión lineal 

Es conveniente plantear regresiones regularizadas para evitar el sobreajuste evidenciado en las actividades previas. 

Se consideran los modelos:

    ● Ridge (L2)
    
    ● Lasso (L1)
    
    ● Elastik Net (combinación L2-L1)
       
    
En cada uno de esos casos se busca cuáles son los valores óptimos de los hiperparámetros de regularización.
    

### Ridge

In [61]:
model =  Ridge()
parameters = {'alpha': [0.1, 1, 5, 10, 100]}
grid = GridSearchCV(estimator=model, param_grid = parameters, cv = 5, n_jobs=-1 )
grid.fit(X_train, y_train)    

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                             max_iter=None, normalize=False, random_state=None,
                             solver='auto', tol=0.001),
             iid='warn', n_jobs=-1, param_grid={'alpha': [0.1, 1, 5, 10, 100]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [62]:
print("Mejor conjunto de parámetros:")
print(grid.best_params_, end="\n\n")

Mejor conjunto de parámetros:
{'alpha': 1}



In [63]:
print('Mean Absolute Error para entrenamiento: %.2f' % 
      mean_absolute_error(y_train, grid.predict(X_train)))

print('Median Absolute Error para entrenamiento: %.2f' % 
      median_absolute_error(y_train, grid.predict(X_train)))

print('Mean Squared Error para entrenamiento: %.2f' % 
      mean_squared_error(y_train, grid.predict(X_train)))

print('Root Mean Squared Error para entrenamiento: %.2f' % 
     sqrt(mean_squared_error(y_train, grid.predict(X_train))))

print("###############################################")

print('Mean Absolute Error para test: %.2f' % 
      mean_absolute_error(y_test, grid.predict(X_test)))

print('Median Absolute Error para test: %.2f' % 
      median_absolute_error(y_test, grid.predict(X_test)))

print('Mean Squared Error para test: %.2f' % 
      mean_squared_error(y_test, grid.predict(X_test)))

print('Root Mean Squared Error para test: %.2f' % 
     sqrt(mean_squared_error(y_test, grid.predict(X_test))))


Mean Absolute Error para entrenamiento: 0.72
Median Absolute Error para entrenamiento: 0.54
Mean Squared Error para entrenamiento: 0.92
Root Mean Squared Error para entrenamiento: 0.96
###############################################
Mean Absolute Error para test: 0.94
Median Absolute Error para test: 0.73
Mean Squared Error para test: 1.48
Root Mean Squared Error para test: 1.22


### Lasso

In [64]:
model =  Lasso()
parameters = {'alpha': [0.1, 1, 5, 10, 100]}
grid = GridSearchCV(estimator=model, param_grid = parameters, cv = 5, n_jobs=-1 )
grid.fit(X_train, y_train)    

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Lasso(alpha=1.0, copy_X=True, fit_intercept=True,
                             max_iter=1000, normalize=False, positive=False,
                             precompute=False, random_state=None,
                             selection='cyclic', tol=0.0001, warm_start=False),
             iid='warn', n_jobs=-1, param_grid={'alpha': [0.1, 1, 5, 10, 100]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [65]:
print("Mejor conjunto de parámetros:")
print(grid.best_params_, end="\n\n")

Mejor conjunto de parámetros:
{'alpha': 10}



In [66]:
print('Mean Absolute Error para entrenamiento: %.2f' % 
      mean_absolute_error(y_train, grid.predict(X_train)))

print('Median Absolute Error para entrenamiento: %.2f' % 
      median_absolute_error(y_train, grid.predict(X_train)))

print('Mean Squared Error para entrenamiento: %.2f' % 
      mean_squared_error(y_train, grid.predict(X_train)))

print('Root Mean Squared Error para entrenamiento: %.2f' % 
     sqrt(mean_squared_error(y_train, grid.predict(X_train))))

print("###############################################")

print('Mean Absolute Error para test: %.2f' % 
      mean_absolute_error(y_test, grid.predict(X_test)))

print('Median Absolute Error para test: %.2f' % 
      median_absolute_error(y_test, grid.predict(X_test)))

print('Mean Squared Error para test: %.2f' % 
      mean_squared_error(y_test, grid.predict(X_test)))

print('Root Mean Squared Error para test: %.2f' % 
     sqrt(mean_squared_error(y_test, grid.predict(X_test))))


Mean Absolute Error para entrenamiento: 1.10
Median Absolute Error para entrenamiento: 0.96
Mean Squared Error para entrenamiento: 1.87
Root Mean Squared Error para entrenamiento: 1.37
###############################################
Mean Absolute Error para test: 1.11
Median Absolute Error para test: 0.96
Mean Squared Error para test: 1.90
Root Mean Squared Error para test: 1.38


### ElasticNet

In [67]:
model =  ElasticNet()
parameters = {'alpha': [0.1, 1, 5, 10, 100],
             'l1_ratio': [0.2, 0.4, 0.6, 0.8]}
grid = GridSearchCV(estimator=model, param_grid = parameters, cv = 5, n_jobs=-1 )
grid.fit(X_train, y_train)    

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True,
                                  l1_ratio=0.5, max_iter=1000, normalize=False,
                                  positive=False, precompute=False,
                                  random_state=None, selection='cyclic',
                                  tol=0.0001, warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={'alpha': [0.1, 1, 5, 10, 100],
                         'l1_ratio': [0.2, 0.4, 0.6, 0.8]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [68]:
print("Mejor conjunto de parámetros:")
print(grid.best_params_, end="\n\n")

Mejor conjunto de parámetros:
{'alpha': 0.1, 'l1_ratio': 0.2}



In [69]:
print('Mean Absolute Error para entrenamiento: %.2f' % 
      mean_absolute_error(y_train, grid.predict(X_train)))

print('Median Absolute Error para entrenamiento: %.2f' % 
      median_absolute_error(y_train, grid.predict(X_train)))

print('Mean Squared Error para entrenamiento: %.2f' % 
      mean_squared_error(y_train, grid.predict(X_train)))

print('Root Mean Squared Error para entrenamiento: %.2f' % 
     sqrt(mean_squared_error(y_train, grid.predict(X_train))))

print("###############################################")

print('Mean Absolute Error para test: %.2f' % 
      mean_absolute_error(y_test, grid.predict(X_test)))

print('Median Absolute Error para test: %.2f' % 
      median_absolute_error(y_test, grid.predict(X_test)))

print('Mean Squared Error para test: %.2f' % 
      mean_squared_error(y_test, grid.predict(X_test)))

print('Root Mean Squared Error para test: %.2f' % 
     sqrt(mean_squared_error(y_test, grid.predict(X_test))))

Mean Absolute Error para entrenamiento: 1.09
Median Absolute Error para entrenamiento: 0.94
Mean Squared Error para entrenamiento: 1.86
Root Mean Squared Error para entrenamiento: 1.36
###############################################
Mean Absolute Error para test: 1.11
Median Absolute Error para test: 0.96
Mean Squared Error para test: 1.88
Root Mean Squared Error para test: 1.37


### KNN 

En este modelo se pueden prueban valores alternativos para los hiperparámetros:

    ● n_neighbors
    
    ● weights
    

In [54]:
model =  KNeighborsRegressor()
parameters = {'n_neighbors': [3,5, 7], 
              'weights'    : ['uniform', 'distance']
                }
#scoring = ['median_absolute_error', 'mean_squared_error']
grid = GridSearchCV(estimator=model, param_grid = parameters, cv = 5, n_jobs=-1 )
grid.fit(X_train, y_train)   

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=KNeighborsRegressor(algorithm='auto', leaf_size=30,
                                           metric='minkowski',
                                           metric_params=None, n_jobs=None,
                                           n_neighbors=5, p=2,
                                           weights='uniform'),
             iid='warn', n_jobs=-1,
             param_grid={'n_neighbors': [3, 5, 7],
                         'weights': ['uniform', 'distance']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [56]:
print("Mejor conjunto de parámetros:")
print(grid.best_params_, end="\n\n")

Mejor conjunto de parámetros:
{'n_neighbors': 7, 'weights': 'uniform'}



In [57]:
print('Mean Absolute Error para entrenamiento: %.2f' % 
      mean_absolute_error(y_train, grid.predict(X_train)))

print('Median Absolute Error para entrenamiento: %.2f' % 
      median_absolute_error(y_train, grid.predict(X_train)))

print('Mean Squared Error para entrenamiento: %.2f' % 
      mean_squared_error(y_train, grid.predict(X_train)))

print('Root Mean Squared Error para entrenamiento: %.2f' % 
     sqrt(mean_squared_error(y_train, grid.predict(X_train))))

print("###############################################")

print('Mean Absolute Error para test: %.2f' % 
      mean_absolute_error(y_test, grid.predict(X_test)))

print('Median Absolute Error para test: %.2f' % 
      median_absolute_error(y_test, grid.predict(X_test)))

print('Mean Squared Error para test: %.2f' % 
      mean_squared_error(y_test, grid.predict(X_test)))

print('Root Mean Squared Error para test: %.2f' % 
     sqrt(mean_squared_error(y_test, grid.predict(X_test))))


Mean Absolute Error para entrenamiento: 0.89
Median Absolute Error para entrenamiento: 0.75
Mean Squared Error para entrenamiento: 1.24
Root Mean Squared Error para entrenamiento: 1.12
###############################################
Mean Absolute Error para test: 1.03
Median Absolute Error para test: 0.88
Mean Squared Error para test: 1.67
Root Mean Squared Error para test: 1.29


Los modelos con hiperparámetros optimizados, mejoran considerablemente el poder predictivo en el caso de regresión lineal, debido a la incorporación de los hiperparámetros de regularización para evitar el sobreajuste. En el caso del método knn no se aprecian mejoras tan marcadas, porque el modelo original ya funcionaba bastante bien.