<center>
<h4>Diplomatura en CDAAyA 2020 - FaMAF - UNC</h4>
<h1>¿Caro o Barato? Análisis de Precios de Almacen en un Contexto Inflacionario</h1>
<h3>Introducción al Aprendizaje Automático & Aprendizaje Automático Supervisado</h3>
</center>
</left>
<h4>Sofía Luján y Julieta Bergamasco</h4>
</left>

## Introducción

En la siguiente notebook se presentará la consigna a seguir para el tercer práctico del proyecto, correspondiente a las materias Introducción al Aprendizaje Automático y Aprendizaje Automático Supervisado. El objetivo consiste en explorar la aplicación de diferentes métodos de aprendizaje supervisado aprendidos en el curso, así como también de métodos de _ensemble learning_, a través de experimentos reproducibles, y evaluando a su vez la conveniencia de uno u otro, así como la selección de diferentes hiperparámetros a partir del cálculo de las métricas pertinentes.

En el caso de nuestro proyecto, podemos plantear diferentes tipos de problemas, como la agrupación de productos, la estimación de un precio o la identificación de precios anómalos. Sin embargo, a los fines de este práctico, nos enfocaremos en la predicción de precios relativos.

Para ello, comenzaremos con las importaciones pertinentes.

## Importaciones

In [1]:
# Importación de las librerías necesarias
import numpy as np
import pandas as pd
# Puede que nos sirvan también
import matplotlib as mpl
mpl.get_cachedir()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
#import plotly.express as px
import sklearn as skl
from io import StringIO

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.utils import shuffle#, print_eval
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error
from sklearn import ensemble #RandomForestClasifier, VotingClassifier
from sklearn import svm #LinearSVC, SVC
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV

np.random.seed(0)  # Para mayor determinismo

In [2]:
pd.set_option('display.max_columns', 150)
pd.set_option('display.max_rows', 150)
pd.set_option('max_colwidth', 151)

## Consigna

### I. Preprocesamiento

A los fines de realizar este práctico, se utilizará el dataset limpio obtenido en la etapa anterior. La división entre train y test será realizada en este mismo práctico.
A continuación se detallan los pasos a seguir para el preprocesamiento de los datos.

#### 1. Obtención del Dataset

Cargar los datasets originales.

#### 2. Aplicar Script de Curación

Inicialmente, con el objetivo de preparar los datos que alimentarán los modelos de aprendizaje automático (ML) propuestos, deberán aplicar el script de curación obtenido en el práctico anterior.
En esta etapa, pueden adicionar los atributos que crean pertinentes a priori o que hayan encontrado interesantes por tener mayor correlación con la variable de interés (precio, precio relativo).

#### 3. Correlación Entre Variables Numéricas

Dado que inicialmente eran pocas las variables numéricas y ahora contamos con un grupo más amplio de estas caracteristicas, se propone obtener la correlación entre todas las variables numéricas. Representarla gráficamente utilizando un mapa de calor (heatmap).
¿Cuáles son las features más correlacionadas con el precio?

#### 4. Multicolinealidad Exacta

Las variables explicativas no deben estar muy correlacionadas entre ellas, ya que la variabilidad de una y otra estarán explicando la misma parte de variabilidad de la variable dependiente. Esto es lo que se conoce como multicolinealidad, lo cual deriva en la imposibilidad de estimar los parámetros cuando la misma es exacta o en estimaciones muy imprecisas cuando la misma es aproximada.
En el caso de encontrar multicolinealidad, responder: ¿Cómo se puede solucionar? ¿Qué decisión tomarían al respecto?

#### 5. Normalización de Atributos

Es posible que sea necesario normalizar las features de nuestro dataset, dado que muchos de los algoritmos de aprendizaje supervisado lo requieren. ¿En qué casos tendrá que implementarse normalización?

Aplicar a los datasets la normalización de atributos que consideren adecuada.

#### 6. Mezca Aleatória y División en Train/Test

Finalmente, están en condiciones de **dividir el dataset en Train y Test**, utilizando para este último conjunto un 20% de los datos disponibles. Previo a esta división, es recomendable que mezclen los datos aleatoriamente.
De este modo, deberán obtener cuatro conjuntos de datos, para cada uno de los datasets: ```X_train```, ```X_test```, ```y_train``` y ```y_test```.

Pensar si hacer de esta forma la división puede afectar la distribución espacial y temporal de los datos.
¿Cuáles pueden ser las consecuencias?


---

A modo de ayuda, **en esta notebook encontrarán una especie de template** que sigue los pasos propuestos y que deberán ir completando.

Recuerden que la ciencia de datos es un **proceso circular, continuo y no lineal**. Es decir, si los datos requieren de mayor procesamiento para satisfacer las necesidades de algoritmos de ML (cualesquiera de ellos), vamos a volver a la etapa inicial para, por ejemplo, crear nuevas features, tomar decisiones diferentes sobre valores faltantes o valores atípicos (outliers), descartar features, entre otras.

### II. Aplicación de Modelos de Aprendizaje Supervisado

Una vez finalizada la etapa de preprocesamiento, se propone implementar diferentes modelos de predicción para el precio relativo, utilizando la librería Scikit-Learn:

1. Linear Support Vector Regression ([Doc](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html#sklearn.svm.LinearSVR))
2. Stochastic Gradient Descent ([Doc](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor))
3. Regression Based on k-nearest neighbors ([Doc](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor))
4. Gaussian Process Regression ([Doc](https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessRegressor.html#sklearn.gaussian_process.GaussianProcessRegressor))
5. Prediction Voting Regressor ([Doc](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingRegressor.html#sklearn.ensemble.VotingRegressor))

Para cada uno de ellos, se pide responder las siguientes consignas:
- Agregar vector de Bias, cuando lo crean pertinente. Cuándo hace falta y cuándo no? Por qué?
- Obtener MSE, MAE, RMSE, R Square

De estos tipos de modelos, cuál creen que es el más adecuado para nuestro caso de aplicación?

**Elegir el modelo que consideren que mejor aplica a nuestro problema.** Para ello, recuerden que los pasos a seguir en la selección pueden esquematizarse como sigue:

#### 1. Descripción de la Hipótesis

¿Cuál es nuestro problema? ¿Cómo se caracteriza? ¿Cuál es la hipótesis?

#### 2. Selección de Regularizador

 ¿Utilizarán algún regularizador?¿Cuál?


#### 3. Selección de Función de Costo

¿Cuál será la función de costo utilizada?

#### 4. Justificación de las Selecciones

¿Por qué eligieron el modelo, el regularizador y la función de costo previas?

Finalmente, para el modelo selecionado:

- Utilizar el método *Grid Search*, o de búsqueda exahustiva, con *cross-validation* para profundizar en la búsqueda y selección de hiperparámetros.
- Calcular métricas sobre el conjunto de entrenamiento y de evaluación para los mejores parámetros obtenidos:
    + MSE, MAE, RMSE, R Square
    + Comparar las métricas obtenidas en cada modelo y obtener conclusiones.

---

Si encuentran cualquier otro modelo que consideren apropiado y deseen aplicar, pueden hacerlo con total libertad.

**Opcional**
- Aplicar PCA para reducción de dimensionalidad (manteniendo *n* componentes principales) y entrenar nuevamente el modelo seleccionado como el más apropiado.

### Entregables

El entregable de este práctico consiste en **esta misma Notebook**, pero con el preprocesamiento aplicado y los modelos implementados, agregando las explicaciones que crean pertinentes y las decisiones tomadas, en caso de corresponder.

Sintetizar las conclusiones en un archivo de texto, como lo vienen haciendo con los anteriores prácticos.

**Fecha de Entrega: 31/08**

# Resolución

## I. Preprocesamiento

### 1. Carga de Datos

Para comenzar, importamos los datos que vamos a utilizar:

In [4]:
#from google.colab import drive
#drive.mount('/content/drive')

In [3]:
# Traemos el dataset limpio del repo
url = '../models/precio_sucursal_producto_400.pkl'
#url = '/content/drive/My Drive/DiploDatos2020-Mentoria/Aprendizaje automatico/precio_sucursal_producto_400.pkl'
dataset = pd.read_pickle(url, compression='zip')

In [4]:
dataset.sample(3)

Unnamed: 0,precio_producto_mean,fecha_20200412,fecha_20200419,fecha_20200426,fecha_20200502,fecha_20200518,provincia,prov_catamarca,prov_chaco,prov_chubut,prov_ciudad_autonoma_de_buenos_aires,prov_cordoba,prov_corrientes,prov_entre_rios,prov_formosa,prov_jujuy_,prov_la_pampa,prov_la_rioja,prov_mendoza,prov_misiones,prov_neuquen,prov_provincia_de_buenos_aires,prov_rio_negro,prov_salta,prov_san_juan,prov_san_luis,prov_santa_cruz,prov_santa_fe,prov_santiago_del_estero,prov_tierra_del_fuego,prov_tucuman,suctipo_autoservicio,suctipo_hipermercado,suctipo_minorista,suctipo_supermercado,banddesc_axion_energy,banddesc_changomas,banddesc_cooperativa_obrera_limitada_de_consumo_y_vivienda,banddesc_coto_cicsa,banddesc_deheza_saicf_e_i,banddesc_disco,banddesc_express,banddesc_hipermercado_carrefour,banddesc_la_anonima,banddesc_market,banddesc_otras_bandDesc,banddesc_simplicity,banddesc_supermercados_cordiez,banddesc_supermercados_dia,banddesc_vea,banddesc_walmart_supercenter,fecha_anterior,precio_anterior,producto_marca,producto_vino,producto_queso,producto_chocolate,producto_galletitas,producto_leche,producto_dulce,producto_tinto,producto_crema,producto_jabon,producto_blanco,producto_polvo,producto_desodorante,producto_fideos,producto_pan,producto_liquido,producto_shampoo,producto_vainilla,producto_jugo,producto_carrefour,producto_pollo,producto_light,...,producto_shoulders,producto_blend,producto_milanesa,producto_preparar,producto_pasas,producto_lario,producto_granja,producto_multiuso,producto_care,producto_bombon,producto_aperitivo,producto_musculo,producto_pategras,producto_cera,producto_bon,producto_johnsons,producto_trapo,producto_sardo,producto_vera,producto_rollo,producto_capilar,producto_merluza,producto_pepas,producto_bondiola,producto_salamin,producto_femenina,producto_pro,producto_castano,producto_largo,producto_escobillon,producto_milan,producto_dulcor,producto_gillette,producto_virgen,producto_biferdil,producto_filet,producto_crudo,producto_bosque,producto_afeitar,producto_roll,producto_procenex,producto_mineral,producto_cofler,producto_lima,producto_koleston,producto_piamontesa,producto_sec,producto_levite,producto_pimienta,producto_o,producto_plusbelle,producto_nalga,producto_bimbo,producto_cagnoli,producto_secas,producto_papa,producto_marineras,producto_vida,producto_caldo,producto_paquete,producto_fargo,producto_tresemme,producto_estilo,producto_flores,producto_insecticida,producto_otras,um_cc,um_gr,um_kg,um_lt,um_ml,um_mt,um_pack,um_un,precio_relativo
980486,86.315366,0,0,1,0,0,AR-R,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,20200419,94.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.329674
1312857,30.563478,0,0,1,0,0,AR-T,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,20200419,33.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.754717
1856710,112.695357,0,0,0,0,1,AR-B,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,20200502,106.911,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.818011


In [5]:
dataset['precio_anterior']

0            29.900000
1            29.900000
2            39.900000
3           499.990000
4           533.323333
              ...     
2147288    1230.000000
2147289     664.000000
2147290      71.490000
2147291      20.990000
2147292      21.990000
Name: precio_anterior, Length: 2138389, dtype: float64

<div class="alert alert-block alert-info">
El dataset ya está **listo para trabajar!**
</div>

### 2. Aplicar Script de Curación

El siguiente paso implica aplicar el script que resultó del práctico anterior. También pueden adicionar campos calculados en base a otros atributos, según lo consideren pertinente.

In [4]:
# En vez de aplicar el script de curacion, vamos a quitar las columnas relacionadas a la fecha y a la unidad
# ya que la presentacion es un dato que quedó relacionado en la normalizacion del precio del producto
# y la fecha esta representada por la media del producto.

## DROP UNIDAD
dataset.drop(columns=['um_cc', 'um_gr', 'um_kg', 'um_lt', 'um_ml', 'um_mt', 'um_pack', 'um_un'], inplace=True)

## DROP FECHA
dataset.drop(columns=['fecha_anterior','fecha_20200412', 'fecha_20200419', 'fecha_20200426', 'fecha_20200502', 'fecha_20200518'], inplace=True)

## DROPEAMOS UNA PROVINCIA Y UNA SUCURSALTIPO por la DUMMY TRAP
dataset.drop(columns=['suctipo_autoservicio', 'prov_catamarca','provincia'], inplace=True) # Dummy trap!

dataset.drop(columns=['precio_producto_mean'], inplace=True)

### 5. Normalización de Atributos


Aplicar al dataset la normalización de atributos que consideren adecuada.

In [5]:
# Pueden utilizar los siguientes métodos, por ejemplo:

min_max_scaler = MinMaxScaler()
standard_scaler = StandardScaler()


### 7. Mezca Aleatória y División en Train/Test

Primeramente, deberán mezclar los datos aleatoriamente. Luego, para dividir en Train/Test el dataset, aplicar el split utilizando un 20% de datos para este último.

En este punto, deberán obtener cuatro conjuntos de datos, para ambos datasets: ```X_train```, ```X_test```, ```y_train``` y ```y_test```.

In [25]:
# Y luego el módulo:
# Con los datos mezclados tomamos solo una muestra para acelerar los tiempos.
def get_dataset_ready(dataset, muestras=10000, rnd=0):
    dataset = dataset.sample(muestras, random_state=rnd)
    y = dataset['precio_relativo']
    X = dataset.drop(columns=['precio_relativo'])
    del dataset 

    return train_test_split(X, y, test_size=0.20, random_state=0)

# Notar que X e y son np.arrays. Además, pueden usar el parámetro que incluye train_test_split para mezclar.

In [7]:
muestras = 1500000
X_train, X_test, y_train, y_test = get_dataset_ready(dataset, muestras)
del dataset

In [9]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((1200000, 444), (1200000,), (300000, 444), (300000,))

In [10]:
X_train.sample(3)

Unnamed: 0,prov_chaco,prov_chubut,prov_ciudad_autonoma_de_buenos_aires,prov_cordoba,prov_corrientes,prov_entre_rios,prov_formosa,prov_jujuy_,prov_la_pampa,prov_la_rioja,prov_mendoza,prov_misiones,prov_neuquen,prov_provincia_de_buenos_aires,prov_rio_negro,prov_salta,prov_san_juan,prov_san_luis,prov_santa_cruz,prov_santa_fe,prov_santiago_del_estero,prov_tierra_del_fuego,prov_tucuman,suctipo_hipermercado,suctipo_minorista,suctipo_supermercado,banddesc_axion_energy,banddesc_changomas,banddesc_cooperativa_obrera_limitada_de_consumo_y_vivienda,banddesc_coto_cicsa,banddesc_deheza_saicf_e_i,banddesc_disco,banddesc_express,banddesc_hipermercado_carrefour,banddesc_la_anonima,banddesc_market,banddesc_otras_bandDesc,banddesc_simplicity,banddesc_supermercados_cordiez,banddesc_supermercados_dia,banddesc_vea,banddesc_walmart_supercenter,precio_anterior,producto_marca,producto_vino,producto_queso,producto_chocolate,producto_galletitas,producto_leche,producto_dulce,producto_tinto,producto_crema,producto_jabon,producto_blanco,producto_polvo,producto_desodorante,producto_fideos,producto_pan,producto_liquido,producto_shampoo,producto_vainilla,producto_jugo,producto_carrefour,producto_pollo,producto_light,producto_bandeja,producto_naranja,producto_frutilla,producto_agua,producto_malbec,producto_limpiador,producto_acondicionador,producto_jamon,producto_dia,producto_aceite,...,producto_infantil,producto_tratamiento,producto_head,producto_zapallo,producto_hoja,producto_cabo,producto_soft,producto_palitos,producto_integral,producto_shoulders,producto_blend,producto_milanesa,producto_preparar,producto_pasas,producto_lario,producto_granja,producto_multiuso,producto_care,producto_bombon,producto_aperitivo,producto_musculo,producto_pategras,producto_cera,producto_bon,producto_johnsons,producto_trapo,producto_sardo,producto_vera,producto_rollo,producto_capilar,producto_merluza,producto_pepas,producto_bondiola,producto_salamin,producto_femenina,producto_pro,producto_castano,producto_largo,producto_escobillon,producto_milan,producto_dulcor,producto_gillette,producto_virgen,producto_biferdil,producto_filet,producto_crudo,producto_bosque,producto_afeitar,producto_roll,producto_procenex,producto_mineral,producto_cofler,producto_lima,producto_koleston,producto_piamontesa,producto_sec,producto_levite,producto_pimienta,producto_o,producto_plusbelle,producto_nalga,producto_bimbo,producto_cagnoli,producto_secas,producto_papa,producto_marineras,producto_vida,producto_caldo,producto_paquete,producto_fargo,producto_tresemme,producto_estilo,producto_flores,producto_insecticida,producto_otras
1255116,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,109.77,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1570940,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,129.9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1491611,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,45.9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## II. Aplicación de Modelos de Regresión

Utilizando los datos de train y test obtenidos, se aplicarán diferentes modelos de regresión para predecir el precio relativo.

In [8]:
## Creamos una funcion para imprimir los resultados
def get_scores(labels, predictions):
    mae = mean_absolute_error(labels, predictions)
    mse = mean_squared_error(labels, predictions)
    rmse = np.sqrt(mean_squared_error(labels, predictions))
    r2 = r2_score(labels, predictions)
    
    return {'mae':mae,'mse':mse,'rmse':rmse,'r2':r2}

In [9]:
models_results = dict()

### 1. Linear Support Vector Regression ([Doc](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html#sklearn.svm.LinearSVR))

A continuación se aplicará el modelo

In [11]:
from sklearn.svm import LinearSVR

In [15]:
scaler = min_max_scaler.fit(X_train[['precio_anterior']])
X_train['precio_anterior'] = scaler.transform(X_train[['precio_anterior']])
X_test['precio_anterior'] = scaler.transform(X_test[['precio_anterior']])

In [16]:
svr = LinearSVR()
svr.fit(X_train, y_train)

LinearSVR()

In [20]:
y_train_pred = svr.predict(X_train)
y_pred = svr.predict(X_test)

models_results['LinearSVR - Train'] = get_scores(y_train, y_train_pred)
models_results['LinearSVR - Test'] = get_scores(y_test, y_pred)

In [21]:
pd.DataFrame(models_results).T

Unnamed: 0,mae,mse,rmse,r2
LinearSVR - Train,2.80026,252.632688,15.894423,0.085178
LinearSVR - Test,2.813218,251.032759,15.844013,0.085124


### 2. Stochastic Gradient Descent ([Doc](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor))

In [22]:
from sklearn.linear_model import SGDRegressor

In [23]:
sgd = SGDRegressor()
sgd.fit(X_train, y_train)

SGDRegressor()

In [24]:
y_train_pred = sgd.predict(X_train)
y_pred = sgd.predict(X_test)

models_results['SGD - Train'] = get_scores(y_train, y_train_pred)
models_results['SGD - Test'] = get_scores(y_test, y_pred)

In [25]:
pd.DataFrame(models_results).T

Unnamed: 0,mae,mse,rmse,r2
LinearSVR - Train,2.80026,252.632688,15.894423,0.085178
LinearSVR - Test,2.813218,251.032759,15.844013,0.085124
SGD - Train,3.850074,223.908536,14.963574,0.189193
SGD - Test,3.861804,222.125707,14.903882,0.190474


Volvemos a cargar el dataset para usar variables sin escalar

In [26]:
dataset = pd.read_pickle(url, compression='zip')
X_train, X_test, y_train, y_test = get_dataset_ready(dataset, muestras)
del dataset

### 5. Prediction Voting Regressor ([Doc](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingRegressor.html#sklearn.ensemble.VotingRegressor))

In [10]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import VotingRegressor

In [11]:
r1 = LinearRegression()
r2 = RandomForestRegressor(n_jobs=-1, n_estimators=10) #empezamos con pocos estimadores para que no demore tanto
voting = VotingRegressor([('linear', r1), ('rf', r2)])
voting.fit(X_train, y_train)

VotingRegressor(estimators=[('linear', LinearRegression()),
                            ('rf',
                             RandomForestRegressor(n_estimators=10,
                                                   n_jobs=-1))])

In [12]:
y_train_pred = voting.predict(X_train)
y_pred = voting.predict(X_test)

models_results['Voting - Train'] = get_scores(y_train, y_train_pred)
models_results['Voting - Test'] = get_scores(y_test, y_pred)

In [13]:
pd.DataFrame(models_results).T

Unnamed: 0,mae,mse,rmse,r2
Voting - Train,2.04752,67.289485,8.203017,0.756334
Voting - Test,2.171182,81.369867,9.020525,0.703452


### 6. XGBoost ([Doc](https://xgboost.readthedocs.io/en/latest/))

In [12]:
from xgboost import XGBRegressor

In [15]:
xgb = XGBRegressor(random_state=0)
xgb.fit(X_train, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [16]:
y_train_pred = xgb.predict(X_train)
y_pred = xgb.predict(X_test)

models_results['XGB - Train'] = get_scores(y_train, y_train_pred)
models_results['XGB - Test'] = get_scores(y_test, y_pred)

In [17]:
pd.DataFrame(models_results).T

Unnamed: 0,mae,mse,rmse,r2
Voting - Train,2.04752,67.289485,8.203017,0.756334
Voting - Test,2.171182,81.369867,9.020525,0.703452
XGB - Train,2.08775,75.927855,8.713659,0.725054
XGB - Test,2.114084,81.909137,9.050367,0.701486


### 4. Selección del Modelo

#### 4.1. Selección y Descripción de Hipótesis

Describir el problema y la hipótesis del modelo.

#### 4.2. Selección de Regularizador

 ¿Utilizarán algún regularizador?¿Cuál?

#### 4.3. Selección de Función de Costo

¿Cuál será la función de costo utilizada?

#### 4.4. Justificación de las Selecciones

A continuación, se justifican las elecciones previas.

### 5. Selección de Parámetros y Métricas Sobre el Conjunto de Evaluación

Para la selección de hiperparámetros, pueden utilizar GridSearch. Además, deben calcular las métricas solicitadas.

In [19]:
exploring_params = {'max_depth':(3,6),
                    'learning_rate':(0.3, 0.5),
                    'n_estimators': (100, 150)
                   }


regressor = XGBRegressor(n_jobs=-1, random_state=0)
model = GridSearchCV(regressor, exploring_params, cv=3, scoring='r2', verbose=2)
model.fit(X_train, y_train)


Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV] learning_rate=0.3, max_depth=3, n_estimators=100 ................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] . learning_rate=0.3, max_depth=3, n_estimators=100, total= 5.1min


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  5.1min remaining:    0.0s


[CV] learning_rate=0.3, max_depth=3, n_estimators=100 ................
[CV] . learning_rate=0.3, max_depth=3, n_estimators=100, total= 4.2min
[CV] learning_rate=0.3, max_depth=3, n_estimators=100 ................
[CV] . learning_rate=0.3, max_depth=3, n_estimators=100, total= 4.5min
[CV] learning_rate=0.3, max_depth=3, n_estimators=150 ................
[CV] . learning_rate=0.3, max_depth=3, n_estimators=150, total= 6.2min
[CV] learning_rate=0.3, max_depth=3, n_estimators=150 ................
[CV] . learning_rate=0.3, max_depth=3, n_estimators=150, total= 6.0min
[CV] learning_rate=0.3, max_depth=3, n_estimators=150 ................
[CV] . learning_rate=0.3, max_depth=3, n_estimators=150, total= 5.9min
[CV] learning_rate=0.3, max_depth=6, n_estimators=100 ................
[CV] . learning_rate=0.3, max_depth=6, n_estimators=100, total= 6.9min
[CV] learning_rate=0.3, max_depth=6, n_estimators=100 ................
[CV] . learning_rate=0.3, max_depth=6, n_estimators=100, total= 7.0min
[CV] l

[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed: 165.3min finished


GridSearchCV(cv=3,
             estimator=XGBRegressor(base_score=None, booster=None,
                                    colsample_bylevel=None,
                                    colsample_bynode=None,
                                    colsample_bytree=None, gamma=None,
                                    gpu_id=None, importance_type='gain',
                                    interaction_constraints=None,
                                    learning_rate=None, max_delta_step=None,
                                    max_depth=None, min_child_weight=None,
                                    missing=nan, monotone_constraints=None,
                                    n_estimators=100, n_jobs=-1,
                                    num_parallel_tree=None, random_state=0,
                                    reg_alpha=None, reg_lambda=None,
                                    scale_pos_weight=None, subsample=None,
                                    tree_method=None, validate_parameter

In [20]:
pd.DataFrame(model.cv_results_).sort_values('rank_test_score')[:10]

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_learning_rate,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
7,596.123631,7.473002,9.484899,0.143734,0.5,6,150,"{'learning_rate': 0.5, 'max_depth': 6, 'n_estimators': 150}",0.752304,0.762007,0.699768,0.738026,0.027341,1
6,409.940566,3.149868,9.284077,0.300926,0.5,6,100,"{'learning_rate': 0.5, 'max_depth': 6, 'n_estimators': 100}",0.73565,0.731768,0.665807,0.711075,0.032048,2
3,601.912315,3.978979,9.603844,0.393422,0.3,6,150,"{'learning_rate': 0.3, 'max_depth': 6, 'n_estimators': 150}",0.702438,0.663539,0.656962,0.674313,0.020068,3
2,407.359612,2.359228,9.203044,0.325719,0.3,6,100,"{'learning_rate': 0.3, 'max_depth': 6, 'n_estimators': 100}",0.66096,0.638915,0.627705,0.642527,0.013814,4
5,357.491224,6.376868,8.890611,0.738408,0.5,3,150,"{'learning_rate': 0.5, 'max_depth': 3, 'n_estimators': 150}",0.519611,0.478076,0.489464,0.495717,0.017524,5
4,236.399512,1.034305,8.778594,0.424145,0.5,3,100,"{'learning_rate': 0.5, 'max_depth': 3, 'n_estimators': 100}",0.491388,0.437975,0.469461,0.466275,0.021922,6
1,352.374012,6.800867,9.373544,0.138477,0.3,3,150,"{'learning_rate': 0.3, 'max_depth': 3, 'n_estimators': 150}",0.468463,0.431183,0.447882,0.449176,0.015247,7
0,265.44993,24.049505,9.460469,0.130294,0.3,3,100,"{'learning_rate': 0.3, 'max_depth': 3, 'n_estimators': 100}",0.430328,0.388066,0.432238,0.416877,0.020387,8


In [11]:
def train_model(model, X_train, y_train):
    model.fit(X_train, y_train)
    y_train_pred = model.predict(X_train)
    y_pred = model.predict(X_test)
    
    return y_train_pred, y_pred

In [22]:
y_train_pred, y_pred = train_model(model.best_estimator_, X_train, y_train)

In [23]:
models_results['BestXGB - Train'] = get_scores(y_train, y_train_pred)
models_results['BestXGB - Test'] = get_scores(y_test, y_pred)

In [24]:
pd.DataFrame(models_results).T

Unnamed: 0,mae,mse,rmse,r2
Voting - Train,2.04752,67.289485,8.203017,0.756334
Voting - Test,2.171182,81.369867,9.020525,0.703452
XGB - Train,2.08775,75.927855,8.713659,0.725054
XGB - Test,2.114084,81.909137,9.050367,0.701486
BestXGB - Train,1.764548,57.186877,7.562201,0.792918
BestXGB - Test,1.807601,69.011172,8.307296,0.748492


In [25]:
model.best_estimator_

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.5, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=150, n_jobs=-1, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [None]:
# Veamos si podemos mejorar el score y achicar la diferencia del train/test

In [26]:
exploring_params_2 = {'max_depth':(6,9),
                    'n_estimators': (150, 250)
                   }


regressor_2 = XGBRegressor(n_jobs=-1, random_state=0, learning_rate=0.5)
model_2 = GridSearchCV(regressor_2, exploring_params_2, cv=3, scoring='r2', verbose=2)
model_2.fit(X_train, y_train)

Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV] max_depth=6, n_estimators=150 ...................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] .................... max_depth=6, n_estimators=150, total=10.1min
[CV] max_depth=6, n_estimators=150 ...................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed: 10.1min remaining:    0.0s


[CV] .................... max_depth=6, n_estimators=150, total=10.1min
[CV] max_depth=6, n_estimators=150 ...................................
[CV] .................... max_depth=6, n_estimators=150, total=10.0min
[CV] max_depth=6, n_estimators=250 ...................................
[CV] .................... max_depth=6, n_estimators=250, total=16.3min
[CV] max_depth=6, n_estimators=250 ...................................
[CV] .................... max_depth=6, n_estimators=250, total=16.3min
[CV] max_depth=6, n_estimators=250 ...................................
[CV] .................... max_depth=6, n_estimators=250, total=16.3min
[CV] max_depth=9, n_estimators=150 ...................................
[CV] .................... max_depth=9, n_estimators=150, total=14.4min
[CV] max_depth=9, n_estimators=150 ...................................
[CV] .................... max_depth=9, n_estimators=150, total=14.5min
[CV] max_depth=9, n_estimators=150 ...................................
[CV] .

[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed: 192.9min finished


GridSearchCV(cv=3,
             estimator=XGBRegressor(base_score=None, booster=None,
                                    colsample_bylevel=None,
                                    colsample_bynode=None,
                                    colsample_bytree=None, gamma=None,
                                    gpu_id=None, importance_type='gain',
                                    interaction_constraints=None,
                                    learning_rate=0.5, max_delta_step=None,
                                    max_depth=None, min_child_weight=None,
                                    missing=nan, monotone_constraints=None,
                                    n_estimators=100, n_jobs=-1,
                                    num_parallel_tree=None, random_state=0,
                                    reg_alpha=None, reg_lambda=None,
                                    scale_pos_weight=None, subsample=None,
                                    tree_method=None, validate_parameters

In [30]:
pd.DataFrame(model_2.cv_results_).sort_values('rank_test_score')[:10]

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
3,1395.589718,2.069993,10.149041,0.19993,9,250,"{'max_depth': 9, 'n_estimators': 250}",0.81219,0.841548,0.783188,0.812309,0.023825,1
2,858.946238,2.442943,9.493098,0.322827,9,150,"{'max_depth': 9, 'n_estimators': 150}",0.798587,0.825552,0.760581,0.794907,0.026652,2
1,967.438531,1.206931,9.552765,0.217519,6,250,"{'max_depth': 6, 'n_estimators': 250}",0.783937,0.789487,0.73818,0.770535,0.02299,3
0,595.621111,3.298206,9.585379,0.191568,6,150,"{'max_depth': 6, 'n_estimators': 150}",0.752304,0.762007,0.699768,0.738026,0.027341,4


In [27]:
y_train_pred, y_pred = train_model(model_2.best_estimator_, X_train, y_train)

In [28]:
models_results['BestXGB_2 - Train'] = get_scores(y_train, y_train_pred)
models_results['BestXGB_2 - Test'] = get_scores(y_test, y_pred)

In [29]:
pd.DataFrame(models_results).T

Unnamed: 0,mae,mse,rmse,r2
Voting - Train,2.04752,67.289485,8.203017,0.756334
Voting - Test,2.171182,81.369867,9.020525,0.703452
XGB - Train,2.08775,75.927855,8.713659,0.725054
XGB - Test,2.114084,81.909137,9.050367,0.701486
BestXGB - Train,1.764548,57.186877,7.562201,0.792918
BestXGB - Test,1.807601,69.011172,8.307296,0.748492
BestXGB_2 - Train,1.09712,16.28503,4.035471,0.941029
BestXGB_2 - Test,1.215706,34.061746,5.836244,0.875864


In [31]:
# Utilizamos el BestXGB_2 para ver si tocando los valores de regularizacion podemos acortar la brecha de 
# los score de train/test

In [47]:
xgb_lambda = XGBRegressor(n_jobs=-1, random_state=0, learning_rate=0.5, max_depth=9, n_estimators=250, reg_lambda=20)

In [48]:
y_train_pred, y_pred = train_model(xgb_lambda, X_train, y_train)

In [51]:
models_results['XGB_Lambda_20 - Train'] = get_scores(y_train, y_train_pred)
models_results['XGB_Lambda_20 - Test'] = get_scores(y_test, y_pred)

In [52]:
pd.DataFrame(models_results).T

Unnamed: 0,mae,mse,rmse,r2
Voting - Train,2.04752,67.289485,8.203017,0.756334
Voting - Test,2.171182,81.369867,9.020525,0.703452
XGB - Train,2.08775,75.927855,8.713659,0.725054
XGB - Test,2.114084,81.909137,9.050367,0.701486
BestXGB - Train,1.764548,57.186877,7.562201,0.792918
BestXGB - Test,1.807601,69.011172,8.307296,0.748492
BestXGB_2 - Train,1.09712,16.28503,4.035471,0.941029
BestXGB_2 - Test,1.215706,34.061746,5.836244,0.875864
XGB_Gamma_1 - Train,1.091293,15.276696,3.908541,0.944681
XGB_Gamma_1 - Test,1.215235,35.364219,5.946782,0.871117


In [20]:
xgb_lambda = XGBRegressor(n_jobs=-1, random_state=0, learning_rate=0.5, max_depth=9, n_estimators=250, reg_alpha=100)
y_train_pred, y_pred = train_model(xgb_lambda, X_train, y_train)
models_results['XGB_Alpha_100 - Train'] = get_scores(y_train, y_train_pred)
models_results['XGB_Alpha_100 - Test'] = get_scores(y_test, y_pred)

In [22]:
pd.DataFrame(models_results).T

Unnamed: 0,mae,mse,rmse,r2
XGB_Alpha_1 - Train,1.088867,15.653618,3.956465,0.943316
XGB_Alpha_1 - Test,1.209959,33.548025,5.792066,0.877736
XGB_Alpha_5 - Train,1.079857,13.488007,3.672602,0.951158
XGB_Alpha_5 - Test,1.206055,33.745839,5.809117,0.877015
XGB_Alpha_20 - Train,1.145304,16.855542,4.10555,0.938964
XGB_Alpha_20 - Test,1.26036,37.956277,6.160867,0.86167
XGB_Alpha_100 - Train,1.24582,18.584154,4.310934,0.932704
XGB_Alpha_100 - Test,1.345129,37.86797,6.153696,0.861992


## Cálculo de Métricas y Conclusiones

In [31]:
## Nos quedamos con el modelo XGBoost con regularizacion L2 (alpha=5) ya que logra los niveles
# mas bajos de error y mantiene altos scores en r2

In [None]:
# Para probar la efectividad de los hiperparametros sampleareamos 1M de registros del dataset
# y volveremos a entrenar/testear el modelo resultante para comprobar si se mantienen 
# los valores de errores/r2

In [42]:
dataset = pd.read_pickle(url, compression='zip')

In [43]:
dataset.drop(columns=['um_cc', 'um_gr', 'um_kg', 'um_lt', 'um_ml', 'um_mt', 'um_pack', 'um_un'], inplace=True)
dataset.drop(columns=['fecha_anterior','fecha_20200412', 'fecha_20200419', 'fecha_20200426', 'fecha_20200502', 'fecha_20200518'], inplace=True)
dataset.drop(columns=['suctipo_autoservicio', 'prov_catamarca','provincia'], inplace=True) # Dummy trap!
dataset.drop(columns=['precio_producto_mean'], inplace=True)

In [44]:
muestras = 1000000
# Utilizamos otra seed=42 para que el sampleo sea diferente al que se utilizo para los entrenamientos anteriores
X_train, X_test, y_train, y_test = get_dataset_ready(dataset, muestras, rnd=42)
del dataset

In [49]:
xgb_hyp = XGBRegressor(n_jobs=-1, random_state=0, learning_rate=0.5, max_depth=9, n_estimators=250,reg_alpha=10)
y_train_pred, y_pred = train_model(xgb_hyp, X_train, y_train)
models_results['XGB_250_A10 - Train'] = get_scores(y_train, y_train_pred)
models_results['XGB_250_A10 - Test'] = get_scores(y_test, y_pred)

In [50]:
pd.DataFrame(models_results).T

Unnamed: 0,mae,mse,rmse,r2
XGB_Alpha_1 - Train,1.088867,15.653618,3.956465,0.943316
XGB_Alpha_1 - Test,1.209959,33.548025,5.792066,0.877736
XGB_Alpha_5 - Train,1.079857,13.488007,3.672602,0.951158
XGB_Alpha_5 - Test,1.206055,33.745839,5.809117,0.877015
XGB_Alpha_20 - Train,1.145304,16.855542,4.10555,0.938964
XGB_Alpha_20 - Test,1.26036,37.956277,6.160867,0.86167
XGB_Alpha_100 - Train,0.81366,5.99352,2.448167,0.979003
XGB_Alpha_100 - Test,1.081556,42.547941,6.522878,0.828484
XGB_500_A5 - Train,0.852413,6.573736,2.56393,0.97697
XGB_500_A5 - Test,1.106381,38.353756,6.193041,0.845392


In [None]:
# seleccionamos el siguiente modelo:
XGBRegressor(n_jobs=-1, random_state=0, learning_rate=0.5, max_depth=9, n_estimators=250,reg_alpha=5)
#
# con estos resultados
#               	mae	        mse	        rmse	    r2
#XGB_250_A5 - Train	1.086918	13.353561	3.654252	0.953218
#XGB_250_A5 - Test	1.272563	42.471277	6.516999	0.828793

## Opcional: Aplicar PCA

In [None]:
from sklearn.decomposition import PCA
n = 
pca = PCA(n_components=n)
reduc_dim = pca.fit(X)
reduc_dim.components_
reduc_dim.explained_variance_