# TP5: Estimación de peso y dimensiones de los envíos de Mercado Libre

# Materia: Aprendizaje  no supervisado

## Análisis del dataset. Comunicación de resultados y conclusiones

A partir de lo visto en la teoría de la materia y del cuarto laboratorio, diagramar una comunicación en formato textual o interactivo describiendo la solución de las actividades propuestas a continuación. Al final de las mismas se proveen actividades opcionales (no obligatorias) que pueden resultar de interés.

### Actividades Propuestas:

    1. Aplicar PCA (Análisis de componentes principales) sobre el conjunto de features para reducir su dimensionalidad. Probar con distintos valores del parámetro que determina la cantidad de componentes finales (n_components), por ejemplo, 2, 5 y 10. Para cada versión resultante agregar el target y entrenar los 3 mejores modelos encontrados en el práctico anterior y reportar métricas. 
    
    Nota: Recordar que para PCA es importante que los features se encuentren normalizados, por lo que recomendamos usar StandardScaler.
    
    2. Aplicar K-Means para generar clustering sobre los siguientes features propuestos: SHP_WEIGHT (Peso físico del ítem) y SHP_LENGTH (Largo del ítem) (Probar distintos K: por ejemplo: 5, 10, 15). 
    
    Luego graficar con distintos colores los distintos dominios (variable DOMAIN_ID) dentro de cada cluster.

    Analizar:
    
        a. Cuales son los dominios más frecuentes de cada cluster
        b. Hay dominios que se encuentran en más de un cluster? Cuales?
    
    3. [Opcional] Aplicar otro algoritmo de clustering, como mixtura de gaussianas.
    
La comunicación debe estar apuntada a un público técnico pero sin conocimiento del tema particular, como por ejemplo, sus compañeros de clase o stakeholders del proyecto. Idealmente, además del documento se debería generar una presentación corta para stakeholders explicando el análisis realizado sobre los datos y las conclusiones obtenidas de tal análisis.
    
Se evaluarán los siguientes aspectos:

    ● El informe debe contener un mensaje claro y presentado de forma concisa.
    ● Los gráficos deben aplicar los conceptos de percepción visual vistos en clase.
    ● Se debe describir o estimar la significancia estadística de su trabajo.


## Carga de bibliotecas y datos

In [11]:
import pandas as pd
import random
import matplotlib.pyplot as plt
import numpy as np
import seaborn
import scipy as sc
from math import sqrt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, LabelBinarizer, LabelBinarizer, StandardScaler
from ast import literal_eval
from pandas.io.json import json_normalize
from fancyimpute import KNN
#extras para TP5
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.metrics import mean_absolute_error, median_absolute_error, mean_squared_error, make_scorer
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import MinMaxScaler
from sklearn.neural_network import MLPRegressor

%matplotlib inline

In [4]:
random.seed(0)
DATASET = '../meli_dataset_20190426.csv'
df_original = pd.read_csv(DATASET, low_memory=False)

In [5]:
df= df_original
df = df.head(10000)

## Preprocesamiento:  

En base a lo desarrollado en el TP2 se eliminan los registros con `STATUS` 404 o con faltantes en la variables `SHP`, se agrupa por `ITEM_ID` y se reemplaza por la mediana. Además, se codifican algunas variables categóricas y se imputan valores a los faltantes de la variable `PRICE`.

In [6]:
# Eliminación de registros con status 404
df = df[df.STATUS != "404"]
df = df.drop(columns=['STATUS'])
df.sample(5)

# Eliminación de registros con faltantes en las variables SHP 
df = df.dropna(subset=['SHP_WEIGHT', 'SHP_LENGTH', 'SHP_WIDTH', 'SHP_HEIGHT'])

# Agrupación por item id y reemplazo por mediana
# Agrupamos por item_id
df_grouped = df.groupby(['ITEM_ID'], as_index=False).median()
#Ordenamos el dataframe por item_id
df.sort_values('ITEM_ID', inplace = True)
# Eliminamos filas con item_id duplicados
df.drop_duplicates(subset='ITEM_ID', keep=False, inplace=True)
# Actualizamos dataframe original con la mediana de pesos y medidas
df.set_index('ITEM_ID', inplace=True)
df.update(df_grouped.set_index('ITEM_ID', inplace=True))
df.reset_index()

# Binarización de CATALOG_PRODUCT_ID, CONDITION y DOMAIN_ID

column = 'CATALOG_PRODUCT_ID'
lb = LabelBinarizer()
lb_results = lb.fit_transform(df[column])
#pd.DataFrame(lb_results, columns=(column + '_') + pd.Series(lb.classes_)).head(10)
CATALOG_PRODUCT_ID_ENCODED = pd.DataFrame(lb_results, columns=(column + '_') + pd.Series(lb.classes_))

column = "CONDITION"
lb = LabelBinarizer()
lb_results = lb.fit_transform(df[column].astype(str))
pd.DataFrame(lb_results, columns=(column + '_') + pd.Series(lb.classes_)).head(10)
CONDITION_ENCODED = pd.DataFrame(lb_results, columns=(column + '_') + pd.Series(lb.classes_))

column = 'DOMAIN_ID'
lb = LabelBinarizer()
lb_results = lb.fit_transform(df[column].astype(str))
#pd.DataFrame(lb_results, columns=(column + '_') + pd.Series(lb.classes_)).head(10)
DOMAIN_ID_ENCODED = pd.DataFrame(lb_results, columns=(column + '_') + pd.Series(lb.classes_))
# Pegado de las variables categoricas codificadas al dataset
df["id"]=CONDITION_ENCODED.index
df=df.set_index("id")
df = pd.concat([df,CATALOG_PRODUCT_ID_ENCODED, CONDITION_ENCODED, DOMAIN_ID_ENCODED], axis=1)


In [7]:
df

Unnamed: 0,SHP_WEIGHT,SHP_LENGTH,SHP_WIDTH,SHP_HEIGHT,ATTRIBUTES,CATALOG_PRODUCT_ID,CONDITION,DOMAIN_ID,PRICE,SELLER_ID,...,DOMAIN_ID_MLB-WIRELESS_ANTENNAS,DOMAIN_ID_MLB-WIRELESS_CHARGERS,DOMAIN_ID_MLB-WIRELESS_FM_TRANSMITTERS,DOMAIN_ID_MLB-WIRE_STRIPPERS,DOMAIN_ID_MLB-WOMEN_SWIMWEAR,DOMAIN_ID_MLB-WRENCHES,DOMAIN_ID_MLB-WRENCH_SETS,DOMAIN_ID_MLB-WRISTWATCHES,DOMAIN_ID_MLB-XENON_KITS,DOMAIN_ID_nan
0,775.0,50.0,20.0,10.0,"[{'id': 'BRAND', 'name': 'Marca', 'value_id': ...",H53U1H7Q5G,new,MLB-ENGINE_GASKET_SETS,750.00,QD3YJ9751S,...,0,0,0,0,0,0,0,0,0,0
1,6100.0,70.0,25.0,5.0,"[{'id': 'BEDDING_SET_SIZE', 'name': 'Tamanho',...",H53U1H7Q5G,new,MLB-BEDDING_SETS,119.90,J3EY3QAB29,...,0,0,0,0,0,0,0,0,0,0
2,464.0,20.0,11.0,10.0,"[{'id': 'BRAND', 'name': 'Marca', 'value_id': ...",H53U1H7Q5G,new,MLB-AUTOMOBILE_FUEL_PUMPS,349.90,NO4W1R9S3D,...,0,0,0,0,0,0,0,0,0,0
3,150.0,25.0,25.0,11.0,"[{'id': 'BRAND', 'name': 'Marca', 'value_id': ...",H53U1H7Q5G,new,MLB-PENDRIVES,21.99,KIQX6YQZI4,...,0,0,0,0,0,0,0,0,0,0
4,3719.0,42.0,34.0,13.0,"[{'id': 'BRAND', 'name': 'Marca', 'value_id': ...",GITRVCM7WO,used,MLB-GAME_CONSOLES,849.00,ZQIKYCCZ7E,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3987,431.0,25.0,25.0,5.0,"[{'id': 'CLOSING', 'name': 'Fecho', 'value_id'...",H53U1H7Q5G,new,MLB-FANNY_PACKS,69.90,GPWP5IFQEN,...,0,0,0,0,0,0,0,0,0,0
3988,150.0,20.0,20.0,20.0,"[{'id': 'ITEM_CONDITION', 'name': 'Condição do...",H53U1H7Q5G,new,MLB-PORTABLE_ELECTRIC_MASSAGERS,7.50,OFLRK20BUP,...,0,0,0,0,0,0,0,0,0,0
3989,3880.0,36.0,24.0,13.0,"[{'id': 'BRAND', 'name': 'Marca', 'value_id': ...",H53U1H7Q5G,new,MLB-ENGINE_OILS,145.90,MQICEHKRH5,...,0,0,0,0,0,0,0,0,0,0
3990,1040.0,28.0,18.0,8.0,"[{'id': 'BRAND', 'name': 'Marca', 'value_id': ...",CCNZQYJ1G6,new,MLB-ROUTERS,329.49,ANYX5441IO,...,0,0,0,0,0,0,0,0,0,0


In [8]:
# Imputación de faltantes de PRICE por KNN
df_numeric = df.select_dtypes([np.number])
df_filled = pd.DataFrame(KNN(3).fit_transform(df_numeric))
df_filled.columns=df_numeric.columns
df=df_filled

Imputing row 1/3992 with 0 missing, elapsed time: 254.758
Imputing row 101/3992 with 0 missing, elapsed time: 254.769
Imputing row 201/3992 with 0 missing, elapsed time: 254.776
Imputing row 301/3992 with 0 missing, elapsed time: 254.783
Imputing row 401/3992 with 0 missing, elapsed time: 254.790
Imputing row 501/3992 with 0 missing, elapsed time: 254.797
Imputing row 601/3992 with 0 missing, elapsed time: 254.804
Imputing row 701/3992 with 0 missing, elapsed time: 254.812
Imputing row 801/3992 with 0 missing, elapsed time: 254.820
Imputing row 901/3992 with 0 missing, elapsed time: 254.828
Imputing row 1001/3992 with 0 missing, elapsed time: 254.838
Imputing row 1101/3992 with 0 missing, elapsed time: 254.847
Imputing row 1201/3992 with 0 missing, elapsed time: 254.856
Imputing row 1301/3992 with 0 missing, elapsed time: 254.866
Imputing row 1401/3992 with 1 missing, elapsed time: 254.877
Imputing row 1501/3992 with 0 missing, elapsed time: 254.885
Imputing row 1601/3992 with 0 missin

In [9]:
df=df_filled

In [12]:
# división entre instancias y etiquetas

X, y = df.iloc[:, 4:], df[['SHP_WEIGHT', 'SHP_LENGTH', 'SHP_WIDTH', 'SHP_HEIGHT']]

# división entre entrenamiento y evaluación
#stratify=y no se emplea porque no es problema de clasificación
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Actividad 1:    

Aplicar PCA (Análisis de componentes principales) sobre el conjunto de features para reducir su dimensionalidad. Probar con distintos valores del parámetro que determina la cantidad de componentes finales (n_components), por ejemplo, 2, 5 y 10. Para cada versión resultante agregar el target y entrenar los 3 mejores modelos encontrados en el práctico anterior y reportar métricas.

Nota: Recordar que para PCA es importante que los features se encuentren normalizados, por lo que recomendamos usar StandardScaler.

In [None]:
Primero se normalizan los features aplicando StandardScaler

In [18]:
scaled_features = StandardScaler().fit_transform(X_train.values)

In [19]:
scaled_features_X_train = pd.DataFrame(scaled_features, index=X_train.index, columns=X_train.columns)

In [20]:
scaled_features_X_train

Unnamed: 0,PRICE,CATALOG_PRODUCT_ID_A0RY70BE19,CATALOG_PRODUCT_ID_A2H2JJFBXM,CATALOG_PRODUCT_ID_A4M0AP2TSK,CATALOG_PRODUCT_ID_A6X73QCLS9,CATALOG_PRODUCT_ID_A7Y7QKJ7EF,CATALOG_PRODUCT_ID_ADKMKF0FVM,CATALOG_PRODUCT_ID_AF4WQUGCVH,CATALOG_PRODUCT_ID_AFPLIBE9VN,CATALOG_PRODUCT_ID_AG9UI846DP,...,DOMAIN_ID_MLB-WIRELESS_ANTENNAS,DOMAIN_ID_MLB-WIRELESS_CHARGERS,DOMAIN_ID_MLB-WIRELESS_FM_TRANSMITTERS,DOMAIN_ID_MLB-WIRE_STRIPPERS,DOMAIN_ID_MLB-WOMEN_SWIMWEAR,DOMAIN_ID_MLB-WRENCHES,DOMAIN_ID_MLB-WRENCH_SETS,DOMAIN_ID_MLB-WRISTWATCHES,DOMAIN_ID_MLB-XENON_KITS,DOMAIN_ID_nan
549,-0.223862,-0.0177,-0.0177,-0.0177,-0.0177,-0.025035,-0.0177,-0.0177,-0.0177,0.0,...,-0.0177,-0.035416,28.235616,-0.0177,-0.025035,-0.025035,0.0,-0.058796,-0.030667,-0.443089
557,-0.196569,-0.0177,-0.0177,-0.0177,-0.0177,-0.025035,-0.0177,-0.0177,-0.0177,0.0,...,-0.0177,-0.035416,-0.035416,-0.0177,-0.025035,-0.025035,0.0,-0.058796,-0.030667,-0.443089
1590,-0.142136,-0.0177,-0.0177,-0.0177,-0.0177,-0.025035,-0.0177,-0.0177,-0.0177,0.0,...,-0.0177,-0.035416,-0.035416,-0.0177,-0.025035,-0.025035,0.0,-0.058796,-0.030667,-0.443089
602,-0.151441,-0.0177,-0.0177,-0.0177,-0.0177,-0.025035,-0.0177,-0.0177,-0.0177,0.0,...,-0.0177,-0.035416,-0.035416,-0.0177,-0.025035,-0.025035,0.0,-0.058796,-0.030667,-0.443089
442,-0.286359,-0.0177,-0.0177,-0.0177,-0.0177,-0.025035,-0.0177,-0.0177,-0.0177,0.0,...,-0.0177,-0.035416,-0.035416,-0.0177,-0.025035,-0.025035,0.0,-0.058796,-0.030667,-0.443089
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1130,0.022557,-0.0177,-0.0177,-0.0177,-0.0177,-0.025035,-0.0177,-0.0177,-0.0177,0.0,...,-0.0177,-0.035416,-0.035416,-0.0177,-0.025035,-0.025035,0.0,-0.058796,-0.030667,-0.443089
1294,0.190492,-0.0177,-0.0177,-0.0177,-0.0177,-0.025035,-0.0177,-0.0177,-0.0177,0.0,...,-0.0177,-0.035416,-0.035416,-0.0177,-0.025035,-0.025035,0.0,-0.058796,-0.030667,-0.443089
860,-0.289259,-0.0177,-0.0177,-0.0177,-0.0177,-0.025035,-0.0177,-0.0177,-0.0177,0.0,...,-0.0177,-0.035416,-0.035416,-0.0177,-0.025035,-0.025035,0.0,-0.058796,-0.030667,-0.443089
3507,-0.032477,-0.0177,-0.0177,-0.0177,-0.0177,-0.025035,-0.0177,-0.0177,-0.0177,0.0,...,-0.0177,-0.035416,-0.035416,-0.0177,-0.025035,-0.025035,0.0,-0.058796,-0.030667,2.256881


In [21]:
# se aplica PCA solo sobre la muestra de entrenamiento para evitar filtraciones
pca = PCA(n_components=5)
pca.fit(scaled_features_X_train)  
PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)
print(pca.explained_variance_ratio_)  
print(pca.singular_values_)  


[0.00283834 0.00209203 0.00178451 0.00173831 0.00171424]
[102.39953119  87.91248449  81.19422191  80.13629824  79.57973611]


In [22]:
pca = PCA(n_components=10)
pca.fit(scaled_features_X_train)  
PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)
print(pca.explained_variance_ratio_)  
print(pca.singular_values_)  

[0.0028381  0.00212601 0.00181481 0.00176011 0.00171951 0.00171516
 0.00171156 0.00170915 0.00170675 0.00170523]
[102.39529924  88.62357778  81.88069094  80.63719798  79.70180658
  79.6008885   79.51730671  79.46144524  79.40567409  79.37021381]


In [23]:
pca = PCA(n_components=15)
pca.fit(scaled_features_X_train)  
PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)
print(pca.explained_variance_ratio_)  
print(pca.singular_values_) 

[0.00283798 0.00213064 0.00184288 0.00175918 0.00172141 0.00171641
 0.00171415 0.0017126  0.00171044 0.00170965 0.00170771 0.00170693
 0.00170164 0.00169638 0.00169208]
[102.39311529  88.71999616  82.51153288  80.61590399  79.74583593
  79.62993411  79.57762305  79.54148222  79.49137933  79.47313873
  79.42795336  79.40975308  79.28655879  79.16393698  79.06364457]


In [None]:
Ningún componente principal explica un porcentaje alto de la varianza!

## Ejercicio 2

Aplicar K-Means para generar clustering sobre los siguientes features propuestos: SHP_WEIGHT (Peso físico del ítem) y SHP_LENGTH (Largo del ítem) (Probar distintos K: por ejemplo: 5, 10, 15). 
    
    Luego graficar con distintos colores los distintos dominios (variable DOMAIN_ID) dentro de cada cluster.

    Analizar:
    
        a. Cuales son los dominios más frecuentes de cada cluster
        b. Hay dominios que se encuentran en más de un cluster? Cuales?

In [25]:
X_train

Unnamed: 0,PRICE,CATALOG_PRODUCT_ID_A0RY70BE19,CATALOG_PRODUCT_ID_A2H2JJFBXM,CATALOG_PRODUCT_ID_A4M0AP2TSK,CATALOG_PRODUCT_ID_A6X73QCLS9,CATALOG_PRODUCT_ID_A7Y7QKJ7EF,CATALOG_PRODUCT_ID_ADKMKF0FVM,CATALOG_PRODUCT_ID_AF4WQUGCVH,CATALOG_PRODUCT_ID_AFPLIBE9VN,CATALOG_PRODUCT_ID_AG9UI846DP,...,DOMAIN_ID_MLB-WIRELESS_ANTENNAS,DOMAIN_ID_MLB-WIRELESS_CHARGERS,DOMAIN_ID_MLB-WIRELESS_FM_TRANSMITTERS,DOMAIN_ID_MLB-WIRE_STRIPPERS,DOMAIN_ID_MLB-WOMEN_SWIMWEAR,DOMAIN_ID_MLB-WRENCHES,DOMAIN_ID_MLB-WRENCH_SETS,DOMAIN_ID_MLB-WRISTWATCHES,DOMAIN_ID_MLB-XENON_KITS,DOMAIN_ID_nan
549,82.30000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
557,99.90000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1590,135.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
602,129.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
442,42.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1130,241.20000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1294,349.49000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
860,40.13000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3507,205.71175,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [34]:
kmeans=KMeans(n_clusters=5).fit(y_train[['SHP_WEIGHT','SHP_LENGTH']])
print(kmeans.labels_)
print(kmeans.cluster_centers_)
#kmeans.predict([[0, 0], [12, 3]])

[0 0 0 ... 0 0 0]
[[  577.86531842    25.76975043]
 [11880.26495726    57.60512821]
 [ 6714.86238532    52.92201835]
 [19989.6122449     57.42040816]
 [ 3033.86597938    41.65649485]]


In [35]:
kmeans=KMeans(n_clusters=10).fit(y_train[['SHP_WEIGHT','SHP_LENGTH']])
print(kmeans.labels_)
print(kmeans.cluster_centers_)

[7 1 1 ... 7 7 7]
[[9.73896429e+03 5.52380952e+01]
 [1.15109413e+03 3.07203274e+01]
 [4.48882000e+03 4.57440000e+01]
 [1.65470882e+04 5.46352941e+01]
 [6.89993939e+03 5.50227273e+01]
 [2.11100000e+04 5.82962963e+01]
 [2.58053517e+03 4.05033639e+01]
 [3.27877335e+02 2.35858655e+01]
 [3.10000000e+04 6.15000000e+01]
 [1.27572500e+04 6.11416667e+01]]


In [36]:
kmeans=KMeans(n_clusters=15).fit(y_train[['SHP_WEIGHT','SHP_LENGTH']])
print(kmeans.labels_)
print(kmeans.cluster_centers_)

[ 0  6  6 ...  0 12 12]
[[2.10516636e+02 2.19467652e+01]
 [9.74570149e+03 5.47910448e+01]
 [2.35566667e+04 6.43333333e+01]
 [4.52301361e+03 4.68489796e+01]
 [2.12697561e+03 3.67093496e+01]
 [1.47928800e+04 5.73600000e+01]
 [1.24941629e+03 3.20366516e+01]
 [1.21426000e+04 6.02622222e+01]
 [7.64396154e+03 5.60769231e+01]
 [3.19998204e+03 4.27550898e+01]
 [3.10000000e+04 6.15000000e+01]
 [1.73760500e+04 5.58300000e+01]
 [6.50524310e+02 2.73499343e+01]
 [2.04109524e+04 5.65714286e+01]
 [6.01427381e+03 5.23928571e+01]]
