# Precio de automóviles usados

Imaginen ustedes que tienen un amigo que quiere vender un automóvil, pero no está muy seguro de cuánto cobrar por él. ¿Qué harías?

Este es el problema que Aditya se encontró, entonces lo que hizo fue hacer web scraping (dedscargar información de la web) para obtener información de un sitio de venta de automóviles en donde estában listadas un montón de características y el precio final al que cada uno de ellos fue vendido. 

 > ❓ ¿Cómo es que nosotros podemos ayudar?
 
Accede a el dataset en el archivo `cars.csv`, puedes ver el dataset original [aquí](https://www.kaggle.com/adityadesai13/used-car-dataset-ford-and-mercedes).

In [3]:
%matplotlib inline
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport
import os
import seaborn as sns
import matplotlib.pyplot as plt

In [4]:
cars = pd.read_csv("cars.csv")

In [5]:
cars.head()

Unnamed: 0,maker,model,year,transmission,mileage,fuelType,tax,mpg,engineSize,price
0,cclass,C Class,2020,Automatic,1200,Diesel,,,2.0,30495
1,cclass,C Class,2020,Automatic,1000,Petrol,,,1.5,29989
2,cclass,C Class,2020,Automatic,500,Diesel,,,2.0,37899
3,cclass,C Class,2019,Automatic,5000,Diesel,,,2.0,30399
4,cclass,C Class,2019,Automatic,4500,Diesel,,,2.0,29899


 > ⁉️ Y las métricas de evaluación – en la regresión no tenemos muchas opciones, podemos usar RMSE o MSE

## Análisis Exploratorio de Datos

Antes de meternos de fondo a la etapa del modelado, vamos a generar un reporte usando la biblioteca [Pandas Profiling](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/):

In [6]:
profile = ProfileReport(cars, title="Raw Car Dataset Analysis", explorative=True)
profile.to_file("cars-report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [8]:
print(os.getcwd() + "/cars-report.html")

/home/diana/Descargas/cf-ml-main/car-prices/cars-report.html


## Elimina los duplicados

Una de las grandes advertencias provistas por nuestro reporte es que existen duplicados en el dataset, así que vamos a comenzar con eso.

 > 😉 Lee la documentación de `drop_duplicates` para ver qué hacen los distintos parámetros

In [9]:
cars = cars.drop_duplicates(keep='first')

## Divide el dataset

In [10]:
from sklearn.model_selection import train_test_split

In [11]:
rest, test = train_test_split(cars, test_size=0.2, shuffle=True) # 20% of 100 = 20
train, val = train_test_split(rest, test_size=0.25, shuffle=True) # 25% of 80 = 20
distributions = np.array([len(train), len(val), len(test)])

print(distributions)
print(distributions / len(cars))

[63759 21254 21254]
[0.59998871 0.20000565 0.20000565]


## *Feature engineering*

### One-hot encode categorical variables

Necesitamos una manera de pasar de una variable categórica a números, por ejemplo, tenemos en nuestro dataset una columna llamada *"maker"*, que se traduce a la constructora del automóvil: "bnw", "ford", "audi"...

Debemos encontrar una forma de convertirlos a números que nuestro algoritmo pueda usar para entrenar nuestro modelo, a este proceso se le llama  *"encoding"* (o codificación).

 > 📹 tengo un video sobre tipos de variables: [https://www.youtube.com/watch?v=SAWsQ3QmmJE](https://www.youtube.com/watch?v=SAWsQ3QmmJE)
 
Una forma de codificar variables categóricas es usando *One-Hot Encoding*, que expandirá nuestra única columna categórica en un vector (que podemos representar en forma de columnas) de 1 y 0:

In [12]:
from sklearn.preprocessing import OneHotEncoder
maker_encoder = OneHotEncoder()

In [13]:
maker_encoder.fit(train[["maker"]])
mkr = maker_encoder.transform(train[["maker"]]).todense()

print(mkr.shape)

(63759, 11)


Para revisar las categorías, podemos usar la propieddad `categories_`.

In [14]:
df = pd.DataFrame(mkr, columns=maker_encoder.categories_, index=train[["maker"]].index)
df["actual"] = train[["maker"]]
df.sample(5)

Unnamed: 0,audi,bmw,cclass,focus,ford,hyundi,merc,skoda,toyota,vauxhall,vw,actual
42803,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,ford
93696,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,hyundi
641,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,cclass
93413,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,hyundi
32484,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,skoda


#### 🚨 `pd.get_dummies` ?

Para hacer *machine learning* no te recomiendo usar `pd.get_dummies` porque no es robusto ante datos faltantes y no preserva el estado, por ejemplo, cuando recibimos un registro para predecir en producción:


In [15]:
test_maker = "audi"

In [16]:
pd.get_dummies([test_maker])

Unnamed: 0,audi
0,1


In [17]:
maker_encoder.transform([[test_maker]]).todense()

matrix([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

### Feature scaling

Existen algoritmos que basan su entrenamiento en únicamente números, sin contexto alguno. Algunos de ellos tienden a otorgar mayor importancia a aquellos números cuyo valor es más grande. Una apuesta segur es escalar los valores de una característica de tal modo que todos se encuentren en la misma escala, pero preservando las distancias relativas ente ellos:

In [18]:
from sklearn.preprocessing import RobustScaler, StandardScaler, MinMaxScaler, MaxAbsScaler
scaler = MaxAbsScaler()

In [19]:
scaler.fit(train[["mileage"]])
scaled = scaler.transform(train[["mileage"]])

In [20]:
values = pd.DataFrame({"mileage": train["mileage"].values, "scaled": scaled.squeeze() })
values.sample(5)

Unnamed: 0,mileage,scaled
27757,3000,0.009288
56082,18018,0.055783
2797,33500,0.103715
6793,15540,0.048111
50116,19550,0.060526


# Artefactos

Hemos visto una diversidad de herramientas que nos sirven para transformar una de nuestras observaciones del munddo real, como el diálogo emitido por una persona o un automóvil, a un grupo de números. 

Cosas como el `OneHotEncoder`, `CountVectorizer` y `MaxAbsScaler` forman parte de este conjunto de herramientas que, una vez preparadas con `fit`, debemos preservar para poder re-usarlas en producción. Estas herramientas son conocidas como artefactos.

In [21]:
import pickle

with open("scaler.pickle", "wb") as wb:
    pickle.dump(scaler, wb)

In [22]:
with open("scaler.pickle", "rb") as rb:
    scaler_loaded = pickle.load(rb)

In [23]:
scaler_loaded.transform([[400]])

array([[0.00123839]])

# Pipelines  

A lo largo del modelado creamos un montón de *artefactos* que debemos conservar para asegurarnos de que usaremos los mismos valores, parámetros e hiperparámetros. Una alternativa sería guardar cada uno de los `OneHotEncoder`, `MinMaxScaler` y cualquier otro objeto que creamos para entrenar nuestro modelo de ML.

Otra forma de hacerlo, un poco más organizada es hacer uso de un `Pipeline` de *scikit learn*:

In [24]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import FeatureUnion
from sklearn.impute import SimpleImputer
from sklearn import set_config

# One-Hot encode maker, transmission y fuelType
one_hot_encode = ColumnTransformer([
    (
        'one_hot_encode', # Nombre de la transformación
        OneHotEncoder(sparse=False), # Transformación a aplicar
        ["maker", "transmission", "fuelType"] # Columnas involucradas
    )
])

# Robust encode mileage
robust_encoding = ColumnTransformer([
    ('robust_encoding', RobustScaler(), ["mileage"])
])

# Impute and standard scale mpg and tax
impute_and_scale = Pipeline([
    ('impute', SimpleImputer(strategy='mean')),
    ('scale', MinMaxScaler())
])

standard_scaling = ColumnTransformer([
    ('standard_scaling', impute_and_scale, ["mpg", "tax"])
])

# Just pass year and engineSize
passthrough = ColumnTransformer([('passthrough', 'passthrough', ['year', "engineSize"])])

# Ensambla todo el pipeline
pipe = Pipeline([
    (
        'features',
        FeatureUnion([
            ('one_hot_encode', one_hot_encode),
            ('robust_encoding', robust_encoding),
            ('just_passs', passthrough),
            ('scale_and_impute', standard_scaling)
        ])
    )
])

In [25]:
from sklearn import set_config

set_config(display="diagram")
pipe

In [26]:
pipe.fit(train)

pd.DataFrame(pipe.transform(train))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,-0.658575,2019.0,1.2,0.101169,0.250000
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.291865,2018.0,2.4,0.076302,0.448276
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.515995,2015.0,1.5,0.153241,0.034483
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.244083,2012.0,1.8,0.091817,0.344828
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,-0.210114,2019.0,1.6,0.108608,0.250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
63754,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.655707,2015.0,1.5,0.080553,0.405172
63755,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.448905,2017.0,3.0,0.108608,0.250000
63756,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.742750,2017.0,2.4,0.076302,0.448276
63757,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.567655,2016.0,1.0,0.132837,0.034483


In [27]:
pd.DataFrame(pipe.transform(test))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,-0.126060,2017.0,1.2,0.127099,0.034483
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,-0.588012,2019.0,1.5,0.119447,0.250000
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,1.664311,2017.0,1.0,0.132837,0.034483
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,-0.617295,2020.0,1.5,0.083953,0.250000
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.388359,2018.0,1.2,0.093092,0.250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21249,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,1.423661,2015.0,1.4,0.094580,0.275862
21250,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,-0.704459,2019.0,1.0,0.097768,0.250000
21251,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,-0.595848,2019.0,2.3,0.116569,0.207450
21252,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,-0.584296,2018.0,1.0,0.135813,0.258621


## Modelado

In [28]:
from sklearn.linear_model import LinearRegression

In [29]:
lr = LinearRegression()

Creamos otro pipeline, que incluya nuestro modelo de regresión elegido

In [30]:
predicting_pipeline = Pipeline([
    ('feature', pipe),
    ('estimator', lr)
])

In [31]:
predicting_pipeline.fit(train, train['price'])

In [32]:
train_pred = predicting_pipeline.predict(train)
val_pred = predicting_pipeline.predict(val)

In [33]:
pd.DataFrame({'real':val['price'], 'predicted':val_pred})

Unnamed: 0,real,predicted
5557,10491,11220.3750
56574,4192,-2955.5625
6956,11784,13574.3125
25452,6750,2749.5000
63765,9487,14180.8750
...,...,...
55748,11270,10752.1875
64943,32000,35690.1875
19034,11499,16112.1250
8088,12700,16581.1250


## Evaluación de los modelos

In [34]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

In [35]:
train_mse = mean_absolute_error(train['price'], train_pred)
val_mse = mean_absolute_error(val['price'], val_pred)

print(f"Entrenamiento MSE: {train_mse:2.02f}\n"
      f"Validación MSE:    {val_mse:2.02f}")

Entrenamiento MSE: 2932.78
Validación MSE:    2954.99


### Evaluación en los datos de prueba

In [36]:
test_pred = predicting_pipeline.predict(test)
test_mse = mean_absolute_error(test['price'], test_pred)

print(f"Prueba MSE: {test_mse:2.02f}")

Prueba MSE: 2933.02


# Guarda el pipeline

In [37]:
from joblib import dump, load
dump(predicting_pipeline, 'car-prices.model') 

['car-prices.model']

## Prediciendo en nuestro propio auto

In [38]:
saved_pipeline = load('car-prices.model')

In [39]:
maker = "ford"
model = "focus"
year = 2020
transmission = "Manual"
mileage = 50
fuelType = "Petrol"
tax = 100
mpg = 30
engineSize = 1.5

mi_automóvil = pd.DataFrame({
    "maker": [maker], "model": [model], "year": [year], "transmission": [transmission], 
    "mileage": [mileage], "fuelType": [fuelType], "tax": [tax], "mpg": [mpg], "engineSize": [engineSize],
})

price = saved_pipeline.predict(mi_automóvil).squeeze()

print(price)

ValueError: X has 9 features, but ColumnTransformer is expecting 10 features as input.

## De tarea... 

 - Creamos un modelo para todas las constructoras, ¿valdría la pena crear un modelo independiente para cada una? – inténtalo y ve si los resultados mejoran.
 - Utiliza otro modelo, tal vez [XGBoost](https://xgboost.readthedocs.io/en/stable/) para ver si te da mejores resultados
 - ¿Sabes Flask, FastAPI, o Django? pon tu modelo en una API

## Para aprender más  

 - Échale un ojo a mi video sobre [tipos de variables](https://www.youtube.com/watch?v=SAWsQ3QmmJE)
 - Conoce [cuándo escalar y cuando normalizar](https://datascience.stackexchange.com/questions/45900/when-to-use-standard-scaler-and-when-normalizer) tus datos
 - Revisa la [documentación de sklearn](https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling) referente a los diferentes escaladores
 - Aprende [cuándo es válido eliminar outliers](https://statisticsbyjim.com/basics/remove-outliers/) (valores extremos)
 
