<h1> Precio de automoviles usados. </h1>
<h3> Uso de algoritmos de Machine Learning para predecir los precios de automoviles usados basados en el dataset.</h3>

In [1]:
import pandas as pd

In [2]:
cars = pd.read_csv('C:/Users/chech/PC Febrero 2023/1. Data Scients/Machine Learning Projects/Casos practicos ML/data/cars.csv')
cars.head()

Unnamed: 0,maker,model,year,transmission,mileage,fuelType,tax,mpg,engineSize,price
0,cclass,C Class,2020,Automatic,1200,Diesel,,,2.0,30495
1,cclass,C Class,2020,Automatic,1000,Petrol,,,1.5,29989
2,cclass,C Class,2020,Automatic,500,Diesel,,,2.0,37899
3,cclass,C Class,2019,Automatic,5000,Diesel,,,2.0,30399
4,cclass,C Class,2019,Automatic,4500,Diesel,,,2.0,29899


<h1> Feature Engineering </h1>

In [3]:
Inicial = len(cars)
Inicial

108540

In [4]:
cars = cars.drop_duplicates(keep='first')

In [5]:
SinDuplicados = len(cars)
SinDuplicados

106267

In [8]:
Eliminados = Inicial - SinDuplicados  
print('Se eliminaron ', Eliminados, ' registros que estaban duplicados')

Se eliminaron  2273  registros que estaban duplicados


<h3> Dividiendo el dataset</h3>

In [9]:
from sklearn.model_selection import train_test_split
import numpy as np

In [10]:
rest, test = train_test_split(cars, test_size=0.2, shuffle=True) # 20% of 100 = 20% del total
train, val = train_test_split(rest, test_size=0.25, shuffle=True) # 25% of 100 = 20% del total

distributions = np.array([len(train), len(val), len(test)])

print(distributions)
print(distributions/ len(cars))

[63759 21254 21254]
[0.59998871 0.20000565 0.20000565]


In [13]:
#Codificando las variables categoricas

from sklearn.preprocessing import  OneHotEncoder
maker_encoder = OneHotEncoder()

In [17]:
maker_encoder.fit(train[["maker"]])
mkr = maker_encoder.transform(train[["maker"]]).todense()
mkr.shape

(63759, 11)

In [18]:
df = pd.DataFrame(mkr, columns=maker_encoder.categories_, index=train[["maker"]].index)
df["actual"] = train[["maker"]]
df.sample(5)

Unnamed: 0,audi,bmw,cclass,focus,ford,hyundi,merc,skoda,toyota,vauxhall,vw,actual
35542,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,ford
41607,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,ford
88031,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,vw
55336,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,vauxhall
29533,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,skoda


<h3> Feature scaling </h3>
<p> Feature scaling es un proceso fundamental en el preprocesamiento de datos en Machine Learning y es esencial en algoritmos que son sensibles a la escala de las características (features) en los datos. Consiste en normalizar o estandarizar las características de un conjunto de datos, es decir, transformarlas de manera que tengan una escala uniforme, lo que facilita la comparación y el análisis.
<li> 1. Normalización (Min-Max Scaling) </li>
<li> 2. Estandarización (Standardization)</li>

In [19]:
from sklearn.preprocessing import RobustScaler, StandardScaler, MinMaxScaler, MaxAbsScaler
scaler = MaxAbsScaler()

In [20]:
scaler.fit(train[["mileage"]])
scaled = scaler.transform(train[["mileage"]])

In [21]:
values = pd.DataFrame({"mileage": train["mileage"].values, "scaled": scaled.squeeze() })
values.sample(5)

Unnamed: 0,mileage,scaled
48860,31339,0.097025
53925,47840,0.148111
41657,2950,0.009133
41765,40782,0.12626
60679,63507,0.196616


<h3>Artefactos </h3>
<p> Cosas como el OneHotEncoder, CountVectorizer y MaxAbsScaler forman parte de este conjunto de herramientas que, una vez preparadas con fit, debemos preservar para poder re-usarlas en producción. Estas herramientas son conocidas como artefactos.

In [22]:
import pickle

with open("scaler.pickle", "wb") as wb:
    pickle.dump(scaler, wb)
    
with open("maker_encoder.pickle", "wb") as wb:
    pickle.dump(maker_encoder, wb) 

<h1> Pipelines </h1>
<p>En Machine Learning, un pipeline es una forma de simplificar una serie de pasos secuenciales en el procesamiento de datos y modelado, encapsulando estos pasos en un único objeto que puede ser utilizado como cualquier otro modelo de aprendizaje automático.

Un pipeline de Machine Learning es especialmente útil cuando se necesita realizar múltiples transformaciones de datos y luego entrenar un modelo. Estos pasos pueden incluir:<p>

<li> Preprocesamiento de datos </li>
<li> Modelado </li>
<li> Post-procesamiento </li>

In [23]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import FeatureUnion
from sklearn.impute import SimpleImputer
from sklearn import set_config

In [35]:
#One hot encode maker, transmission, fuelType
one_hot_encode = ColumnTransformer([
    (
        'maker-transmission-fuelType', #Nombre de la transformación.
        OneHotEncoder(sparse=False), # La transformación
        ["maker", "transmission", "fuelType"] # Columnas que va a sufrir esa transformación.
    )
])

In [25]:
one_hot_encode.fit(train)



In [28]:
variable = one_hot_encode.transform(train)
variable.shape

(63759, 20)

In [36]:
# Robust encode mileage
robust_encoding = ColumnTransformer([
    ('mileage', RobustScaler(), ["mileage"])
])

In [37]:
# Impute and standard scale mpg and tax
impute_and_scale = Pipeline([
    ('impute', SimpleImputer(strategy='mean')),
    ('scale', MinMaxScaler())
])

standard_scaling = ColumnTransformer([
    ('mpg-tax', impute_and_scale, ["mpg", "tax"])
])

In [38]:
# Just pass year engineSize
passthrough = ColumnTransformer([('pass', 'passthrough', ['year', "engineSize"])])

In [40]:
# Ensambla todo el pipeline
pipe = Pipeline([
    ('features',
        FeatureUnion([
            ('one_hot_encode', one_hot_encode),
            ('robust_encoding', robust_encoding),
            ('just_passs', passthrough),
            ('scale_and_impute', standard_scaling)
        ])
     )
])

In [41]:
from sklearn import set_config

set_config(display="diagram")
pipe

In [43]:
type(train)

pandas.core.frame.DataFrame

In [44]:
pipe.fit(train)

pd.DataFrame(pipe.transform(train))



Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,-0.421029,2019.0,1.0,0.116674,0.207286
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.335286,2018.0,1.4,0.099469,0.250000
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,-0.333233,2018.0,1.4,0.101169,0.250000
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.086831,2018.0,1.6,0.096281,0.250000
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,2.228126,2014.0,2.0,0.142402,0.034483
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
63754,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,-0.464143,2019.0,1.0,0.097768,0.250000
63755,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,-0.296641,2019.0,1.5,0.099469,0.250000
63756,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,-0.114486,2017.0,1.2,0.110521,0.250000
63757,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,-0.063321,2019.0,1.4,0.082678,0.250000


In [45]:
pd.DataFrame(pipe.transform(test))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,2.242095,2015.0,2.0,0.129862,0.215517
1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,-0.322766,2018.0,2.0,0.093092,0.258621
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,2.192702,2005.0,1.2,0.102869,0.250000
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,3.264698,2008.0,3.0,0.082678,0.413793
4,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,-0.468853,2019.0,2.2,0.082678,0.250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21249,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.343176,2017.0,2.0,0.145802,0.034483
21250,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,-0.135539,2017.0,1.0,0.135813,0.250000
21251,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,-0.390395,2018.0,2.0,0.124548,0.250000
21252,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,-0.559346,2019.0,2.1,0.087566,0.250000


In [46]:
train_x = pipe.transform(train)
type(train_x)

numpy.ndarray

<h2> Modelado </h2>

In [47]:
from sklearn.linear_model import LinearRegression

In [48]:
lr = LinearRegression()

In [49]:
predicting_pipeline = Pipeline([
    ('feature', pipe),
    ('estimator', lr)
])

In [50]:
predicting_pipeline.fit(train, train['price'])



In [51]:
train_pred = predicting_pipeline.predict(train)
val_pred = predicting_pipeline.predict(val)

In [52]:
pd.DataFrame({'real':val['price'], 'predicted':val_pred})

Unnamed: 0,real,predicted
70918,52999,37703.25
24958,8500,10342.75
4431,11998,13112.75
43993,9799,13065.00
45210,10290,11960.25
...,...,...
17893,11000,13809.25
2518,27499,23382.25
83381,8882,8890.25
18709,18995,17415.75


<h2> Evaluación de los modelos.</h2>

In [53]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

In [54]:
train_mse = mean_absolute_error(train['price'], train_pred)
val_mse = mean_absolute_error(val['price'], val_pred)

print(f"Entrenamiento MSE: {train_mse:2.02f}\n"
      f"Validación MSE:    {val_mse:2.02f}")

Entrenamiento MSE: 2957.45
Validación MSE:    2958.33


<h3> Evaluación de los datos de prueba </h3>

In [55]:
test_pred = predicting_pipeline.predict(test)
test_mse = mean_absolute_error(test['price'], test_pred)

print(f"Prueba MSE: {test_mse:2.02f}")

Prueba MSE: 2965.58


In [56]:
from joblib import dump, load
dump(predicting_pipeline, 'car-prices.model') 

['car-prices.model']

<h3> Prediccion manual de un auto especifico.</h3>

In [58]:
saved_pipeline = load('car-prices.model')

In [59]:
maker = "ford"
model = "focus"
year = 2020
transmission = "Manual"
mileage = 50
fuelType = "Petrol"
tax = 100
mpg = 30
engineSize = 1.5

mi_automóvil = pd.DataFrame({
    "maker": [maker], "model": [model], "year": [year], "transmission": [transmission], 
    "mileage": [mileage], "fuelType": [fuelType], "tax": [tax], "mpg": [mpg], "engineSize": [engineSize],
})

price = saved_pipeline.predict(mi_automóvil).squeeze()

print(price)

22050.75
