## Construcción del modelo

Apalancados en la ingeniería de features, creamos un conjunto de clases a utilizar en la definición de pipelines, que nos permitan reproducir y modificar con facilidad los pasos de preprocesamiento, previos al entrenamiento de un modelo: 

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
import numpy as np
from sklearn.experimental import enable_iterative_imputer

import sys
sys.path.append('src')
from models.pipeline import CarsPipeline

In [2]:
# Cargar y dividir los datos
data = pd.read_csv('../datasets/Car details v3.csv')

data["selling_price_log"] = np.log(data["selling_price"])

X = data.drop(columns=['selling_price', 'selling_price_log'])
y = data['selling_price']
y_log = data['selling_price_log']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train_log, X_test_log, y_train_log, y_test_log = train_test_split(X, y_log, test_size=0.3, random_state=42)

In [3]:
# Ajustar y transformar los datos
final_pipeline = CarsPipeline()

X_train_processed = final_pipeline.fit_transform(X_train)
X_test_processed = final_pipeline.transform(X_test)



In [4]:
final_pipeline_log = CarsPipeline()

X_train_processed_log = final_pipeline_log.fit_transform(X_train_log)
X_test_processed_log = final_pipeline_log.transform(X_test_log)



Veamos de usar un Ridge como primer modelo simple. Usaremos búsqueda de grilla para el hiperparámetro alpha:

In [5]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, mean_absolute_percentage_error
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor

from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

In [6]:
# Creamos el arbol
regression = DecisionTreeRegressor(criterion='squared_error', splitter='best', 
                                   max_depth=None, min_samples_split=2, min_samples_leaf=1, 
                                   random_state=42)
# Y entrenamos
regression.fit(X_train_processed, y_train)

In [7]:
regression.get_params()

{'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'random_state': 42,
 'splitter': 'best'}

In [8]:
from sklearn.tree import export_graphviz
export_graphviz(regression, out_file = "arbol_regression.dot",
                feature_names=final_pipeline.final_columns(),
                rounded=True,
                filled=True)

In [9]:
from sklearn.metrics import mean_absolute_error

y_pred_train = regression.predict(X_train_processed)
y_pred = regression.predict(X_test_processed)

mae_train = mean_absolute_error(y_train, y_pred_train)
mae = mean_absolute_error(y_test, y_pred)

print(f"El error de entrenamiento fue: {mae_train}")
print(f"El error de testeo fue: {mae}")

El error de entrenamiento fue: 3132.088631506009
El error de testeo fue: 82609.24565633338


In [10]:
# Creamos el arbol
regression_log = DecisionTreeRegressor(criterion='squared_error', splitter='best', 
                                   max_depth=None, min_samples_split=2, min_samples_leaf=1, 
                                   random_state=42)
# Y entrenamos
regression_log.fit(X_train_processed_log, y_train_log)

In [11]:
from sklearn.tree import export_graphviz
export_graphviz(regression_log, out_file = "arbol_regression_log.dot",
                feature_names=final_pipeline.final_columns(),
                rounded=True,
                filled=True)

In [12]:
from sklearn.metrics import mean_absolute_error

y_pred_train_log = regression_log.predict(X_train_processed_log)
y_pred_log = regression_log.predict(X_test_processed_log)

y_train_inv = np.exp(y_train_log)
y_pred_train_inv = np.exp(y_pred_train_log)

y_test_inv = np.exp(y_test_log)
y_pred_inv = np.exp(y_pred_log)

mae_train = mean_absolute_error(y_train_inv, y_pred_train_inv)
mae = mean_absolute_error(y_test_inv, y_pred_inv)

print(f"El error de entrenamiento fue: {mae_train}")
print(f"El error de testeo fue: {mae}")

El error de entrenamiento fue: 3135.7068671169136
El error de testeo fue: 81767.63396613808


- Ridge
- Arbol regresión
- SVR
- Boost (hay 2)
- Random Forest