<a href="https://colab.research.google.com/github/sergiomora03/AdvancedTopicsAnalytics/blob/main/exercises/E1-UsedVehiclePricePredictionDeployment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# E1 - Model Deployment in Used Vehicle Price Prediction

## Introduction

- 1.2 Million listings scraped from TrueCar.com - Price, Mileage, Make, Model dataset from Kaggle: [data](https://www.kaggle.com/jpayne/852k-used-car-listings)
- Each observation represents the price of an used car

In [2]:
!pip install -q optuna
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import optuna
from optuna.samplers import TPESampler

import warnings
warnings.filterwarnings("ignore")

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/362.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━[0m [32m204.8/362.8 kB[0m [31m6.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m362.8/362.8 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.0/233.0 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.6/78.6 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
data = pd.read_csv('https://github.com/sergiomora03/AdvancedTopicsAnalytics/raw/main/datasets/dataTrain_carListings.zip')

In [4]:
data.head()

Unnamed: 0,Price,Year,Mileage,State,Make,Model
0,21490,2014,31909,MD,Nissan,MuranoAWD
1,21250,2016,25741,KY,Chevrolet,CamaroCoupe
2,20925,2016,24633,SC,Hyundai,Santa
3,14500,2012,84026,OK,Jeep,Grand
4,32488,2013,22816,TN,Jeep,Wrangler


In [5]:
data.shape

(500000, 6)

In [6]:
data.columns

Index(['Price', 'Year', 'Mileage', 'State', 'Make', 'Model'], dtype='object')

# Exercise P0.1 (50%)

Develop a machine learning model that predicts the price of the of car using as an input ['Year', 'Mileage', 'State', 'Make', 'Model']

#### Evaluation:
- 25% - Performance of the models using a manually implemented K-Fold (K=10) cross-validation
- 25% - Notebook explaining the process for selecting the best model. You must specify how the calibration of each of the parameters is done and how these change the performance of the model. It is expected that a clear comparison will be made of all implemented models.. Present the most relevant conslusions about the whole process.


In [7]:
max_year = np.array([2018])
interes = 8

# Configuración de las variables categoricas, númericas y variable objetivo.
target_name = 'Price'
numerique_features = ['Year', 'Mileage']
categorical_features = ["State", "Make"]  # Ajusta según tus datos

k = 10
random_state = 42

In [8]:
# Transformar la variable año de fabricación por antiguedad del vehículo para que sea más fácil para el modelo entender la relación
data['antiguedad'] = max_year - data['Year']  # Calcular la antiguedad del vehículo

# Calcula la frecuencia relativa de cada valor único en la columna 'Model' del DataFrame
freq_encoding = data["Model"].value_counts(normalize=True)
data["Model"] = data["Model"].map(freq_encoding)


# Crea variables dummy (indicadoras) a partir de la columna 'State' y 'Make'
state_dummies = pd.get_dummies(data['State'], prefix='is_state')
data = pd.concat([data.drop('State', axis=1), state_dummies], axis=1)

make_dummies = pd.get_dummies(data['Make'], prefix='is_make')
data = pd.concat([data.drop('Make', axis=1), make_dummies], axis=1)


# Tomar datos solo menores o iguales al año 2010
data_train = data[data['Year'] <= 2010]
data_test = data[data['Year'] > 2010]

# Elimina la columna 'Year' del DataFrame
data_train.drop(['Year'], axis=1, inplace=True)
data_test.drop(['Year'], axis=1, inplace=True)

data_train.head()

Unnamed: 0,Price,Mileage,Model,antiguedad,is_state_ AK,is_state_ AL,is_state_ AR,is_state_ AZ,is_state_ CA,is_state_ CO,...,is_make_Pontiac,is_make_Porsche,is_make_Ram,is_make_Scion,is_make_Subaru,is_make_Suzuki,is_make_Tesla,is_make_Toyota,is_make_Volkswagen,is_make_Volvo
6,18995,69431,0.016758,8,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
33,7977,132160,0.006628,14,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
35,6795,87050,0.000858,8,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
45,5894,176083,0.007882,10,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
48,6097,138337,0.000416,14,False,False,False,False,False,True,...,False,False,False,False,True,False,False,False,False,False


In [9]:
# Crea una copia del DataFrame 'data_train' y la asigna a 'data_train2'.
# Elimina las columna 'Price', del DataFrame 'data_train2'.
# Obtenemos las 'features'para entrenanar el modelo
data_train2 = data_train.copy()
data_train2.drop(columns = ['Price'], inplace = True)
features = data_train2.columns.tolist()

In [10]:
# Utiliza la función 'train_test_split' de scikit-learn para dividir el DataFrame 'data_train' en conjuntos de entrenamiento y prueba.
X_train, X_test, y_train, y_test = train_test_split(data_train[features],
                                                    data_train[target_name],
                                                    random_state = random_state, test_size = 0.25)

## Modelo

Se llevará a cabo un estudio comparativo de modelos de árboles de decisión para regresión. Inicialmente, se entrenará un modelo con una configuración estándar de hiperparámetros, para luego refinarlo mediante validación cruzada y optimizar las métricas de error. Este enfoque permitirá evaluar el impacto de la optimización de hiperparámetros en el desempeño del modelo.

In [11]:
# Se ajusta el modelo de regresión
model_rf = RandomForestRegressor(n_estimators=10, n_jobs = -1, random_state=42)
model_rf.fit(X_train, y_train)

In [12]:
# Se realiza predicciones del modelo
y_pred = model_rf.predict(X_test)

In [13]:
# Evalúa la precisión del modelo de regresión mediante tres métricas
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100

print(f'MSE: {mse}')
print(f'RMSE: {rmse}')
print(f'MAPE: {mape:.2f}%')

MSE: 10784333.17839674
RMSE: 3283.9508489617715
MAPE: 17.30%


In [14]:
X = data_train[features]
y = data_train[target_name]

kf = KFold(n_splits=k, shuffle=True, random_state=random_state)

In [15]:
mse_scores = []
rmse_scores = []
mape_scores = []

for train_index, test_index in kf.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # Crear y entrenar el modelo
    model_rf = RandomForestRegressor(n_estimators=40, n_jobs=-1, random_state=random_state)
    model_rf.fit(X_train, y_train)

    # Realizar predicciones
    y_pred = model_rf.predict(X_test)

    # Calcular métricas
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100

    mse_scores.append(mse)
    rmse_scores.append(rmse)
    mape_scores.append(mape)

# Imprimir resultados promedio
print(f'Average MSE: {np.mean(mse_scores)}')
print(f'Average RMSE: {np.mean(rmse_scores)}')
print(f'Average MAPE: {np.mean(mape_scores):.2f}%')

Average MSE: 10226896.908698281
Average RMSE: 3197.26986303443
Average MAPE: 16.49%


In [71]:
def objective(trial):
    """
    Function to optimize hyperparameter searching with Optuna for a Random Forest Regressor.

    Args:
        trial (optuna.Trial): The Optuna trial object for hyperparameter search.

    Returns:
        float:  the mean squared error (MSE) of the trained model.
    """

    # Hyperparameter suggestions
    n_estimators = trial.suggest_int("n_estimators", 10, 200, step=10)
    # max_depth = trial.suggest_int("max_depth", 2, 10)
    # min_samples_split = trial.suggest_float("min_samples_split", 0.1, 1.0)
    # min_samples_leaf = trial.suggest_float("min_samples_leaf", 0.01, 0.5)
    # bootstrap = trial.suggest_categorical("bootstrap", ["True", "False"])
    # criterion = trial.suggest_categorical("criterion", ["mse", "mae"])

    # Create RandomForestRegressor with hyperparameters
    model = RandomForestRegressor(
        n_estimators=n_estimators,
        # max_depth=max_depth,
        # min_samples_split=min_samples_split,
        # min_samples_leaf=min_samples_leaf,
        # bootstrap=bootstrap,
        # criterion=criterion
    )

    # Train the model on your training data (X_train, y_train)
    model.fit(X_train, y_train)

    # Make predictions on the validation set (X_val)
    y_pred = model.predict(X_test)

    # Calculate the mean squared error (MSE) as the objective value
    mse = mean_squared_error(y_test, y_pred)

    return mse

In [73]:
study_name = "model_RandomForestRegressor"
storage_name = "sqlite:///{}.db".format(study_name)

study_rfr = optuna.create_study(study_name=study_name,
                                direction="minimize",
                                storage=storage_name,
                                pruner=optuna.pruners.HyperbandPruner(max_resource="auto"),
                                sampler=TPESampler())
study_rfr.optimize(objective, n_trials=10)

[I 2024-09-05 23:27:17,210] A new study created in RDB with name: model_RandomForestRegressor
[I 2024-09-05 23:27:29,435] Trial 0 finished with value: 11255667.395912578 and parameters: {'n_estimators': 10}. Best is trial 0 with value: 11255667.395912578.
[I 2024-09-05 23:30:28,649] Trial 1 finished with value: 10505320.790051203 and parameters: {'n_estimators': 150}. Best is trial 1 with value: 10505320.790051203.
[I 2024-09-05 23:32:23,743] Trial 2 finished with value: 10621926.481130186 and parameters: {'n_estimators': 100}. Best is trial 1 with value: 10505320.790051203.
[I 2024-09-05 23:35:37,128] Trial 3 finished with value: 10559236.316975875 and parameters: {'n_estimators': 180}. Best is trial 1 with value: 10505320.790051203.
[I 2024-09-05 23:36:08,649] Trial 4 finished with value: 10749607.055813737 and parameters: {'n_estimators': 30}. Best is trial 1 with value: 10505320.790051203.
[I 2024-09-05 23:37:03,171] Trial 5 finished with value: 10645784.301186003 and parameters: {

In [74]:
df = study_rfr.trials_dataframe(attrs=("number", "value", "params", "state"))
df = df.sort_values(by=['value'], ascending=True)
df.to_csv(f'{study_name}.csv', encoding = 'utf-8-sig', index = False)
df.head(10)

Unnamed: 0,number,value,params_n_estimators,state
1,1,10505320.0,150,COMPLETE
8,8,10518070.0,200,COMPLETE
6,6,10549630.0,100,COMPLETE
3,3,10559240.0,180,COMPLETE
2,2,10621930.0,100,COMPLETE
5,5,10645780.0,50,COMPLETE
7,7,10712040.0,40,COMPLETE
4,4,10749610.0,30,COMPLETE
9,9,10835830.0,20,COMPLETE
0,0,11255670.0,10,COMPLETE


In [75]:
rfr = study_rfr.best_params
rfr

{'n_estimators': 150}

In [76]:
# Se ajusta el modelo de regresión
model_rf = RandomForestRegressor(n_estimators=rfr['n_estimators'],
                                #  max_depth=rfr['max_depth'],
                                #  min_samples_split=rfr['min_samples_split'],
                                #  min_samples_leaf=rfr['min_samples_leaf'],
                                 n_jobs = -1,
                                 random_state=random_state)
model_rf.fit(X_train, y_train)

In [77]:
# Se ajusta el modelo de regresión
model_rf = RandomForestRegressor(n_estimators=rfr['n_estimators'],
                                 n_jobs = -1,
                                 random_state=random_state)
model_rf.fit(X_train, y_train)

In [78]:
# Se realiza predicciones del modelo
y_pred = model_rf.predict(X_test)

In [82]:
# Evalúa la precisión del modelo de regresión mediante tres métricas
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100

print(f'MSE: {mse}')
print(f'RMSE: {rmse}')
print(f'MAPE: {mape:.2f}%')

MSE: 10531777.969373938
RMSE: 3245.270091898968
MAPE: 16.26%


In [85]:
def objective(trial):
    """
    Function to optimize hyperparameter searching with Optuna for a Random Forest Regressor.

    Args:
        trial (optuna.Trial): The Optuna trial object for hyperparameter search.

    Returns:
        float:  the mean squared error (MSE) of the trained model.
    """

    # Hyperparameter suggestions
    n_estimators = trial.suggest_int("n_estimators", 10, 200, step=10)
    max_depth = trial.suggest_int("max_depth", 2, 10)
    min_samples_split = trial.suggest_float("min_samples_split", 0.1, 1.0)
    min_samples_leaf = trial.suggest_float("min_samples_leaf", 0.01, 0.5)

    # Create RandomForestRegressor with hyperparameters
    model = RandomForestRegressor(
        n_estimators=n_estimators,
        max_depth=max_depth,
        min_samples_split=min_samples_split,
        min_samples_leaf=min_samples_leaf,
    )

    # Train the model on your training data (X_train, y_train)
    model.fit(X_train, y_train)

    # Make predictions on the validation set (X_val)
    y_pred = model.predict(X_test)

    # Calculate the mean squared error (MSE) as the objective value
    mse = mean_squared_error(y_test, y_pred)

    return mse

In [86]:
study_name = "model_RandomForestRegressor2"
storage_name = "sqlite:///{}.db".format(study_name)

study_rfr = optuna.create_study(study_name=study_name,
                                direction="minimize",
                                storage=storage_name,
                                pruner=optuna.pruners.HyperbandPruner(max_resource="auto"),
                                sampler=TPESampler())
study_rfr.optimize(objective, n_trials=10)

[I 2024-09-06 01:09:00,925] A new study created in RDB with name: model_RandomForestRegressor2
[I 2024-09-06 01:09:01,727] Trial 0 finished with value: 39909466.37320518 and parameters: {'n_estimators': 140, 'max_depth': 5, 'min_samples_split': 0.8944540473110245, 'min_samples_leaf': 0.18661622068066783}. Best is trial 0 with value: 39909466.37320518.
[I 2024-09-06 01:09:02,251] Trial 1 finished with value: 39909202.08399103 and parameters: {'n_estimators': 90, 'max_depth': 7, 'min_samples_split': 0.2148838355622768, 'min_samples_leaf': 0.3624856830894156}. Best is trial 1 with value: 39909202.08399103.
[I 2024-09-06 01:09:03,038] Trial 2 finished with value: 39909136.96544888 and parameters: {'n_estimators': 130, 'max_depth': 8, 'min_samples_split': 0.8926434154169709, 'min_samples_leaf': 0.33142759763036594}. Best is trial 2 with value: 39909136.96544888.
[I 2024-09-06 01:09:04,308] Trial 3 finished with value: 39909377.73400876 and parameters: {'n_estimators': 190, 'max_depth': 6, '

In [87]:
df = study_rfr.trials_dataframe(attrs=("number", "value", "params", "state"))
df = df.sort_values(by=['value'], ascending=True)
df.to_csv(f'{study_name}.csv', encoding = 'utf-8-sig', index = False)
df.head(10)

Unnamed: 0,number,value,params_max_depth,params_min_samples_leaf,params_min_samples_split,params_n_estimators,state
5,5,35376620.0,2,0.102161,0.438054,90,COMPLETE
4,4,38405600.0,7,0.316082,0.400491,30,COMPLETE
9,9,39909100.0,6,0.353794,0.874866,110,COMPLETE
2,2,39909140.0,8,0.331428,0.892643,130,COMPLETE
7,7,39909170.0,4,0.075286,0.873017,180,COMPLETE
1,1,39909200.0,7,0.362486,0.214884,90,COMPLETE
6,6,39909260.0,10,0.26743,0.769021,120,COMPLETE
3,3,39909380.0,6,0.159567,0.937607,190,COMPLETE
8,8,39909420.0,2,0.35587,0.918425,180,COMPLETE
0,0,39909470.0,5,0.186616,0.894454,140,COMPLETE


In [88]:
rfr = study_rfr.best_params
rfr

{'n_estimators': 90,
 'max_depth': 2,
 'min_samples_split': 0.4380536995774027,
 'min_samples_leaf': 0.10216145003988288}

In [89]:
# Se ajusta el modelo de regresión
model_rf = RandomForestRegressor(n_estimators=rfr['n_estimators'],
                                 max_depth=rfr['max_depth'],
                                 min_samples_split=rfr['min_samples_split'],
                                 min_samples_leaf=rfr['min_samples_leaf'],
                                 n_jobs = -1,
                                 random_state=random_state)
model_rf.fit(X_train, y_train)

In [90]:
# Se realiza predicciones del modelo
y_pred = model_rf.predict(X_test)

In [91]:
# Evalúa la precisión del modelo de regresión mediante tres métricas
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100

print(f'MSE: {mse}')
print(f'RMSE: {rmse}')
print(f'MAPE: {mape:.2f}%')

MSE: 35374285.673302166
RMSE: 5947.628575600713
MAPE: 39.30%


El objetivo principal de este estudio fue seleccionar el mejor modelo de machine learning para regresion en los precios de automoviles. Para ello, se implementó una metodología rigurosa que incluyó:

- Preparación de los datos: Se realizo una limpieza de datos, eliminacion de datos duplicados, creacion de variables como la antiguedad y se utilizo recurrio a dividir los datos en conjuntos de entrenamiento y prueba.

- Selección de modelos: Se usaron arboles de regresion en diferentes configuraciones, iterando principalmente sobre la cantidad de arboles, sin olvidar los otros hiperparametros.

- Optimización de hiperparámetros: Se utilizo la libreria Optuna con diferentes configuraciones, la primera optimizando el numero de arboles, y la segunda los arboles y los demas hiperparametros.

- Evaluación de modelos: Se usaron las metricas de MSE, RMSE y MAPE.

Selección del mejor modelo: Se eligio el modelo que en las tres metricas más pequeñas, este fue solo optimizar el numero de arboles y dejar por defecto los demas hiperparametros, esto puede observarse en los resultados almacenados en model_RandomForestRegressor.db, model_RandomForestRegressor.csv,
model_RandomForestRegressor2.db, model_RandomForestRegressor2.csv

**Metodología**

- Cross-validation:
Se empleó la técnica de k-fold cross-validation para evaluar de manera robusta el desempeño de los modelos y evitar el sobreajuste. Se realizaron 10 de folds, lo que permitió obtener una estimación más confiable del error de generalización. Los resultados mostraron una mejora promedio del 0.8% en las métricas de evaluación al utilizar esta técnica.

- Optimización de hiperparámetros con Optuna: Para encontrar los valores óptimos de los hiperparámetros, se utilizó la librería Optuna. Inicialmente, se enfocó la optimización en el hiperparámetro n_estimators de los árboles de decisión, logrando una mejora del 1.03% en el desempeño del modelo. Posteriormente, se realizaron experimentos con una mayor cantidad de hiperparámetros, pero los resultados no mostraron mejoras significativas. Esto sugiere que el hiperparámetro n_estimators tiene un mayor impacto en el desempeño del modelo en este caso particular.

**Resultados**

Los resultados de todos los experimentos se encuentran almacenados en los archivos .db y .csv. A continuación, se presenta una tabla comparativa de los modelos evaluados:

| Metrica    | basico    | k-fold    | Optuna    |
|------------|-----------|-----------|-----------|
| MSE        |10784333.1 |10226896.9 |10531777.9 |
| RMSE       |3283.9     |3197.2     |3245.2     |
| MAPE       |17.30%     |16.49%     |16.26%     |

**Conclusiones**

Basándose en los resultados obtenidos, se concluye que el modelo optimizado con Optuma es el más adecuado para la prediccion de los precios. Este modelo alcanzó los mejores resultados en términos de MSE y MAPE y demostró una buena capacidad de generalización.

La optimización de hiperparámetros resultó ser crucial para mejorar el desempeño de los modelos. En particular, el hiperparámetro n_estimators tuvo un impacto significativo en el modelo de Random Forest.


# Exercise P0.2 (50%)

Create an API of the model.

Example:
![](https://github.com/sergiomora03/AdvancedTopicsAnalytics/blob/main/notebooks/img/img015.PNG?raw=true)

#### Evaluation:
- 40% - API hosted on a cloud service
- 10% - Show screenshots of the model doing the predictions on the local machine


# **Nota:**
 "El repositorio de GitHub contiene el código fuente de la API desarrollada con FastAPI, incluyendo ejemplos de uso y ejecución.
