![image info](https://raw.githubusercontent.com/davidzarruk/MIAD_ML_NLP_2023/main/images/banner_1.png)

# Proyecto 1 - Predicción de precios de vehículos usados

En este proyecto podrán poner en práctica sus conocimientos sobre modelos predictivos basados en árboles y ensambles, y sobre la disponibilización de modelos. Para su desasrrollo tengan en cuenta las instrucciones dadas en la "Guía del proyecto 1: Predicción de precios de vehículos usados".

**Entrega**: La entrega del proyecto deberán realizarla durante la semana 4. Sin embargo, es importante que avancen en la semana 3 en el modelado del problema y en parte del informe, tal y como se les indicó en la guía.

Para hacer la entrega, deberán adjuntar el informe autocontenido en PDF a la actividad de entrega del proyecto que encontrarán en la semana 4, y subir el archivo de predicciones a la [competencia de Kaggle](https://www.kaggle.com/t/b8be43cf89c540bfaf3831f2c8506614).

## Datos para la predicción de precios de vehículos usados

En este proyecto se usará el conjunto de datos de Car Listings de Kaggle, donde cada observación representa el precio de un automóvil teniendo en cuenta distintas variables como: año, marca, modelo, entre otras. El objetivo es predecir el precio del automóvil. Para más detalles puede visitar el siguiente enlace: [datos](https://www.kaggle.com/jpayne/852k-used-car-listings).

## Ejemplo predicción conjunto de test para envío a Kaggle

En esta sección encontrarán el formato en el que deben guardar los resultados de la predicción para que puedan subirlos a la competencia en Kaggle.

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Importación librerías
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import RandomizedSearchCV
import xgboost as xgb
from sklearn.metrics import mean_squared_error
import joblib
from flask import Flask
from flask_restx import Api, Resource, fields, reqparse

In [3]:
# Carga de datos de archivo .csv
dataTraining = pd.read_csv('https://raw.githubusercontent.com/davidzarruk/MIAD_ML_NLP_2023/main/datasets/dataTrain_carListings.zip')
dataTesting = pd.read_csv('https://raw.githubusercontent.com/davidzarruk/MIAD_ML_NLP_2023/main/datasets/dataTest_carListings.zip', index_col=0)

In [4]:
# Visualización datos de entrenamiento
dataTraining.head()

Unnamed: 0,Price,Year,Mileage,State,Make,Model
0,34995,2017,9913,FL,Jeep,Wrangler
1,37895,2015,20578,OH,Chevrolet,Tahoe4WD
2,18430,2012,83716,TX,BMW,X5AWD
3,24681,2014,28729,OH,Cadillac,SRXLuxury
4,26998,2013,64032,CO,Jeep,Wrangler


In [5]:
# Visualización datos de test
dataTesting.head()

Unnamed: 0_level_0,Year,Mileage,State,Make,Model
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,2014,31909,MD,Nissan,MuranoAWD
1,2017,5362,FL,Jeep,Wrangler
2,2014,50300,OH,Ford,FlexLimited
3,2004,132160,WA,BMW,5
4,2015,25226,MA,Jeep,Grand


In [6]:
label_encoders = {}
for column in ['State', 'Make', 'Model']:
    le = LabelEncoder()
    dataTraining[column] = le.fit_transform(dataTraining[column])
    label_encoders[column] = le
X = dataTraining[['Year', 'Mileage', 'State', 'Make', 'Model']]
y = dataTraining['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [15]:
xgb_model = xgb.XGBRegressor(objective='reg:squarederror')

param_dist = {
    'n_estimators': [100, 200, 300, 400, 500],
    'learning_rate': [0.01, 0.05, 0.1, 0.2, 0.3],
    'max_depth': [3, 5, 7, 9],
    'gamma': [0, 0.1, 0.5, 1, 1.5, 2],
    'colsample_bytree': [0.3, 0.5, 0.7, 1.0],
    'subsample': [0.5, 0.75, 1.0],
    'reg_alpha': [0, 0.01, 0.1, 1],
    'reg_lambda': [0.01, 0.1, 1]
}

xgb_random = RandomizedSearchCV(
    estimator=xgb_model,
    param_distributions=param_dist,
    n_iter=100,
    scoring='neg_mean_squared_error',
    cv=5,
    verbose=2,
    random_state=42
)

xgb_random.fit(X_train, y_train)

Fitting 5 folds for each of 100 candidates, totalling 500 fits
[CV] END colsample_bytree=0.3, gamma=1.5, learning_rate=0.05, max_depth=9, n_estimators=400, reg_alpha=1, reg_lambda=0.01, subsample=0.5; total time=  10.3s
[CV] END colsample_bytree=0.3, gamma=1.5, learning_rate=0.05, max_depth=9, n_estimators=400, reg_alpha=1, reg_lambda=0.01, subsample=0.5; total time=  10.3s
[CV] END colsample_bytree=0.3, gamma=1.5, learning_rate=0.05, max_depth=9, n_estimators=400, reg_alpha=1, reg_lambda=0.01, subsample=0.5; total time=  10.0s
[CV] END colsample_bytree=0.3, gamma=1.5, learning_rate=0.05, max_depth=9, n_estimators=400, reg_alpha=1, reg_lambda=0.01, subsample=0.5; total time=  10.1s
[CV] END colsample_bytree=0.3, gamma=1.5, learning_rate=0.05, max_depth=9, n_estimators=400, reg_alpha=1, reg_lambda=0.01, subsample=0.5; total time=  10.1s
[CV] END colsample_bytree=0.3, gamma=0, learning_rate=0.05, max_depth=3, n_estimators=400, reg_alpha=1, reg_lambda=0.1, subsample=1.0; total time=   4.0

RandomizedSearchCV(cv=5,
                   estimator=XGBRegressor(base_score=None, booster=None,
                                          callbacks=None,
                                          colsample_bylevel=None,
                                          colsample_bynode=None,
                                          colsample_bytree=None, device=None,
                                          early_stopping_rounds=None,
                                          enable_categorical=False,
                                          eval_metric=None, feature_types=None,
                                          gamma=None, grow_policy=None,
                                          importance_type=None,
                                          interaction_constraints=None,
                                          learning_rate=...
                                          random_state=None, ...),
                   n_iter=100,
                   param_distributions={'colsample_

In [16]:
best_params = xgb_random.best_params_
best_neg_mse = xgb_random.best_score_

best_rmse = np.sqrt(-best_neg_mse)

print("Mejores parámetros:", best_params)
print("Mejor RMSE:", best_rmse)

Mejores parámetros: {'subsample': 1.0, 'reg_lambda': 1, 'reg_alpha': 1, 'n_estimators': 400, 'max_depth': 7, 'learning_rate': 0.2, 'gamma': 1.5, 'colsample_bytree': 0.7}
Mejor RMSE: 3719.911394065415


In [7]:
best_params = {'subsample': 1.0, 'reg_lambda': 1, 'reg_alpha': 1, 'n_estimators': 400, 'max_depth': 7, 'learning_rate': 0.2, 'gamma': 1.5, 'colsample_bytree': 0.7}
best_model = xgb.XGBRegressor(
    objective='reg:squarederror',
    n_estimators=best_params['n_estimators'],
    learning_rate=best_params['learning_rate'],
    max_depth=best_params['max_depth'],
    gamma=best_params['gamma'],
    colsample_bytree=best_params['colsample_bytree'],
    subsample=best_params['subsample'],
    reg_alpha=best_params['reg_alpha'],
    reg_lambda=best_params['reg_lambda']
)
best_model.fit(X_train, y_train)

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=0.7, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=1.5, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=0.2, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=7, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=400, n_jobs=None,
             num_parallel_tree=None, random_state=None, ...)

In [8]:
y_pred = best_model.predict(X_test)

In [9]:
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"RMSE on Test Set: {rmse_test}")

RMSE on Test Set: 3736.5575731107665


In [10]:
for column in ['State', 'Make', 'Model']:
    dataTesting[column] = label_encoders[column].transform(dataTesting[column])

In [11]:
X_testing = dataTesting[['Year', 'Mileage', 'State', 'Make', 'Model']]
y_pred_testing = best_model.predict(X_testing)
y_pred_testing

array([20939.326, 36699.324, 14271.449, ..., 24251.402, 16264.052,
       19251.799], dtype=float32)

In [12]:
predicted_prices = pd.DataFrame({
    'Price': y_pred_testing
})
predicted_prices.reset_index(drop=True, inplace=True)

In [13]:
# Guardar predicciones en formato exigido en la competencia de kaggle
predicted_prices.to_csv('test_submission.csv', index_label='ID')
predicted_prices.head()

Unnamed: 0,Price
0,20939.326172
1,36699.324219
2,14271.449219
3,7605.625488
4,30662.025391


In [14]:
joblib.dump(best_model, 'model_deployment/car_price_reg.pkl')
joblib.dump(label_encoders, 'model_deployment/label_encoders.pkl')

['model_deployment/label_encoders.pkl']

In [15]:
best_model = joblib.load('model_deployment/car_price_reg.pkl')
label_encoders = joblib.load('model_deployment/label_encoders.pkl')

app = Flask(__name__)
api = Api(app, version='1.0', title='Model API',
          description='A simple API that use model to make predictions')

ns = api.namespace('predict', description='Model Prediction')

model = api.model('PredictionData', {
    'Year': fields.Integer(required=True, description='Year of the vehicle'),
    'Mileage': fields.Integer(required=True, description='Mileage of the vehicle'),
    'State': fields.String(required=True, description='State where the vehicle is registered'),
    'Make': fields.String(required=True, description='Make of the vehicle'),
    'Model': fields.String(required=True, description='Model of the vehicle'),
})

parser = reqparse.RequestParser()
parser.add_argument('Year', type=int, required=True, help='Year of the vehicle')
parser.add_argument('Mileage', type=int, required=True, help='Mileage of the vehicle')
parser.add_argument('State', type=str, required=True, help='State where the vehicle is registered')
parser.add_argument('Make', type=str, required=True, help='Make of the vehicle')
parser.add_argument('Model', type=str, required=True, help='Model of the vehicle')

@ns.route('/')
class CarPriceApi(Resource):
    @api.expect(model)
    @api.response(200, 'Success')
    def post(self):
        args = parser.parse_args()
        input_data = pd.DataFrame([args])

        # Aplicar label encoding
        for column in ['State', 'Make', 'Model']:
            input_data[column] = label_encoders[column].transform(input_data[column])

        # Predecir con el modelo
        prediction = best_model.predict(input_data)

        # Devolver el resultado
        return {
            "result": float(prediction[0])
        }, 200

if __name__ == '__main__':
    app.run(debug=True, use_reloader=False, host='0.0.0.0', port=5000)

 * Serving Flask app "__main__" (lazy loading)
 * Environment: production
[2m   Use a production WSGI server instead.[0m
 * Debug mode: on


 * Running on all addresses.
 * Running on http://192.168.1.105:5000/ (Press CTRL+C to quit)
192.168.1.105 - - [21/Apr/2024 18:12:32] "GET / HTTP/1.1" 200 -
192.168.1.105 - - [21/Apr/2024 18:12:32] "GET /swagger.json HTTP/1.1" 200 -
192.168.1.105 - - [21/Apr/2024 18:12:32] "GET /swaggerui/swagger-ui.css HTTP/1.1" 200 -
192.168.1.105 - - [21/Apr/2024 18:12:32] "GET /swaggerui/favicon-16x16.png HTTP/1.1" 200 -
192.168.1.105 - - [21/Apr/2024 18:12:48] "POST /predict/ HTTP/1.1" 200 -
