# Project 2

# Used Vehicle Price Prediction

## Introduction

- 1.2 Million listings scraped from TrueCar.com - Price, Mileage, Make, Model dataset from Kaggle: [data](https://www.kaggle.com/jpayne/852k-used-car-listings)
- Each observation represents the price of an used car

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import scipy as sp
import category_encoders as ce
from sklearn import model_selection
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from scipy.sparse import csr_matrix
from sklearn.metrics import mean_squared_error
from statistics import mean 
import matplotlib.pyplot as plt
from random import randrange

In [None]:
data = pd.read_csv('../datasets/dataTrain_carListings.zip')

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data.dtypes

In [None]:
data.Price.describe()

In [None]:
data.plot(kind='scatter', y='Price', x='Year')

In [None]:
data.plot(kind='scatter', y='Price', x='Mileage')

In [None]:
data.columns

In [None]:
data=data.sample(n=10000, random_state=1) # Se toma una muestra aleatoria de la base para hacer los entrenamientos y calibración de parametros

In [None]:
# Conjunto de predictores
X_cat = data[["State", "Make", "Model"]]
X_num = data[["Year","Mileage"]]

In [None]:
encoder = ce.BinaryEncoder().fit(X_cat,axis=1)
X_cat = encoder.transform(X_cat)

In [None]:
X = pd.concat([X_num,X_cat],  axis=1, sort = False)

In [None]:
X.head()

In [None]:
X.shape

In [None]:
y = data['Price']

# Exercise P2.1 (50%)

Develop a machine learning model that predicts the price of the of car using as an input ['Year', 'Mileage', 'State', 'Make', 'Model']

#### Evaluation:
- 25% - Performance of the models using a manually implemented K-Fold (K=10) cross-validation
- 25% - Notebook explaining the process for selecting the best model. You must specify how the calibration of each of the parameters is done and how these change the performance of the model. It is expected that a clear comparison will be made of all implemented models.. Present the most relevant conslusions about the whole process. 


In [None]:
# train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=.2, random_state=1)

In [None]:
def cross_validation(X_train, y_train, model,k=10):
    
    scores = []
    
    cv = KFold(n_splits=k, random_state=42, shuffle=False)
    
    #folds = np.array_split(data, k)

    for train_index, test_index in cv.split(X_train):
        X_train_rf, X_test_rf, y_train_rf, y_test_rf = X_train.iloc[train_index], X_train.iloc[test_index], y_train.iloc[train_index], y_train.iloc[test_index]
        X_train_rf = csr_matrix(X_train_rf)     #Sparse para darle agilidad al modelo
        model.fit(X_train_rf, y_train_rf)
        y_pr=model.predict(X_test_rf)
        scores.append(mean_squared_error(y_test_rf, y_pr)**0.5)
    
    return mean(scores)

### Random forest
A random forest regressor is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. 

In [None]:
# Random forest 
clf = RandomForestRegressor(random_state=1, n_jobs=-1)
clf.fit(X_train, y_train)
y_pr=clf.predict(X_test)
mean_squared_error(y_test, y_pr)**0.5

#### Parameter tunning
n_estimators represents the number of trees in the forest. Usually the higher the number of trees the better to learn the data. 
However, adding a lot of trees can slow down the training process considerably,therefore we do a parameter search to find the sweet spot.

In [None]:
# findind the best n_estimators: 

# list of values to try for n_estimators
estimator_range = range(10, 1000, 100)

# list to store the average Accuracy for each value of n_estimators
accuracy_scores = []

# use 5-fold cross-validation with each value of n_estimators (WARNING: SLOW!)
for estimator in estimator_range:
    clf = RandomForestRegressor(n_estimators=estimator, random_state=1, n_jobs=-1)
    accuracy_scores.append(cross_validation(X_train, y_train, clf,k=10))
    

In [None]:
plt.plot(estimator_range, accuracy_scores)
plt.xlabel('n_estimators')
plt.ylabel('MSE')

In [None]:
X.columns

In [None]:
feature_cols = X.columns

max_features is the size of the random subsets of features to consider when splitting a node.

In [None]:
# list of values to try for max_features: 
feature_range = range(1, len(feature_cols)+1)

# list to store the average Accuracy for each value of max_features
accuracy_scores = []

# use 10-fold cross-validation with each value of max_features (WARNING: SLOW!)
for feature in feature_range:
    clf = RandomForestRegressor(n_estimators=200, max_features=feature, random_state=1, n_jobs=-1)
    accuracy_scores.append(cross_validation(X_train, y_train, clf,k=10))
    

In [None]:
plt.plot(feature_range, accuracy_scores)
plt.xlabel('max_features')
plt.ylabel('MSE')

max_depth represents the depth of each tree in the forest. The deeper the tree, the more splits it has and it captures more information about the data. 

In [None]:
# list of values to try for max_depth: 
max_depth_range = range(1, 21)

# list to store the average RMSE for each value of max_depth
accuracy_scores = []

for depth in max_depth_range:
    clf =  RandomForestRegressor(max_depth=depth, n_estimators=200,max_features=9, random_state=1,n_jobs=-1)
    accuracy_scores.append(cross_validation(X_train, y_train, clf,k=10))
    
    

In [None]:
plt.plot(max_depth_range, accuracy_scores)
plt.xlabel('max_depth')
plt.ylabel('MSE')

In [None]:
# model with the optimized parameters 
clf_optimizado =  RandomForestRegressor(max_depth=15, n_estimators=100, max_features=9, random_state=1,n_jobs=-1)
clf_optimizado.fit(X_train, y_train)
y_pr=clf_optimizado.predict(X_test)
mean_squared_error(y_test, y_pr)**0.5

Con el modelo random forest calibrado usando cross validation con kfold =10 se logra reducir el RMSE de 5431.5 a  5289.156776271887. 

## XGBoost

In [None]:
#XGBoost
from xgboost import XGBRegressor
from sklearn import metrics
xg = XGBRegressor()
xg

In [None]:
xg.fit(X_train, y_train)
y_pred = xg.predict(X_test)
metrics.mean_squared_error(y_pred, y_test.values)**0.5

n_estimators : represents the number of trees in the forest. Usually the higher the number of trees the better to learn the data.

In [None]:
# findind the best n_estimators: 

# list of values to try for n_estimators
estimator_range = range(10, 1000, 100)

# list to store the average RMSE
accuracy_scores = []

for e in estimator_range:
    xg =  XGBRegressor(n_estimators=e, random_state=1,n_jobs=-1)
    xg.fit(X_train, y_train)
    y_pred = xg.predict(X_test)
    accuracy_scores.append(cross_validation(X_train, y_train, xg,k=10))       

In [None]:
plt.plot(estimator_range, accuracy_scores)
plt.xlabel('n_estimators')
plt.ylabel('RMSE')

learning_rate: A technique to slow down the learning in the gradient boosting model is to apply a weighting factor for the corrections by new trees when added to the model.

This weighting is called the shrinkage factor or the learning rate, depending on the literature or the tool.

In [None]:
# findinf best value for learning_rate
learning_rate_range = np.arange(0, 1, 0.1)

# list to store the average RMSE
accuracy_scores = []

for lr in learning_rate_range:
    xg =  XGBRegressor(eta=lr, n_estimators=200, random_state=1,n_jobs=-1)
    xg.fit(X_train, y_train)
    y_pred = xg.predict(X_test)
    accuracy_scores.append(cross_validation(X_train, y_train, xg,k=10))
    
    
    

In [None]:
plt.plot(learning_rate_range, accuracy_scores)
plt.xlabel('learning_rate')
plt.ylabel('Accuracy')

Gamma: A node is split only when the resulting split gives a positive reduction in the loss function. Gamma specifies the minimum loss reduction required to make a split.
Makes the algorithm conservative. The values can vary depending on the loss function and should be tuned. 

In [None]:
# tunning gamma 
gamma_range = range(0,50)
# list to store the average RMSE 
accuracy_scores = []

for g in gamma_range:
    xg =  XGBRegressor(gamma=g ,eta=0.099, n_estimators=200, random_state=1,n_jobs=-1)
    xg.fit(X_train, y_train)
    y_pred = xg.predict(X_test)
    accuracy_scores.append(cross_validation(X_train, y_train, xg,k=10))

In [None]:
plt.plot(gamma_range, accuracy_scores)
plt.xlabel('gamma')
plt.ylabel('RMSE')

In [None]:
# XGB optimizado 

xg_op =  XGBRegressor(n_estimators=200,eta=0.099, random_state=1,n_jobs=-1)
xg_op.fit(X_train, y_train)
y_pred = xg_op.predict(X_test)
metrics.mean_squared_error(y_pred, y_test.values)**0.5

Con el modelo XGboost calibrado usando cross validation con kfold =10 se logra reducir el RMSE de 5647.7 a  5554.7.
En comparación con el modelo random forest calibrado donde se obtuvo un RMSE de 5289.156776271887 con el XGBoost se obtiene un RMSE de 5554.7. De acuerdo con lo anterior se procede a trabajar con el modelo random forest. 

# Exercise P2.2 (50%)

Create an API of the model.

Example:
![](https://raw.githubusercontent.com/albahnsen/PracticalMachineLearningClass/master/notebooks/images/img015.PNG)

#### Evaluation:
- 40% - API hosted on a cloud service
- 10% - Show screenshots of the model doing the predictions on the local machine


### Save model

In [None]:
import joblib

In [None]:
joblib.dump(clf_optimizado, 'model_deployment_proyecto/car_price_prediction.aaaa', compress=3)

In [None]:
joblib.dump(encoder, 'model_deployment_proyecto/encoder.aaaa', compress=3)

###  Model in batch

In [None]:
from model_deployment_proyecto.m09_model_deployment import predict_price

In [None]:
predict_price(Year=2015,Mileage=54593,State='MS',Make='Toyota',Model='CamrySE')

### API

In [None]:
from flask import Flask
from flask_restx import Api, Resource, fields
import joblib
from model_deployment_proyecto.m09_model_deployment import predict_price

app = Flask(__name__)

api = Api(
    app, 
    version='1.0', 
    title='Car Prices Prediction API',
    description='Car Prices Prediction API')

ns = api.namespace('predict', 
     description='Price prediction')
   
parser = api.parser()

parser.add_argument(
    'Year', 
    type=int, 
    required=True, 
    help='Year to be analyzed', 
    location='args')

parser.add_argument(
    'Mileage', 
    type=int, 
    required=True, 
    help='Mileage to be analyzed', 
    location='args')

parser.add_argument(
    'State', 
    type=str, 
    required=True, 
    help='State to be analyzed', 
    location='args')

parser.add_argument(
    'Make', 
    type=str, 
    required=True, 
    help='Make to be analyzed', 
    location='args')


parser.add_argument(
    'Model', 
    type=str, 
    required=True, 
    help='Model to be analyzed', 
    location='args')

resource_fields = api.model('Resource', {
    'result': fields.String,
})

@ns.route('/')
class PhishingApi(Resource):

    @api.doc(parser=parser)
    @api.marshal_with(resource_fields)
    def get(self):
        args = parser.parse_args()
        
        return {
        "result": predict_price(args['Year'],args['Mileage'],args['State'],args['Make'],args['Model'])
        }, 200
    
    
if __name__ == '__main__':
    app.run(debug=True, use_reloader=False, host='0.0.0.0', port=5000)
