## Model training

In this notebook I tried different models, compare their performance, try to optimize hyperparameters.
I used the R2 metric for evaluation
Training part of this notebook is converted to the script

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression, RidgeCV
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import RandomizedSearchCV
from catboost import CatBoostRegressor
import warnings
warnings.filterwarnings('ignore')

In [2]:
SEED = 42

## Prepare dataset

In [3]:
data = pd.read_csv("../data/processed/data.csv")

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1007 entries, 0 to 1006
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   car_make               1007 non-null   int64  
 1   year                   1007 non-null   int64  
 2   engine_size_l          1007 non-null   float64
 3   horsepower             1007 non-null   float64
 4   torque_lb_ft           1007 non-null   float64
 5   0_60_mph_time_seconds  1007 non-null   float64
 6   price_in_usd           1007 non-null   int64  
 7   age                    1007 non-null   int64  
dtypes: float64(4), int64(4)
memory usage: 63.1 KB


In [5]:
X = data.drop(columns='price_in_usd', axis=1)
y = data['price_in_usd']

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)

In [7]:
X_train.head()

Unnamed: 0,car_make,year,engine_size_l,horsepower,torque_lb_ft,0_60_mph_time_seconds,age
29,11,2021,4.0,986.0,590.0,2.5,2
280,26,2022,4.395781,1874.0,1696.0,1.9,1
507,25,2021,6.0,764.0,738.0,2.6,2
652,9,2021,6.2,650.0,650.0,3.5,2
947,5,2022,2.5,394.0,354.0,3.6,1


In [8]:
R2_score_train = []
R2_score_test = []
CV = []

def car_pred_model(model, name):
    model.fit(X_train, y_train)

    y_pred_train = model.predict(X_train)
    R2_score_train_v = r2_score(y_train, y_pred_train)
    R2_score_train.append(round(R2_score_train_v, 2))
    
    y_pred_test = model.predict(X_test)
    R2_score_test_v = r2_score(y_test, y_pred_test)
    R2_score_test.append(round(R2_score_test_v, 2))
    
    cross_val = cross_val_score(model ,X_train, y_train ,cv=5)
    cv_mean = cross_val.mean()
    CV.append(round(cv_mean, 2))
    
    print("Model name:", name)
    if (hasattr(model, 'best_estimator_')):
        print("Model best parameters:", model.best_estimator_)
    print("Train R2 score:", round(R2_score_train_v, 2))
    print("Test R2 score:", round(R2_score_test_v, 2))
    print("Train cross validation scores :", cross_val)
    print("Test cross validation  mean :", round(cv_mean, 2))


## Linear regression

In [9]:
lr = LinearRegression()
car_pred_model(lr, "Linear Regression")

Model name: Linear Regression
Train R2 score: 0.66
Test R2 score: 0.59
Train cross validation scores : [0.62494452 0.66735624 0.67739147 0.64534328 0.65530797]
Test cross validation  mean : 0.65


## Lasso

In [10]:
ls = Lasso()
alpha = np.logspace(-3, 3, num=14)
ls_rs = RandomizedSearchCV(estimator = ls, param_distributions = dict(alpha=alpha))
car_pred_model(ls_rs, "Lasso")

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


Model name: Lasso
Model best parameters: Lasso(alpha=345.5107294592218)
Train R2 score: 0.66
Test R2 score: 0.59
Train cross validation scores : [0.62494518 0.66742589 0.67739144 0.64510127 0.65534739]
Test cross validation  mean : 0.65


  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


## Random Forest

In [11]:
rf = RandomForestRegressor()
param_grid = { 
    'n_estimators': [25, 50, 100, 150], 
    'max_features': ['sqrt', 'log2', None], 
    'max_depth': [3, 6, 9], 
    'max_leaf_nodes': [3, 6, 9], 
} 
rf_rs = RandomizedSearchCV(estimator = rf, param_distributions = param_grid)
car_pred_model(rf, 'Random Forest')

Model name: Random Forest
Train R2 score: 0.99
Test R2 score: 0.86
Train cross validation scores : [0.98027906 0.95274512 0.90633257 0.87293852 0.9227835 ]
Test cross validation  mean : 0.93


## Gradient Boosting

In [12]:
gb = GradientBoostingRegressor()
param_grid = {
    "learning_rate":[0.001, 0.01, 0.05, 0.1, 0.3],
    "n_estimators":[25, 50, 100, 150],
    "max_depth":[2, 4, 6, 8],
    "max_features":['auto','sqrt']
}
gb_rs = RandomizedSearchCV(estimator = gb, param_distributions = param_grid)
car_pred_model(gb_rs, 'Gradient Boosting')



Model name: Gradient Boosting
Model best parameters: GradientBoostingRegressor(max_depth=8, max_features='sqrt', n_estimators=50)
Train R2 score: 1.0
Test R2 score: 0.88
Train cross validation scores : [0.970482   0.97343777 0.89357626 0.88053944 0.95728419]
Test cross validation  mean : 0.94




## Cat Boosting

In [13]:
catb = CatBoostRegressor()
param_grid = {
        "iterations": [100],
        "learning_rate":[0.001, 0.01, 0.05, 0.1, 0.3],
        "depth": [3, 6, 9], 
}
catb_rs = RandomizedSearchCV(estimator = catb, param_distributions = param_grid)
car_pred_model(catb_rs, 'CatBoosting')

0:	learn: 737444.6664915	total: 56.1ms	remaining: 5.55s
1:	learn: 732426.8960088	total: 57.4ms	remaining: 2.81s
2:	learn: 726997.8190377	total: 58.2ms	remaining: 1.88s
3:	learn: 721556.7064391	total: 59ms	remaining: 1.42s
4:	learn: 716652.7830482	total: 59.6ms	remaining: 1.13s
5:	learn: 711479.2387723	total: 60.2ms	remaining: 943ms
6:	learn: 705940.5064254	total: 60.9ms	remaining: 809ms
7:	learn: 700941.6345927	total: 61.5ms	remaining: 707ms
8:	learn: 695460.8755177	total: 62.2ms	remaining: 628ms
9:	learn: 690654.8332083	total: 62.7ms	remaining: 565ms
10:	learn: 686242.5866870	total: 63.5ms	remaining: 514ms
11:	learn: 681078.9086050	total: 64.2ms	remaining: 471ms
12:	learn: 675930.8636051	total: 64.7ms	remaining: 433ms
13:	learn: 671023.2462880	total: 65.4ms	remaining: 402ms
14:	learn: 666523.9584887	total: 66.9ms	remaining: 379ms
15:	learn: 662205.9741285	total: 68.2ms	remaining: 358ms
16:	learn: 657944.0748225	total: 68.7ms	remaining: 336ms
17:	learn: 654173.7274271	total: 69.9ms	rem

## Results

In [14]:
model = ["LinearRegression", "Lasso", "RandomForestRegressor", "GradientBoostingRegressor", "CatBoosting"]
results = pd.DataFrame({'Model': model,'R2(Train)': R2_score_train,'R2(Test)': R2_score_test,'CV score mean(Train)': CV})
results

Unnamed: 0,Model,R2(Train),R2(Test),CV score mean(Train)
0,LinearRegression,0.66,0.59,0.65
1,Lasso,0.66,0.59,0.65
2,RandomForestRegressor,0.99,0.86,0.93
3,GradientBoostingRegressor,1.0,0.88,0.94
4,CatBoosting,0.99,0.85,0.94


## Conclusion

- In this notebook I've tried different linear and regression models
- Optimized hyperparameters
- Compared results of different models results (using R2 score)
- Select the best model (GradientBoostingRegressor)