## Model 2 - Linear SVR (Baseline model)

## Load Dataset

In [1]:
import pandas as pd
dfclean2 = pd.read_csv('/Volumes/GoogleDrive/My Drive/MScA 2022 WINTER/MSCA 31008 5 Data Mining Principles/Project/clean_dataset2.csv')

In [2]:
dfclean2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120987 entries, 0 to 120986
Data columns (total 15 columns):
 #   Column            Non-Null Count   Dtype
---  ------            --------------   -----
 0   price             120987 non-null  int64
 1   year              120987 non-null  int64
 2   condition         120987 non-null  int64
 3   cylinders         120987 non-null  int64
 4   fuel              120987 non-null  int64
 5   odometer          120987 non-null  int64
 6   title_status      120987 non-null  int64
 7   transmission      120987 non-null  int64
 8   drive             120987 non-null  int64
 9   type              120987 non-null  int64
 10  state             120987 non-null  int64
 11  MSRP              120987 non-null  int64
 12  car_age           120987 non-null  int64
 13  is_vintage        120987 non-null  int64
 14  is_color_neutral  120987 non-null  int64
dtypes: int64(15)
memory usage: 13.8 MB


In [3]:
dfclean2.head()

Unnamed: 0,price,year,condition,cylinders,fuel,odometer,title_status,transmission,drive,type,state,MSRP,car_age,is_vintage,is_color_neutral
0,22590,2010,2,4,2,71229,0,2,0,8,1,46110,11,0,0
1,4500,1992,0,3,2,192000,0,0,0,0,1,25695,29,0,0
2,14000,2012,0,3,2,95000,0,0,1,5,1,37775,9,0,1
3,32990,2019,2,4,4,6897,0,2,0,8,1,38400,2,0,1
4,2100,2006,1,1,2,97000,0,0,0,4,1,21495,15,0,0


## Modeling

In [4]:
X = dfclean2.drop('price',axis=1)
y = dfclean2['price']

In [5]:
print(X.shape)
print(y.shape)

(120987, 14)
(120987,)


In [6]:
from sklearn. model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=.2,
    random_state=0)

In [7]:
# LinearSVR without Standard Scaler
from sklearn.svm import LinearSVR

LSVR = LinearSVR(random_state=0)
LSVR.fit(X_train, y_train)
y_pred = LSVR.predict(X_test)



In [21]:
# LinearSVR with Standard Scaler
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

LSVRs = make_pipeline(StandardScaler(), LinearSVR(random_state=0))
LSVRs.fit(X_train, y_train)
y_preds = LSVRs.predict(X_test)

Model without Standard Scaler failed to converge.
Decided to go with Standard Scaler pipeline with the normalized dataset. We will use this model for evaluation.

## Evaluation

In [27]:
# Define a function for output statistics

import numpy as np
from sklearn import metrics

def reg_metrics(pred_model, y_pred, x_train, x_test, y_train, y_test):
    """ Function takes in training and testing sets, prediction model, 
    and ouputs the below metrics:
    1. R² or Coefficient of Determination.
    2. Adjusted R²
    3. Mean Squared Error(MSE)
    4. Root-Mean-Squared-Error(RMSE)
    5. Mean-Absolute-Proportion-Error (MAPE)
    6. Mean-Absolute-Error(MAE)
    """
    
    #1-2 Coefficient of Determination (R² & Adjusted R²)
    print("\n\t--- Coefficient of Determination (R² & Adjusted R²) ---")
    r2 = metrics.r2_score(y_pred=y_pred, y_true=y_test)
    adj_r2 = 1 - (1-r2)*(len(y_train)-1)/(len(y_train)-x_train.shape[1]-1)

    print(f"R²\t\t: {r2}")
    print(f"Adjusted R²\t: {adj_r2}")
    
    #3-6 MSE, RMSE, MAPE, MAE
    print("\n\t--- Error Metrics ---")
    
    # MSE, RMSE
    y_train_pred = pred_model.predict(x_train)
    train_mse = metrics.mean_squared_error(y_pred=y_train_pred, y_true=y_train, squared=True)
    train_rmse = metrics.mean_squared_error(y_pred=y_train_pred, y_true=y_train, squared=False)
    print(f"Train_MSE\t: {train_mse}")
    print(f"Train_RMSE\t: {train_rmse}")
    
    test_mse = metrics.mean_squared_error(y_pred=y_pred, y_true=y_test, squared=True)
    test_rmse = metrics.mean_squared_error(y_pred=y_pred, y_true=y_test, squared=False)
    print(f"Test_MSE\t: {test_mse}")
    print(f"Test_RMSE\t: {test_rmse}")

    # MAPE
    residual = y_test - y_pred
    ape = np.abs(residual) / y_test
    mape = np.mean(ape)
    print(f"MAPE\t: {mape}")
    
    #MAE
    mae = metrics.mean_absolute_error(y_pred=y_pred, y_true=y_test)
    print(f"MAE\t: {mae}")

In [28]:
reg_metrics(pred_model=LSVRs, 
            y_pred=y_preds, 
            x_train=X_train, 
            x_test=X_test, 
            y_train=y_train, 
            y_test=y_test)


	--- Coefficient of Determination (R² & Adjusted R²) ---
R²		: 0.6485865504476755
Adjusted R²	: 0.6485357125336312

	--- Error Metrics ---
Train_MSE	: 31506201.26783804
Train_RMSE	: 5613.0385058217835
Test_MSE	: 31652183.040777557
Test_RMSE	: 5626.027287596245
MAPE	: 0.42673753028759825
MAE	: 3906.0195903321282


<b>Notes:</b>

Initially planned to use SVR, however the machine failed to execute the model. SVM is generally known to scale badly with a larger number of samples. Hence, we use LinerSVR as an alternative.

<b>Reading:</b>

Tuning SVR to reduce running time, option: use LinearSVR <br>
https://datascience.stackexchange.com/questions/989/svm-using-scikit-learn-runs-endlessly-and-never-completes-execution<br>
https://stackoverflow.com/questions/47460201/scikit-learn-svr-runs-very-long<br>
https://stackoverflow.com/questions/15582669/how-to-speed-up-sklearn-svr?rq=1<br>

Function details:<br>
https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html#sklearn.svm.LinearSVR<br>
- Linear SVR is similar to SVR with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.
- This class supports both dense and sparse input.<br>


https://towardsdatascience.com/hyperparameter-tuning-for-support-vector-machines-c-and-gamma-parameters-6a5097416167<br>


SVM using Grid Search CV still takes a long time to run<br>
https://stackoverflow.com/questions/57495123/optimizing-svr-parameters-using-gridsearchcv<br>
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html