# 4. **ML Modeling - Price Prediction Regression Model**
---


###  **Machine Learning Regression Model for Price Prediction:**
To address the business problem, the proposed solution involves creating a robust machine learning model for predicting the resale price of used cars. This model will take into account various features such as the make, model, mileage, year of manufacture, and any other relevant factors that influence the pricing of used cars.

The model's implementation aims to empower purchasing agents with a tool that can provide a reasonable estimate of the resale value of a potential acquisition. This ensures that the company can make well-informed decisions when acquiring used cars, optimizing the balance between competitive pricing and maintaining healthy profit margins.




###  **Model Evaluation:**
####  Evaluation Metrics:

*   **Mean Absolute Error (MAE):**
    
    *   MAE represents the average absolute difference between the predicted prices and the actual prices.
    *   Business Relation: A low MAE indicates that, on average, the model's predictions are close to the true prices. This is crucial for the business, as it directly aligns with the goal of accurate price prediction, helping purchasing agents make informed decisions.
*   **Mean Squared Error (MSE):**
    
    *   MSE calculates the average squared difference between predicted and actual prices.
    *   Business Relation: MSE gives more weight to large errors. A low MSE indicates that the model is effective in minimizing significant deviations, which is important for ensuring that extreme pricing errors are minimized, contributing to more consistent and reliable predictions.
*   **R-squared (R2):**
    
    *   R2 measures the proportion of the variance in the target variable explained by the model.
    *   Business Relation: A high R2 suggests that a significant portion of the variability in used car prices is captured by the model. This is important for business success as it indicates that the model is effectively leveraging the provided features to predict prices, contributing to better decision-making.

In [1]:
#Library Imports

import pandas as pd 
import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV, GridSearchCV, KFold

from sklearn.preprocessing import OneHotEncoder, RobustScaler
import category_encoders as ce
from category_encoders import OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor
from xgboost.sklearn import XGBRegressor
from sklearn.compose import TransformedTargetRegressor

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error, mean_absolute_percentage_error


In [5]:
#Load Preprocessed Data
cars = pd.read_csv("1. Combined Dataset.csv")
cars.head()

Unnamed: 0.1,Unnamed: 0,model,year,price,transmission,mileage,fuelType,engineSize,tax,mpg,brand
0,0,C Class,2020,30495,Automatic,1200,Diesel,2.0,145.0,61.4,Mercedes
1,1,C Class,2020,29989,Automatic,1000,Petrol,1.5,145.0,46.3,Mercedes
2,2,C Class,2020,37899,Automatic,500,Diesel,2.0,145.0,61.4,Mercedes
3,3,C Class,2019,30399,Automatic,5000,Diesel,2.0,145.0,61.4,Mercedes
4,5,C Class,2019,29899,Automatic,4500,Diesel,2.0,145.0,61.4,Mercedes


## 4.1 Feature Selection

All columns in the dataset are potential features for the model. 
- The target variable is the price of the used car.
- The features are the remaining columns in the dataset.

In [20]:
cars.drop(cars[cars['fuelType'] == "Other"].index, inplace=True)

In [21]:
cars.drop(cars[cars['transmission'] == "Other"].index, inplace=True)

## 4.2 Train Test Split

In [22]:
#Column Selection
X = cars.drop(['price'], axis = 1)
y = cars['price']

# Data Splitting dengan proporsi test size 80:20
xtrain, xtest, ytrain, ytest = train_test_split(
    X, 
    y, 
    test_size=0.2, 
    random_state= 2023)

## 4.3 Columns Encoding

Encoding is a technique used in machine learning to convert categorical data into numerical format. In this project, we've aplied three methods: one-hot encoding for nominal categories, robust scaler Using the robust scaler to standardize the scale of numerical data. One-hot encoding is suitable for categories without a specific order and a limited set of unique values. Here's a breakdown of how each encoding type is applied.

- Onehot: Transmission, FuelType
- Robust: Mileage, Mpg,
- Binary: Model, Brand

In [23]:
transform = ColumnTransformer([
    ('Scaler', RobustScaler(), ['mileage', 'mpg']),
    ('OHE', OneHotEncoder(drop='first'), ['transmission', 'fuelType']),
    ('Binary Encoder', ce.BinaryEncoder(), ['model', 'brand'])
],remainder = "passthrough")

transform

## 4.4 Model Benchmarking

In the initial phase, we will perform modeling on the 7 selected benchmark models. The results from these benchmark models, in terms of scoring, will be based on the chosen evaluation metrics: RMSE, MAE, and MAPE. The selection of these three metrics is to compare the model's performance by examining its residuals.

THe benchmark models are:
- Linear Regression
- KNN Regressor
- Decision Tree Regressor
- Random Forest Regressor
- Gradient Boosting Regressor
- XGBoost Regressor
- AdaBoost Regressor

In [24]:
# Define the algorithm

lr = LinearRegression()
knn = KNeighborsRegressor()
dt = DecisionTreeRegressor(random_state= 2023)
rf = RandomForestRegressor(random_state= 2023)
ada = AdaBoostRegressor(random_state= 2023)
xgb = XGBRegressor(random_state= 2023)
gbr = GradientBoostingRegressor(random_state= 2023)

models = [lr, knn, dt, rf, ada, xgb, gbr]

score_rmse = []
nilai_mean_rmse = []
nilai_std_rmse = []

score_mae = []
nilai_mean_mae = []
nilai_std_mae = []

score_mape = []
nilai_mean_mape = []
nilai_std_mape = []

score_r2 = []
nilai_mean_r2 = []
nilai_std_r2 = []

# Finding the best algorithm based on each metrics

for i in models:
    
    crossval = KFold(n_splits=5, shuffle=True, random_state=2023)

    estimator = Pipeline([
        ('preprocessing', transform),
        ('model', i)
    ])

    # RMSE
    model_cv_rmse = cross_val_score(
        estimator, 
        xtrain, 
        ytrain, 
        cv=crossval, 
        scoring='neg_root_mean_squared_error', 
        error_score='raise'
        )

    print(model_cv_rmse, i)

    score_rmse.append(model_cv_rmse)
    nilai_mean_rmse.append(model_cv_rmse.mean())
    nilai_std_rmse.append(model_cv_rmse.std())

    # MAE
    model_cv_mae = cross_val_score(
        estimator, 
        xtrain, 
        ytrain, 
        cv=crossval, 
        scoring='neg_mean_absolute_error', 
        error_score='raise'
        )

    print(model_cv_mae, i)

    score_mae.append(model_cv_mae)
    nilai_mean_mae.append(model_cv_mae.mean())
    nilai_std_mae.append(model_cv_mae.std())

    # MAPE
    model_cv_mape = cross_val_score(
        estimator, 
        xtrain, 
        ytrain, 
        cv=crossval, 
        scoring='neg_mean_absolute_percentage_error', 
        error_score='raise'
        )

    print(model_cv_mape, i)

    score_mape.append(model_cv_mape)
    nilai_mean_mape.append(model_cv_mape.mean())
    nilai_std_mape.append(model_cv_mape.std())
    
    model_cv_r2 = cross_val_score(
        estimator, 
        xtrain, 
        ytrain, 
        cv=crossval, 
        scoring='r2', 
        error_score='raise'
        )

    print(model_cv_mape, i)

    score_r2.append(model_cv_r2)
    nilai_mean_r2.append(model_cv_r2.mean())
    nilai_std_r2.append(model_cv_r2.std())

[-4547.21189856 -4496.94210906 -4401.99469956 -4316.44441987
 -4502.16356774] LinearRegression()
[-2912.36575168 -2832.03800874 -2941.99948347 -2850.37452259
 -2858.94742209] LinearRegression()
[-0.20943529 -0.20108219 -0.21822168 -0.21327009 -0.21260896] LinearRegression()
[-0.20943529 -0.20108219 -0.21822168 -0.21327009 -0.21260896] LinearRegression()
[-6906.26642262 -6946.38477817 -6911.29335666 -6872.58239958
 -7107.37596547] KNeighborsRegressor()
[-4328.27934228 -4433.53353527 -4397.26250108 -4346.50833405
 -4375.00067509] KNeighborsRegressor()
[-0.27568741 -0.28596332 -0.28504668 -0.29007608 -0.28333324] KNeighborsRegressor()
[-0.27568741 -0.28596332 -0.28504668 -0.29007608 -0.28333324] KNeighborsRegressor()
[-2769.47783918 -2581.49067933 -2773.66753421 -2529.84728296
 -2810.11695952] DecisionTreeRegressor(random_state=2023)
[-1586.1723929  -1564.03729987 -1563.12782345 -1533.97715275
 -1559.97948762] DecisionTreeRegressor(random_state=2023)
[-0.09465272 -0.09327911 -0.0938001  -

In [25]:
pd.DataFrame({
    'Model': ['Linear Regression', 'KNN Regressor', 'DecisionTree Regressor',
              'RandomForest Regressor', 'AdaBoost Regressor', 'XGBoost Regressor', 'GradientBoosting Regressor'],
    'Mean_RMSE': nilai_mean_rmse,
    'Std_RMSE': nilai_std_rmse,
    'Mean_MAE': nilai_mean_mae,
    'Std_MAE': nilai_std_mae,
    'Mean_MAPE': nilai_mean_mape,
    'Std_MAPE': nilai_std_mape,
    'Mean_R2' : nilai_mean_r2,
    'Std_R2' : nilai_std_r2
}).sort_values('Mean_MAPE',ascending = False)

Unnamed: 0,Model,Mean_RMSE,Std_RMSE,Mean_MAE,Std_MAE,Mean_MAPE,Std_MAPE,Mean_R2,Std_R2
3,RandomForest Regressor,-1983.487274,53.621151,-1200.399393,10.31039,-0.072317,0.000348,0.959165,0.002325
5,XGBoost Regressor,-1995.756143,44.490561,-1282.811456,6.481486,-0.077966,0.00086,0.958687,0.001539
2,DecisionTree Regressor,-2692.920059,114.127647,-1561.458831,16.598947,-0.094057,0.00061,0.924745,0.005294
6,GradientBoosting Regressor,-2939.193533,82.712433,-1974.390591,31.846728,-0.122304,0.001057,0.91041,0.003701
0,Linear Regression,-4452.951339,83.03349,-2879.145038,41.253013,-0.210924,0.005669,0.794409,0.00387
1,KNN Regressor,-6948.780585,82.676653,-4376.116878,37.175668,-0.284021,0.00472,0.499279,0.008035
4,AdaBoost Regressor,-6425.467376,630.238816,-5326.325576,699.248525,-0.464249,0.066525,0.56676,0.083249


> From the benchmarked model, we will select the top 3 models with the best performance to be further tuned and optimized.

## 4.5 Benchmarked Model Tuning & Evaluation

From the benchmarked model, we will select the top 3 models with the best performance to be further tuned and optimized. The tuning process will be done using the GridSearchCV method. The evaluation of the tuned model will be based on the same evaluation metrics as the benchmarked model.

In [40]:
# Benchmarking top 3 Best Model

rf = RandomForestRegressor(random_state= 42)
xgb = XGBRegressor(random_state= 42)
dt = DecisionTreeRegressor(random_state= 42)

models = [rf, xgb, dt]

score_rmse = []
score_mae = []
score_mape = []
score_r2 = []

# Finding the best algorithm based on each metrics

for i in models:
    model = Pipeline([
        ('preprocessing', transform),
        ('model', i)
    ])
    
    model.fit(xtrain, ytrain)
    y_pred = model.predict(xtest)
    
    score_rmse.append(np.sqrt(mean_squared_error(ytest, y_pred)))
    score_mae.append(mean_absolute_error(ytest, y_pred))
    score_mape.append(mean_absolute_percentage_error(ytest, y_pred))
    score_r2.append(r2_score(ytest, y_pred))
    
# Model Evaluation dataframe     
score_before_tuning = pd.DataFrame({'RMSE': score_rmse, 'MAE': score_mae, 'MAPE': score_mape, 'R2': score_r2}, index=['Random Forest', 'XGBoost', 'Decission Tree'])
score_before_tuning

Unnamed: 0,RMSE,MAE,MAPE,R2
Random Forest,2147.58246,1180.189002,0.071134,0.95489
XGBoost,2039.582079,1269.665987,0.07723,0.959313
Decission Tree,2850.889548,1535.581556,0.09143,0.920505


> Random Forest are found to be Best Model before Tuning

### 4.5.1 Random Forest Hyperparameter Tuning

In [27]:
param_grid = {
    'model__n_estimators': [50, 100, 200],
    'model__max_depth': [None, 10, 20, 30],
    'model__min_samples_split': [2, 5, 10],
    'model__min_samples_leaf': [1, 2, 4]
}

# Create a Random Forest Regressor
rf = RandomForestRegressor(random_state=2023)

# Create a pipeline with data preprocessing and Random Forest model
pipe_rf = Pipeline([
    ('prep', transform),
    ('model', rf)
])

# Hyperparameter tuning with GridSearchCV
grid_rf = GridSearchCV(
    estimator=pipe_rf,
    param_grid=param_grid,
    cv=5,
    scoring=['neg_root_mean_squared_error', 'neg_mean_absolute_error', 'neg_mean_absolute_percentage_error', 'r2'],
    refit='neg_root_mean_squared_error',
    n_jobs=-1,
    verbose=1,
)

In [28]:
#Fit the model from the best parameter
grid_rf.fit(xtrain, ytrain)

Fitting 5 folds for each of 108 candidates, totalling 540 fits


In [29]:
#Display the best parameter and score
pd.DataFrame(grid_rf.cv_results_).sort_values(\
    by=['rank_test_neg_root_mean_squared_error','rank_test_neg_mean_absolute_error', 'rank_test_neg_mean_absolute_percentage_error', 'rank_test_r2']).head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_model__max_depth,param_model__min_samples_leaf,param_model__min_samples_split,param_model__n_estimators,params,split0_test_neg_root_mean_squared_error,...,std_test_neg_mean_absolute_percentage_error,rank_test_neg_mean_absolute_percentage_error,split0_test_r2,split1_test_r2,split2_test_r2,split3_test_r2,split4_test_r2,mean_test_r2,std_test_r2,rank_test_r2
59,77.775148,4.713273,0.716884,0.081765,20,1,5,200,"{'model__max_depth': 20, 'model__min_samples_l...",-1882.366284,...,0.001152,2,0.960951,0.952739,0.959747,0.961682,0.95681,0.958386,0.003276,1
56,87.733534,4.190319,0.930115,0.081247,20,1,2,200,"{'model__max_depth': 20, 'model__min_samples_l...",-1886.686148,...,0.001176,1,0.960772,0.952919,0.959904,0.961474,0.956817,0.958377,0.003159,2
55,41.600623,2.237336,0.428455,0.020945,20,1,2,100,"{'model__max_depth': 20, 'model__min_samples_l...",-1890.075132,...,0.001244,3,0.960631,0.952908,0.959773,0.961369,0.956636,0.958263,0.003126,3
58,38.723313,2.762005,0.421675,0.053372,20,1,5,100,"{'model__max_depth': 20, 'model__min_samples_l...",-1889.062031,...,0.001208,4,0.960673,0.952732,0.959691,0.961571,0.956593,0.958252,0.00323,4
86,75.644242,2.695483,0.754584,0.099627,30,1,5,200,"{'model__max_depth': 30, 'model__min_samples_l...",-1883.394838,...,0.001185,5,0.960909,0.952516,0.959527,0.961467,0.956682,0.95822,0.003298,5


In [30]:
# Best parameter for Random Forest
print('Random Forest (by GridSearchCV')
print('Best_score:', grid_rf.best_score_)
print('Best_params:', grid_rf.best_params_)

Random Forest (by GridSearchCV
Best_score: -2001.1442276267935
Best_params: {'model__max_depth': 20, 'model__min_samples_leaf': 1, 'model__min_samples_split': 5, 'model__n_estimators': 200}


In [31]:
# Model Random Forest
model = {'RF': RandomForestRegressor(random_state= 2023)}

# Define model terhadap estimator terbaik
rf_tuning = grid_rf.best_estimator_

# Fitting model
rf_tuning.fit(xtrain, ytrain)

# Predict test set
y_pred_rf_tuning = rf_tuning.predict(xtest)

# Simpan nilai metrics RMSE, MAE & MAPE setelah tuning
rmse_rf_tuning = np.sqrt(mean_squared_error(ytest, y_pred_rf_tuning))
mae_rf_tuning = mean_absolute_error(ytest, y_pred_rf_tuning)
mape_rf_tuning = mean_absolute_percentage_error(ytest, y_pred_rf_tuning)
r2_rf_tuning = r2_score(ytest, y_pred_rf_tuning)

score_after_tuning_rf = pd.DataFrame({'RMSE': rmse_rf_tuning, 'MAE': mae_rf_tuning, 'MAPE': mape_rf_tuning, 'R2': r2_rf_tuning}, index=model.keys())
score_after_tuning_rf

Unnamed: 0,RMSE,MAE,MAPE,R2
RF,2143.942062,1173.063032,0.07079,0.955042


### 4.5.2 XGBOOST Hyperparameter Tuning

In [32]:
param_grid_xgb = {
    'model__n_estimators': [50, 100, 200],
    'model__max_depth': [3, 6, 9],  # XGBoost uses maximum depth instead of None
    'model__learning_rate': [0.01, 0.1, 0.2],
    'model__subsample': [0.8, 1.0],
    'model__colsample_bytree': [0.8, 1.0],
}

# Create an XGBoost Regressor
xgb = XGBRegressor(random_state=2023)

# Create a pipeline with data preprocessing and XGBoost model
pipe_xgb = Pipeline([
    ('prep', transform),
    ('model', xgb)
])

# Hyperparameter tuning with GridSearchCV
grid_xgb = GridSearchCV(
    estimator=pipe_xgb,
    param_grid=param_grid_xgb,
    cv=5,
    scoring={
        'neg_root_mean_squared_error': 'neg_root_mean_squared_error',
        'neg_mean_absolute_error': 'neg_mean_absolute_error',
        'neg_mean_absolute_percentage_error': 'neg_mean_absolute_percentage_error',
        'r2': 'r2'
    },
    refit='neg_root_mean_squared_error',
    n_jobs=-1,
    verbose=1,
)
grid_xgb.fit(xtrain, ytrain)

Fitting 5 folds for each of 108 candidates, totalling 540 fits


In [33]:
pd.DataFrame(grid_xgb.cv_results_).sort_values(\
    by=['rank_test_neg_root_mean_squared_error','rank_test_neg_mean_absolute_error', 'rank_test_neg_mean_absolute_percentage_error', 'rank_test_r2']).head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_model__colsample_bytree,param_model__learning_rate,param_model__max_depth,param_model__n_estimators,param_model__subsample,params,...,std_test_neg_mean_absolute_percentage_error,rank_test_neg_mean_absolute_percentage_error,split0_test_r2,split1_test_r2,split2_test_r2,split3_test_r2,split4_test_r2,mean_test_r2,std_test_r2,rank_test_r2
34,6.015324,0.424151,0.409106,0.058275,0.8,0.1,9,200,0.8,"{'model__colsample_bytree': 0.8, 'model__learn...",...,0.001619,2,0.964188,0.963792,0.964178,0.966106,0.960186,0.96369,0.001929,1
35,4.73335,0.316504,0.313163,0.049776,0.8,0.1,9,200,1.0,"{'model__colsample_bytree': 0.8, 'model__learn...",...,0.001394,1,0.963741,0.963452,0.962973,0.966446,0.960374,0.963397,0.001936,2
53,4.638005,0.214401,0.345875,0.045597,0.8,0.2,9,200,1.0,"{'model__colsample_bytree': 0.8, 'model__learn...",...,0.000918,3,0.962269,0.9643,0.961198,0.965495,0.960198,0.962692,0.001953,3
88,6.071372,0.5352,0.358841,0.05093,1.0,0.1,9,200,0.8,"{'model__colsample_bytree': 1.0, 'model__learn...",...,0.001133,4,0.963743,0.961212,0.962931,0.965719,0.959819,0.962685,0.002038,4
51,2.626181,0.28122,0.171941,0.044498,0.8,0.2,9,100,1.0,"{'model__colsample_bytree': 0.8, 'model__learn...",...,0.000987,7,0.962131,0.964177,0.961266,0.965686,0.959738,0.962599,0.002107,5


In [34]:
print('XGBoost (by GridSearchCV')
print('Best_score:', grid_xgb.best_score_)
print('Best_params:', grid_xgb.best_params_)

XGBoost (by GridSearchCV
Best_score: -1869.5205986853703
Best_params: {'model__colsample_bytree': 0.8, 'model__learning_rate': 0.1, 'model__max_depth': 9, 'model__n_estimators': 200, 'model__subsample': 0.8}


In [35]:
# Model Random Forest
model = {'XGB': XGBRegressor(random_state= 2023)}

# Define model terhadap estimator terbaik
xgb_tuning = grid_xgb.best_estimator_

# Fitting model
xgb_tuning.fit(xtrain, ytrain)

# Predict test set
y_pred_xgb_tuning = xgb_tuning.predict(xtest)

# Simpan nilai metrics RMSE, MAE & MAPE setelah tuning
rmse_xgb_tuning = np.sqrt(mean_squared_error(ytest, y_pred_xgb_tuning))
mae_xgb_tuning = mean_absolute_error(ytest, y_pred_xgb_tuning)
mape_xgb_tuning = mean_absolute_percentage_error(ytest, y_pred_xgb_tuning)
r2_xgb_tuning = r2_score(ytest, y_pred_xgb_tuning)

score_after_tuning_xgb = pd.DataFrame({'RMSE': rmse_xgb_tuning, 'MAE': mae_xgb_tuning, 'MAPE': mape_xgb_tuning, 'R2': r2_xgb_tuning}, index=model.keys())
score_after_tuning_xgb

Unnamed: 0,RMSE,MAE,MAPE,R2
XGB,1938.151081,1122.310301,0.067693,0.963259


### 4.5.3 Decision Tree Hyperparameter Tuning

In [36]:
param_grid_dt = {
    'model__max_depth': [None, 3, 6, 9],
    'model__min_samples_split': [2, 5, 10],
    'model__min_samples_leaf': [1, 2, 4]
}

# Create a Decision Tree Regressor
dt = DecisionTreeRegressor(random_state=2023)

# Create a pipeline with data preprocessing and Decision Tree model
pipe_dt = Pipeline([
    ('prep', transform),
    ('model', dt)
])

# Hyperparameter tuning with GridSearchCV
grid_dt = GridSearchCV(
    estimator=pipe_dt,
    param_grid=param_grid_dt,
    cv=5,
    scoring={
        'neg_root_mean_squared_error': 'neg_root_mean_squared_error',
        'neg_mean_absolute_error': 'neg_mean_absolute_error',
        'neg_mean_absolute_percentage_error': 'neg_mean_absolute_percentage_error',
        'r2': 'r2'
    },
    refit='neg_root_mean_squared_error',
    n_jobs=-1,
    verbose=1,
)

# Fit the model
grid_dt.fit(xtrain, ytrain)

Fitting 5 folds for each of 36 candidates, totalling 180 fits


In [37]:
pd.DataFrame(grid_dt.cv_results_).sort_values(\
    by=['rank_test_neg_root_mean_squared_error','rank_test_neg_mean_absolute_error', 'rank_test_neg_mean_absolute_percentage_error', 'rank_test_r2']).head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_model__max_depth,param_model__min_samples_leaf,param_model__min_samples_split,params,split0_test_neg_root_mean_squared_error,split1_test_neg_root_mean_squared_error,...,std_test_neg_mean_absolute_percentage_error,rank_test_neg_mean_absolute_percentage_error,split0_test_r2,split1_test_r2,split2_test_r2,split3_test_r2,split4_test_r2,mean_test_r2,std_test_r2,rank_test_r2
5,0.726059,0.072291,0.046875,0.006213,,2,10,"{'model__max_depth': None, 'model__min_samples...",-2361.340998,-2679.068659,...,0.001179,4,0.938551,0.926381,0.932522,0.94151,0.934428,0.934678,0.0052,1
8,0.825788,0.087214,0.047074,0.004343,,4,10,"{'model__max_depth': None, 'model__min_samples...",-2277.461601,-2662.866466,...,0.001302,1,0.942839,0.927269,0.93299,0.939697,0.930012,0.934561,0.005852,2
6,0.784304,0.0956,0.049072,0.004906,,4,2,"{'model__max_depth': None, 'model__min_samples...",-2296.692321,-2681.063542,...,0.001246,2,0.941869,0.926272,0.934128,0.939204,0.929039,0.934102,0.005888,3
7,0.693147,0.064163,0.045678,0.003909,,4,5,"{'model__max_depth': None, 'model__min_samples...",-2296.692321,-2681.063542,...,0.001246,2,0.941869,0.926272,0.934128,0.939204,0.929039,0.934102,0.005888,3
2,0.761091,0.044979,0.044481,0.004829,,1,10,"{'model__max_depth': None, 'model__min_samples...",-2347.314052,-2692.568313,...,0.001273,5,0.939279,0.925638,0.930166,0.940691,0.933963,0.933947,0.005608,5


In [38]:
print('Decission Tree (by GridSearchCV')
print('Best_score:', grid_dt.best_score_)
print('Best_params:', grid_dt.best_params_)

Decission Tree (by GridSearchCV
Best_score: -2507.608495839723
Best_params: {'model__max_depth': None, 'model__min_samples_leaf': 2, 'model__min_samples_split': 10}


In [39]:
# Model Random Forest
model = {'DT': DecisionTreeRegressor(random_state= 2023)}

# Define model terhadap estimator terbaik
dt_tuning = grid_dt.best_estimator_

# Fitting model
dt_tuning.fit(xtrain, ytrain)

# Predict test set
y_pred_dt_tuning = dt_tuning.predict(xtest)

# Simpan nilai metrics RMSE, MAE & MAPE setelah tuning
rmse_dt_tuning = np.sqrt(mean_squared_error(ytest, y_pred_dt_tuning))
mae_dt_tuning = mean_absolute_error(ytest, y_pred_dt_tuning)
mape_dt_tuning = mean_absolute_percentage_error(ytest, y_pred_dt_tuning)
r2_dt_tuning = r2_score(ytest, y_pred_dt_tuning)

score_after_tuning_dt = pd.DataFrame({'RMSE': rmse_dt_tuning, 'MAE': mae_dt_tuning, 'MAPE': mape_dt_tuning, 'R2': r2_dt_tuning}, index=model.keys())
score_after_tuning_dt

Unnamed: 0,RMSE,MAE,MAPE,R2
DT,2766.608121,1443.036708,0.086504,0.925136


---

### 4.5.4 Top 3 Model Evaluation

In [8]:
# Model Evaluation dataframe
# Create dataframe from tuned models and rank them
score_after_tuning = pd.concat([score_after_tuning_rf, score_after_tuning_xgb, score_after_tuning_dt]).sort_values('MAPE', ascending=False)
score_after_tuning

NameError: name 'score_after_tuning_rf' is not defined

> ### XGBOOST are found to be Best Model after Tuning

## 4.6 Prediction output simulation

In [7]:
#Sample data test simulation
sample = cars.sample(200)
sample.head()

Unnamed: 0.1,Unnamed: 0,model,year,price,transmission,mileage,fuelType,engineSize,tax,mpg,brand
23826,27638,Kuga,2017,15950,Manual,24845,Diesel,2.0,145.0,60.1,Ford
33647,37510,Kuga,2017,16495,Automatic,19278,Diesel,2.0,150.0,54.3,Ford
52399,56614,RAV4,2017,18490,Automatic,30724,Petrol,2.0,205.0,43.5,Toyota
29254,33084,Fiesta,2017,8499,Manual,30940,Petrol,1.2,125.0,54.3,Ford
35227,39147,Focus,2016,10699,Automatic,26796,Petrol,1.0,125.0,51.4,Ford


In [None]:
sample_pred = pipe_xgb.predict(sample.drop('price', axis=1))
result_df = pd.DataFrame({"Prediction": sample_pred, "Actual": sample['price'].values})

result_df["diff"] = result_df["Prediction"] - result_df["Actual"]
pd.DataFrame(result_df)


In [65]:
pred = pipe_xgb.predict(sample)
result_df = pd.DataFrame({
    'Predicted': pred,
    'Actual': sample['price'],
    'Diff': pred - sample['price'],
    'Percentage': round(((pred - sample['price']) / sample['price']) * 100, 2)
})
result_df

Unnamed: 0,Predicted,Actual,Diff,Percentage
24534,7489.395996,7991,-501.604004,-6.28
29495,14489.902344,14499,-9.097656,-0.06
29794,13036.708008,15390,-2353.291992,-15.29
70852,44300.343750,47500,-3199.656250,-6.74
56910,22984.970703,21995,989.970703,4.50
...,...,...,...,...
44060,18150.208984,19000,-849.791016,-4.47
7905,28981.751953,30199,-1217.248047,-4.03
70116,6863.679688,6766,97.679688,1.44
34231,10122.005859,8999,1123.005859,12.48


## 4.7 Conclusion

- XGBOOST are found to be Best Model after Tuning, with MAE of  and R2 of 
- The model is able to predict the price of used cars with  accuracy.


## 4.8 Recommendation
- 

---