#### Notebook `Modelling`

#### Group:
- `Miguel Matos - 20221925`
- `André Nicolau - 20221861`
- `André Ferreira - 20250398`

---

#### <font> Table of Contents </font> <a class="anchor" id='toc'></a> 
1. [Imports](#Imports)
2. [Modelling](#Modelling)
- 2.1 [Linear Regression](#linear-regression)
- 2.2 [Random Forest](#random-forest)

# Imports
[Back to TOC](#toc)

In [1]:
from functions import *

In [2]:
X_train = pd.read_csv("../data/Split_data/X_train.csv", index_col= "carID")
y_train = pd.read_csv("../data/Split_data/y_train.csv", index_col= "carID")

X_val = pd.read_csv("../data/Split_data/X_val.csv", index_col= "carID")
y_val = pd.read_csv("../data/Split_data/y_val.csv", index_col= "carID")

Test_data = pd.read_csv("../data/Split_data/test_preprocessed.csv", index_col= "carID")

pd.set_option("display.max_columns", None)

# Modelling
[Back to TOC](#toc)


## Linear Regression

We have to try the model with the variables 'paintQuality%', 'previousOwners', 'Diesel' and without them.

In [3]:
reg = LinearRegression(fit_intercept=True).fit(X_train, y_train)
print(reg.score(X_train, y_train))
y_pred = reg.predict(X_val)

0.7641349946098694


In [4]:
calculate_regression_metrics(y_val, y_pred)

Evaluation Metrics
------------------
R² Score : 0.7177
MAE      : 3266.61
RMSE     : 5129.15
------------------


(0.7177, 3266.61, 5129.15)

Let's try now the linear regression without the three variables that were suggested to exclude during feature selection.

In [5]:
X_train_smaller = X_train.drop(columns=['paintQuality%', 'previousOwners', 'Diesel'])
X_val_smaller = X_val.drop(columns=['paintQuality%', 'previousOwners', 'Diesel'])

In [6]:
reg_1 = LinearRegression(fit_intercept=True).fit(X_train_smaller, y_train)
print(reg_1.score(X_train_smaller, y_train))
y_pred_1 = reg_1.predict(X_val_smaller)

0.7641028301165973


In [7]:
calculate_regression_metrics(y_val, y_pred_1)

Evaluation Metrics
------------------
R² Score : 0.7174
MAE      : 3267.55
RMSE     : 5132.39
------------------


(0.7174, 3267.55, 5132.39)

## Random Forest 

Let's define the parameters of a default Random forest regressor.

In [3]:
params_baseline = {
    "n_estimators": 200,
    "max_depth": None,
    "min_samples_split": 2,
    "min_samples_leaf": 1,
    "max_features": "sqrt",
    "bootstrap": True
}

Now, we'll define parameters of a random forest regressor that tends more to not overfit.

In [4]:
params_regularized = {
    "n_estimators": 500,
    "max_depth": 20,
    "min_samples_split": 5,
    "min_samples_leaf": 2,
    "max_features": 0.5,
    "bootstrap": True
}

Finally, we'll try a random forest regression with higher capacity of understanding the data, but taking the risk of overfitting.

In [5]:
params_high_capacity = {
    "n_estimators": 800,
    "max_depth": None,
    "min_samples_split": 2,
    "min_samples_leaf": 1,
    "max_features": "log2",
    "bootstrap": False  # often gives slightly different behavior
}

In [85]:
baseline = test_params(RandomForestRegressor, params_baseline, X_train, X_val, y_train, y_val)
regularized = test_params(RandomForestRegressor, params_regularized, X_train, X_val, y_train, y_val)
highcap = test_params(RandomForestRegressor, params_high_capacity, X_train, X_val, y_train, y_val)

  return fit_method(estimator, *args, **kwargs)


Train R2 score: 0.9898

Evaluation Metrics
------------------
R² Score : 0.9306
MAE      : 1463.64
RMSE     : 2543.34
------------------
(0.9306, 1463.64, 2543.34)




  return fit_method(estimator, *args, **kwargs)


Train R2 score: 0.9707

Evaluation Metrics
------------------
R² Score : 0.9312
MAE      : 1454.79
RMSE     : 2532.78
------------------
(0.9312, 1454.79, 2532.78)




  return fit_method(estimator, *args, **kwargs)


Train R2 score: 1.0000

Evaluation Metrics
------------------
R² Score : 0.9336
MAE      : 1432.19
RMSE     : 2488.46
------------------
(0.9336, 1432.19, 2488.46)




The model that performed the best, was the most regulized model, because even though he performed slightly worse than the last model, the last model seemed to overfit the data.

Finally, let's test to run the models without the three suggested columns.

In [86]:
baseline = test_params(RandomForestRegressor, params_baseline, X_train_smaller, X_val_smaller, y_train, y_val)
regularized = test_params(RandomForestRegressor, params_regularized, X_train_smaller, X_val_smaller, y_train, y_val)
highcap = test_params(RandomForestRegressor, params_high_capacity, X_train_smaller, X_val_smaller, y_train, y_val)

  return fit_method(estimator, *args, **kwargs)


Train R2 score: 0.9902

Evaluation Metrics
------------------
R² Score : 0.935
MAE      : 1408.35
RMSE     : 2460.66
------------------
(0.935, 1408.35, 2460.66)




  return fit_method(estimator, *args, **kwargs)


Train R2 score: 0.9672

Evaluation Metrics
------------------
R² Score : 0.9334
MAE      : 1424.84
RMSE     : 2492.3
------------------
(0.9334, 1424.84, 2492.3)




  return fit_method(estimator, *args, **kwargs)


Train R2 score: 0.9998

Evaluation Metrics
------------------
R² Score : 0.9345
MAE      : 1439.58
RMSE     : 2471.35
------------------
(0.9345, 1439.58, 2471.35)




In [18]:
from sklearn.model_selection import RandomizedSearchCV

model = RandomForestRegressor()

param_grid = {
    "n_estimators": [200, 400, 600, 800, 1000],
    "max_depth": [None, 10, 20, 30, 40, 50],
    "min_samples_split": [2, 5, 10, 20],
    "min_samples_leaf": [1, 2, 4, 8],
    "max_features": ["auto", "sqrt", 0.5, 0.7, 1.0],
    "bootstrap": [True, False]
}


random_search = RandomizedSearchCV(
    estimator=model,
    param_distributions=param_grid,
    n_iter=50,
    cv=5,                 # CV done ONLY inside the training set
    scoring='neg_root_mean_squared_error',
    n_jobs=-1,
    random_state=42
)

random_search.fit(X_train, y_train)

best_model = random_search.best_estimator_

val_pred = best_model.predict(X_val)

  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **

In [19]:
best_model

In [20]:
calculate_regression_metrics(y_val, val_pred)

Evaluation Metrics
------------------
R² Score : 0.935
MAE      : 1422.2
RMSE     : 2461.96
------------------


(0.935, 1422.2, 2461.96)

# Test data prediction

In [31]:
#Test_data = Test_data.drop(columns=['paintQuality%', 'previousOwners', 'Diesel'])
Test_data = Test_data.reindex(columns=X_train.columns)

In [32]:
model = RandomForestRegressor(bootstrap=False, max_depth=20, max_features=0.5,
                      min_samples_split=10, n_estimators=200, random_state=42, n_jobs=-1)
model.fit(X_train, y_train)

prediction = model.predict(Test_data)

  return fit_method(estimator, *args, **kwargs)


In [33]:
carID_test = Test_data.index.values

submission = pd.DataFrame({
    "price": prediction
}, index=carID_test)

submission.index.name = "carID"

In [34]:
submission

Unnamed: 0_level_0,price
carID,Unnamed: 1_level_1
89856,16205.066931
106581,26191.609904
80886,11789.616885
100174,17172.930875
81376,34561.242473
...,...
105775,14273.328718
81363,35106.083204
76833,38066.472558
91768,17129.410682


In [35]:
submission.to_csv("../data/Kaggle_submission_turbinada.csv")