#### Notebook `Modelling`

#### Group:
- `Miguel Matos - 20221925`
- `André Nicolau - 20221861`
- `André Ferreira - 20250398`

---

#### <font> Table of Contents </font> <a class="anchor" id='toc'></a> 
1. [Imports](#Imports)
2. [Modelling](#Modelling)
- 2.1 [Linear Regression](#linear-regression)
- 2.2 [Random Forest](#random-forest)

# Imports
[Back to TOC](#toc)

In [1]:
from functions import *

In [2]:
X_train = pd.read_csv("../data/Split_data/X_train.csv", index_col= "carID")
y_train = pd.read_csv("../data/Split_data/y_train.csv", index_col= "carID")

X_val = pd.read_csv("../data/Split_data/X_val.csv", index_col= "carID")
y_val = pd.read_csv("../data/Split_data/y_val.csv", index_col= "carID")

Test_data = pd.read_csv("../data/Split_data/test_preprocessed.csv", index_col= "carID")

pd.set_option("display.max_columns", None)

# Modelling
[Back to TOC](#toc)


## Linear Regression

We have to try the model with the variables 'paintQuality%', 'previousOwners', 'Diesel' and without them.

In [3]:
reg = LinearRegression(fit_intercept=True).fit(X_train, y_train)
y_pred = reg.predict(X_val)

In [4]:
params = {"fit_intercept": True}
test_params(LinearRegression, params, X_train, X_val, y_train, y_val)

=== Train Metrics ===
R² Score : 0.7641
MAE      : 3080.24
RMSE     : 4740.73

=== Validation Metrics ===
R² Score : 0.7177
MAE      : 3266.61
RMSE     : 5129.15


Let's try now the linear regression without the three variables that were suggested to exclude during feature selection.

In [5]:
X_train_smaller = X_train.drop(columns=['paintQuality%', 'previousOwners', 'Diesel'])
X_val_smaller = X_val.drop(columns=['paintQuality%', 'previousOwners', 'Diesel'])

In [6]:
reg_1 = LinearRegression(fit_intercept=True).fit(X_train_smaller, y_train)
y_pred_1 = reg_1.predict(X_val_smaller)

In [7]:
params = {"fit_intercept": True}
test_params(LinearRegression, params, X_train, X_val, y_train, y_val)

=== Train Metrics ===
R² Score : 0.7641
MAE      : 3080.24
RMSE     : 4740.73

=== Validation Metrics ===
R² Score : 0.7177
MAE      : 3266.61
RMSE     : 5129.15


## Random Forest 

Let's define the parameters of a default Random forest regressor.

In [8]:
params_baseline = {
    "n_estimators": 200,
    "max_depth": None,
    "min_samples_split": 2,
    "min_samples_leaf": 1,
    "max_features": "sqrt",
    "bootstrap": True, 
    "random_state" :42, 
    "n_jobs": -1
}

Now, we'll define parameters of a random forest regressor that tends more to not overfit.

In [9]:
params_regularized = {
    "n_estimators": 500,
    "max_depth": 20,
    "min_samples_split": 5,
    "min_samples_leaf": 2,
    "max_features": 0.5,
    "bootstrap": True, 
    "random_state" :42, 
    "n_jobs": -1
}

Finally, we'll try a random forest regression with higher capacity of understanding the data, but taking the risk of overfitting.

In [10]:
params_high_capacity = {
    "n_estimators": 800,
    "max_depth": None,
    "min_samples_split": 2,
    "min_samples_leaf": 1,
    "max_features": "log2",
    "bootstrap": False,  # often gives slightly different behavior
    "random_state" :42, 
    "n_jobs": -1
}


In [11]:
print("\n===== BASELINE RANDOM FOREST =====")
baseline = test_params(
    RandomForestRegressor, 
    params_baseline, 
    X_train, X_val, y_train, y_val
)

print("\n===== REGULARIZED RANDOM FOREST =====")
regularized = test_params(
    RandomForestRegressor, 
    params_regularized, 
    X_train, X_val, y_train, y_val
)

print("\n===== HIGH-CAPACITY RANDOM FOREST =====")
highcap = test_params(
    RandomForestRegressor, 
    params_high_capacity, 
    X_train, X_val, y_train, y_val
)



===== BASELINE RANDOM FOREST =====


  return fit_method(estimator, *args, **kwargs)


=== Train Metrics ===
R² Score : 0.9898
MAE      : 546.25
RMSE     : 985.29

=== Validation Metrics ===
R² Score : 0.9306
MAE      : 1463.64
RMSE     : 2543.34

===== REGULARIZED RANDOM FOREST =====


  return fit_method(estimator, *args, **kwargs)


=== Train Metrics ===
R² Score : 0.9707
MAE      : 944.10
RMSE     : 1672.18

=== Validation Metrics ===
R² Score : 0.9312
MAE      : 1454.79
RMSE     : 2532.78

===== HIGH-CAPACITY RANDOM FOREST =====


  return fit_method(estimator, *args, **kwargs)


=== Train Metrics ===
R² Score : 1.0000
MAE      : 0.30
RMSE     : 21.57

=== Validation Metrics ===
R² Score : 0.9336
MAE      : 1432.19
RMSE     : 2488.46


The model that performed the best, was the most regulized model, because even though he performed slightly worse than the last model, the last model seemed to overfit the data.

Finally, let's test to run the models without the three, suggested  to deletefrom the feature selection, columns.

In [12]:
print("\n===== BASELINE RANDOM FOREST =====")
baseline = test_params(
    RandomForestRegressor, 
    params_baseline, 
    X_train_smaller, X_val_smaller, y_train, y_val
)

print("\n===== REGULARIZED RANDOM FOREST =====")
regularized = test_params(
    RandomForestRegressor, 
    params_regularized, 
    X_train_smaller, X_val_smaller, y_train, y_val
)

print("\n===== HIGH-CAPACITY RANDOM FOREST =====")
highcap = test_params(
    RandomForestRegressor, 
    params_high_capacity, 
    X_train_smaller, X_val_smaller, y_train, y_val
)



===== BASELINE RANDOM FOREST =====


  return fit_method(estimator, *args, **kwargs)


=== Train Metrics ===
R² Score : 0.9902
MAE      : 531.70
RMSE     : 966.22

=== Validation Metrics ===
R² Score : 0.9350
MAE      : 1408.35
RMSE     : 2460.66

===== REGULARIZED RANDOM FOREST =====


  return fit_method(estimator, *args, **kwargs)


=== Train Metrics ===
R² Score : 0.9672
MAE      : 1016.32
RMSE     : 1766.76

=== Validation Metrics ===
R² Score : 0.9334
MAE      : 1424.84
RMSE     : 2492.30

===== HIGH-CAPACITY RANDOM FOREST =====


  return fit_method(estimator, *args, **kwargs)


=== Train Metrics ===
R² Score : 0.9998
MAE      : 12.14
RMSE     : 139.25

=== Validation Metrics ===
R² Score : 0.9345
MAE      : 1439.58
RMSE     : 2471.35


Based on all the performances, the Random Forest with regulized parameters seemed to give better results, but due to the fact the the basline one gave better results in Kaggle, it is the one we'll keep here in the notebook for the final submission.

# Test data prediction

In [13]:
# Putting the columns in test data in the same order as we have in the train data
Test_data = Test_data.reindex(columns=X_train.columns)

Finally, let's build the final model and make predictions on the test data.

In [14]:
model = RandomForestRegressor(n_estimators = 200,
    max_depth= None,
    min_samples_split = 2,
    min_samples_leaf= 1,
    max_features = "sqrt",
    bootstrap= True, random_state=42, n_jobs=-1)

model.fit(X_train, y_train)

prediction = model.predict(Test_data)

  return fit_method(estimator, *args, **kwargs)


In [15]:
carID_test = Test_data.index.values # Getting an array with the CarId ordered.

submission = pd.DataFrame({
    "price": prediction
}, index=carID_test) #A dataframe with the index as the CarID and the only column present beeing the predictions.

submission.index.name = "carID" #Naming the index as "carID"

In [16]:
submission

Unnamed: 0_level_0,price
carID,Unnamed: 1_level_1
89856,15826.710
106581,25871.360
80886,11710.110
100174,15502.200
81376,30917.185
...,...
105775,14787.595
81363,36248.605
76833,36811.365
91768,18072.170


In [17]:
submission.to_csv("../data/Kaggle_submission_baseline_all_feat.csv")