# 4. Model Training and Evaluation

In this notebook, we will train and evaluate our machine learning models on the preprocessed data. We will compare the performance of a simple Linear Regression model with a more complex Random Forest model.

In [1]:
import os
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score

## Data Loading

In [2]:
# Load the processed data
input_data_dir = os.path.join("..", "data", "processed")
X_train = pd.read_csv(os.path.join(input_data_dir, "X_train_processed.csv"))
y_train = pd.read_csv(os.path.join(input_data_dir, "y_train_processed.csv"))
X_test = pd.read_csv(os.path.join(input_data_dir, "X_test_processed.csv"))
y_test = pd.read_csv(os.path.join(input_data_dir, "y_test_processed.csv"))

# Convert y_train and y_test to 1D arrays
y_train = y_train.values.ravel()
y_test = y_test.values.ravel()

print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)


X_train shape: (431, 13)
y_train shape: (431,)
X_test shape: (108, 13)
y_test shape: (108,)


## Model Training and Evaluation

In [3]:
# Train and evaluate the Linear Regression model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
y_pred_lr = lr_model.predict(X_test)
r2_lr = r2_score(y_test, y_pred_lr)

# Train and evaluate the Random Forest model
rf_model = RandomForestRegressor(random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
r2_rf = r2_score(y_test, y_pred_rf)

print("--- Final Model Performance Comparison ---")
print(f"Linear Regression R-squared: {r2_lr:.4f}")
print(f"Random Forest R-squared: {r2_rf:.4f}")

--- Final Model Performance Comparison ---
Linear Regression R-squared: 0.7110
Random Forest R-squared: 0.6267


## Conclusion

After removing the data leakage and retraining the models, we can see that the R-squared scores are much more realistic. The Linear Regression model achieves an R-squared of around 0.71, while the Random Forest model achieves a score of around 0.63. This indicates that the overfitting issue has been resolved. Interestingly, the simpler Linear Regression model outperforms the more complex Random Forest model on this dataset. This suggests that a linear model might be a better fit for this particular problem.