# Model Building, Training and Evaluation

In this notebook, we will be building, training, and evaluating machine learning models for our housing prices prediction task.

It is a logical continuation of our end-to-end ML workflow and extends the workflow from our [previous notebook](2_data_cleaning_and_processing_and_feature_engineering.ipynb). We have performed several preprocessing and feature engineering steps to prepare our dataset for modeling. Here's a summary of what we've done so far:

- Data Cleaning: We handled missing values in the dataset. For some features, NaN values were replaced with 'None' or 0, depending on the feature's description in the data dictionary. For others, we used the mode or median value of the feature.

- Outlier Removal: We identified potential outliers in the dataset using scatter plots and removed them based on certain criteria. This step is crucial to prevent our models from being unduly influenced by extreme values.

- Categorical Feature Encoding: We converted categorical variables into a format that can be provided to a machine learning algorithm. We used ordinal encoding for ordinal features and one-hot encoding for nominal features.

- Feature Engineering: We created new features based on domain knowledge and exploratory data analysis. These new features can help improve the performance of our models by introducing new information or highlighting certain patterns.

- Normalization: We applied a log transformation to the numerical features in the dataset to stabilize the variance, reduce skewness, and diminish the influence of outliers.

- Feature Scaling: We standardized the features in the dataset to have a mean of 0 and a standard deviation of 1. This step is important for many machine learning algorithms that are sensitive to the scale of the features.

- Feature Selection: We calculated Mutual Information (MI) scores for the features and identified those with an MI score of 0. These features do not contribute any predictive power to our models and were dropped from the dataset.

We have saved different versions of the dataset after each major step. This allows us to experiment with different combinations of preprocessing and feature engineering steps and see which one produces the best results. We have three main paths for our modeling phase:

- Path 1 - Model with All Features: In this path, we will use the dataset that includes all features (original and engineered) without any feature selection. This will serve as our baseline.

- Path 2 - Model with Zero MI Features Dropped: In this path, we will use the dataset where features with an MI score of 0 have been dropped. This allows us to see if removing these non-informative features improves the performance of our models.

- Path 3 - Further Feature Selection and Dimensionality Reduction: In this path, we can explore other feature selection techniques and dimensionality reduction methods, such as Principal Component Analysis (PCA). This can help us reduce the dimensionality of our dataset and potentially improve the performance and interpretability of our models.

In this notebook, we will train different models mainly 3 regressin models so maybe linear regression, ridge or lasso regression, xg boost, and a neural network on these datasets and evaluate their performance. This will help us understand the impact of our preprocessing and feature engineering decisions and guide us in improving our models.



In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA


In [2]:

# Load the scaled train and test datasets
scaled_train_data_path = '../data_details/train_data_scaled.csv'
scaled_test_data_path = '../data_details/test_data_scaled.csv'

scaled_train_data = pd.read_csv(scaled_train_data_path)
scaled_test_data = pd.read_csv(scaled_test_data_path)

# Display the first few rows of the scaled train dataset
print(scaled_train_data.head())

# Display the first few rows of the scaled test dataset
print(scaled_test_data.head())






   MSSubClass  LotFrontage   LotArea  OverallQual  OverallCond  YearBuilt  \
0    0.425401    -0.076972 -0.119598     0.694385    -0.460159   1.045693   
1   -1.124498     0.561642  0.139916     0.031855     1.944960   0.164918   
2    0.425401     0.061643  0.462502     0.694385    -0.460159   0.980858   
3    0.646045    -0.322636  0.129296     0.694385    -0.460159  -1.870098   
4    0.425401     0.711952  0.944727     1.278778    -0.460159   0.948417   

   YearRemodAdd  MasVnrArea  ExterQual  ExterCond  ...  SaleCondition_Normal  \
0      0.879312    1.211932   1.060664  -0.238718  ...              0.466838   
1     -0.422181   -0.804570  -0.688649  -0.238718  ...              0.466838   
2      0.831422    1.139621   1.060664  -0.238718  ...              0.466838   
3     -0.713815   -0.804570  -0.688649  -0.238718  ...             -2.142069   
4      0.735570    1.432385   1.060664  -0.238718  ...              0.466838   

   SaleCondition_Partial  TotalArea   TotalSF  AgeAtSale

In [3]:
from sklearn.linear_model import Ridge, Lasso
import xgboost as xgb
from sklearn.neural_network import MLPRegressor

# Path 1 - Model with All Features

# Split the data into features and target variable
data_path = '../data_details/train_data_scaled.csv'
data = pd.read_csv(data_path)
y_train_path = '../data_details/target_variable.csv'
y = pd.read_csv(y_train_path)

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(data, y, test_size=0.2, random_state=42)

X_test = scaled_test_data

# Initialize the models
linear_regression = LinearRegression()
ridge_regression = Ridge()
lasso_regression = Lasso()
xgboost_regression = xgb.XGBRegressor(objective ='reg:squarederror')

# Train the models
linear_regression.fit(X_train, y_train)
ridge_regression.fit(X_train, y_train)
lasso_regression.fit(X_train, y_train)
xgboost_regression.fit(X_train, y_train)

# Make predictions
lr_predictions = linear_regression.predict(X_val)
rr_predictions = ridge_regression.predict(X_val)
lasso_predictions = lasso_regression.predict(X_val)
xgboost_predictions = xgboost_regression.predict(X_val)

# Evaluate the models
lr_rmse = np.sqrt(mean_squared_error(y_val, lr_predictions))
rr_rmse = np.sqrt(mean_squared_error(y_val, rr_predictions))
lasso_rmse = np.sqrt(mean_squared_error(y_val, lasso_predictions))
xgboost_rmse = np.sqrt(mean_squared_error(y_val, xgboost_predictions))

print(f'Linear Regression RMSE: {lr_rmse}')
print(f'Ridge Regression RMSE: {rr_rmse}')
print(f'Lasso Regression RMSE: {lasso_rmse}')
print(f'XGBoost Regression RMSE: {xgboost_rmse}')

# Path 2 - Neural Network Model

# Initialize the model
neural_network = MLPRegressor()

# Train the model
neural_network.fit(X_train, y_train)

# Make predictions
nn_predictions = neural_network.predict(X_val)

# Evaluate the model
nn_rmse = np.sqrt(mean_squared_error(y_val, nn_predictions))

print(f'Neural Network RMSE: {nn_rmse}')


Linear Regression RMSE: 32136804843.254498
Ridge Regression RMSE: 0.12130846072149407
Lasso Regression RMSE: 0.41734378753081575
XGBoost Regression RMSE: 0.13091274763742594


  y = column_or_1d(y, warn=True)


Neural Network RMSE: 2.4241068400523513




## Updating the Neural Network Model in Direction 1

We received a warning during the training of the Neural Network model indicating that the maximum number of iterations was reached before convergence. This suggests that the model might need more iterations to find the optimal parameters. Let's increase the `max_iter` parameter and adjust the `learning_rate_init` to see if this improves the model's performance.

In [4]:
# Initialize the model with increased max_iter and adjusted learning_rate_init
neural_network = MLPRegressor(max_iter=500, learning_rate_init=0.001)

# Train the model
neural_network.fit(X_train, y_train)

# Make predictions
nn_predictions = neural_network.predict(X_val)

# Evaluate the model
nn_rmse = np.sqrt(mean_squared_error(y_val, nn_predictions))

print(f'Updated Neural Network RMSE: {nn_rmse}')

  y = column_or_1d(y, warn=True)


Updated Neural Network RMSE: 2.1244147160438627


## Summary of Direction 1 Results

In Direction 1, we trained our models using all available features. Here are the RMSE values we obtained:

- Linear Regression RMSE: 32136804843.254498
- Ridge Regression RMSE: 0.12130846072149407
- Lasso Regression RMSE: 0.41734378753081575
- XGBoost Regression RMSE: 0.13091274763742594
- Neural Network RMSE: 2.999916166728695

The Linear Regression model's RMSE is significantly higher than the others, suggesting that it may not be the best model for this task. The Ridge Regression and XGBoost Regression models have the lowest RMSEs, indicating that they performed the best on our validation set. This is expected as Ridge has inbuilt regularization features. The Lasso Regression model's RMSE is higher than Ridge and XGBoost, indicating it's not performing as well as them. The Neural Network model's RMSE is quite high, suggesting that it's not performing well and might need more tuning, like adjusting the architecture (number of layers, neurons) or the learning rate.


## Observations on Neural Network Model Performance

After running the Neural Network model again with increased `max_iter` and adjusted `learning_rate_init`, the difference in performance was not significant. The RMSE before the adjustment was 2.4241068400523513, and after the adjustment, it was 2.1244147160438627. This suggests that the adjustments made to the model parameters did not significantly improve the model's performance. It's possible that further tuning of the model parameters or adjustments to the model architecture may be required to achieve better performance.


These results provide a good baseline for comparison as we proceed to Direction 2.



## Preparing for Direction 2

In Direction 2, we will use the dataset where features with an MI score of 0 have been dropped. This will allow us to see if removing these non-informative features improves the performance of our models.

To do this, we will need to load the dataset with zero MI features dropped, then repeat the model training, prediction, and evaluation steps. We will compare the RMSE values obtained in Direction 2 with those from Direction 1 to see if the performance has improved.

Let's proceed to Direction 2.

In [5]:
# Path 2 - Model with Zero MI Features Dropped

# Load the dataset where features with an MI score of 0 have been dropped
data_path = '../data_details/train_data_scaled_dropped.csv'
data = pd.read_csv(data_path)

# Load the target variable
y_train_path = '../data_details/target_variable.csv'
y = pd.read_csv(y_train_path)

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(data, y, test_size=0.2, random_state=42)

# Initialize the models
linear_regression = LinearRegression()
ridge_regression = Ridge()
lasso_regression = Lasso()
xgboost_regression = xgb.XGBRegressor(objective ='reg:squarederror')

# Train the models
linear_regression.fit(X_train, y_train)
ridge_regression.fit(X_train, y_train)
lasso_regression.fit(X_train, y_train)
xgboost_regression.fit(X_train, y_train)

# Make predictions
lr_predictions = linear_regression.predict(X_val)
rr_predictions = ridge_regression.predict(X_val)
lasso_predictions = lasso_regression.predict(X_val)
xgboost_predictions = xgboost_regression.predict(X_val)

# Evaluate the models
lr_rmse = np.sqrt(mean_squared_error(y_val, lr_predictions))
rr_rmse = np.sqrt(mean_squared_error(y_val, rr_predictions))
lasso_rmse = np.sqrt(mean_squared_error(y_val, lasso_predictions))
xgboost_rmse = np.sqrt(mean_squared_error(y_val, xgboost_predictions))

print(f'Linear Regression RMSE: {lr_rmse}')
print(f'Ridge Regression RMSE: {rr_rmse}')
print(f'Lasso Regression RMSE: {lasso_rmse}')
print(f'XGBoost Regression RMSE: {xgboost_rmse}')

# Path 2 - Neural Network Model

# Initialize the model
neural_network = MLPRegressor()

# Train the model
neural_network.fit(X_train, y_train)

# Make predictions
nn_predictions = neural_network.predict(X_val)

# Evaluate the model
nn_rmse = np.sqrt(mean_squared_error(y_val, nn_predictions))

print(f'Neural Network RMSE: {nn_rmse}')

# Summary of Direction 2 Results

print("In Direction 2, we trained our models using the dataset where features with an MI score of 0 have been dropped. Here are the RMSE values we obtained:")
print(f'Linear Regression RMSE: {lr_rmse}')
print(f'Ridge Regression RMSE: {rr_rmse}')
print(f'Lasso Regression RMSE: {lasso_rmse}')
print(f'XGBoost Regression RMSE: {xgboost_rmse}')
print(f'Neural Network RMSE: {nn_rmse}')



Linear Regression RMSE: 7282268088.739932
Ridge Regression RMSE: 0.1223206754782862
Lasso Regression RMSE: 0.41734378753081575
XGBoost Regression RMSE: 0.1336104636520608


  y = column_or_1d(y, warn=True)


Neural Network RMSE: 2.2308262104205774
In Direction 2, we trained our models using the dataset where features with an MI score of 0 have been dropped. Here are the RMSE values we obtained:
Linear Regression RMSE: 7282268088.739932
Ridge Regression RMSE: 0.1223206754782862
Lasso Regression RMSE: 0.41734378753081575
XGBoost Regression RMSE: 0.1336104636520608
Neural Network RMSE: 2.2308262104205774




## Updating the Neural Network Model in Direction 2

We received a warning during the training of the Neural Network model indicating that the maximum number of iterations was reached before convergence. This suggests that the model might need more iterations to find the optimal parameters. Let's increase the `max_iter` parameter and adjust the `learning_rate_init` to see if this improves the model's performance.

In [6]:
# Initialize the model with increased max_iter and adjusted learning_rate_init
neural_network = MLPRegressor(max_iter=500, learning_rate_init=0.001)

# Train the model
neural_network.fit(X_train, y_train)

# Make predictions
nn_predictions = neural_network.predict(X_val)

# Evaluate the model
nn_rmse = np.sqrt(mean_squared_error(y_val, nn_predictions))

print(f'Updated Neural Network RMSE: {nn_rmse}')

  y = column_or_1d(y, warn=True)


Updated Neural Network RMSE: 2.1544314492905117



## Observations on Neural Network Model Performance

After running the Neural Network model again with increased `max_iter` and adjusted `learning_rate_init`, the difference in performance was not significant. The RMSE before the adjustment was 2.2308262104205774, and after the adjustment, it was 2.1544314492905117. This suggests that the adjustments made to the model parameters did not significantly improve the model's performance. It's possible that further tuning of the model parameters or adjustments to the model architecture may be required to achieve better performance.