# Using Parallelism with Machine Learning: The Housing Prices Competition 

## Description of the competition

- The Housing Prices Competition train_dataset consists of various features of residential homes in Ames, Iowa, including both quantitative and categorical variables like the size of the property, the number of rooms, year built, and neighborhood quality.
- It includes a set of 79 explanatory variables describing almost every aspect of the houses, allowing for in-depth analysis.
- *The primary goal* of the competition is to predict **the final price of each home**, in this lab we will use *RandomForests*.
- The models are evaluated on Root Mean Squared Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price, encouraging precise predictions over a range of housing prices.

### File descriptions
- *train.csv*: the training set used to train the model.
- *test.csv*: the test set used to compute the performance of the model.
- *train_data_description.txt*: full description of each column.
### Useful train_data fields

Here's a brief version of what you'll find in the train_data description file.

- *SalePrice*: the property's sale price in dollars. This is the target variable that you're trying to predict.
- *MSSubClass*: The building class
- *MSZoning*: The general zoning classification

Teh train_dataset is acessible here: https://www.kaggle.com/code/dansbecker/random-forests/tutorial

## Read and prepare the train_data
*If you're curious about this the professor can explain it for you*.

In [12]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Load the train_dataset
file_path = 'data/housing_prices_data/train.csv'
train_data = pd.read_csv(file_path, index_col="Id")

# Columns to be deleted
columns_to_delete = ['MoSold', 'YrSold', 'SaleType', 'SaleCondition', 'Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature']

# Delete the specified columns
train_data_cleaned = train_data.drop(columns=columns_to_delete, axis=1)

# Define the input features (X) and the output (y)
X = train_data_cleaned.drop('SalePrice', axis=1)
y = train_data_cleaned['SalePrice']

# Identify the categorical columns in X
categorical_columns = X.select_dtypes(include=['object']).columns

# Initialize a LabelEncoder for each categorical column
label_encoders = {column: LabelEncoder() for column in categorical_columns}

# Apply Label Encoding to each categorical column
for column in categorical_columns:
    X[column] = label_encoders[column].fit_transform(X[column])

# Display the first few rows of X to confirm the encoding
print(X.head())


    MSSubClass  MSZoning  LotFrontage  LotArea  Street  LotShape  LandContour  \
Id                                                                              
1           60         3         65.0     8450       1         3            3   
2           20         3         80.0     9600       1         3            3   
3           60         3         68.0    11250       1         0            3   
4           70         3         60.0     9550       1         0            3   
5           60         3         84.0    14260       1         0            3   

    Utilities  LotConfig  LandSlope  ...  GarageQual  GarageCond  PavedDrive  \
Id                                   ...                                       
1           0          4          0  ...           4           4           2   
2           0          2          0  ...           4           4           2   
3           0          4          0  ...           4           4           2   
4           0          0        

## Split the Data into training and test

In [13]:
from sklearn.model_selection import train_test_split

# Split the first dataset (X, y) into train and test sets with a 70% - 30% split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.30, random_state=42)

# Fill NaN values in X_train and X_val with the median of the respective columns
X_train_filled = X_train.fillna(X_train.median())
X_val_filled = X_val.fillna(X_val.median())

(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

((1022, 70), (438, 70), (1022,), (438,))

## First RandomForest Model
This is the code for a simple trial.

In [14]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from math import sqrt

# Create a Random Forest model
rf_model = RandomForestRegressor(random_state=42)

# Train the model on the training data
rf_model.fit(X_train_filled, y_train)

# Make predictions on the validation data
y_val_pred_filled = rf_model.predict(X_val_filled)

# Calculate the RMSE on the validation data
rmse_filled = sqrt(mean_squared_error(y_val, y_val_pred_filled))

# Print the RMSE
print(f'RMSE on the validation data: {rmse_filled}')

RMSE on the validation data: 26057.941851126383


### Parameters of Random Forest Model
The three most important parameters that typically have the most impact on the performance of a Random Forest model are:

- *n_estimators*: This parameter specifies the number of trees in the forest. Generally, a higher number of trees increases the performance and makes the predictions more stable, but it also makes the computation slower. Selecting the right number of trees requires balancing between performance and computational efficiency.

- *max_features*: This parameter defines the maximum number of features that are allowed to try in an individual tree. There are several options available for this parameter:

    - *sqrt*: This is commonly used and means that the maximum number of features used at each split is the square root of the total number of features.
    - *log2*: This is another typical option, meaning the log base 2 of the feature count is used.
    - *A specific integer or float*: You can specify an exact number or a proportion of the total.

- *max_depth*: This parameter specifies the maximum depth of each tree. Deeper trees can model more complex patterns, but they also risk overfitting. Limiting the depth of trees can improve the model's generalization and reduce overfitting. It's often useful to set this parameter to a finite value, especially when dealing with a large number of features.

## Finding the best parameters sequentially

In [15]:
import time
from sklearn.metrics import mean_absolute_percentage_error

start_time = time.time()

# Define the parameter ranges
n_estimators_range = [10, 25, 50, 100, 200, 300, 400]
max_features_range = ['sqrt', 'log2', None]  # None means using all features
max_depth_range = [1, 2, 5, 10, 20, None]  # None means no limit

# Initialize variables to store the best model and its RMSE and parameters
best_rmse = float('inf')
best_mape = float('inf')
best_model = None
best_parameters = {}

# Loop over all possible combinations of parameters
for n_estimators in n_estimators_range:
    for max_features in max_features_range:
        for max_depth in max_depth_range:
            # Create and train the Random Forest model
            rf_model = RandomForestRegressor(
                n_estimators=n_estimators,
                max_features=max_features,
                max_depth=max_depth,
                random_state=42
            )
            rf_model.fit(X_train_filled, y_train)
            
            # Make predictions and compute RMSE
            y_val_pred = rf_model.predict(X_val_filled)
            rmse = sqrt(mean_squared_error(y_val, y_val_pred))
            # Compute MAPE
            mape = mean_absolute_percentage_error(y_val, y_val_pred) * 100
            # print(f"The parameters: {n_estimators}, {max_features}, {max_depth}. RMSE: {rmse}, MAPE: {mape}%")
            # If the model is better than the current best, update the best model and its parameters
            if rmse < best_rmse:
                best_rmse = rmse
                best_mape = mape
                best_model = rf_model
                best_parameters = {
                    'n_estimators': n_estimators,
                    'max_features': max_features,
                    'max_depth': max_depth
                }
print(f"The best parameters {best_parameters} for RMSE = {best_rmse}, MAPE: {mape}%")
end_time = time.time()
sequential_time = start_time - end_time
print(f"The sequential execution time is {end_time - start_time}")

The best parameters {'n_estimators': 100, 'max_features': None, 'max_depth': None} for RMSE = 26057.941851126383, MAPE: 9.83203095544113%
The sequential execution time is 59.58460569381714


As we can see, the process of sequentially finding the best parameters for the Random Forest Model is slow and inefficient. We can use parallelism to speed up this process.

# 2.d. Parallelize with Threading and processes

In this section, we will try to parallelize the process of finding the best parameters for the Random Forest Model. We will use threading, and multiprocessing. In the end, we will compare the performance of the two methods. 

### Parallelizing with Threading:

In [16]:
# Start by importing the necessary libraries
import threading

In [17]:
# Initialize variables to store the best model and its parameters
best_rmse = float('inf')
best_mape = float('inf')
best_model = None
best_parameters = {}
lock = threading.Lock()

In [18]:
# Turn the test and evaluation into a function

def evaluate_model(n_estimators, max_features, max_depth):
    global best_rmse, best_mape, best_model, best_parameters
    rf_model = RandomForestRegressor(
        n_estimators=n_estimators,
        max_features=max_features,
        max_depth=max_depth,
        random_state=42
    )
    rf_model.fit(X_train_filled, y_train)
    y_val_pred = rf_model.predict(X_val_filled)
    rmse = sqrt(mean_squared_error(y_val, y_val_pred))
    mape = mean_absolute_percentage_error(y_val, y_val_pred) * 100
    
    with lock: # Lock the critical section
        if rmse < best_rmse:
            best_rmse = rmse
            best_mape = mape
            best_model = rf_model
            best_parameters = {
                'n_estimators': n_estimators,
                'max_features': max_features,
                'max_depth': max_depth
            }


In [19]:
start_time = time.time()
threads = []

for n_estimators in n_estimators_range:
    for max_features in max_features_range:
        for max_depth in max_depth_range:
            thread = threading.Thread(target=evaluate_model, args=(n_estimators, max_features, max_depth))
            threads.append(thread)
            thread.start()

for thread in threads:
    thread.join()

end_time = time.time()
threaded_time = end_time - start_time
print(f"The best parameters {best_parameters} for RMSE = {best_rmse}, MAPE: {mape}%")
print(f"The threaded execution time is {threaded_time}")

The best parameters {'n_estimators': 100, 'max_features': None, 'max_depth': None} for RMSE = 26057.941851126383, MAPE: 9.83203095544113%
The threaded execution time is 35.15937113761902


### Parallelizing with Multiprocessing:

In this section, we will be using multiprocessing to parallelize the process of finding the best parameters for the Random Forest Model. We will use the same approach as in the previous section, but instead of using threading, we will use the multiprocessing module to create separate processes for each model evaluation.

For each combination of hyperparameters, a separate process is spawned.
Each process runs the evaluate_model function independently. This function Trains a RandomForestRegressor model with a specific set of hyperparameters.
Evaluates the model by predicting on a validation set and calculating the RMSE and MAPE.
Stores the hyperparameters and corresponding evaluation metrics (RMSE and MAPE) in a JSON file, with one file per hyperparameter combination.


- Synchronization and Result Collection:

After starting all processes, the main process waits for them to complete their execution using the join() method. This ensures that all model evaluations are finished before proceeding to the result collection phase.
Once all processes have completed, the main process reads the JSON files, each containing the results of a single model evaluation.


- Identifying the Best Model:

The main process compares the RMSE from each model's evaluation results. It keeps track of the best (lowest) RMSE and corresponding MAPE and hyperparameters.
After iterating through all files, it identifies the best model's hyperparameters and its performance metrics.

- Cleanup and Final Output:

Temporary JSON files used to store the results of each subprocess are removed to clean up the storage.
The total execution time for the entire multiprocessing operation is calculated and displayed.
The hyperparameters of the best model and its performance metrics (RMSE and MAPE) are printed out as the final output.


In [20]:
# Start by importing the necessary libraries
import multiprocessing # For parallel processing
import json # For saving the best parameters to a file
import os # For checking if the file exists and deleting it after the run

In [21]:
# Wrap the evaluation in a function and save the results to a json file
def evaluate_model(n_estimators, max_features, max_depth, file_path):
    rf_model = RandomForestRegressor(
        n_estimators=n_estimators,
        max_features=max_features,
        max_depth=max_depth,
        random_state=42
    )
    rf_model.fit(X_train_filled, y_train)
    y_val_pred = rf_model.predict(X_val_filled)
    rmse = sqrt(mean_squared_error(y_val, y_val_pred))
    mape = mean_absolute_percentage_error(y_val, y_val_pred) * 100
    
    # Save results to a JSON file
    with open(file_path, 'w') as f:
        json.dump({
            'n_estimators': n_estimators,
            'max_features': max_features,
            'max_depth': max_depth,
            'rmse': rmse,
            'mape': mape
        }, f)


In [22]:
start_time = time.time() # Start the timer
processes = [] # Initialize a list to store the processes
temp_files = [] # Initialize a list to store the file paths

for n_estimators in n_estimators_range: # Loop over the parameter ranges
    for max_features in max_features_range:
        for max_depth in max_depth_range:
            file_path = f'results/temp_results_{n_estimators}_{max_features}_{max_depth}.json'
            temp_files.append(file_path)
            process = multiprocessing.Process(target=evaluate_model, args=(n_estimators, max_features, max_depth, file_path)) # Create a process for each combination of parameters
            processes.append(process) # Add the process to the list
            process.start()

# Wait for all processes to finish
for process in processes:
    process.join()

In [23]:
# Initialize variables to store the best RMSE and its parameters
best_rmse = float('inf')
best_mape = float('inf')
best_parameters = {}

# Collect results from JSON files
for file_path in temp_files: # Loop over the temporary files
    with open(file_path, 'r') as f: 
        results = json.load(f) # Load the results from the file
        if results['rmse'] < best_rmse: # If the results are better than the current best, update the best RMSE and its parameters
            best_rmse = results['rmse']
            best_mape = results['mape']
            best_parameters = {
                'n_estimators': results['n_estimators'],
                'max_features': results['max_features'],
                'max_depth': results['max_depth']
            }
    os.remove(file_path)  # Clean up the temporary file

end_time = time.time() # End the timer
multiprocessing_time = end_time - start_time
print(f"The multiprocessing execution time is {multiprocessing_time}")
print(f"The best parameters {best_parameters} for RMSE = {best_rmse}, MAPE: {best_mape}%")

The multiprocessing execution time is 17.05547070503235
The best parameters {'n_estimators': 100, 'max_features': None, 'max_depth': None} for RMSE = 26057.941851126383, MAPE: 9.868196740754167%


- Sequential Execution Time: 59.58 seconds
- Threaded Execution Time: 35.16 seconds
- Multiprocessing Execution Time: 17.06 seconds

Execution Time Changes:

The threaded implementation is faster than the sequential one, likely due to parallel execution of some tasks.
The multiprocessing implementation is the fastest, significantly improving performance by effectively using multiple CPU cores and avoiding the limitations of Python's Global Interpreter Lock (GIL).
Performance Metrics:

Ideally, all three methods (sequential, threaded, multiprocessing) should give you the same best model accuracy (RMSE, MAPE), meaning they all find the same best set of hyperparameters.


The main benefit of threading and multiprocessing is reducing the time it takes to find the best model, not necessarily improving the model's accuracy itself.


In summary, multiprocessing offered the best speed improvement in your hyperparameter tuning task, likely due to its ability to better utilize the CPU resources for this particular computation-heavy task.