# Assignment 4: Pipelines and Hyperparameter Tuning (32 total marks)
### Due: November 22 at 11:59pm

### Name: Balkarn Gill - 30202219

### In this assignment, you will be putting together everything you have learned so far. You will need to find your own dataset, do all the appropriate preprocessing, test different supervised learning models and evaluate the results. More details for each step can be found below.

### You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Import Libraries

In [None]:
import numpy as np
import pandas as pd

## Step 1: Data Input (4 marks)

Import the dataset you will be using. You can download the dataset onto your computer and read it in using pandas, or download it directly from the website. Answer the questions below about the dataset you selected.

To find a dataset, you can use the resources listed in the notes. The dataset can be numerical, categorical, text-based or mixed. If you want help finding a particular dataset related to your interests, please email the instructor.

**You cannot use a dataset that was used for a previous assignment or in class**

In [None]:
# Import dataset (1 mark)

import pandas as pd

# Load the dataset
file_path = 'housing_price_dataset.csv'  # Use the correct file name
housing_data = pd.read_csv(file_path)

# Feature matrix (X) - all columns except 'Price'
X = housing_data.drop('Price', axis=1)

# Target vector (y) - 'Price' column
y = housing_data['Price']
print(X.columns)

Index(['SquareFeet', 'Bedrooms', 'Bathrooms', 'Neighborhood', 'YearBuilt'], dtype='object')


### Questions (3 marks)

1. (1 mark) What is the source of your dataset?

I got this dataset from Kaggle.

1. (1 mark) Why did you pick this particular dataset?

I chose this dataset because it offers a practical and relevant problem in the field of machine learning, particularly in regression analysis. Housing price prediction is a common real-world application of machine learning, making this dataset a valuable resource for understanding how to handle and analyze real-world data. Its mix of numerical and categorical data, along with a clear target variable (Price), provides a good opportunity to apply various data preprocessing and machine learning techniques.


1. (1 mark) Was there anything challenging about finding a dataset that you wanted to use?

The main challenge in finding a suitable dataset could have been ensuring that it is well-suited for the intended analysis, in this case, predictive modeling. It's important to find a dataset that is not only relevant to the topic of interest but also has a good mix of features, a reasonable size, and minimal missing or inconsistent data. Additionally, if you were looking for open-source data, ensuring that the dataset has a clear and permissible usage license could also have been a challenge. Lastly, selecting a dataset that is neither too simplistic nor overly complex for your current skill level might have been a consideration.

*ANSWER HERE*

## Step 2: Data Processing (5 marks)

The next step is to process your data. Implement the following steps as needed.

In [None]:
# Clean data (if needed)

# Check for missing values
missing_values = housing_data.isnull().sum()

print(missing_values)
print(housing_data.columns)



SquareFeet      0
Bedrooms        0
Bathrooms       0
Neighborhood    0
YearBuilt       0
Price           0
dtype: int64
Index(['SquareFeet', 'Bedrooms', 'Bathrooms', 'Neighborhood', 'YearBuilt',
       'Price'],
      dtype='object')


In [None]:
# Implement preprocessing steps. Remember to use ColumnTransformer if more than one preprocessing method is needed

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Identifying numerical and categorical columns
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = ['Neighborhood']  # Update as per your dataset

# Define transformers
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)])

# Apply preprocessing to X
X_preprocessed = preprocessor.fit_transform(X)


### Questions (2 marks)

1. (1 mark) Were there any missing/null values in your dataset? If yes, how did you replace them and why? If no, describe how you would've replaced them and why.

There were no missing or null values in the dataset. However, if there had been missing values, the approach to handle them would depend on the nature of the data and the specific column. For numerical columns, missing values could be replaced with the mean or median of the column, as this can preserve the general distribution of the data. The mean is suitable for normally distributed data, while the median is preferred for skewed distributions to avoid the influence of outliers. For categorical columns, missing values could be replaced with the mode, which is the most frequently occurring category. Alternatively, for both numerical and categorical data, another approach could be to use more sophisticated imputation methods like K-Nearest Neighbors (KNN) or regression imputation, which can predict missing values based on the other available data.





2. (1 mark) What type of data do you have? What preprocessing methods would you have to apply based on your data types?

The dataset contains both numerical (e.g., 'SquareFeet', 'Bedrooms', 'Bathrooms', 'YearBuilt') and categorical data (e.g., 'Neighborhood'). For the numerical data, preprocessing methods such as standardization or normalization are often applied. Standardization (using StandardScaler in scikit-learn) rescales data to have a mean of 0 and a standard deviation of 1, which is particularly useful when different features have different scales and units. Normalization (using MinMaxScaler) rescales the features to a specific range, typically 0 to 1. This can be beneficial for algorithms that are sensitive to the scale of input data, like gradient descent-based algorithms. For the categorical data, since machine learning models require numerical input, converting categories to numbers is necessary. This can be done through one-hot encoding (using OneHotEncoder) which creates binary columns for each category of a variable. This method is particularly useful when there is no inherent order in the categorical variables. Another method is label encoding (using LabelEncoder), where each category is assigned a unique integer. However, this might imply an ordinal relationship where none exists, so it's generally preferred for ordinal data. For this dataset, one-hot encoding was used for the 'Neighborhood' column to capture the categorical information without implying any ordinal relationship.



*ANSWER HERE*

## Step 3: Implement Machine Learning Model (11 marks)

In this section, you will implement three different supervised learning models (one linear and two non-linear) of your choice. You will use a pipeline to help you decide which model and hyperparameters work best. It is up to you to select what models to use and what hyperparameters to test. You can use the class examples for guidance. You must print out the best model parameters and results after the grid search.

In [None]:
# Implement pipeline and grid search here. Can add more code blocks if necessary

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor


X_train, X_test, y_train, y_test = train_test_split(X_preprocessed, y, test_size=0.2, random_state=42)


# Linear Regression Pipeline
pipe_linear = Pipeline([('regressor', LinearRegression())])

# Decision Tree Regressor Pipeline
pipe_tree = Pipeline([('regressor', DecisionTreeRegressor())])

# Random Forest Regressor Pipeline
pipe_forest = Pipeline([('regressor', RandomForestRegressor())])


param_grid_linear = {
    'regressor__fit_intercept': [True, False],
    'regressor__positive': [True, False]
}

param_grid_tree = {'regressor__max_depth': [None, 10, 20, 30],
                   'regressor__min_samples_split': [2, 5, 10]}

param_grid_forest = {'regressor__n_estimators': [10, 50, 100],
                     'regressor__max_features': [1.0, 'sqrt', 'log2']}


grid_linear = GridSearchCV(pipe_linear, param_grid_linear, cv=5, scoring='neg_mean_squared_error')
grid_tree = GridSearchCV(pipe_tree, param_grid_tree, cv=5, scoring='neg_mean_squared_error')
grid_forest = GridSearchCV(pipe_forest, param_grid_forest, cv=5, scoring='neg_mean_squared_error')


# For Linear Regression
grid_linear.fit(X_train, y_train)
print("Best parameters for Linear Regression:", grid_linear.best_params_)
print("Best score for Linear Regression:", grid_linear.best_score_)

# For Decision Tree
grid_tree.fit(X_train, y_train)
print("Best parameters for Decision Tree Regressor:", grid_tree.best_params_)
print("Best score for Decision Tree Regressor:", grid_tree.best_score_)

# For Random Forest
grid_forest.fit(X_train, y_train)
print("Best parameters for Random Forest Regressor:", grid_forest.best_params_)
print("Best score for Random Forest Regressor:", grid_forest.best_score_)



Best parameters for Linear Regression: {'regressor__fit_intercept': True, 'regressor__positive': True}
Best score for Linear Regression: -2506412594.5019636
Best parameters for Decision Tree Regressor: {'regressor__max_depth': 10, 'regressor__min_samples_split': 10}
Best score for Decision Tree Regressor: -2772145233.9303474
Best parameters for Random Forest Regressor: {'regressor__max_features': 1.0, 'regressor__n_estimators': 100}
Best score for Random Forest Regressor: -2834640133.0141687


### Questions (5 marks)

1. (1 mark) Do you need regression or classification models for your dataset?

For this dataset, regression models are needed. The reason is that the target variable, 'Price', is continuous and numerical. Regression models are designed to predict a continuous outcome, making them suitable for predicting house prices based on various features like square footage, number of bedrooms, bathrooms, neighborhood, and year built.


1. (2 marks) Which models did you select for testing and why?

The selected models were Linear Regression, Decision Tree Regressor, and Random Forest Regressor.

Linear Regression: This is a fundamental regression model and serves as a good baseline. It assumes a linear relationship between the independent variables and the dependent variable. It's useful to understand how well a simple model performs before moving to more complex ones.

Decision Tree Regressor: As a non-linear model, it can capture more complex relationships in the data that a linear model might miss. Decision trees are also easy to interpret and can handle both numerical and categorical data effectively.

Random Forest Regressor: This is an ensemble model that builds multiple decision trees and merges them together to get a more accurate and stable prediction. It's generally known for its high accuracy and ability to handle overfitting better than individual decision trees.




1. (2 marks) Which model worked the best? Does this make sense based on the theory discussed in the course and the context of your dataset?

Best Performing Model Based on Grid Search Results:

Among the three models tested, the Linear Regression model achieved the best score, as indicated by the highest (least negative) value of the scoring metric (Mean Squared Error, MSE). The best parameters for the Linear Regression model were found to be {'regressor__fit_intercept': True, 'regressor__positive': True}.
Does the Best Performing Model Make Sense Theoretically and Contextually?


The fact that Linear Regression performed the best can be interpreted in several ways:


Simplicity of Relationships: The underlying relationship between the features and the target variable (house prices) in this dataset might be more linear than complex. This indicates that a simple linear model is sufficient to capture the trends in the data without the need for more complex, non-linear models.

Overfitting in Complex Models: It’s possible that the Decision Tree and Random Forest models overfitted the training data. While these models are capable of capturing more complex relationships, they can also fit the noise in the training data, leading to poorer performance on unseen data (test data).

Dataset Characteristics: The performance of Linear Regression suggests that the features have a significant linear relationship with the house prices. The dataset might not have enough variability or non-linear patterns to benefit from the more complex models.

Hyperparameter Choices: The grid search results are also influenced by the choice of hyperparameters and their ranges. It's possible that with a different set of hyperparameters or a broader search, the non-linear models could have performed better.

In summary, the Linear Regression model's superior performance in this context might suggest that for this particular dataset, a simple linear approach is more effective than more complex models. This aligns with the principle that sometimes simpler models can outperform complex ones, especially when the relationships in the data are not overly complex or when the dataset size and feature set do not warrant more sophisticated models.







*ANSWER HERE*

## Step 4: Validate Model (6 marks)

Use the testing set to calculate the testing accuracy for the best model determined in Step 3.

In [None]:
# Calculate testing accuracy (1 mark)

from sklearn.linear_model import LinearRegression

# Create the Linear Regression model with the best parameters
best_linear_model = LinearRegression(fit_intercept=True, positive=True)

# Fit the model on the training data
best_linear_model.fit(X_train, y_train)


from sklearn.metrics import mean_squared_error, r2_score

# Predict on the testing data
y_pred = best_linear_model.predict(X_test)

# Calculate the testing accuracy
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Testing Mean Squared Error: {mse}")
print(f"Testing R-squared: {r2}")


Testing Mean Squared Error: 2436390444.066854
Testing R-squared: 0.5755382856951105



### Questions (5 marks)

1. (1 mark) Which accuracy metric did you choose?

The accuracy metrics used were Mean Squared Error (MSE) and R-squared. MSE is a measure of the average squared difference between the estimated values and the actual values, providing a quantifiable indication of the model's prediction error. R-squared, or the coefficient of determination, indicates the proportion of the variance in the dependent variable that is predictable from the independent variables, providing a measure of how well unseen samples are likely to be predicted by the model.


1. (1 mark) How do these results compare to those in part 3? Did this model generalize well?

In part 3, the best model (Linear Regression) was selected based on the grid search results, which used the training data. The testing MSE and R-squared in part 4 are crucial for evaluating how well the model generalizes to new, unseen data.
The testing MSE is 2436390444.066854 and the R-squared is 0.5755382856951105. While the R-squared value suggests that the model explains about 57.5% of the variance in the target variable, the MSE is relatively high, indicating a substantial average error in the price predictions.
This comparison suggests that the model has moderate predictive power but may not generalize exceptionally well to unseen data, given the substantial prediction error.


1. (3 marks) Based on your results and the context of your dataset, did the best model perform "well enough" to be used out in the real-world? Why or why not? Do you have any suggestions for how you could improve this analysis?

Real-World Applicability: The model's performance, with an R-squared of 0.5755, indicates it is moderately effective but might not be reliable enough for high-stakes decisions in the real-world real estate market. The substantial MSE implies that the model's predictions can be off by a significant margin, which is critical when dealing with high-value assets like real estate.

Suggestions for Improvement:
Feature Engineering: More sophisticated feature engineering could potentially improve the model's performance. This might include creating new features, transforming existing features, or integrating external data that could be relevant (e.g., economic indicators, local housing market trends).

Model Complexity: Exploring more complex or different types of regression models might yield better results. Models that can capture more complex non-linear relationships or interactions between features (like Gradient Boosting or Support Vector Regression) might be more effective.

Hyperparameter Tuning: Extending the range and scope of hyperparameter tuning could also lead to improvements. This might involve exploring a wider range of values or different hyperparameters.

Cross-Validation: Using a more robust cross-validation strategy might provide a better understanding of how the model is likely to perform on various subsets of the data.

Data Quality and Size: Increasing the dataset size or improving data quality (if possible) can also enhance model performance. More data points can provide a more comprehensive representation of the underlying patterns.



*ANSWER HERE*

## Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?

From the examples, lectures provided in class, as well as online tools and libraries.

1. In what order did you complete the steps?

In the order that they were listed.


1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?

I used AI to help debug errors in my code. I also used it to help me understand the results better and answer some questions. It was easier to use AI to understand results rather than dig online for resources or through old lecture notes. I used prompts such as "What does this error mean and how do I fix it", or "What do these results mean"

1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

I did have some challenges remembering how to perform specific models. I had to use quite a bit of AI and old notes to help me be successful.



*DESCRIBE YOUR PROCESS HERE*

## Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challenging, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

While working on this assignment, I found it particularly interesting to apply machine learning concepts to a practical scenario like housing price prediction. It was engaging to think through each step of the process, from data preprocessing to model selection and validation. The most challenging aspect was adapting the advice and code suggestions to the specific issues and errors encountered, which required a deep understanding of both the theoretical and practical aspects of machine learning. This challenge was also motivating, as it pushed me to think critically and creatively to provide solutions that were both accurate and applicable to your unique dataset and objectives.