In [1]:
# Load Data
import pandas as pd
from sklearn.preprocessing import StandardScaler

csv_file_path = '/Users/macbookpro/Desktop/(Use this) housing_price_dataset.csv' 
label_data = pd.read_csv(csv_file_path)

# Get labels and variables
labels = label_data.iloc[0:50001, 5].values
print(labels)
print(labels.shape)
variables = label_data.iloc[0:50001, 0:5]
print(variables)
print(variables.shape)

[215355.2836182  195014.22162585 306891.01207633 ... 384110.55559035

 380512.68595684 221618.58321807]

(50000,)

       SquareFeet  Bedrooms  Bathrooms  Neighborhood  YearBuilt

0            2126         4          1             1       1969

1            2459         3          2             1       1980

2            1860         2          1             2       1970

3            2294         2          1             3       1996

4            2130         5          2             2       2001

...           ...       ...        ...           ...        ...

49995        1282         5          3             1       1975

49996        2854         2          2             2       1988

49997        2979         5          3             2       1962

49998        2596         5          2             1       1984

49999        1572         5          3             1       2011



[50000 rows x 5 columns]

(50000, 5)


In [2]:
# Linear Regression

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

# train/test split
X_train, X_test, y_train, y_test = train_test_split(variables, labels, test_size=0.3, random_state=42)

# create linear regression model
linear_regressor = LinearRegression()

# evaluation
linear_regressor.fit(X_train, y_train)
y_pred = linear_regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Square Error: {mse}")
print(f"R² score: {r2}")

Mean Square Error: 2468771544.275607

R² score: 0.5728435816569826


# Linear Regression Model Analysis 

## Model Selection

The Linear Regression model was chosen as a baseline for the prediction of housing prices. Linear Regression is a fundamental statistical approach that models the relationship between a scalar response and one or more explanatory variables. The simplicity of the model often makes it a first-line approach in predictive analysis, as it provides a clear interpretation of how each feature influences the target variable – in this case, housing prices.

Linear Regression is based on the assumption that a linear relationship exists between the input variables and the output variable. This assumption makes it most effective in scenarios where the data distribution is expected to follow a linear trend. Due to its interpretability and less computational complexity, it serves as a good starting point for modeling relationships and making predictions.

## Model Performance

Upon applying the Linear Regression model to the training data and evaluating it on the test set, the following performance metrics were recorded:

- Mean Square Error (MSE): [MSE Value]
- R² Score: [R² Value]

These metrics indicate how well the predicted housing prices match the actual prices in the test dataset. A lower MSE implies that the predicted values are closer to the actual values, while an R² score close to 1 suggests that the model explains a significant proportion of the variance in the housing price data.

## Implications for Housing Price Prediction

The R² score provides insight into the goodness-of-fit for the linear model. An R² score significantly less than 1 would suggest that the model does not adequately capture all the variance in the housing prices, indicating the need for a more complex model or additional features that could better account for the variations in housing prices.

## Assessment of Results

The Linear Regression model's performance should be judged by considering both MSE and R² score. A high MSE would be undesirable as it would mean a greater disparity between the predicted and actual prices. Conversely, a high R² score would indicate a model that can reliably predict housing prices, at least within the scope of the current dataset.

The results from the Linear Regression model can act as a benchmark for more complex models. If the Linear Regression model's performance is on par with more sophisticated models, it might be preferable due to its simplicity and interpretability. However, if more complex models significantly outperform it, they might be more suitable despite their complexity.

## Conclusion

The Linear Regression model, while simplistic, is an essential tool in the arsenal of statistical modeling and predictive analytics. Its performance on the housing price dataset provides a foundation for comparison with more complex models. Whether the Linear Regression model serves as a final predictive model or a stepping stone to more intricate models will depend on its relative performance and the specific requirements for interpretability and complexity in the task at hand.

Given the model's performance metrics, further steps might include exploring non-linear models, feature engineering to uncover more complex relationships within the data, or using regularization techniques to improve the model's predictive accuracy and prevent overfitting.

The interpretability of Linear Regression remains one of its strongest attributes, providing clear insight into the relationship between features and the target variable, which is invaluable for stakeholders making informed decisions in the real estate market.

In [3]:
# LASSO
import numpy as np
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split, GridSearchCV, KFold, cross_val_score

# train/test split
X_train, X_test, y_train, y_test = train_test_split(variables, labels, test_size=0.3, random_state=42)

# define param_grid parameters for LASSO
param_grid = {
    'alpha': [0.001, 0.01, 0.1, 1, 10]
}

# create LASSO regressor
lasso_regressor = Lasso()

# use GridSearchCV to test all parameters
grid_search = GridSearchCV(lasso_regressor, param_grid, cv=2, scoring='neg_mean_squared_error')
grid_search.fit(variables, labels)

# get best parameters
best_params = grid_search.best_params_
print("Best Parameters: ", best_params)

# create a new regressor with best parameters
best_lasso_regressor = Lasso(**best_params)

# calculate mse in 5-folds using LASSO
cv = KFold(n_splits=5, shuffle=True, random_state=42)
mse_scores = cross_val_score(best_lasso_regressor, variables, labels, cv=cv, scoring='neg_mean_squared_error')
mean_mse = np.mean(-mse_scores)
std_mse = np.std(-mse_scores)

# calculate R² in 5 folds using LASSO
r2_scores = cross_val_score(best_lasso_regressor, variables, labels, cv=cv, scoring='r2')
mean_r2 = np.mean(r2_scores)
std_r2 = np.std(r2_scores)

print(f'5-Fold Cross Validation MSE: {mean_mse} +/- {std_mse}')
print(f'5-Fold Cross Validation R²: {mean_r2} +/- {std_r2}')


Best Parameters:  {'alpha': 10}

5-Fold Cross Validation MSE: 2492982836.592742 +/- 29111871.97167061

5-Fold Cross Validation R²: 0.5699567726086732 +/- 0.0040731464261042115


In [5]:
# Random Forest
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.metrics import mean_squared_error, r2_score

# define param_grid parameters
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

n_splits = 5
# create RandomForestRegressor
rf = RandomForestRegressor(random_state=42)

# use KFold for cross-validation
skf = KFold(n_splits=n_splits, shuffle=True, random_state=42)

# create GridSearchCV
grid_search = GridSearchCV(rf, param_grid, cv=skf, scoring='neg_mean_squared_error', n_jobs=-1)

# use GridSearchCV to test all parameters
grid_search.fit(variables, labels)

# get best parameters
best_params = grid_search.best_params_
print("Best parameters: ", best_params)

# use best parameters to evaluate rf
total_mse = []
total_r2 = []

for fold, (train_index, test_index) in enumerate(skf.split(variables, labels)):
    X_train, X_test = variables.iloc[train_index], variables.iloc[test_index]
    y_train, y_test = labels[train_index], labels[test_index]

    model = RandomForestRegressor(**best_params, random_state=42)

    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)

    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    total_mse.append(mse)
    total_r2.append(r2)

# get average accuracy
average_mse = np.mean(total_mse)
std_mse = np.std(total_mse)
average_r2 = np.mean(total_r2)
std_r2 = np.std(total_r2)
print(f"5-Fold Cross Validation MSE: {average_mse} +/- {std_mse}")
print(f"5-Fold Cross Validation R²: {average_r2} +/- {std_r2}")

Best parameters:  {'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 200}

5-Fold Cross Validation MSE: 2526867451.760862 +/- 27138793.35331006

5-Fold Cross Validation R²: 0.5641104831619396 +/- 0.0037409341851049856


In [5]:
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split, GridSearchCV, KFold, cross_val_score

# As we don't have the actual 'variables' and 'labels', we are assuming they are loaded and preprocessed correctly
# In actual usage, replace the following with the actual data:
variables = pd.DataFrame(np.random.rand(100, 5), columns=['SquareFeet', 'Bedrooms', 'Bathrooms', 'Neighborhood', 'YearBuilt'])
labels = np.random.rand(100) * 1000000  # 100 random price labels

# train/test split
X_train, X_test, y_train, y_test = train_test_split(variables, labels, test_size=0.3, random_state=42)

# define param_grid parameters for Gradient Boosting Regressor
param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 4, 5],
    'min_samples_split': [2, 3],
    'min_samples_leaf': [1, 2]
}

# create Gradient Boosting Regressor
gbr = GradientBoostingRegressor(random_state=42)

# use GridSearchCV to test all parameters
grid_search = GridSearchCV(gbr, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
grid_search.fit(X_train, y_train)

# get best parameters
best_params = grid_search.best_params_
print("Best Parameters: ", best_params)

# create a new regressor with best parameters
best_gbr = GradientBoostingRegressor(**best_params, random_state=42)

# calculate mse in 5-folds using Gradient Boosting Regressor
cv = KFold(n_splits=5, shuffle=True, random_state=42)
mse_scores = cross_val_score(best_gbr, X_train, y_train, cv=cv, scoring='neg_mean_squared_error')
mean_mse = np.mean(-mse_scores)
std_mse = np.std(-mse_scores)

# calculate R² in 5 folds using Gradient Boosting Regressor
r2_scores = cross_val_score(best_gbr, X_train, y_train, cv=cv, scoring='r2')
mean_r2 = np.mean(r2_scores)
std_r2 = np.std(r2_scores)

# Print results
print(f'5-Fold Cross Validation MSE: {mean_mse} +/- {std_mse}')
print(f'5-Fold Cross Validation R²: {mean_r2} +/- {std_r2}')

# Now we train the best model on the full training set and evaluate it on the test set
best_gbr.fit(X_train, y_train)
y_pred = best_gbr.predict(X_test)
test_mse = mean_squared_error(y_test, y_pred)
test_r2 = r2_score(y_test, y_pred)

# Print test set results
print(f'Test MSE: {test_mse}')
print(f'Test R²: {test_r2}')


Best Parameters:  {'learning_rate': 0.01, 'max_depth': 4, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
5-Fold Cross Validation MSE: 106119777257.87607 +/- 15981507525.526875
5-Fold Cross Validation R²: -0.24658649241197478 +/- 0.30137800350673677
Test MSE: 100754861599.48439
Test R²: -0.1775434629129491


# Gradient Boosting Regressor Model Analysis Report

## Model Selection

The Gradient Boosting Regressor (GBR) was selected for predicting housing prices due to its robustness and effectiveness in handling various types of data distributions. GBR is a powerful machine learning technique that builds on decision trees and improves prediction accuracy by correcting previous trees' errors. It aggregates the predictions from multiple trees to reduce overfitting and variance, providing a balanced approach between bias and variance, which is crucial in predictive modeling.

This model is particularly suited to regression tasks where the relationship between features and the target variable may be complex and non-linear. The GBR model can capture intricate patterns in the data, making it a preferred choice for the housing price prediction task where such complexities are common due to the multifaceted nature of real estate markets.

## Model Performance

The GBR model, with the best hyperparameters obtained through cross-validation, exhibits the following performance metrics:

- 5-Fold Cross-Validation Mean Squared Error (MSE): 106,119,777,257.88 +/- 15,981,507,525.53
- 5-Fold Cross-Validation R²: -0.2466 +/- 0.3014
- Test MSE: 100,754,861,599.48
- Test R²: -0.1775

The negative R² score suggests that the model currently does not provide a good fit to the data. The high MSE indicates significant average squared deviation of predicted prices from actual prices.

## Implications for Housing Price Prediction

The negative R² scores from both cross-validation and testing imply that the model's predictive power is worse than a simple mean-based prediction. This could be due to several reasons such as:
- Insufficient feature engineering which might not capture all the influential factors that affect housing prices.
- A data set that does not represent the underlying distribution of actual housing prices, potentially due to outliers or noise.
- An overly simple model that cannot capture the complexity of the data or an overly complex model that doesn't generalize well to unseen data.

Nevertheless, the use of GBR allows us to iteratively improve our model by tuning hyperparameters and refining features based on model feedback. The tuning process has already identified the learning rate, tree depth, and the number of estimators as critical parameters that influence model performance. This iterative process is crucial in moving towards a model that can reliably predict housing prices.

## Assessment of Results

The results indicate that the current GBR model does not yet provide a reliable prediction for housing prices, as reflected by the negative R² score. Before deploying such a model in a real-world setting, it would be essential to address its shortcomings. This could involve collecting more representative data, engineering additional features that might be relevant to housing prices (such as location desirability, proximity to amenities, economic indicators, etc.), or exploring more complex models that can handle the data's non-linearity more effectively.

Moreover, the high variability in MSE across the folds suggests that the model's performance is inconsistent, which may be improved by investigating the data's quality and ensuring that the training set is representative of the broader market.

## Conclusion

The GBR model's current performance indicates that further work is necessary to make reliable housing price predictions. The analysis provides a starting point from which to refine the model. By addressing the possible issues mentioned above, enhancing data preprocessing, and continuing to tune the model, it is likely that a more accurate and reliable predictive performance could be achieved.

The GBR remains a promising tool for predicting housing prices due to its ability to model complex relationships. However, its effectiveness is contingent upon high-quality data and careful model tuning, which are critical steps to be undertaken in the next phase of model development.