# Predicting house prices using non-linear regression models

The goal is to develop regression models to predict house prices based on their characteristics. 

**Why?**

This offers benefits such as aiding real estate investment decisions, helping buyers and sellers make informed choices, assessing lending risks, analyzing market trends, and automating valuation processes. 

# Scope

The available data describes house sold in King County, USA from May 2014 to May 2015. It has been downloaded from [Kaggle](https://www.kaggle.com/datasets/harlfoxem/housesalesprediction/data). Two multiple linear regression algorithms will be tested, ordinary least squares and lasso regression. These will not be used to extrapolate beyond the range of the avaialble data. Instead predictions will be made for a randomly sampled holdout test set.

# Set-up

## Dependencies

In [1]:
import sys
CONFIG_DIRECTORY = 'C:\\Users\\billy\\OneDrive\\Documents\\Python Scripts\\1. Portfolio\\house-price-regression\\house-price-regression'
if CONFIG_DIRECTORY not in sys.path:
    sys.path.insert(0, CONFIG_DIRECTORY)

import ast
import config
import datetime
import numpy as np
import pandas as pd
import scipy
import seaborn as sns
import statsmodels.api as sm
from IPython.display import display
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
from matplotlib import pyplot as plt
from sklearn.linear_model import LinearRegression
import warnings
plt.style.use('seaborn-v0_8-muted')

## Import data

In [2]:
train_df = pd.read_csv(config.file_directory('cleaned') + 'train_df.csv')
train_df = train_df.loc[:, train_df.columns!='yr_renovated_bool']
print(f"Number of records: {train_df.shape[0]}")
print(f"Number of columns: {train_df.shape[1]}")
train_df.head()

Number of records: 14480
Number of columns: 18


Unnamed: 0,bathrooms,bedrooms,condition,floors,grade,sqft_above,sqft_basement,sqft_living,sqft_living15,sqft_lot,sqft_lot15,view,waterfront,yr_built,yr_renovated,zipcode,price,zipcode_price
0,1.75,3,4,1.5,8,1910,0,1910,1820,17003,14806,0,0,1963,0,98001,175000.0,9
1,1.75,2,4,1.0,7,1490,0,1490,2280,9874,9869,0,0,1963,0,98004,898000.0,9
2,1.75,4,4,1.0,8,1990,0,1990,2620,8900,8925,0,0,1972,0,98006,745000.0,9
3,2.5,4,3,2.0,8,2200,940,3140,2860,7260,8186,0,0,2004,0,98006,744000.0,9
4,3.5,5,3,2.0,11,3480,1320,4800,4050,14984,19009,2,0,1998,0,98006,1350000.0,9


In [3]:
transformed_train_df = pd.read_csv(config.file_directory('cleaned') + 'transformed_train_df.csv')
transformed_train_df = transformed_train_df.loc[:, transformed_train_df.columns!='yr_renovated_bool']
print(f"Number of records: {transformed_train_df.shape[0]}")
print(f"Number of columns: {transformed_train_df.shape[1]}")
transformed_train_df.head()

Number of records: 14480
Number of columns: 18


Unnamed: 0,bathrooms_qt,bedrooms_qt,condition_qt,floors_qt,grade_qt,price_qt,sqft_above_qt,sqft_basement_qt,sqft_living15_qt,sqft_living_qt,sqft_lot15_qt,sqft_lot_qt,view_qt,waterfront,yr_built_qt,yr_renovated_qt,zipcode_price_qt,zipcode_qt
0,0.520661,-0.332198,-0.434861,0.74304,-0.478025,-0.002509,0.700711,-5.199338,0.812654,0.368215,-0.698693,0.063183,-5.199338,0,0.868016,-5.199338,0.091712,-0.011291
1,-1.321946,-0.332198,-0.434861,-5.199338,-0.478025,-0.78333,-1.075207,-5.199338,-2.004234,-1.391202,-0.607463,-0.580895,-5.199338,0,-1.334066,2.325972,-0.364191,-0.199538
2,2.013671,1.60221,-0.434861,0.74304,1.634747,1.54295,2.175718,-5.199338,1.387908,1.887753,0.129582,0.163099,1.56699,0,1.192071,-5.199338,-0.364191,-0.16001
3,0.067799,0.68635,-0.434861,0.74304,0.421111,-0.689529,0.756333,-5.199338,0.621099,0.439001,0.298876,-0.035923,-5.199338,0,0.186756,-5.199338,-0.364191,-0.199538
4,0.520661,-0.332198,-0.434861,0.74304,-0.478025,0.397918,0.595437,-5.199338,0.584984,0.241963,-0.960196,-0.933654,-5.199338,0,0.821412,-5.199338,-0.364191,-0.238089


In [4]:
test_df = pd.read_csv(config.file_directory('cleaned') + 'test_df.csv')
test_df = test_df.loc[:, test_df.columns!='yr_renovated_bool']
print(f"Number of records: {test_df.shape[0]}")
print(f"Number of columns: {test_df.shape[1]}")
test_df.head()

Number of records: 7133
Number of columns: 18


Unnamed: 0,bathrooms,bedrooms,condition,floors,grade,sqft_above,sqft_basement,sqft_living,sqft_living15,sqft_lot,sqft_lot15,view,waterfront,yr_built,yr_renovated,zipcode,price,zipcode_price
0,1.5,2,3,3.0,7,1430,0,1430,1430,1650,1650,0,0,1999,0,98125,297000.0,5
1,3.25,4,4,2.0,12,4670,0,4670,4230,51836,41075,0,0,1988,0,98005,1578000.0,9
2,0.75,2,3,1.0,7,1200,240,1440,1440,3700,4300,0,0,1914,0,98107,562100.0,8
3,1.0,2,4,1.0,8,1130,0,1130,1680,2640,3200,0,0,1927,0,98109,631500.0,8
4,2.5,4,3,2.0,9,3180,0,3180,2440,9603,15261,2,0,2002,0,98155,780000.0,1


# K-nearest neighbours

## Background

**References:**
1. Practical Statistics for Data Scientists, Andrew Bruce & Peter Gedeck
2. https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-supervised-learning#other
3. https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote02_kNN.html
4. https://ocw.mit.edu/courses/6-034-artificial-intelligence-fall-2010/resources/lecture-10-introduction-to-learning-nearest-neighbors/
5. https://scikit-learn.org/stable/modules/neighbors.html#regression

## Model training

## Model evaluation

## Conclusion

# Random forest

## Background

**References:**
1. Practical Statistics for Data Scientists, Andrew Bruce & Peter Gedeck
2. https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote17.html
3. https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote18.html
4. https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-supervised-learning#tree
5. https://ocw.mit.edu/courses/6-034-artificial-intelligence-fall-2010/resources/lecture-11-learning-identification-trees-disorder/
6. https://scikit-learn.org/stable/modules/ensemble.html#random-forests-and-other-randomized-tree-ensembles

## Model training

## Model evaluation

## Conclusion

# Gradient boosting

## Background

**References**
1. Practical Statistics for Data Scientists, Andrew Bruce & Peter Gedeck
2. https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-supervised-learning#tree
3. https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote19.html
4. https://ocw.mit.edu/courses/6-034-artificial-intelligence-fall-2010/resources/lecture-17-learning-boosting/
5. https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosted-trees
6. https://xgboost.ai/

## Model training

## Model evaluation

## Conclusion

# Polynomial regression

## Background

**References**
1. Practical Statistics for Data Scientists, Andrew Bruce & Peter Gedeck
2. https://scikit-learn.org/stable/modules/preprocessing.html#polynomial-features

## Model training

## Model evaluation

## Conclusion