# House Price Prediction using Linear Regression

House price prediction is a common task in real estate and property markets. It involves estimating the value of a house based on various factors such as the number of bedrooms, the size of the house and other relevant features. Linear regression is a widely used statistical technique that can be applied to predict house prices based on these features.

In this Python code, we will demonstrate how to build a simple linear regression model to predict house prices. We will use a dataset that contains information about different houses, including their features and corresponding prices. The goal is to train a regression model that can accurately estimate house prices based on the provided features.

## Importing the libraries

Describe imports here


In [12]:
import pandas as pd
import numpy as np

from scipy import stats
from src.linear_regression import GDLinearRegression, LinearRegression
from sklearn.linear_model import LinearRegression as SklearnLinearRegression
from src.tester import base_test
from src.preprocessing import city_price_encoding

## Load the data

The house price dataset contains 18 features from the year 2014 during the months from february to october. The dataset contains the relevant features for predicting the price which are :

1) date : the date of the data for each house
2) price : the actual price for the house
3) bedrooms: the number of bedrooms in the house
4) bathrooms: the number of bathrooms in the house
5) sqft_living: the total living in square feet which is the area that will be heated or cooled or the area where you spend your time
6) sqft_lot: the total lot in square feet which is the land you own according to the boundary lines determined by the city
7) floors: the number of floors in the house
8) waterfront: indicates whether the house has direct access to a natural or man-made waterway such as a lake, river, channel or canal.
9) view: the view the house has
10) condition: a numerical value that rates the condition of the house
11) sqft_above: the total of all living square feet in a home that is above the ground.
12) sqft_basement: the total square feet of the basement
13) yr_built: the year the house was built in
14) yr_renovated: the year the house was renovated in
15) street: the address of the house and which street it is on
16) city: the city the house is in
17) statezip: the zip code for the state the house is in
18) country: the country the house is located in


In [13]:
df = pd.read_csv("data/house-price-data.csv")
df.head(5)


Unnamed: 0,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated,street,city,statezip,country
0,2014-05-02 00:00:00,313000.0,3.0,1.5,1340,7912,1.5,0,0,3,1340,0,1955,2005,18810 Densmore Ave N,Shoreline,WA 98133,USA
1,2014-05-02 00:00:00,2384000.0,5.0,2.5,3650,9050,2.0,0,4,5,3370,280,1921,0,709 W Blaine St,Seattle,WA 98119,USA
2,2014-05-02 00:00:00,342000.0,3.0,2.0,1930,11947,1.0,0,0,4,1930,0,1966,0,26206-26214 143rd Ave SE,Kent,WA 98042,USA
3,2014-05-02 00:00:00,420000.0,3.0,2.25,2000,8030,1.0,0,0,4,1000,1000,1963,0,857 170th Pl NE,Bellevue,WA 98008,USA
4,2014-05-02 00:00:00,550000.0,4.0,2.5,1940,10500,1.0,0,0,4,1140,800,1976,1992,9105 170th Ave NE,Redmond,WA 98052,USA


## Preprocess the data


Outlier detection and removal


In [14]:
df = df[(np.abs(stats.zscore(df["price"])) < 2.8)]


Encoding city names


In [15]:
df2 = df[["city", "price"]]
average_prices = {city: p["price"].mean() for city, p in df2.groupby("city")}
df.insert(len(df.columns), "city_price", df["city"].map(average_prices))

Feature and label splitting


In [16]:
X, y = df.loc[:, df.columns != "price"], df["price"]


Dropping unnecessary columns


In [17]:
X = X.drop(["date", "street", "statezip", "country", "city"],
           axis=1, inplace=False)


Normalization


In [18]:
X = (X - X.mean()) / X.std()
y = (y - y.mean()) / y.std()


In [19]:
X.head(5)

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated,city_price
0,-0.434029,-0.851352,-0.861752,-0.191529,-0.017328,-0.074239,-0.301596,-0.667344,-0.566286,-0.680225,-0.53181,1.220353,-0.653448
2,-0.434029,-0.191513,-0.203272,-0.078575,-0.946951,-0.074239,-0.301596,0.812038,0.148135,-0.680225,-0.161249,-0.826435,-1.42147
3,-0.434029,0.138407,-0.125147,-0.188226,-0.946951,-0.074239,-0.301596,0.812038,-0.977987,1.553916,-0.262311,-0.826435,1.689589
4,0.674131,0.468327,-0.192111,-0.119082,-0.946951,-0.074239,-0.301596,0.812038,-0.808463,1.107087,0.175625,1.207082,0.84715
5,-1.54219,-1.511192,-1.375144,-0.234415,-0.946951,-0.074239,-0.301596,-0.667344,-1.123293,-0.680225,-1.104495,1.209123,0.186241


## Initialize models

We built two models for this project. One is a simple linear regression model and the other is a linear regression model optimized using gradient descent. The two models will be compared to scikit-learn's linear regression model.

### Least Squares Linear Regression


$$y = \beta X$$

$$
y=
\begin{pmatrix}
y_{1}\\
y_{2} \\
\vdots \\
y_{n}
\end{pmatrix},

\beta=
\begin{pmatrix}
\beta_{0}\\
\beta_{1} \\
\beta_{2} \\
\vdots \\
\beta_{p}
\end{pmatrix},
X=
\begin{pmatrix}
1 & x_{1,1} & x_{1,2} & \cdots & x_{1,p} \\
1 & x_{2,1} & X_{2,2} & \cdots & x_{2,p} \\
\vdots  & \vdots  & \vdots  & \ddots & \vdots  \\
1 & x_{n,1} & x_{n,2} & \cdots & x_{n,p} 
\end{pmatrix}
$$

$$\beta = (X^TX)^{-1}X^Ty$$

### Optimized Linear Regression using Gradient Descent

Optimizing linear regression using gradient descent is about finding the optimal values for the regression coefficients that minimize the cost function by iteratively updating the regression coefficients in the direction of steepest descent. 
It contains hyperparameters such as: 
- Learning Rate: the step size taken in each iteration of gradient descent
- Number of Iterations: the number that determines the maximum number of times gradient descent updates the regression coefficients.
- The threshold: a value that acts as a stopping criterion for the gradient descent algorithm. Once the magnitude of the gradient vector falls   below the threshold, the algorithm terminates.

In [20]:
models = {
    "LinearRegression": LinearRegression(),
    "GDLinearRegression": GDLinearRegression(
        learning_rate=0.05, threshold=1e-9, max_iter=2500
    ),
    "SklearnLinearRegression": SklearnLinearRegression(),
}

## Test each model


In [21]:
for name, model in models.items():
    print(f"Testing {name}")
    base_test(model, city_price_encoding)


Testing LinearRegression
mse: 0.38902 +/- 0.04520
Testing GDLinearRegression
mse: 0.38904 +/- 0.04516
Testing SklearnLinearRegression
mse: 0.38924 +/- 0.04532
