<a href="https://colab.research.google.com/github/aayomide/house-price-prediction/blob/main/Forecasting_the_Cost_of_Homes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Project:** House Price Prediction with Machine Learning

### **Problem Statement**

The goal of this data science project is to use the house price dataset to construct a regression machine-learning system for forecasting the cost of homes.

### **Task**

- To implement a machine learning model capable of predicting the best future house sale prices.
- Build a REST API, preferably a FAST API, for the ML model or deploy the model on a cloud platform, e.g., Postman or Heroku.

**Link to Dataset**: [house_data](https://docs.google.com/spreadsheets/d/1KM55GGkAuMJNaS6IHPfMapo02SvYWWQOrCw7Nky--vw/edit?usp=sharing)

-------------------

## Import Libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore' )

pd.options.display.max_columns = 25

In [None]:
house_df = pd.read_csv("house_data.csv.csv")

print(house_df.shape)
house_df.head()

(21613, 21)


Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [None]:
#check data info
house_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21613 non-null  int64  
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long  

In [None]:
# confirm there are no missing values
print(house_df.isnull().sum().sum())

#check duplicates
house_df.duplicated().sum()

0


0

Observations so far:
- All except the date column are of numerical values. The date column should be of datetime data type - we'll convert this later
- No missing values

In [None]:
#convert date column from object to datetime data type
house_df['date'] = pd.to_datetime(house_df.date)

## EDA

In [None]:
# Check distribution of values of each feature
house_df.drop("id",1).corr()['price'].sort_values(ascending=False)

price            1.000000
sqft_living      0.702044
grade            0.667463
sqft_above       0.605566
sqft_living15    0.585374
bathrooms        0.525134
view             0.397346
sqft_basement    0.323837
bedrooms         0.308338
lat              0.306919
waterfront       0.266331
floors           0.256786
yr_renovated     0.126442
sqft_lot         0.089655
sqft_lot15       0.082456
yr_built         0.053982
condition        0.036392
long             0.021571
zipcode         -0.053168
Name: price, dtype: float64

## Feature Engineering
The only feature I'll be engineering here are the date features

In [None]:
house_df = house_df.assign(day = house_df['date'].dt.day, 
                            month=house_df['date'].dt.month,
                            year= house_df['date'].dt.year)

## Modelling
I'll train about 3-4 vanilla models here. Vanilla as in no hyperparameter tuning yet.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error


In [None]:
#split data into training and test set
features = house_df.drop(['id', 'price', 'date'], axis=1)
target = house_df.price

# apply a 80/20 train-validation set split
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

#normalize the data using MinMaxScaler .. we scale after splitting to prevent data leakage
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#### Modelling with Linear Regression

In [None]:
#fit the model onto the data
linreg_model = LinearRegression(fit_intercept=True)
linreg_model.fit(X_train,y_train)

# make predictions
predictions_lr = linreg_model.predict(X_test)

In [None]:
# # check weight each feature.. i.e how much influence did a feature have on the linear model
# predictors = X_train.columns
# coef = pd.Series(linear_model.coef_, predictors).sort_values()

# print(coef)

#### Modelling with Random Forest

In [None]:
rfr_model = RandomForestRegressor(random_state=42)
rfr_model.fit(X_train, y_train)

predictions_rf = rfr_model.predict(X_test)

#### Modelling with XGBoost

In [None]:
xgbr_model = XGBRegressor()
xgbr_model.fit(X_train, y_train)

predictions_xg = xgbr_model.predict(X_test)

#### Check Accuracy of Models

In [None]:
def calculate_accuracy(y_test, y_pred):
    # calculate mean absolute error, root mean square, r-squared, 
    MAE = round(mean_absolute_error(y_test, y_pred), 3)
    RMSE = round(np.sqrt(mean_squared_error(y_test, y_pred)), 3)
    R2 = r2_score(y_test, y_pred)

    #print(f"Mean Absolute Error (MAE): {MAE} \n")
    #print(f"Root Mean Squared Error (RMSE): {RMSE} \n")
    #print(f"R-squared: {R2}")
    
    return([MAE, RMSE, R2])

In [None]:
accuracy_df = pd.DataFrame(index=['MAE', 'RMSE', 'r-squared'])
accuracy_df['Linear Regression'] = calculate_accuracy(y_test, predictions_lr)
accuracy_df['Random Forest'] = calculate_accuracy(y_test, predictions_rf)
accuracy_df['XGBoost'] = calculate_accuracy(y_test, predictions_xg)

In [None]:
accuracy_df

Unnamed: 0,Linear Regression,Random Forest,XGBoost
MAE,127046.861,73337.644,70738.553
RMSE,212139.636,151064.501,147245.167
r-squared,0.702664,0.849225,0.856753


    Random Forest slightly edges out the XGBoost model

### Feature Importance
Let's check the most impactful drivers of houses prices

In [None]:
# check weight each feature.. i.e how much influence did a feature have on the linear model
predictors = X_train.columns
coef = pd.Series(rfr_model.feature_importances_, predictors).sort_values(ascending=False)

print(coef)

grade            0.313839
sqft_living      0.272860
lat              0.151332
long             0.062039
yr_built         0.032161
waterfront       0.031429
sqft_living15    0.029854
sqft_above       0.019093
zipcode          0.014207
sqft_lot         0.012766
sqft_lot15       0.011436
bathrooms        0.011327
view             0.010017
day              0.006400
sqft_basement    0.005570
month            0.005007
bedrooms         0.002897
condition        0.002735
yr_renovated     0.002183
floors           0.001703
year             0.001144
dtype: float64


In [None]:
# list(coef[:4].index)

## End to End Pipeline
Let's finalize the pipeline by adding our model to it. 

Then we save our model, load it and use it to predict our test data.

In [None]:
import pickle
from sklearn.pipeline import Pipeline

In [None]:
end_2_end_pipeline = Pipeline([
    ('preparation', MinMaxScaler()),
    ('classifier', XGBRegressor(random_state=42))
])

# fit the training set on the pipeline. so we transformed and train our dataset on the go
end_2_end_pipeline.fit(X_train, y_train)

# save the model
filename= 'house_price_model.pkl'
with open(filename, 'wb') as file:
    pickle.dump(end_2_end_pipeline, file)
    
# ====
# load the saved model
with open(filename, 'rb') as f:
    load_model = pickle.load(f)
    
    
# run the model on the test set
y_pred = load_model.predict(X_test)
y_pred

array([ 414595.62,  863093.56, 1063882.2 , ...,  308299.75,  575448.6 ,
        342928.84], dtype=float32)

## REST API

In [None]:
import requests

In [None]:
#TESTING THE ENDPOINT
# base_url = "http://127.0.0.1"
base_url = "https://predict-house-price-zummit.herokuapp.com"
endpoint = "/predict"
full_url = base_url + endpoint

json_ = {
  "date": "2022-07-21",
  "bedrooms": 0,
  "bathrooms": 0,
  "sqft_living": 0,
  "sqft_lot": 0,
  "floors": 0,
  "waterfront": 0,
  "view": 0,
  "condition": 0,
  "grade": 0,
  "sqft_above": 0,
  "sqft_basement": 0,
  "yr_built": 0,
  "yr_renovated": 0,
  "zipcode": 0,
  "lat": 0,
  "long": 0,
  "sqft_living15": 0,
  "sqft_lot15": 0
}


response = requests.post(full_url, json=json_)   #post data to the endpoint
status_code = response.status_code
if True:
    msg = "Everything went well!" if status_code == 200 else "There was an error when handling the request."
    print(msg)

print(response.json())

Everything went well!
{'prediction': [100339.59375]}
