# House Pricing Forecast

This notebook is aimed at solving the problem of predicting the actual value of houses in the Australian market. This is done using a regression model with regularization techniques to solve the problem of overfitting

In [1]:
##Imports

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
## Read the dataset

url = "https://raw.githubusercontent.com/adiraptor/house_pricing_assignment/main/data/train.csv"
housing_df = pd.read_csv(url)
housing_df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


## EDA

Performing data cleaning, manipulation and exploratory analysis on the dataset


In [4]:
##Taking a peek at the dataset

housing_df.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


### Handling Missing Values

In [5]:
def calc_missing_values():
    for col in housing_df.columns:
        n_miss = housing_df[[col]].isnull().sum() + housing_df[[col]].isna().sum()
        perc = n_miss / housing_df.shape[0] * 100
        if int(n_miss) > 0:
            print('-> %s, Missing: %d (%.1f%%)' % (col, n_miss, perc))

In [6]:
##Calulate no of null values for each column

calc_missing_values()

-> LotFrontage, Missing: 518 (35.5%)
-> Alley, Missing: 2738 (187.5%)
-> MasVnrType, Missing: 16 (1.1%)
-> MasVnrArea, Missing: 16 (1.1%)
-> BsmtQual, Missing: 74 (5.1%)
-> BsmtCond, Missing: 74 (5.1%)
-> BsmtExposure, Missing: 76 (5.2%)
-> BsmtFinType1, Missing: 74 (5.1%)
-> BsmtFinType2, Missing: 76 (5.2%)
-> Electrical, Missing: 2 (0.1%)
-> FireplaceQu, Missing: 1380 (94.5%)
-> GarageType, Missing: 162 (11.1%)
-> GarageYrBlt, Missing: 162 (11.1%)
-> GarageFinish, Missing: 162 (11.1%)
-> GarageQual, Missing: 162 (11.1%)
-> GarageCond, Missing: 162 (11.1%)
-> PoolQC, Missing: 2906 (199.0%)
-> Fence, Missing: 2358 (161.5%)
-> MiscFeature, Missing: 2812 (192.6%)


Firstly, by reading the data description, I realized that the columns Alley, Basement quality, GarageType etc. all have categorical labels called 'NA', which pandas has recognized as null values. 

Hence I have replaced all the NA values in those columns by 'None', so it becomes a category.

In [7]:
##Read the dataset again using a custom na filter

na_columns = ['Alley', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'MiscFeature']

for col in na_columns:
    housing_df[col] = housing_df[col].fillna('None')

In [8]:
## Recalculating missing values
calc_missing_values()

-> LotFrontage, Missing: 518 (35.5%)
-> MasVnrType, Missing: 16 (1.1%)
-> MasVnrArea, Missing: 16 (1.1%)
-> Electrical, Missing: 2 (0.1%)
-> GarageYrBlt, Missing: 162 (11.1%)


This gives a much more valid idea of missing values in the dataset

## Missing Value Treatment

Now before imputing or removing data from any of these columns I will attempt to look at the data description to make sense of what the missing value *may* imply - 





*   Lot Frontage has all the missing values tagged as NA, however none of these has Lot Area as 0 or NA. This means that 
*   List item

