## House Price Prediction using Linear, Ridge and Lasso Regression

### Business Objective

A Housing company named 'Surprise Housing' has decided to enter the Australian market. The company wants to purchase houses at a price below their actual values and flip them on at a higher price. For the same purpose, the company has collected a data set from the sale of houses in Australia. 

The company wants to know:

- Which variables are significant in predicting the price of a house, and

- How well those variables describe the price of a house.

### 1. Data Understanding and Exploration
Let's first have a look at the dataset and understand the size, attribute names etc

In [None]:
# import libraries 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import r2_score
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV


In [None]:
# Read the dataset
housing_df = pd.read_csv('train.csv')

In [None]:
# summary of the dataset
print(housing_df.info())

In [None]:
# head
housing_df.head()

In [None]:
# Lets check data type of each column and compare it with data description given. 
# According to data description MSSubClass,OverallQual,OverallCond, BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr, TotRmsAbvGrd, Fireplaces,GarageCars,MoSold
# contains categories therefore converting these features into object type.

for i in ['MSSubClass','OverallQual','OverallCond', 'BsmtFullBath','BsmtHalfBath','FullBath','HalfBath',
          'BedroomAbvGr','KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces','GarageCars','MoSold','YrSold']:
    housing_df[i] = housing_df[i].astype('object')
    
    

#### Derived metrices

In [None]:
# Derived matrices : From year of build we can derive age of  house

housing_df['House_Age'] =  2022 - housing_df['YearBuilt']
housing_df['RemodAdd_Age'] =  2022 - housing_df['YearRemodAdd']
housing_df['Garage_Age'] =  2022 - housing_df['GarageYrBlt']

# Dropping year of build 

housing_df.drop(['YearBuilt','YearRemodAdd','GarageYrBlt'], axis = 1, inplace = True)

In [None]:
housing_df.describe()

In [None]:
# removing ID  Colomun as it is of no use in prediction

housing_df.drop(['Id'], axis = 1, inplace = True)


#### Missing values treatment

In [None]:
# Checking and treating missing values in dataset 

for i in housing_df.columns:
    print(i, ':', (housing_df[i].isnull().sum()/len(housing_df))*100)

In [None]:
# Replacing NA values as zero for numerical features
housing_df["LotFrontage"].fillna(0, inplace = True)
housing_df["MasVnrArea"].fillna(0, inplace = True)
housing_df["Garage_Age"].fillna(0, inplace = True)

# Replacing None in Categorical features for others categories
for i in ['Alley','MasVnrType','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2',
          'Electrical','GarageType','GarageFinish','GarageQual','GarageCond','FireplaceQu','PoolQC',
          'Fence','MiscFeature']:
    housing_df[i].fillna('None', inplace = True)


In [None]:
# Dropping colums having high value of same categoris
housing_df.drop(['Alley','Fence','PoolQC','MiscFeature','Utilities'], axis = 1, inplace = True)

#### Data Exploration
To perform linear regression, the (numeric) target variable should be linearly related to at least one another numeric variable. Let's see whether that's true in this case.

We'll first subset the list of all (independent) numeric variables, and then make a pairwise plot.

In [None]:
# plotting correlations on a heatmap

plt.figure(figsize=(20,20))
sns.heatmap(housing_df.corr(), annot = True, cmap = 'Greens')
plt.show()

The heatmap shows some useful insights:

Correlation of price with independent variables:

Price is highly (positively) correlated with Total basement Sq ft, 1st Floor Sq ft, GrLivArea, GarageArea

Price is negatively correlated House_age, remodadd_age, garage_age,

Correlation among independent variables:

Many independent variables are highly correlated:TotalBsmtSF and 1stFlrSF, these are positively correlated

In [None]:
# separting Numerical and categorical columns for analysis 

num_col = []
cat_col = []

for i in housing_df.columns:
    if housing_df[i].dtypes == 'object':
        cat_col.append(i)
    else:
        num_col.append(i)
        
print('Numerical Columns:', num_col)

print('Categorical Columns:', cat_col)


In [None]:
# ploting scatter plot for numrical variables

c = 1
plt.figure(figsize=(20,80))
for i in num_col:
    plt.subplot(13,3,c)
    sns.scatterplot(data = housing_df , y = "SalePrice" , x = housing_df[i])
    c = c+1
    
plt.show()

Observation:

- Price is highly (positively) correlated with Total basement Sq ft, 1st Floor Sq ft, GrLivArea, GarageArea

- Price is negatively correlated House_age, remodadd_age, garage_age,


In [None]:
# checking  boxplot for cateorical variables 

c = 1
plt.figure(figsize=(20,80))
for i in cat_col:
    plt.subplot(15,4,c)
    sns.boxplot(data = housing_df, x = housing_df[i], y = 'SalePrice')
    c = c+1
    
plt.show()

Observation:

- In MsZoning for "RL" Residential Low Density properties price is high
- Street tells the Type of road access to property

### 3. Data Preparation

#### Data Preparation
Let's now prepare the data and build the model.

In [None]:
# convert into dummies - one hot encoding
housing_categorical = housing_df[cat_col]
housing_dummies = pd.get_dummies(housing_categorical, drop_first=True)
housing_dummies.head()
print(housing_dummies.shape)

In [None]:
# adding dummy variales to the dataset
housing_df = pd.concat([housing_df[num_col],housing_dummies],axis=1)
print(housing_df.shape)
housing_df.head()

In [None]:
# Splitting Data as 80% Train set and 20% Test set

housing_train, housing_test = train_test_split(housing_df, train_size = 0.8, random_state = 100 )
print(housing_train.shape)
print(housing_test.shape)

In [None]:
# scaling the features - necessary before using Ridge or Lasso

# 1. instntiate object
scaler = MinMaxScaler()

# 2. Fit the data
housing_train[num_col] = scaler.fit_transform(housing_train[num_col])
housing_train.head()

In [None]:
y_train = housing_train.pop('SalePrice')
X_train = housing_train
print(y_train.shape)
print(X_train.shape)

In [None]:
# Transform the test data
housing_test[num_col] = scaler.transform(housing_test[num_col])

y_test = housing_test.pop('SalePrice')
X_test = housing_test
print(y_test.shape)
print(X_test.shape)

## 3. Model Building and Evaluation

### Model 1 : Linear Regression Model

In [None]:
# Instantiate
lm = LinearRegression()

# Fit a line
lm.fit(X_train, y_train)

# Print the coefficients and intercept
print(lm.intercept_)
print(lm.coef_)

In [None]:
y_pred_train = lm.predict(X_train)
y_pred_test = lm.predict(X_test)

metric = []
r2_train_lr = r2_score(y_train, y_pred_train)
print(r2_train_lr)
metric.append(r2_train_lr)

r2_test_lr = r2_score(y_test, y_pred_test)
print(r2_test_lr)
metric.append(r2_test_lr)

rss1_lr = np.sum(np.square(y_train - y_pred_train))
print(rss1_lr)
metric.append(rss1_lr)

rss2_lr = np.sum(np.square(y_test - y_pred_test))
print(rss2_lr)
metric.append(rss2_lr)

mse_train_lr = mean_squared_error(y_train, y_pred_train)
print(mse_train_lr)
metric.append(mse_train_lr**0.5)

mse_test_lr = mean_squared_error(y_test, y_pred_test)
print(mse_test_lr)
metric.append(mse_test_lr**0.5)

#####  From Above Model, vanilla linear regression model is overfitting. It gives very high r2 value on train set but performing poor on test test

### Model 2 : Ridge Regression

In [None]:
# list of alphas to tune - if value too high it will lead to underfitting, if it is too low, 
# it will not handle the overfitting
params = {'alpha': [0.0001, 0.001, 0.01, 0.05, 0.1, 
 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 3.0, 
 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 20, 50, 100, 500, 1000 ]}

ridge = Ridge()

# cross validation
folds = 5
model_cv = GridSearchCV(estimator = ridge, 
                        param_grid = params, 
                        scoring= 'neg_mean_absolute_error',  
                        cv = folds, 
                        return_train_score=True,
                        verbose = 1)            
model_cv.fit(X_train, y_train) 
#https://scikit-learn.org/stable/modules/model_evaluation.html

In [None]:
# Printing the best hyperparameter alpha
print(model_cv.best_params_)

In [None]:
#Fitting Ridge model for alpha = 10 and printing coefficients which have been penalised
alpha = 8.0
ridge = Ridge(alpha=alpha)

ridge.fit(X_train, y_train)
print(ridge.coef_)

In [None]:
# Lets calculate some metrics such as R2 score, RSS and RMSE
y_pred_train = ridge.predict(X_train)
y_pred_test = ridge.predict(X_test)

metric2 = []
r2_train_lr = r2_score(y_train, y_pred_train)
print(r2_train_lr)
metric2.append(r2_train_lr)

r2_test_lr = r2_score(y_test, y_pred_test)
print(r2_test_lr)
metric2.append(r2_test_lr)

rss1_lr = np.sum(np.square(y_train - y_pred_train))
print(rss1_lr)
metric2.append(rss1_lr)

rss2_lr = np.sum(np.square(y_test - y_pred_test))
print(rss2_lr)
metric2.append(rss2_lr)

mse_train_lr = mean_squared_error(y_train, y_pred_train)
print(mse_train_lr)
metric2.append(mse_train_lr**0.5)

mse_test_lr = mean_squared_error(y_test, y_pred_test)
print(mse_test_lr)
metric2.append(mse_test_lr**0.5)

### Model 3 : Lasso Regression

In [None]:
lasso = Lasso()

# cross validation
model_cv = GridSearchCV(estimator = lasso, 
                        param_grid = params, 
                        scoring= 'neg_mean_absolute_error', 
                        cv = folds, 
                        return_train_score=True,
                        verbose = 1)            

model_cv.fit(X_train, y_train) 

In [None]:
# Printing the best hyperparameter alpha
print(model_cv.best_params_)

In [None]:
#Fitting Ridge model for alpha = 0.0001 and printing coefficients which have been penalised

alpha =0.0001

lasso = Lasso(alpha=alpha)
        
lasso.fit(X_train, y_train) 

lasso.coef_

In [None]:
# Lets calculate some metrics such as R2 score, RSS and RMSE

y_pred_train = lasso.predict(X_train)
y_pred_test = lasso.predict(X_test)

metric3 = []
r2_train_lr = r2_score(y_train, y_pred_train)
print(r2_train_lr)
metric3.append(r2_train_lr)

r2_test_lr = r2_score(y_test, y_pred_test)
print(r2_test_lr)
metric3.append(r2_test_lr)

rss1_lr = np.sum(np.square(y_train - y_pred_train))
print(rss1_lr)
metric3.append(rss1_lr)

rss2_lr = np.sum(np.square(y_test - y_pred_test))
print(rss2_lr)
metric3.append(rss2_lr)

mse_train_lr = mean_squared_error(y_train, y_pred_train)
print(mse_train_lr)
metric3.append(mse_train_lr**0.5)

mse_test_lr = mean_squared_error(y_test, y_pred_test)
print(mse_test_lr)
metric3.append(mse_test_lr**0.5)

#### Metrices Evauation and comparison for all models 

In [None]:
# Creating a table which contain all the metrics

lr_table = {'Metric': ['R2 Score (Train)','R2 Score (Test)','RSS (Train)','RSS (Test)',
                       'MSE (Train)','MSE (Test)'], 
        'Linear Regression': metric
        }

lr_metric = pd.DataFrame(lr_table ,columns = ['Metric', 'Linear Regression'] )

rg_metric = pd.Series(metric2, name = 'Ridge Regression')
ls_metric = pd.Series(metric3, name = 'Lasso Regression')

final_metric = pd.concat([lr_metric, rg_metric, ls_metric], axis = 1)

final_metric

### Lets observe the changes in the coefficients after regularization

In [None]:
betas = pd.DataFrame(index=X_train.columns)
betas.rows = X_train.columns
betas['Linear'] = lm.coef_
betas['Ridge'] = ridge.coef_
betas['Lasso'] = lasso.coef_

In [None]:
pd.set_option('display.max_rows', None)
betas

### Observation :

Below features impact positively on house price :

- LotArea : Lot size in square feet of house
- GrLivArea: Above grade (ground) living area square feet
- Neighborhood_NoRidge: Physical locations within Ames city limits (Northridge)
- OverallQual_10: Rates the overall material and finish of the house (rating = 10)
- RoofMatl_WdShngl:Roof material(Wood Shingles)


Below features imapct negatively on house price(that is price drop with increase in values of these features) :

- House_Age : how old house is built
- RemodAdd_Age: how old house is remodified
- Neighborhood_Edwards: Physical locations within Ames city limits (Edwards)