# House Price Predicition Regression Project

A US-based housing company named Surprise Housing has decided to enter the Australian market. The company uses data analytics to purchase houses at a price below their actual values and flip them on at a higher price. For the same purpose, the company has collected a data set from the sale of houses in Australia.

The company is looking at prospective properties to buy to enter the market. You are required to build a regression model using regularisation in order to predict the actual value of the prospective properties and decide whether to invest in them or not.

The company wants to know:

- Which variables are significant in predicting the price of a house, and
- How well those variables describe the price of a house.

## Business Goal

You are required to model the price of houses with the available independent variables. This model will then be used by the management to understand how exactly the prices vary with the variables. They can accordingly manipulate the strategy of the firm and concentrate on areas that will yield high returns. Further, the model will be a good way for management to understand the pricing dynamics of a new market.

### Approach:

1.   Importing modules, Reading the data
2.   Analyzing Numerical Features
    *   Checking Statistical summary
    *   Checking Distribution of numerical features
    *   Outlier Treatment
    *   Inspecting Correlation
    *   Missing Value Handling
    *   Extracting new features and drop redundant ones
    *   Correcting datatype
    *   Univaritate and Bivariate Analysis, Data Visualization
3.  Analyzing Categorical Features
    *   Missing Value Handling
    *   Encoding Categorical Features
    *   Data Visualization
    *   Dropping Redundant Features
4.  Splitting data into Train and Test data
    *   Transformation of Target Variable
    *   Imputing Missing Values
    *   Feature Scaling
5.  Primary Feature Selection using RFE
6.  Ridge Regression
7.  Lasso Regression
8.  Comparing model coefficients
9.  Model Evaluation 
10. Choosing the final model and most significant features.

   ### Importing Modules

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# for Model Buidling
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import RFE
import statsmodels.api as sm
# for model evaluation
from sklearn.metrics import mean_squared_error, r2_score
# for supperssing warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# importing the dataset
house =  pd.read_csv('train.csv')
house.head()

In [None]:
house.info()

Summary of the dataset : 1460 rows, 81 columns

In [None]:
#Numerical Analaysis
house.describe()

In [None]:
# Separating the numerical and categorical values
numeric_house = house.select_dtypes(include=['int64','float64'])
categorical_house = house.select_dtypes(include=["int64","float64"])

In [None]:
# Numerical columns
print("1 Numerical Data : ", numeric_house.columns)

# Categorical
print("2 Categorical Data :", categorical_house.columns)

### Analyzing Numerical Data

**Outlier Detection**

Checking the Percentage of Outliers for all the numerical dataset.

In [None]:
outliers_percentage={}

for feature in numeric_house:
    IQR=numeric_house[feature].quantile(.75)-numeric_house[feature].quantile(.25)
    outliers_count=numeric_house[(numeric_house[feature]>(numeric_house[feature].quantile(.75)+1.5*IQR)) | (numeric_house[feature]<(numeric_house[feature].quantile(.25)-1.5*IQR))].shape[0]
    outliers_percentage[feature]=round(outliers_count/numeric_house.shape[0]*100,2)
    
outliers_df = pd.DataFrame({"Features":list(outliers_percentage.keys()),"Percentage":list(outliers_percentage.values())})
outliers_df.sort_values(by="Percentage",ascending=False)

**Comment**

- Majority of numerical data have outliers
- Dropping the all Outliers will causes loss information
- hence ressigining fixed minimum values to these rows where feature value is outside the range of **[25th Percentilie - 1.5 IOR, 75th percentilie + 1.5 IQR]**
- IQR or Iner Quartie Range = Difference between 75th percentilie and 25th percentilie values of features.
- Target column 'SalePrice' is excluded in this.

In [None]:
for feature,percentage in outliers_percentage.items():
    if feature!= 'SalePrice':
        IQR = house[feature].quantile(.75) - house[feature].quantile(.25)
        max_value = house[feature].quantile(.75)+1.5*IQR
        min_value = house[feature].quantile(.25) - 1.5*IQR
        house[feature][house[feature] > max_value] = max_value
        house[feature][house[feature] < min_value] = min_value

In [None]:
# Checking the dataset after reassigning minmum and maximum values
house.describe()

**Correlation in Numerical Data**

In [None]:
plt.figure(figsize=(20,16))
sns.heatmap(numeric_house.corr(), annot=True)
plt.show()

**Comments**
- some of the features have high corelation with each others.
- GarageCars and GarageArea (0.88)
- GarageYrBlt and YearBlt (0.83)
- TotalRmsAbvGrd and GrLivArea (0.83)
- TotalBsmtSF and 1stflrSF (0.82)

One feature from each of these pair will be dropped after visualization.

**Univariate and Bivariate Analysis - Numerical Feature**

**Analyzing Numerical Features with Continuous Values**

In [None]:
fig=plt.subplots(figsize=(12,12))
for i, feature in enumerate(['MSSubClass','LotFrontage','LotArea','MasVnrArea','BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','1stFlrSF','2ndFlrSF']):
    plt.subplot(9,3,i+1)
    plt.subplots_adjust(hspace=2.0)
    sns.scatterplot(x=house[feature], y=house['SalePrice'])
    plt.tight_layout()

**Comments:**

- **LotFrontage**,**LotArea**,**TotalBsmtSF**,**1stFlrSF**,**2ndFlrSF** are showing the postive correlation with SalePrice.
- **MSSubClass** his discrete values
- **BsmtFinSF2** has single value and can be dropped.

In [None]:
fig=plt.subplots(figsize=(12,12))
for i, feature in enumerate (['LowQualFinSF','GrLivArea','GarageCars','GarageArea','WoodDeckSF','OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch','PoolArea']):
 plt.subplot(9,3,i+1)
 plt.subplots_adjust(hspace=2.0)
 sns.scatterplot(x=house[feature], y=house['SalePrice'])
 plt.tight_layout()

**Comments:**

- **'GrLivArea','GarageArea'**,showing postive correlation with SalePirce.
- **'LowQualFinSF','EnclosedPorch,'3SsnPorch','ScreenPorch','PoolArea','Miscvar'** feature have single values and can be dropped

**Visualizing the distribution of the numerical features**

In [None]:
fig=plt.subplots(figsize=(12,12))    
for i, feature in enumerate(['MSSubClass','LotFrontage','LotArea','MasVnrArea','BsmtFinSF1','BsmtUnfSF','TotalBsmtSF','1stFlrSF','2ndFlrSF','GrLivArea','GarageCars','GarageArea','OpenPorchSF']):
    plt.subplot(9,3,i+1)
    plt.subplots_adjust(hspace=2.0)
    sns.distplot(house[feature])
    plt.tight_layout()

In [None]:
house[['LowQualFinSF','GrLivArea','GarageCars','GarageArea','WoodDeckSF','OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch','PoolArea','MiscVal']].describe()

Removing these features having fixed values as they won't contribute in predicting SalePrice.

In [None]:
house[['LowQualFinSF','EnclosedPorch','3SsnPorch','ScreenPorch','PoolArea','MiscVal']].describe()

In [None]:
house.drop(['LowQualFinSF','EnclosedPorch','3SsnPorch','ScreenPorch','PoolArea','MiscVal'], axis=1, inplace=True)

# checking the remaing Columns
house.columns

**Analyzing the Numerical Values with Discrete Values**

In [None]:
house[['OverallQual','OverallCond','MoSold','YrSold','BsmtFullBath','BsmtHalfBath','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr','TotRmsAbvGrd','Fireplaces','GarageYrBlt','YearBuilt','YearRemodAdd']]

In [None]:
fig=plt.subplots(figsize=(12,12))    
for i, feature in enumerate(['OverallQual','OverallCond','MoSold','YrSold','BsmtFullBath','BsmtHalfBath','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr','TotRmsAbvGrd','Fireplaces','GarageYrBlt','YearBuilt','YearRemodAdd']):
    plt.subplot(9,3,i+1)
    plt.subplots_adjust(hspace=2.0)
    sns.barplot(x=house[feature], y=house['SalePrice'])
    plt.tight_layout()

**Comments**
- 'OverallQual': More the rating of the features more the SalePrice (target Variable)
- 'OverallCond' : SalePrice is highest for rating 5.
- 'MoSold' and 'YrSold' : SalePrice  does not show strong trend depending on month and year which realty is sold.
- 'FullBath' : 3rd and "HalfBath": 1 is highest SalePrice.
- 'TotRmsAbvGrd': More the number of total rooms grade more the SalePrice.
- 'GarageYrBlt','YearBuilt','YearRemodAdd','YrSold': Will extract new features from to identify any trend.
- 'BstmFullBath','KitchenAbvGr': Need Further inspection for meaningful insight.

In [None]:
house[['BsmtFullBath','KitchenAbvGr','GarageYrBlt','YearBuilt','YearRemodAdd']].describe()

In [None]:
print(house['BsmtFullBath'].value_counts())
print(house['KitchenAbvGr'].value_counts())

In [None]:
# Dropping the KitchenAbvGr for not having useful information
house.drop(['KitchenAbvGr'],axis=1, inplace=True)

In [None]:
house[['GarageYrBlt','YearBuilt','YearRemodAdd','YrSold']].describe()

In [None]:
# Converting the year related features into numbers of years.
for feature in ['GarageYrBlt','YearBuilt','YearRemodAdd','YrSold']:
    house[feature] = 2021 - house[feature]

In [None]:
fig=plt.subplots(figsize=(12,12))

for i, feature in enumerate(['GarageYrBlt','YearBuilt','YearRemodAdd','YrSold']):
    plt.subplot(4,2,i+1)
    plt.subplots_adjust(hspace=2.0)
    sns.scatterplot(x=house[feature], y=house['SalePrice'])
    plt.tight_layout()

**Comments:**
- For Most the realty properties Garage is built within 20 years. SalePrice is more recently built Garages.
- SalePrice is more than lower value of YearBuilt i.e. more recently build houses.
- Recently remodelled houses (lower values of YearRomdAdd) have higher SalePrice.
- YrSold still does not any sigmificant trend

**Missing Value Handling - Numerical Features**

In [None]:
print("Feature: Percentage of Missing Value")
print("====================================")
for feat in house.select_dtypes(exclude=['object']).columns:
    if house[feat].isnull().any():
        print(feat," : ", round(house[feat].isnull().sum()/house.shape[0],2)*100)

In [None]:
# Since MasVnrArea has only 1% data missing, droping the row with Null Values in MasVnrArea  
# Dropping the ID columns as it doesnot contribute towards predicting SalePrice.

house = house[~house['MasVnrArea'].isnull()]
house.drop(['Id'],axis=1, inplace=True)
numeric_house.drop(['Id'],axis=1, inplace=True)

In [None]:
# Check the number of remaining columns
house.columns.shape

**Comments:**
- GarageCars and GarageArea (Correlation Coefficient = 0.88) dropping GarageCars
- GarageYrBlt and YearBlt (Correlation Coefficient = 0.83) dropping GarageYrBlt for high correlation and containing missing value.
- TotalRmsAbvGrd and GrLivArea (Correlation Coefficient = 0.83) dropping GrLivArea
- TotalBsmtSF and 1stflrSF (Correlation Coefficient = 0.82) dropping TotalBsmtSF
- Missing Value Imputation to be done for house["LotFrontage"] after spilitting data into train and test set to avoid data leakage.

In [None]:
house.drop(['GarageCars','GarageYrBlt','GrLivArea','TotalBsmtSF'], axis=1, inplace=True)

# Checking the number of remaining columns
print(house.columns.shape)

### Analyzing Categorical Features

In [None]:
# Categorical Feature in the DataFrames

categorical_house.columns

**Missing Value Handling - Categorical Features**

In [None]:
print("Feature: Percentage of Missing Value")
print("====================================")
for feat in house.select_dtypes(include=['object']).columns:
    if house[feat].isnull().any():
        print(feat, ':' , round(house[feat].isnull().sum()/house[feat].shape[0],2)*100)

In [None]:
house['Electrical'].isnull().sum()

In [None]:
house['PoolQC'].value_counts()

**Comments:**

- For 'Alley' Nan Means 'No access to alley.
- For 'BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2' Nan_means "No Bassement"
- For GarageType, GarageFinish, GarageQual, GarageCond Nan means "No Garage"
- Fpr FriplaceQu and Fence Nan means 'No Fire' Place and 'No Fence'
- MiscFeature - Nan means no additional features mentioned.

All these feature will be imputed with meaningful values in place of missing data. 

In [None]:
mv_categorical_feat = ['Alley','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','GarageType','GarageFinish','GarageQual','GarageCond','FireplaceQu','Fence','MiscFeature']
print(house[mv_categorical_feat].isnull().sum())

In [None]:
# Imputing Missing Values with "Not_applicable"
house[mv_categorical_feat]= house[mv_categorical_feat].fillna(value="Not_applicable",axis=1)

# Check after imputation
print(house[mv_categorical_feat].isnull().sum())

In [None]:
# Dropping the "PoolQC" for very high percentage of data imbalance
house.drop(['PoolQC'], axis=1, inplace=True)

# dropping rows with null values in Electrical for very low missing value count.
house.dropna(subset=["Electrical"],inplace=True)

In [None]:
print("Feature : Percentage of Missing Values")
print("======================================")
for feat in house.columns:
    print(feat, ':', round(house[feat].isnull().sum()/house[feat].shape[0], 2)*100)

Missing Values imputation will be done after Spliting Training and testing set avoid data leakage

In [None]:
house.columns.shape

In [None]:
# Function to generate the boxplot for SalePrice against different features given the list of features

def generate_boxplot(feature_list):
    fig=plt.subplots(figsize=(20,16))
    for i, feature in enumerate(feature_list):
        plt.subplot(4, 2, i+1)
        plt.subplots_adjust(hspace=2.0)
        sns.boxplot(x=house[feature], y=house['SalePrice'])
        plt.tight_layout()

divided the ordinal feature into smaller segement and Visualizing their impact on SalePrice

**Analyzing Orderred Features**

In [None]:
ext_features = ['LotShape', 'Utilities', 'LandSlope', 'HouseStyle', 'ExterQual', 'ExterCond']
generate_boxplot(ext_features)


**Comments**
- LotShape : Slightly irregular LotShape have the highest SalePrice
- Utilities : Most of the house in the dataset have all the public utilities
- LandSlope : House at severse land slope have lowest SalePrice
- HouseStyle : 2 storied houses have the highest SalePrice
- ExterQual : House with Excellent qualtity of material on the exterior have the highest SalePrice
- ExterCond : House with Excellent condition of material on the exterior have the highest SalePrice

In [None]:
int_features = ['HeatingQC','KitchenQual','Functional','FireplaceQu']
generate_boxplot(int_features)

**Comments**

- House having excellent heating quality and Kitchen quality have highest SalePrice
- House With Typical funcationally have highest SalePrice. There are very few house that are severely damaged
- SalePirce range in Largest for house with average firplace quality

In [None]:
garage_feature = ['GarageFinish','GarageQual','GarageCond']
generate_boxplot(garage_feature)

**Comments**
- SalePrice is highest of Garage is Finished
- The Range of SalePrice is widest for Typical/Average Garage qualtiy and Condition.
- there are very few house with excellent condition of garage

In [None]:
bassement_feature = ['BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2']
generate_boxplot(bassement_feature)

**Comment**

- House with Excellwnt Quality bassement have highest SalePrice
- House with good living quarters (BsmtFinshType1=GLQ) have highest SalePrice
- A lost of house have unfinished basement or no bassement (label = Not_applicable)

**Ecoding Categorical Features**

In [None]:
# LotShape into numerical values
house['LotShape']=house['LotShape'].map({'IR1':0,'IR2':1,'IR3':2,'Reg':3})
# Utilities into numerical values
house['Utilities']=house['Utilities'].map({'AllPub':3,'NoSewr':2,'NoSeWa':1,'ELO':0})
# LotShape into numerical values
house['LandSlope']=house['LandSlope'].map({'Gtl':0,'Mod':1,'Sev':2})
# HouseStyle into numerical values
house['HouseStyle']=house['HouseStyle'].map({'1Story':0,'1.5Fin':1,'1.5Unf':2,'2Story':3,'2.5Fin':4,'2.5Unf':5,'SFoyer':6,'SLvl':7})
# ExterQual into numerical values
house['ExterQual']=house['ExterQual'].map({'Po':0,'Fa':1,'TA':2,'Gd':3,'Ex':4})
# ExterCond into numerical values
house['ExterCond']=house['ExterCond'].map({'Po':0,'Fa':1,'TA':2,'Gd':3,'Ex':4})
# BsmtQual into numerical values
house['BsmtQual']=house['BsmtQual'].map({'Not_applicable':0,'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5})
# BsmtCond into numerical values
house['BsmtCond']=house['BsmtCond'].map({'Not_applicable':0,'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5})
# BsmtExposure into numerical values
house['BsmtExposure']=house['BsmtExposure'].map({'Not_applicable':0,'No':1,'Mn':2,'Av':3,'Gd':4})                                                                              
# BsmtFinType1 into numerical values
house['BsmtFinType1']=house['BsmtFinType1'].map({'Not_applicable':0,'Unf':1,'LwQ':2,'Rec':3,'BLQ':4,'ALQ':5,'GLQ':6})
# BsmtFinType2 into numerical values
house['BsmtFinType2']=house['BsmtFinType2'].map({'Not_applicable':0,'Unf':1,'LwQ':2,'Rec':3,'BLQ':4,'ALQ':5,'GLQ':6})
# HeatingQC into numerical values
house['HeatingQC']=house['HeatingQC'].map({'Po':0,'Fa':1,'TA':2,'Gd':3,'Ex':4})
# CentralAir into numerical values
house['CentralAir']=house['CentralAir'].map({'N':0,'Y':1})
# KitchenQual into numerical values
house['KitchenQual']=house['KitchenQual'].map({'Po':0,'Fa':1,'TA':2,'Gd':3,'Ex':4})
# GarageFinish into numerical values
house['GarageFinish']=house['GarageFinish'].map({'Not_applicable':0,'Unf':1,'RFn':2,'Fin':3})
# GarageQual into numerical values
house['GarageQual']=house['GarageQual'].map({'Not_applicable':0,'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5})
# GarageCond into numerical values
house['GarageCond']=house['GarageCond'].map({'Not_applicable':0,'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5})
# Functional into numerical values
house['Functional']=house['Functional'].map({'Typ':0,'Min1':1,'Min2':3,'Mod':4,'Maj1':5,'Maj2':6,'Fa':7,'Sev':8,'Sal':9})
# FireplaceQu into numerical values
house['FireplaceQu']=house['FireplaceQu'].map({'Not_applicable':0,'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5})


In [None]:
# Checkingthe Features after encoding
house[['LotShape', 'Utilities', 'LandSlope', 'HouseStyle', 'ExterQual', 'ExterCond','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2',
'HeatingQC','KitchenQual','Functional','FireplaceQu','GarageFinish','GarageQual','GarageCond']].info()

In [None]:
unordered_feature = ['MSZoning','Street','Alley','LandContour','LotConfig','Neighborhood','Condition1','Condition2','BldgType','RoofStyle',
'RoofMatl','Exterior1st','Exterior2nd','MasVnrType','Foundation','Heating','Electrical','GarageType','PavedDrive','Fence',
'MiscFeature','SaleType','SaleCondition']

In [None]:
generate_boxplot(['MSZoning','Street','Alley','LandContour','LotConfig','Neighborhood'])

**Comments:**
- Most of the houses do not have alley
- Neighborhood has a lots of labels, using one hot coding directly would leads to high numbers of additional columns
- house claasified as MSZoning = RL or Residentil Low density have the highest SalePrice 

In [None]:
generate_boxplot(['Condition1','Condition2','BldgType','RoofStyle',
'RoofMatl','Exterior1st','Exterior2nd'])

**Comments**

- Normal Condition ( condition1 = Norm and Condtion2 =Nrm) House are likely to have high SalePrice
- Feature like 'RoofMat Exterior1st,Exterior2nd have labels with very few data this cannot contiribute in prediciting SalePrice.

In [None]:
generate_boxplot(['MasVnrType','Foundation','Heating','Electrical','GarageType','PavedDrive','Fence',
'MiscFeature'])

**Comments**

- Houses with foundation of poured concrete ( FOundation =PConc) and/or Electical with Standard Circul Break and/or Heating Type = GasA have the highest price
- Houses With Attached and built in garage have high SalePrice
- Most of the House do not have fence(fence=Not fence)

In [None]:
generate_boxplot(['SaleType','SaleCondition'])

**Comment**
- Most of the house are newly built houses with warranty deed have high SalePrice
- Sale Condition = Normal leads to high SalePrice

**Encoding Categorical Variables**

In [None]:
dummy_df = pd.get_dummies(house[unordered_feature],drop_first=True)

In [None]:
dummy_df.shape

**Comment:**
- Adding 144 features to exisiting dataset will make the model complex
- From the above boxplot for some categorical features only labels is dominating over others.
- in dummy_df any label have same values like95% or more will be  dropped as those new features are highly imbalanced

In [None]:
dummies_to_drop = []
for feat in dummy_df.columns:
    if dummy_df[feat].value_counts()[0]/dummy_df.shape[0]>=0.95:
        dummies_to_drop.append(feat)
print(dummies_to_drop)
print(len(dummies_to_drop))

In [None]:
# Dropping the highly imbalanced dummy varaiables
dummy_df = dummy_df.drop(dummies_to_drop, axis=1)

print(dummy_df.shape)

In [None]:
house.shape

In [None]:
# Adding the dummy variables to the original dataframe
house = pd.concat([house,dummy_df],axis=1)

# Dropping the redundant columns
house = house.drop(unordered_feature, axis=1)

In [None]:
house.shape

### Splitting into Train and Test Data

In [None]:
X = house.drop(['SalePrice'], axis=1)
X.head()

In [None]:
plt.title('Distribution of Sale')
sns.distplot(house["SalePrice"])
plt.show()

**Comment:** Since SalePrice is highly Skewed, Checking the Distribution of transformed SalePrice.

In [None]:
sns.distplot(np.log(house['SalePrice']))
plt.title('Distribution of log transformed SalePrice')
plt.show()

In [None]:
# log transformed SalePrice is normaliy distributed, hence te transformed data will be used for model building

y = np.log(house['SalePrice'])
print(y)

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.2, random_state=101)

In [None]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

In [None]:
X['LotFrontage'].isnull().any()

In [None]:
# Imputing value of LotFrontage after spliting training and testing the dataset to prevent data leakage

si = SimpleImputer(missing_values=np.nan, strategy='mean')
si.fit(X_train[['LotFrontage']])

In [None]:
X_train[['LotFrontage']] = si.transform(X_train[['LotFrontage']])

In [None]:
X_test[['LotFrontage']] = si.transform(X_test[['LotFrontage']])


### Feature Scalling

In [None]:
X_train.values

In [None]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
ss.fit(X_train)

In [None]:
X_tr_scaled = pd.DataFrame(data=ss.transform(X_train), columns=X_train.columns)
X_te_scaled = pd.DataFrame(data=ss.transform(X_test), columns=X_test.columns)

In [None]:
# Checking the features after 

print(X_tr_scaled)
print(X_te_scaled)

### Initial Feature Selection with RFE

In [None]:
# Given the number of features = n, the function prints and returns top n features selected by RFE
def top_n_features(n, X_tr_scaled, y_train):
    top_n_cols = []

    linear_m = LinearRegression()
    linear_m.fit(X_tr_scaled, y_train)
    rfe = RFE(linear_m)  # Remove 'n' from here

    rfe = rfe.fit(X_tr_scaled, y_train)
    rfe.support_[:n] = True

    print("Top %d features: " % n)
    rfe_ranking = list(zip(X_tr_scaled.columns, rfe.support_, rfe.ranking_))

    for i in rfe_ranking:
        if i[1]:
            top_n_cols.append(i[0])
    print(top_n_cols)
    return top_n_cols

# Example usage


In [None]:
# Checking top 45, 50, and 55 features
top_45 = top_n_features(45, X_tr_scaled, y_train)
top_50 = top_n_features(50, X_tr_scaled, y_train)
top_55 = top_n_features(55, X_tr_scaled, y_train)

In [None]:
# Given the Training Data and list of features, this will provides the statisitical summary of the model
# This Will be  used to check adjusted R_squared value for top 45, 50, and 55 Features

def build_regressor(X_train, y_train, cols):
    X_train_ols = sm.add_constant(X_train[cols])
    lin_reg = sm.OLS(y_train.values.reshape(-1, 1), X_train_ols).fit()
    print(lin_reg.summary())

In [None]:
build_regressor(X_tr_scaled,y_train,top_45)

In [None]:
build_regressor(X_tr_scaled,y_train,top_50)

In [None]:
build_regressor(X_tr_scaled,y_train,top_55)

**Comments:** By inspecting the Ajusted R-square value of linear Regression model with top_45,top_50,and top_55 features seem to be optimum as model with 50 and 55 features have same the adjusted R-square value on the training data.

In [None]:
X_train_rfe = X_tr_scaled[top_50]
X_test_rfe = X_te_scaled[top_50]


In [None]:
def build_model(X_train, X_test, y_train, params, model='ridge'):
    if model == 'ridge':
        estimator_model = Ridge()
    else:
        estimator_model = Lasso()
    model_cv = GridSearchCV(estimator=estimator_model,
                           param_grid=params,
                           scoring='neg_mean_absolute_error',
                           cv=5,
                           return_train_score=True,
                           verbose=1)
    model_cv.fit(X_train, y_train)
    alpha = model_cv.best_params_['alpha']
    print("Optimum alpha for %s is %f" % (model, alpha))
    final_model = model_cv.best_estimator_
    
    final_model.fit(X_train, y_train)
    y_train_pred = final_model.predict(X_train)
    y_test_pred = final_model.predict(X_test)
    
    # Model Evaluation
    print(model, "Regression with alpha", alpha)
    print("===========================")
    print('R2 score (train):', r2_score(y_train, y_train_pred))
    print('R2 score (test):', r2_score(y_test, y_test_pred))
    print('RMSE (train):', np.sqrt(mean_squared_error(y_train, y_train_pred)))
    print('RMSE (test):', np.sqrt(mean_squared_error(y_test, y_test_pred)))
    
    return final_model, y_test_pred


### Ridge Regression

In [None]:
# List of alphas to tune
params = {'alpha': [0.0001,0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10, 20, 50, 100, 500, 1000]}
ridge_final_model, y_test_predicted = build_model(X_train_rfe, X_test_rfe, y_train, params, model='ridge')


**Comments:** Ridge Regression model was able to achieve R2 Score is 87% test data of variance in test data can be explained by model Root mean square error = 0.1531 on test data that means the prediction made by the model can of by 0.1531 units.

### Lasso Regression

In [None]:
params ={'alpha':[0.000001,0.00001,0.0001,0.001,0.01,0.1,1.0,10,100,500,10000]}

lasso_final_model, y_test_predicted = build_model(X_train_rfe, X_test_rfe, y_train, params, model='lasso')

### Comparing Model COefficient

In [None]:
model_coefficients = pd.DataFrame(index=X_test_rfe.columns)
model_coefficients.rows = X_test_rfe.columns

model_coefficients['Ridge (alpha=9.0)']=ridge_final_model.coef_
model_coefficients['Lasso (alpha=0.0001)'] = lasso_final_model.coef_
pd.set_option('display.max_row',None)
model_coefficients

In [None]:
# Converting the prediction to its orginal scale (anti log)

test_prediction = np.around(np.exp(y_test_predicted)).astype(int)
print(test_prediction[:5])

### Final Model

Lasso Regression product slightly R2 score on test data than Ridge Regression. Choosing Lasso as the final model

In [None]:
# 50 feature ordered by feature importance in lasso Regression
model_coefficients[['Lasso (alpha=0.0001)']].sort_values(by='Lasso (alpha=0.0001)',ascending=False)

In [None]:
model_coefficients[['Lasso (alpha=0.0001)']].sort_values(by='Lasso (alpha=0.0001)',ascending=False).index[:10]

## Summary

### Summary

- First the housing data is read and analyzed dividing the features into numerical and categorical types.


- SalePrice is the target column here.


- All the features are then analyzed, missing data handling, outlier detection, data cleaning are done. Trend of SalePrice is 
observed for change in individual features.


- New features are extracted, redundant features dropped and categorical features are encoded accordingly.


- Then the data in split into train and test data and feature scaling is performed.


- Target variable SalePrice is right skewed. Natural log of the same is Normal distributed, hence for model building, natural log of SalePrice is considered.


- Creating dummy variables increased the number of features greatly, highly imbalanced columns are dropped.


- Top 50 features are selected through RFE and adjusted R-square. 50 features : 
['MSSubClass', 'LotArea', 'LandSlope', 'OverallQual', 'OverallCond', 'YearBuilt', 'BsmtQual', 'BsmtExposure', 'BsmtFinSF1', 'BsmtUnfSF', 'HeatingQC', 'CentralAir', '1stFlrSF', '2ndFlrSF', 'BsmtFullBath', 'HalfBath', 'KitchenQual', 'Functional', 'Fireplaces', 'GarageFinish', 'GarageArea', 'GarageQual', 'OpenPorchSF', 'MSZoning_RL', 'Street_Pave', 'LotConfig_CulDSac', 'Neighborhood_Edwards', 'Neighborhood_NAmes', 'Neighborhood_NWAmes', 'Neighborhood_NridgHt', 'Neighborhood_Somerst', 'Condition1_Feedr', 'Condition1_Norm', 'Condition2_Norm', 'BldgType_TwnhsE', 'RoofStyle_Gable', 'RoofStyle_Hip', 'Exterior1st_HdBoard', 'Exterior1st_Wd Sdng', 'Exterior2nd_HdBoard', 'Exterior2nd_Wd Sdng', 'MasVnrType_BrkFace', 'MasVnrType_None', 'MasVnrType_Stone', 'Foundation_PConc', 'Heating_GasA', 'GarageType_Not_applicable', 'PavedDrive_Y', 'SaleCondition_Normal', 'SaleCondition_Partial']


- Ridge and Lasso Regression Model are built with optimum alpha calculated in GridSearchCV method.
Optimum alpha = 9.0 for ridge and 0.0001 for lasso model.


- Model evaluation is done with R2 score and Root Mean Square Error.


- Lasso Regression is chosen as final model for having slightly better R-square value on test data.


- Out of 50 features in the final model, top 10 features in order of descending importance are ['1stFlrSF', '2ndFlrSF', 'OverallQual', 'OverallCond', 'SaleCondition_Partial', 'LotArea', 'BsmtFinSF1','SaleCondition_Normal', 'MSZoning_RL', 'Neighborhood_Somerst']


- Model coefficients are listed in a table along with the corresponding features , for example natural log of SalePrice will change by 0.124911 with unit change in the feature '1stFlrSF' when all the features remain constant. Negative sign in the coefficient signifies negative correlation between the predictor and target variable. 


- Predicted value of SalePrice is tranformed into its original scale by performing antilog. 


### FAQ

**Question 1**

What is the optimal value of alpha for ridge and lasso regression? What will be the changes in the model if you choose double the value of alpha for both ridge and lasso? What will be the most important predictor variables after the change is implemented?

**Answers:**
The Optimual values of alpha of ridge and lasso is 8.000000 and 0.001000.


In [None]:
# Double the alpha values
doubled_alpha_ridge = model_coefficients['Ridge (alpha=8.0)'] * 2
doubled_alpha_lasso = model_coefficients['Lasso (alpha=0.0001)'] * 2

# Create DataFrames to store coefficients
ridge_coeffs = pd.DataFrame(index=model_coefficients.index)
lasso_coeffs = pd.DataFrame(index=model_coefficients.index)



In [None]:
doubled_alpha_ridge

**Ridge Regression:**

- If you double the alpha value in Ridge regression, it will increase the regularization strength. This means that the model will be penalized more for having large coefficient values. As a result, the coefficients of the predictors will tend to become smaller.
- This increased regularization will lead to a simpler model that is less likely to overfit the training data. It can help prevent multicollinearity by encouraging coefficients to be small but non-zero.
- After doubling alpha, the most important predictor variables will likely remain the same as they were before the change. However, their coefficient values will decrease in magnitude.

In [None]:
doubled_alpha_lasso

**Lasso Regression:**

- Doubling the alpha value in Lasso regression will also increase the regularization strength. Lasso uses L1 regularization, which encourages some coefficients to be exactly zero. Increasing alpha makes it more likely that Lasso will set more coefficients to zero.
- The impact on the model will be sparsity in the coefficient vector. Many predictor variables may become irrelevant (have coefficients set to zero), effectively performing feature selection. Only a subset of the most important predictor variables will have non-zero coefficients.
- After doubling alpha, the most important predictor variables will be those that Lasso retains with non-zero coefficients.

**Question 2:**

You have determined the optimal value of lambda for ridge and lasso regression during the assignment. Now, which one will you choose to apply and why?

**Answers:**

**Ridge Regression:**

- Ridge adds L2 regularization to the linear regression, which penalizes the sum of squared coefficients.
- It is effective when you believe that most of the features are relevant, but I want to prevent multicollinearity and control the magnitude of the coefficients.
- Ridge can be a good choice when we have a large number of features, and we want to avoid feature selection.

**Lasso Regression:**

- Lasso adds L1 regularization, which can lead to sparse coefficient vectors by setting some coefficients to exactly zero.
- It is useful when we suspect that many features are irrelevant, and we want automatic feature selection.
- Lasso can be a good choice when we have a high-dimensional dataset and want to simplify the model by eliminating unimportant predictors.
- In this Problem Statement, since Lasso has a slightly higher R2 score on the test data, it indicates that Lasso's feature selection capability might be more suitable for your dataset, effectively reducing the impact of irrelevant predictors. The choice of Lasso as the final model aligns with goal of achieving better predictive performance.

**Question 3**

After building the model, you realised that the five most important predictor variables in the lasso model are not available in the incoming data. You will now have to create another model excluding the five most important predictor variables. Which are the five most important predictor variables now?

In [None]:
model_coefficients[['Lasso (alpha=0.0001)']].sort_values(by='Lasso (alpha=0.0001)',ascending=False).index[:5]

Refrence link click here - git@github.com:anwarshaikh042/Assignment_Advanced_Regression.git