# Introduction
In this notebook, i will take you through the step by step approach in solving a House Pricing regression problem. This notebook aims to:

1. Provide insights on Housing Data
2. Understand importance of Preprocessing
3. Introduction to feature engineering
4. Use of ensembling algorithm

I hope that after reading this notebook, beginners will be more comfortable in tackling any learning problems and able to use the taught techniques to solve any problems from start to end. For non-beginners, hopefully you are able to get something out of it from this notebook and gain new insights and knowledge along the way :)

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style("darkgrid")
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.impute import SimpleImputer

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
#importing libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(font_scale=1)

In [None]:
#importing data from csv file using pandas
train=pd.read_csv('../input/home-data-for-ml-course/train.csv')
test=pd.read_csv('../input/home-data-for-ml-course/test.csv')

train.head()

## 1. Understanding Data

In [None]:
print("Train",train.shape)
print("Test",test.shape)

Just taking a quick glance at the top rows of the dataframe, we can see that there are some columns that are filled with **NAN (Not a Number)**. We will investigate this later on.

What i did here is first to concatenate the train and test together for extracting insights into the Housing Price data as a whole. It will also be more convenient for our preprocessing steps later on as we will only have 1 data reference

In [None]:
X = pd.concat([train.drop("SalePrice", axis=1),test], axis=0)
y = train[['SalePrice']]

In [None]:
X.info()

Lets isolate both the numerical and categorical columns since we will be applying different visualization techniques on them

In [None]:
numeric_ = X.select_dtypes(exclude=['object']).drop(['MSSubClass'], axis=1).copy()
numeric_.columns

In [None]:
cat_train = X.select_dtypes(include=['object']).copy()
cat_train['MSSubClass'] = X['MSSubClass']   #MSSubClass is nominal
cat_train.columns

## 2.Data Visualization

In [None]:
#lets create scatterplot of GrLivArea and SalePrice
sns.scatterplot(x='GrLivArea',y='SalePrice',data=train)
plt.show()

In [None]:
#as per above plot we can see there are two outliers which can affect on out model,lets remove those outliers
train=train.drop(train.loc[(train['GrLivArea']>4000) & (train['SalePrice']<200000)].index,0)
train.reset_index(drop=True, inplace=True)

In [None]:
#lest we how its look after removing outliers
sns.scatterplot(x='GrLivArea',y='SalePrice',data=train)
plt.show()

In [None]:
#lets create heatmap first of all lest see on which feature SalePrice is dependent
corr=train.drop('Id',1).corr().sort_values(by='SalePrice',ascending=False).round(2)
print(corr['SalePrice'])

In [None]:
#here we can see SalePrice mostly dependent on this features OverallQual,GrLivArea,TotalBsmtSF,GarageCars,1stFlrSF,GarageArea 
plt.subplots(figsize=(12, 9))
sns.heatmap(corr, vmax=.8, square=True);

In [None]:
#now lets create heatmap for top 10 correlated features
cols =corr['SalePrice'].head(10).index
cm = np.corrcoef(train[cols].values.T)
sns.set(font_scale=1)
hm = sns.heatmap(cm, annot=True, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

In [None]:
#lets see relation of 10 feature with SalePrice through Pairplot
sns.pairplot(train[corr['SalePrice'].head(10).index])
plt.show()

## Extra Analysis

In [None]:
#lets store number of test and train rows
trainrow=train.shape[0]
testrow=test.shape[0]

In [None]:
#copying id data
testids=test['Id'].copy()

In [None]:
#copying sales priece
y_train=train['SalePrice'].copy()

In [None]:
#combining train and test data
data=pd.concat((train,test)).reset_index(drop=True)
data=data.drop('SalePrice',1)

In [None]:
#dropping id columns
data=data.drop('Id',axis=1)

## Missing Value

In [None]:
#checking missing data
missing=data.isnull().sum().sort_values(ascending=False)
missing=missing.drop(missing[missing==0].index)
missing

In [None]:
#PoolQC is quality of pool but mostly house does not have pool so putting NA
data['PoolQC']=data['PoolQC'].fillna('NA')
data['PoolQC'].unique()

In [None]:
#MiscFeature: mostly house does not have it so putting NA
data['MiscFeature']=data['MiscFeature'].fillna('NA')
data['MiscFeature'].unique()

In [None]:
#Alley,Fence,FireplaceQu: mostly house does not have it so putting NA
data['Alley']=data['Alley'].fillna('NA')
data['Alley'].unique()

data['Fence']=data['Fence'].fillna('NA')
data['Fence'].unique()

data['FireplaceQu']=data['FireplaceQu'].fillna('NA')
data['FireplaceQu'].unique()

In [None]:
#LotFrontage: all house have linear connected feet so putting most mean value
data['LotFrontage']=data['LotFrontage'].fillna(data['LotFrontage'].dropna().mean())

In [None]:
#GarageCond,GarageQual,GarageFinish
data['GarageCond']=data['GarageCond'].fillna('NA')
data['GarageCond'].unique()

data['GarageQual']=data['GarageQual'].fillna('NA')
data['GarageQual'].unique()

data['GarageFinish']=data['GarageFinish'].fillna('NA')
data['GarageFinish'].unique()

In [None]:
#GarageYrBlt,GarageType,GarageArea,GarageCars putting 0
data['GarageYrBlt']=data['GarageYrBlt'].fillna(0)
data['GarageType']=data['GarageType'].fillna(0)
data['GarageArea']=data['GarageArea'].fillna(0)
data['GarageCars']=data['GarageCars'].fillna(0)

In [None]:
#BsmtExposure,BsmtCond,BsmtQual,BsmtFinType2,BsmtFinType1 
data['BsmtExposure']=data['BsmtExposure'].fillna('NA')
data['BsmtCond']=data['BsmtCond'].fillna('NA')
data['BsmtQual']=data['BsmtQual'].fillna('NA')
data['BsmtFinType2']=data['BsmtFinType2'].fillna('NA')
data['BsmtFinType1']=data['BsmtFinType1'].fillna('NA')

#BsmtFinSF1,BsmtFinSF2 
data['BsmtFinSF1']=data['BsmtFinSF1'].fillna(0)
data['BsmtFinSF2']=data['BsmtFinSF2'].fillna(0)

In [None]:
#MasVnrType,MasVnrArea
data['MasVnrType']=data['MasVnrType'].fillna('NA')
data['MasVnrArea']=data['MasVnrArea'].fillna(0)

In [None]:
#MSZoning 
data['MSZoning']=data['MSZoning'].fillna(data['MSZoning'].dropna().sort_values().index[0])
#Utilities
data['Utilities']=data['Utilities'].fillna(data['Utilities'].dropna().sort_values().index[0])
#BsmtFullBath
data['BsmtFullBath']=data['BsmtFullBath'].fillna(0)

#Functional
data['Functional']=data['Functional'].fillna(data['Functional'].dropna().sort_values().index[0])

#BsmtHalfBath
data['BsmtHalfBath']=data['BsmtHalfBath'].fillna(0)

#BsmtUnfSF
data['BsmtUnfSF']=data['BsmtUnfSF'].fillna(0)
#Exterior2nd
data['Exterior2nd']=data['Exterior2nd'].fillna('NA')

#Exterior1st
data['Exterior1st']=data['Exterior1st'].fillna('NA')
#TotalBsmtSF
data['TotalBsmtSF']=data['TotalBsmtSF'].fillna(0)
#SaleType
data['SaleType']=data['SaleType'].fillna(data['SaleType'].dropna().sort_values().index[0])
#Electrical
data['Electrical']=data['Electrical'].fillna(data['Electrical'].dropna().sort_values().index[0])


In [None]:
#KitchenQual
data['KitchenQual']=data['KitchenQual'].fillna(data['KitchenQual'].dropna().sort_values().index[0])


Now check any missing value

In [None]:
#lets check any missing remain
missing=data.isnull().sum().sort_values(ascending=False)
missing=missing.drop(missing[missing==0].index)
missing

As you can see that no missing value in data

# 3.Feature Engineering

Feature Engineering is a technique by which we create new features that could potentially aid in predicting our target variable, which in this case, is SalePrice. In this notebook, we will create additional features based on our **Domain Knowledge** of the housing features

Based on the current feature we have, the first additional featuire we can add would be **TotalLot**, which sums up both the LotFrontage and LotArea to identify the total area of land available as lot. We can also calculate the total number of surface area of the house, TotalSF by adding the area from basement and 2nd floor. **TotalBath** can also be used to tell us in total how many bathrooms are there in the house. We can also add all the different types of porches around the house and generalise into a total porch area, **TotalPorch**.

* TotalLot = LotFrontage + LotArea
* TotalSF = TotalBsmtSF + 2ndFlrSF
* TotalBath = FullBath + HalfBath
* TotalPorch = OpenPorchSF + EnclosedPorch + ScreenPorch
* TotalBsmtFin = BsmtFinSF1 + BsmtFinSF2

In [None]:
#as we know some feature are highly co-related with SalePrice so lets create some feature using these features
data['GrLivArea_2']=data['GrLivArea']**2
data['GrLivArea_3']=data['GrLivArea']**3
data['GrLivArea_4']=data['GrLivArea']**4

data['TotalBsmtSF_2']=data['TotalBsmtSF']**2
data['TotalBsmtSF_3']=data['TotalBsmtSF']**3
data['TotalBsmtSF_4']=data['TotalBsmtSF']**4

data['GarageCars_2']=data['GarageCars']**2
data['GarageCars_3']=data['GarageCars']**3
data['GarageCars_4']=data['GarageCars']**4

data['1stFlrSF_2']=data['1stFlrSF']**2
data['1stFlrSF_3']=data['1stFlrSF']**3
data['1stFlrSF_4']=data['1stFlrSF']**4

data['GarageArea_2']=data['GarageArea']**2
data['GarageArea_3']=data['GarageArea']**3
data['GarageArea_4']=data['GarageArea']**4

In [None]:
#lets add 1stFlrSF and 2ndFlrSF and create new feature floorfeet
data['Floorfeet']=data['1stFlrSF']+data['2ndFlrSF']
data=data.drop(['1stFlrSF','2ndFlrSF'],1)

In [None]:
#MSSubClass,MSZoning
data=pd.get_dummies(data=data,columns=['MSSubClass'],prefix='MSSubClass')
data=pd.get_dummies(data=data,columns=['MSZoning'],prefix='MSZoning')
data.head()

In [None]:
X['TotalLot'] = X['LotFrontage'] + X['LotArea']
X['TotalBsmtFin'] = X['BsmtFinSF1'] + X['BsmtFinSF2']
X['TotalSF'] = X['TotalBsmtSF'] + X['2ndFlrSF']
X['TotalBath'] = X['FullBath'] + X['HalfBath']
X['TotalPorch'] = X['OpenPorchSF'] + X['EnclosedPorch'] + X['ScreenPorch']

### Binay Columns

We also include simple feature engineering by creating binary columns for some features that can indicate the presence(1) / absence(0) of some features of the house

In [None]:
colum = ['MasVnrArea','TotalBsmtFin','TotalBsmtSF','2ndFlrSF','WoodDeckSF','TotalPorch']

for col in colum:
    col_name = col+'_bin'
    X[col_name] = X[col].apply(lambda x: 1 if x > 0 else 0)

### Converting Categorical to Numerical
Lastly, because machine learning only learns from data that is numerical in nature, we will convert the remaining categorical columns into one-hot features using the get_dummies() method into numerical columns that is suitable for feeding into our machine learning algorithm.

In [None]:
X = pd.get_dummies(X)

### SalePrice Distribution
> 

In [None]:
plt.figure(figsize=(10,6))
plt.title("Before transformation of SalePrice")
dist = sns.distplot(train['SalePrice'],norm_hist=False)


Distribution is skewed to the right, where the tail on the curve’s right-hand side is longer than the tail on the left-hand side, and the mean is greater than the mode. This situation is also called positive skewness.
Having a skewed target will affect the overall performance of our machine learning model, thus, one way to alleviate will be to using **log transformation** on skewed target, in our case, the SalePrice to reduce the skewness of the distribution.

In [None]:
plt.figure(figsize=(10,6))
plt.title("After transformation of SalePrice")
dist = sns.distplot(np.log(train['SalePrice']),norm_hist=False)

In [None]:
y["SalePrice"] = np.log(y['SalePrice'])

Now that we are satisfied with our final data, we will proceed to the part where we will solve this regression problem - Modeling

# 4. Modling

This section will consist of scaling the data for better optimization in our training, and also introducing the varieties of ensembling methods that are used in this notebook for predicting the Housing price. We also try out hyperparameter tuning briefly, as i will be dedicating a new notebook that will explain more in details on the process of Hyperparameter Tuning as well as the mathematical aspect of the ensemble algorithms.

### Split into train-validation set

In [None]:
x = X.loc[train.index]
y = y.loc[train.index]
test = X.loc[test.index]

### Scaling of Data

In [None]:
#lets import StandardScaler from sklearn for feature scalling
from sklearn.preprocessing import StandardScaler


In [None]:
#lets split data using trainrow data and scale data
cols = x.select_dtypes(np.number).columns
transformer = RobustScaler().fit(x[cols])
x[cols] = transformer.transform(x[cols])
test[cols] = transformer.transform(test[cols])

In [None]:
num_correlation = train.select_dtypes(exclude='object').corr()
corr = num_correlation.corr()
print(corr['SalePrice'].sort_values(ascending=False))

In [None]:
# Create target object and call it y
y = train.SalePrice
# Create X
#features = ['OverallQual','LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF','FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd', 'GrLivArea','GarageCars', 'GarageArea']
featurestop=['OverallQual','TotalBsmtSF', 'YearBuilt','YearRemodAdd','GarageYrBlt','Fireplaces', '1stFlrSF', 'MasVnrArea','FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd', 'GrLivArea','GarageCars', 'GarageArea']
X = train[featurestop]
train[featurestop]
sns.heatmap(X.isnull(),yticklabels=False, cbar=False, cmap='viridis')

In [None]:
##Check TestData
# path to file you will use for predictions
test_data_path = '/kaggle/input/home-data-for-ml-course/test.csv'

# read test data file using pandas
test_data = pd.read_csv(test_data_path)

# create test_X which comes from test_data but includes only the columns you used for prediction.
# The list of columns is stored in a variable called features
test_X = test_data[featurestop]
#test_X.dropna(inplace=True)
test_X.info()

In [None]:
GarageYrBltmean=X.loc[:,"GarageYrBlt"].mean()
MasVnrAreamean=X.loc[:,"MasVnrArea"].mean()
print(GarageYrBltmean,MasVnrAreamean)

In [None]:
X['GarageYrBlt'].fillna(GarageYrBltmean,inplace = True)
X['MasVnrArea'].fillna(MasVnrAreamean,inplace = True)

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

In [None]:
# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Define the model. Set random_state to 1
rf_model = RandomForestRegressor(random_state=1)
rf_model.fit(train_X, train_y)
rf_val_predictions = rf_model.predict(val_X)
rf_val_mae = mean_absolute_error(rf_val_predictions, val_y)

print("Validation MAE for Random Forest Model: {:,.0f}".format(rf_val_mae))

In [None]:
#working with missing Values
GarageCarsmean=test_X.loc[:,"GarageCars"].mean()
GarageAreamean=test_X.loc[:,"GarageArea"].mean()
GarageYrBltmean=test_X.loc[:,"GarageYrBlt"].mean()
MasVnrAreamean=test_X.loc[:,"MasVnrArea"].mean()
TotalBsmtSFmean=test_X.loc[:,"TotalBsmtSF"].mean()
print(GarageYrBltmean,MasVnrAreamean)
print(GarageCarsmean,GarageAreamean)

In [None]:
test_X['GarageArea'].fillna(GarageAreamean,inplace = True)
test_X['GarageYrBlt'].fillna(GarageYrBltmean,inplace = True)
test_X['MasVnrArea'].fillna(MasVnrAreamean,inplace = True)
test_X['GarageCars'].fillna(GarageCarsmean,inplace = True)
test_X['TotalBsmtSF'].fillna(TotalBsmtSFmean,inplace = True)
test_X.info()

In [None]:
rf_model_on_full_data = RandomForestRegressor(random_state=1)
rf_model_on_full_data.fit(X, y)

In [None]:
# make predictions which we will submit. 
test_preds = rf_model_on_full_data.predict(test_X)

# The lines below shows how to save predictions in format used for competition scoring
# Just uncomment them.

#output = pd.DataFrame({'Id': test_data.Id,
#                       'SalePrice': test_preds})
#output.to_csv('submission.csv', index=False)
rf_model_on_full_data = RandomForestRegressor(random_state=1)
rf_model_on_full_data.fit(X, y)

# Then in last code cell


output = pd.DataFrame({'Id': test_data.Id,
                       'SalePrice': test_preds})
output.to_csv('submission.csv', index=False)

Hope you guys have learnt how the whole process of solving a regression problems looks like, understood the importance of data preprocessing and gain insights into the varieties of ensembling algorithms that you can use in future regression problems :)

# Please Upvote this notebook if it has helped you in any ways! Thank you:)