In this notebook I have performed Exploratory Data Analysis on the housing dataset and tried to identify relationship between a house's Sales Price and various other features. After EDA data pre-processing is done to handle different missing values after which I have applied different regression models to make the predictions. 

I hope you find this kernel helpful and some **<font color='red'>UPVOTES</font>** would be very much appreciated



In [None]:
import warnings
warnings.filterwarnings('ignore')

## **Importing Required Libraries**

In [None]:
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

%matplotlib inline

## **Loading the Training and Testing Dataset**

In [None]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

### Describing the Training Dataset

In [None]:
train.head(3)

#### **Dimensions of training dataset**

In [None]:
print('Number of rows in training set: ',train.shape[0])
print('Number of columns in training set: ', train.shape[1])

### **Describing the test dataset**

In [None]:
test.head(3)

#### **Dimensions of test dataset**

In [None]:
print('Number of rows in test dataset: ', test.shape[0])
print('Number of columns in test dataset: ', test.shape[1])

**Concatinating both the training and testing set for exploratory data analysis**

Since the training set contains one extra column **'SalePrice'**, I will remove it during concatination

In [None]:
df = pd.concat([train.drop('SalePrice', axis = 1),test], axis = 0)

#### **Peeking into the dataset**

In [None]:
df.head(3)

#### **Dimensions of combined dataset**

In [None]:
print('Number of rows in dataset: ', df.shape[0])
print('Number of columns in dataset: ', df.shape[1])

#### **Describing the dataset**

Since the **'Id'** column is of no use in describing the dataset, I will remove it during describing

In [None]:
df.drop('Id', axis = 1).describe().T                   #T = transpose of the dataset

**Total Number of Categorical Attributes**

In [None]:
print('No. of categorical attributes: ', df.select_dtypes(exclude = ['int64','float64']).columns.size)

**Total Number of Numerical Attributes**

In [None]:
print('No. of numerical attributes: ', df.select_dtypes(exclude = ['object']).columns.size)

**Checking for Null Values in the dataset**

In [None]:
plt.figure(figsize=(20,6))
sns.heatmap(df.select_dtypes(exclude=['object']).isnull(), yticklabels=False, cbar = False, cmap = 'viridis')
plt.title('Null Values present in Numerical Attributes',fontsize=18)
plt.show()

plt.figure(figsize=(20,6))
sns.heatmap(df.select_dtypes(exclude=['int64','float64']).isnull(), yticklabels=False, cbar = False, cmap = 'viridis')
plt.title('Null Values present in Categorical Attributes',fontsize=18)
plt.show()

#### **Plotting the percentage(%) of null values **

Only those Null Values are included whose percentage(%) is greater than 0

In [None]:
null_val = df.isnull().sum()/len(df)*100
null_val.sort_values(ascending = False, inplace = True)
null_val = pd.DataFrame(null_val, columns = ['missing %'])
null_val = null_val[null_val['missing %'] > 0]

sns.set_style('whitegrid')
plt.figure(figsize=(10,6))
sns.barplot(x = null_val.index, y = null_val['missing %'], palette='Set1')
plt.xticks(rotation = 90)
plt.show()

## **Exploratory Data Analysis and Visualization**

### **1. Plotting the distribution of all Numerical Attributes**

In [None]:
sns.set_style('whitegrid')
df.hist(bins = 30, figsize = (20,15), color = 'darkgreen')
plt.show()
plt.tight_layout()

### **2. Heatmap of all the features**

In [None]:
plt.figure(figsize=(30,20))
sns.heatmap(df.corr(), annot = True,cmap='GnBu')
plt.title('Heatmap of all Features',fontsize=18)
plt.show()

### **3. Pairplot between various features**

In [None]:
sns.set_style('whitegrid')
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']
sns.pairplot(train[cols])
plt.show()

### **Plotting the relationships between 'SalePrice' with numerical features**

#### **1. SalePrice SalePrice vs 1stFlrSF**

In [None]:
plt.figure(figsize=(10,6))
sns.scatterplot(x='1stFlrSF',y='SalePrice', data = train,color = 'orange')
plt.title('SalePrice vs. 1stFlrSF')
plt.show()

#### **2. SalePrice vs. GrLivArea**

In [None]:
plt.figure(figsize=(10,6))
sns.scatterplot(x='GrLivArea',y='SalePrice', data = train,color = 'limegreen')
plt.title('SalePrice vs. OverallQual')
plt.show()

#### **3. SalePrice vs. TotalBsmtSF**

In [None]:
plt.figure(figsize=(10,6))
sns.scatterplot(x='TotalBsmtSF',y='SalePrice', data = train,color = 'royalblue')
plt.title('SalePrice vs. TotalBsmtSF')
plt.show()

#### **4. SalePrice vs. GarageArea**

In [None]:
plt.figure(figsize=(10,6))
sns.scatterplot(x='GarageArea',y='SalePrice', data = train,color = 'royalblue')
plt.title('SalePrice vs. GarageArea')
plt.show()

### **Plotting SalePrice relationships with categorical features**

#### **1. SalePrice vs. OverallQual**

In [None]:
sns.set_style('whitegrid')
plt.figure(figsize=(10,6))
sns.boxplot(x='OverallQual', y='SalePrice', data = train,palette='magma')
plt.show()

#### **SalePrice vs. Street**

In [None]:
plt.figure(figsize=(5,6))
sns.boxplot(x='Street', y='SalePrice', data = train,palette='magma')
plt.title('SalePrice vs. Street')
plt.show()

#### **2. SalePrice vs. YearBuilt**

In [None]:
plt.figure(figsize=(20,12))
sns.boxplot(x='YearBuilt', y='SalePrice', data = train)
plt.xticks(rotation = 90)
plt.title('SalePrice vs. YearBuilt', fontsize=15)
plt.show()

### **Preparing the Data**

**Filling in the missing values**

In [None]:
#Group by neighborhood and fill in missing value by the median LotFrontage of all the neighborhood
df['LotFrontage'] = df.groupby("Neighborhood")["LotFrontage"].transform(lambda x: x.fillna(x.median()))

In [None]:
#GarageType, GarageFinish, GarageQual and GarageCond these are replacing with None
for col in ['GarageType', 'GarageFinish', 'GarageQual', 'GarageCond']:
    df[col] = df[col].fillna('None')

In [None]:
#GarageYrBlt, GarageArea and GarageCars these are replacing with zero
for col in ['GarageYrBlt', 'GarageArea', 'GarageCars']:
    df[col] = df[col].fillna(int(0))

In [None]:
#BsmtFinType2, BsmtExposure, BsmtFinType1, BsmtCond, BsmtQual these are replacing with None
for col in ('BsmtFinType2', 'BsmtExposure', 'BsmtFinType1', 'BsmtCond', 'BsmtQual'):
    df[col] = df[col].fillna('None')

In [None]:
#MasVnrArea : replace with zero
df['MasVnrArea'] = df['MasVnrArea'].fillna(int(0))

In [None]:
#MasVnrType : replace with None
df['MasVnrType'] = df['MasVnrType'].fillna('None')

In [None]:
#There is put mode value 
df['Electrical'] = df['Electrical'].fillna(df['Electrical']).mode()[0]

In [None]:
#There is no need of Utilities
df = df.drop(['Utilities'], axis=1)

In [None]:
df['PoolQC'] = df['PoolQC'].fillna('None')

In [None]:
df['MiscFeature'].fillna('None', inplace = True)

In [None]:
df['Alley'].fillna('None', inplace = True)

In [None]:
df['Fence'].fillna('None', inplace = True)

In [None]:
df['FireplaceQu'] = df['FireplaceQu'].fillna('None')

In [None]:
df['KitchenQual'].fillna(df['KitchenQual'].mode()[0], inplace = True)


In [None]:
df['BsmtFullBath'].fillna(0, inplace = True)

In [None]:
df['FullBath'].fillna(df['FullBath'].mode()[0],inplace = True)

In [None]:
for col in ['SaleType','KitchenQual','Exterior2nd','Exterior1st','Electrical']:
    df[col].fillna(df[col].mode()[0],inplace=True)

In [None]:
df['MSZoning'].fillna(df['MSZoning'].mode()[0],inplace=True)

In [None]:
df['Functional'].fillna(df['Functional'].mode()[0],inplace=True)

In [None]:
for col in ('BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath'):
    df[col].fillna(0,inplace=True)

In [None]:
#Checking there is any null value or not
plt.figure(figsize=(15, 4))
sns.heatmap(df.isnull(),yticklabels=False)
plt.show()

There are no remaining Null Values in the dataset

### **Label Encoding the categorical features**

In [None]:
cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
        'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1', 
        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond', 
        'YrSold', 'MoSold', 'MSZoning', 'LandContour', 'LotConfig', 'Neighborhood',
        'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
        'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'Foundation', 'GarageType', 'MiscFeature', 
        'SaleType', 'SaleCondition', 'Electrical', 'Heating')

In [None]:
from sklearn.preprocessing import LabelEncoder
for c in cols:
    lbl = LabelEncoder()
    lbl.fit(list(df[c].values))
    df[c] = lbl.transform(list(df[c].values))

### Splitting the data into Training and Test sets

In [None]:
train_data = df.iloc[:1460,:]
test_data = df.iloc[1460:,:]

In [None]:
train_data.shape

In [None]:
test_data.shape

In [None]:
X = train_data
y = train['SalePrice']

### **Splitting the datasets into training and testing sets**

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=0.21, random_state = 7)

## **Regression Models**

### **1. Linear Regression**

In [None]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

In [None]:
lin_reg.score(X_test,y_test)

In [None]:
prediction = lin_reg.predict(test_data)

### **2. Random Forest Regressor**

In [None]:
#Train the model
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=1000)

In [None]:
model.fit(X_train, y_train)

In [None]:
model.score(X_test,y_test)

### **3. Gradient Boosting Regressor**

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
GBR = GradientBoostingRegressor(n_estimators=100, max_depth=4)

In [None]:
GBR.fit(X_train, y_train)

In [None]:
GBR.score(X_test,y_test)

**Since Gradient Boosting Regressor has the highest score using it to make final generation**

In [None]:
GBR.fit(X,y)

In [None]:
predictions = GBR.predict(test_data)

In [None]:
submission = pd.DataFrame({'Id':test_data['Id'],'SalePrice':predictions})

In [None]:
submission.to_csv('housepricesub.csv',index=False)