# House Price Prediction Problem from Kaggle 

### My workflow (according to Assignment Task):

1. Training data set
2. Deal with missing data
3. Handling categorical data as stated in assignment
4. Exploring data after categorize with attributes and correlation feature
5. Tune some featues if my data need this
6. Applying different 'Supervised Learning' models to obtain predictive outputs, algorrithms are: 

                * Simple Linear Regression with L1,L2 regularization
                * Random Forest
                * Decision Tree
                * Gradient Boosting 
                
7. I'll fine tune this algorithms and apply PCA & Decision Tree followed by each other for comparision and will discuss other given tasks. 

## Importing libraries and train the dataset

In [51]:
# importing necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm, skew
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x))

In [52]:
# train and test data from given dataset

test = pd.read_csv("test.csv")
train = pd.read_csv("train.csv")

In [53]:
test.shape,train.shape

((1459, 80), (1460, 81))

##### Finding data measurement

In [54]:
# finding size of the performing data

ntrain = train.shape[0]
ntest = test.shape[0]
y_train = train.SalePrice.values
train_test_data = pd.concat((train, test)).reset_index(drop=True)
train_test_data.drop(['SalePrice'], axis=1, inplace=True)
print("train_test_data size is : {}".format(train_test_data.shape))

train_test_data size is : (2919, 80)


# Missing Data Handling

In [55]:
# visualizing the presence of data

train_test_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2919 entries, 0 to 2918
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             2919 non-null   int64  
 1   MSSubClass     2919 non-null   int64  
 2   MSZoning       2915 non-null   object 
 3   LotFrontage    2433 non-null   float64
 4   LotArea        2919 non-null   int64  
 5   Street         2919 non-null   object 
 6   Alley          198 non-null    object 
 7   LotShape       2919 non-null   object 
 8   LandContour    2919 non-null   object 
 9   Utilities      2917 non-null   object 
 10  LotConfig      2919 non-null   object 
 11  LandSlope      2919 non-null   object 
 12  Neighborhood   2919 non-null   object 
 13  Condition1     2919 non-null   object 
 14  Condition2     2919 non-null   object 
 15  BldgType       2919 non-null   object 
 16  HouseStyle     2919 non-null   object 
 17  OverallQual    2919 non-null   int64  
 18  OverallC

In [56]:
train_test_data.isnull().sum()

Id                 0
MSSubClass         0
MSZoning           4
LotFrontage      486
LotArea            0
                ... 
MiscVal            0
MoSold             0
YrSold             0
SaleType           1
SaleCondition      0
Length: 80, dtype: int64

In [57]:
# range of missing data

train_test_data_na= (train_test_data.isnull().sum() / len(train_test_data)) * 100
train_test_data_na = train_test_data_na.drop(train_test_data_na[train_test_data_na == 0].index)
missing_data = pd.DataFrame({'Missing Percentage' :train_test_data_na})
missing_data.shape

(34, 1)

In [58]:
missing_data.head(34)

Unnamed: 0,Missing Percentage
MSZoning,0.137
LotFrontage,16.65
Alley,93.217
Utilities,0.069
Exterior1st,0.034
Exterior2nd,0.034
MasVnrType,0.822
MasVnrArea,0.788
BsmtQual,2.775
BsmtCond,2.809


### Here, dropping Ferature having more than 50% missing data


In [59]:
# performing drop operation and finding rests

train_test_data_new = train_test_data.drop(columns=['Id', 'Alley', 'PoolQC', 'Fence', 'MiscFeature'] , axis=1)

In [60]:
train_test_data_new.shape

(2919, 75)

In [61]:
train_test_data_new['MSZoning'] = train_test_data_new['MSZoning'].fillna(train_test_data_new['MSZoning'].mode()[0])
train_test_data_new['LotFrontage'] = train_test_data_new['LotFrontage'].fillna(train_test_data_new['LotFrontage'].mean())
train_test_data_new['Utilities'] = train_test_data_new['Utilities'].fillna(train_test_data_new['Utilities'].mode()[0])
train_test_data_new['Exterior1st'] = train_test_data_new['Exterior1st'].fillna(train_test_data_new['Exterior1st'].mode()[0])
train_test_data_new['Exterior2nd'] = train_test_data_new['Exterior2nd'].fillna(train_test_data_new['Exterior2nd'].mode()[0])
train_test_data_new['MasVnrType'] = train_test_data_new['MasVnrType'].fillna(train_test_data_new['MasVnrType'].mode()[0])
train_test_data_new['MasVnrArea'] = train_test_data_new['MasVnrArea'].fillna(train_test_data_new['MasVnrArea'].mean())
train_test_data_new['BsmtQual'] = train_test_data_new['BsmtQual'].fillna(train_test_data_new['BsmtQual'].mode()[0])
train_test_data_new['BsmtCond'] = train_test_data_new['BsmtCond'].fillna(train_test_data_new['BsmtCond'].mode()[0])
train_test_data_new['BsmtExposure'] = train_test_data_new['BsmtExposure'].fillna(train_test_data_new['BsmtExposure'].mode()[0])
train_test_data_new['BsmtFinType1'] = train_test_data_new['BsmtFinType1'].fillna(train_test_data_new['BsmtFinType1'].mode()[0])
train_test_data_new['BsmtFinSF1'] = train_test_data_new['BsmtFinSF1'].fillna(train_test_data_new['BsmtFinSF1'].mean())
train_test_data_new['BsmtFinType2'] = train_test_data_new['BsmtFinType2'].fillna(train_test_data_new['BsmtFinType2'].mode()[0])
train_test_data_new['BsmtFinSF2'] = train_test_data_new['BsmtFinSF2'].fillna(train_test_data_new['BsmtFinSF2'].mean())
train_test_data_new['BsmtUnfSF'] = train_test_data_new['BsmtUnfSF'].fillna(train_test_data_new['BsmtUnfSF'].mean())
train_test_data_new['TotalBsmtSF'] = train_test_data_new['TotalBsmtSF'].fillna(train_test_data_new['TotalBsmtSF'].mean())
train_test_data_new['Electrical'] = train_test_data_new['Electrical'].fillna(train_test_data_new['Electrical'].mode()[0])
train_test_data_new['BsmtFullBath'] = train_test_data_new['BsmtFullBath'].fillna(train_test_data_new['BsmtFullBath'].mode()[0])
train_test_data_new['BsmtHalfBath'] = train_test_data_new['BsmtHalfBath'].fillna(train_test_data_new['BsmtHalfBath'].mode()[0])
train_test_data_new['KitchenQual'] = train_test_data_new['KitchenQual'].fillna(train_test_data_new['KitchenQual'].mode()[0])
train_test_data_new['Functional'] = train_test_data_new['Functional'].fillna(train_test_data_new['Functional'].mode()[0])
train_test_data_new['FireplaceQu'] = train_test_data_new['FireplaceQu'].fillna(train_test_data_new['FireplaceQu'].mode()[0])
train_test_data_new['GarageType'] = train_test_data_new['GarageType'].fillna(train_test_data_new['GarageType'].mode()[0])
train_test_data_new['GarageYrBlt'] = train_test_data_new['GarageYrBlt'].fillna(train_test_data_new['GarageYrBlt'].mean())
train_test_data_new['GarageFinish'] = train_test_data_new['GarageFinish'].fillna(train_test_data_new['GarageFinish'].mode()[0])
train_test_data_new['GarageCars'] = train_test_data_new['GarageCars'].fillna(train_test_data_new['GarageCars'].mean())
train_test_data_new['GarageArea'] = train_test_data_new['GarageArea'].fillna(train_test_data_new['GarageArea'].mean())
train_test_data_new['GarageQual'] = train_test_data_new['GarageQual'].fillna(train_test_data_new['GarageQual'].mode()[0])
train_test_data_new['GarageCond'] = train_test_data_new['GarageCond'].fillna(train_test_data_new['GarageCond'].mode()[0])
train_test_data_new['SaleType'] = train_test_data_new['SaleType'].fillna(train_test_data_new['SaleType'].mode()[0])

# Handling Categorical Data

In [62]:
columns=['MSZoning','Street','LotShape','LandContour','Utilities','LotConfig','LandSlope','Neighborhood',
         'Condition2','BldgType','Condition1','HouseStyle','SaleType',
        'SaleCondition','ExterCond',
         'ExterQual','Foundation','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2',
        'RoofStyle','RoofMatl','Exterior1st','Exterior2nd','MasVnrType','Heating','HeatingQC',
         'CentralAir',
         'Electrical','KitchenQual','Functional',
         'FireplaceQu','GarageType','GarageFinish','GarageQual','GarageCond','PavedDrive']

In [63]:
print(len(columns))

39


In [64]:
# operation on categorical data

def category_onehot_multcols(multcolumns):
    df_final=train_test_data_new
    i=0
    for fields in multcolumns:
        
        print(fields)
        df1=pd.get_dummies(train_test_data_new[fields],drop_first=True)
        
        train_test_data_new.drop([fields],axis=1,inplace=True)
        if i==0:
            df_final=df1.copy()
        else:
            
            df_final=pd.concat([df_final,df1],axis=1)
        i=i+1
       
        
    df_final=pd.concat([train_test_data_new,df_final],axis=1)
        
    return df_final

In [65]:
train_test_data_final= category_onehot_multcols(columns)

MSZoning
Street
LotShape
LandContour
Utilities
LotConfig
LandSlope
Neighborhood
Condition2
BldgType
Condition1
HouseStyle
SaleType
SaleCondition
ExterCond
ExterQual
Foundation
BsmtQual
BsmtCond
BsmtExposure
BsmtFinType1
BsmtFinType2
RoofStyle
RoofMatl
Exterior1st
Exterior2nd
MasVnrType
Heating
HeatingQC
CentralAir
Electrical
KitchenQual
Functional
FireplaceQu
GarageType
GarageFinish
GarageQual
GarageCond
PavedDrive


In [66]:
train_test_data_final.shape

(2919, 236)

In [67]:
train_test_data_final =train_test_data_final.loc[:,~train_test_data_final.columns.duplicated()]

In [68]:
train_test_data_final.shape

(2919, 176)

In [69]:
# finalize data set for performing prediction 

train_test_data_final

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,Min1,Min2,Typ,Attchd,Basment,BuiltIn,CarPort,Detchd,RFn,P
0,60,65.000,8450,7,5,2003,2003,196.000,706.000,0.000,...,0,0,1,1,0,0,0,0,1,0
1,20,80.000,9600,6,8,1976,1976,0.000,978.000,0.000,...,0,0,1,1,0,0,0,0,1,0
2,60,68.000,11250,7,5,2001,2002,162.000,486.000,0.000,...,0,0,1,1,0,0,0,0,1,0
3,70,60.000,9550,7,5,1915,1970,0.000,216.000,0.000,...,0,0,1,0,0,0,0,1,0,0
4,60,84.000,14260,8,5,2000,2000,350.000,655.000,0.000,...,0,0,1,1,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2914,160,21.000,1936,4,7,1970,1970,0.000,0.000,0.000,...,0,0,1,1,0,0,0,0,0,0
2915,160,21.000,1894,4,5,1970,1970,0.000,252.000,0.000,...,0,0,1,0,0,0,1,0,0,0
2916,20,160.000,20000,5,7,1960,1996,0.000,1224.000,0.000,...,0,0,1,0,0,0,0,1,0,0
2917,85,62.000,10441,5,5,1992,1992,0.000,337.000,0.000,...,0,0,1,1,0,0,0,0,0,0


#### Finalizing reshaped data

In [70]:
train,test = train_test_data_final[:ntrain],train_test_data_final[ntrain:]

In [71]:
train.shape, test.shape

((1460, 176), (1459, 176))

# Applying Supervised Learning Models for prediction

### Cross Validation


In [80]:
# importing libraries for models and evaluations 

from sklearn.linear_model import Lasso, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor

from sklearn.decomposition import PCA

from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error


In [81]:
# modifing scaling for the prediction range and values

n_folds = 5
def rmsle_cv(model):
    kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values)
    rmse= np.sqrt(-cross_val_score(model, train.values, y_train, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)

## Linear Regression with L1 and L2


Simple Linear Regression with Lasso and Ridge regularization

### Lasso Regression 

In [87]:
# tuning lasso model 

lasso = Lasso(alpha =0.0005, random_state=1)

In [88]:
# finding prediction score

score = rmsle_cv(lasso)
score.mean()

  positive)
  positive)
  positive)
  positive)
  positive)


43770.658888231905

### Ridge Regression

In [89]:
# tuning Ridge Model 

ridge = Ridge(alpha=0.6)

In [90]:
# finding predictive score

score = rmsle_cv(ridge)
score.mean()

33779.02293193526

## Decision Tree 



### Decision Tree with no bound or restriction 

In [91]:
# tuning Decision Tree model (no restriction)

decisiontree_without_restriction = DecisionTreeRegressor(random_state=0)

In [92]:
# finding score for Decision Tree

score = rmsle_cv(decisiontree_without_restriction)
score.mean()

41909.25098000234

### Decision Tree with maximun depth

In [93]:
# tuning Decision Tree model (max_depth)

decisiontree_with_restriction = DecisionTreeRegressor(random_state=0,max_depth=4) 

In [94]:
# finding score

score = rmsle_cv(decisiontree_with_restriction)
score.mean()

42701.98183721271

## Random Forest

**estimator_100**

In [96]:
# tuning Random Forest model 

randomforest_estimator_100 = RandomForestRegressor(n_estimators=100)

In [97]:
# finding predictive score

score = rmsle_cv(randomforest_estimator_100)
score.mean()

30257.176194089698

**estimator_300**

In [98]:
# tuning Random Forest model

randomforest_estimator_300 = RandomForestRegressor(n_estimators=300)

In [99]:
# finding predictive score

score = rmsle_cv(randomforest_estimator_300)
score.mean()

30129.970496494836

**estimator_500**

In [100]:
# tuning Random Forest model

randomforest_estimator_500 = RandomForestRegressor(n_estimators=500)

In [101]:
# finding predictive score

score = rmsle_cv(randomforest_estimator_500)
score.mean()

30226.930103155562

## Gradient Boosting

In [102]:
# tuning GB model

GBoost = GradientBoostingRegressor(n_estimators=3000)

In [103]:
# fiding predictive score

score = rmsle_cv(GBoost)
score.mean()

25378.489830172894

**So, if we compare among the algorithms or models we applied on our data set, we can see different prediction values from the output function. There, Gradient Boosting model showed us the less 'Sales Price' for the house, So here GB is the best model according to my predicting actions from the given data set.** 

### PCA and Decision Tree

In [104]:
# applying PCA 

pca = PCA(n_components=13)
train_pca = pca.fit_transform(train)

In [107]:
# tuning Decision Tree 

decisiontree_pca = DecisionTreeRegressor(max_depth=4)

In [108]:
# finding predictive score after tuning Decision tree

score = rmsle_cv(decisiontree_pca)
score.mean()

42569.7021759482

**Here, we can see that, between PCA and L2 regularization, L2 regularization performs better as it shows less selling price of house**.

#### Answering the given Question

Question: What is ensemble learning? What is the difference between Bagging and Boosting?

Answer: 

Ensemble Learning: Ensemble Learning is a combination of methods that combine several decision trees classifiers to produce better predictive performance than a single decision tree classifier. The main principle behind the ensemble model is that a group of weak learners come together to form a strong learner, thus increasing the accuracy of the model.When we try to predict the target variable using any machine learning technique, the main causes of difference in actual and predicted values are noise, variance, and bias.

Bagging: Bagging is used when our objective is to reduce the variance of a decision tree. Here the concept is to create a few subsets of data from the training sample, which is chosen randomly with replacement. Now each collection of subset data is used to prepare their decision trees thus, we end up with an ensemble of various models. The average of all the assumptions from numerous tress is used, which is more powerful than a single decision tree.

Boosting: Boosting is another ensemble procedure to make a collection of predictors. In other words, we fit consecutive trees, usually random samples, and at each step, the objective is to solve net error from the prior trees. If a given input is misclassified by theory, then its weight is increased so that the upcoming hypothesis is more likely to classify it correctly by consolidating the entire set at last converts weak learners into better performing models.


Comparison between Bagging and Boosting:

1. Bagging and Boosting both are ensemble methods to get N learners from 1 learner but, while they are built independently for Bagging, Boosting tries to add new models that do well where previous models fail.
2. Both generate several training data sets by random sampling but only Boosting determines weights for the data to tip the scales in favor of the most difficult cases.
3. Both make the final decision by averaging  the N learners (or taking the majority of them) but it is an equally weighted average for Bagging and a weighted average for Boosting, more weight to those with better performance on training data.
4. Both are good at reducing variance and provide higher stability but only Boosting tries to reduce bias. On the other hand, Bagging may solve the over-fitting problem, while Boosting can increase it.