## Introduction

We need to predict the final price of each home using several explanatory variables

In [1]:
import numpy as np
import pandas as pd

Loading the dataset using read_csv function

In [2]:
housing = pd.read_csv('train.csv')

In [3]:
housing.isnull().sum()[housing.isnull().sum() != 0]

LotFrontage      259
Alley           1369
MasVnrType         8
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
Electrical         1
FireplaceQu      690
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1453
Fence           1179
MiscFeature     1406
dtype: int64

Missing values for numerical columns are filled with the median value and for the categorical values are filled with the most frequent observation(mode). 

In [4]:
Lotfront_median = housing['LotFrontage'].median()
MSZoning_mode = housing['MSZoning'].mode()[0]
Utilities_mode = housing['Utilities'].mode()[0]

In some of the columns when a particular feature is absent, it is filled with 'NO'. So, for the missing values in those coluomns it makes sense to fill the missing values with same value

In [5]:
housing['LotFrontage'].fillna(Lotfront_median,inplace = True )

housing['FireplaceQu'].fillna('NO',inplace = True)
housing['GarageYrBlt'].fillna(0,inplace = True)
housing['GarageType'].fillna('NO',inplace = True)
housing['GarageQual'].fillna('NO',inplace = True)
housing['GarageCond'].fillna('NO',inplace = True)
housing['GarageFinish'].fillna('NO',inplace = True)
housing['BsmtQual'].fillna('NO',inplace = True)
housing['BsmtCond'].fillna('NO',inplace = True)
housing['BsmtExposure'].fillna('NO',inplace = True)
housing['BsmtFinType1'].fillna('NO',inplace = True)
housing['BsmtFinType2'].fillna('NO',inplace = True)
housing['MasVnrType'].fillna('NO',inplace = True)
housing['MasVnrArea'].fillna(0,inplace = True)
housing['Alley'].fillna('NO',inplace = True)
housing['PoolQC'].fillna('NO',inplace = True)
housing['Fence'].fillna('NO',inplace = True)
housing['MiscFeature'].fillna('NO',inplace = True)

Electrical_mode = housing['Electrical'].mode()[0] 
housing['Electrical'].fillna(Electrical_mode,inplace = True)

In [6]:
housing.isnull().sum()[housing.isnull().sum()!=0]

Series([], dtype: int64)

When the distribution of a numerical column is highly skewed, it becomes slightly difficult. So, bring the distribution cloer to the normal or to reduce the skew in distribution sometimes we take the logarithm of the distribution.

Also some of the classifiers work better when the values in numerical columns are standardized

In [7]:
housing['SalePrice_log'] = np.log10(housing['SalePrice'])
housing['LotArea_log'] = np.log10(housing['LotArea'])
housing['Area_Qual'] = np.log10(housing['LotArea_log']*housing['OverallQual'])
housing['Age'] = housing['YrSold'] - housing['YearBuilt']
housing['Area_Cond'] = housing['LotArea_log']*housing['OverallCond']
housing['GrLivArea_log'] = np.log10(housing['GrLivArea'])
housing['MoSold_sqr'] = housing['MoSold']**2
housing['A/C'] = (housing['CentralAir'] == 'Y').astype(int)

housing['LotArea_sqr'] = housing['LotArea']**2
housing['TotalBsmtSF_log'] = np.log10(1+housing['TotalBsmtSF'])
housing['LotArea_std'] = (housing['LotArea'] - housing['LotArea'].mean())/(housing['LotArea'].std())
housing['Area_Qual_std'] = (housing['Area_Qual'] - housing['Area_Qual'].mean())/(housing['Area_Qual'].std())
housing['Area_Cond_std'] = (housing['Area_Cond'] - housing['Area_Cond'].mean())/(housing['Area_Cond'].std())
housing['GrLivArea_log_std'] = (housing['GrLivArea_log'] - housing['GrLivArea_log'].mean())/(housing['GrLivArea_log'].std())
housing['LotArea_sqr_std'] = (housing['LotArea_sqr'] - housing['LotArea_sqr'].mean())/(housing['LotArea_sqr'].std())
housing['TotalBsmtSF_log_std'] = (housing['TotalBsmtSF_log'] -\
                                  housing['TotalBsmtSF_log'].mean())/(housing['TotalBsmtSF_log'].std())                                                                         

For categorical variables, onehot encoding is performed using the get_dummies function

In [8]:
housing_dummies = pd.get_dummies(housing,columns= ['MSZoning','MSSubClass','Street','Alley','LandContour',
                        'Neighborhood','Condition1','Condition2','HouseStyle','MasVnrType','ExterCond','Heating',
                                                   'Functional','MiscFeature','SaleType','BsmtCond','BsmtExposure',
                                                   'BsmtFinType1','HeatingQC','CentralAir','KitchenQual','FireplaceQu',
                                                   'GarageCond',])

Condition1 and Condition2 columns gives information about proximity to various facilities. Both of them have identical values. So, to avoid duplication we are adding the corresponding dummy columns 

In [9]:
housing_cols = list(housing.columns)
housing_dummy_cols = list(housing_dummies.columns)

In [10]:
for col in ['Feedr','RRAn','PosN','RRAe','PosA','RRNn','RRNe']:
    col1 = "Condition1_" + col
    col2 = "Condition2_" + col
    if col2 in housing_dummy_cols:
        housing_dummies[col] = housing_dummies[col1] +housing_dummies[col2]
    else:
        housing_dummies[col] = housing_dummies[col1]

There are several categorical features in this dataset. For each feature, they are converted into categorical columns and then 
these new features are fed into the classifier and then the performance of the calssifier has been analysed and finally the best possible features have been selected.

In [11]:
features_all = list(housing_dummies)
label_cols = ['Id', 'SalePrice','SalePrice_log']

In [12]:
feature_data = housing_dummies[features_all]
label_data = housing_dummies[label_cols]

After the preprocessing, entire data have been split into training and validation datasets and then linear regression is performed on all these features to predict the final value of dataset

In [13]:
from sklearn.model_selection import train_test_split

feature_train,feature_cv,label_train,label_cv =  train_test_split(feature_data.values,label_data.values,test_size = 0.3,random_state = 0)

In [14]:
feature_train = pd.DataFrame(feature_train,columns = features_all)
feature_cv = pd.DataFrame(feature_cv,columns = features_all)
label_train = pd.DataFrame(label_train,columns = label_cols)
label_cv = pd.DataFrame(label_cv,columns = label_cols)

In [15]:
for df in [feature_train,feature_cv,label_train,label_cv]:
    print(df.isnull().sum()[df.isnull().sum() != 0])

Series([], dtype: int64)
Series([], dtype: int64)
Series([], dtype: int64)
Series([], dtype: int64)


In [16]:
features = ['LotArea_std','Area_Qual_std','Age','OverallCond','Area_Cond_std','GrLivArea_log_std',
           'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath','BedroomAbvGr',
           'KitchenAbvGr','Fireplaces','GarageCars','WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 
            'ScreenPorch','MoSold_sqr','MoSold','MSZoning_FV', 'MSZoning_RH', 'MSZoning_RL', 'MSZoning_RM',
            'MSSubClass_20', 'MSSubClass_30', 'MSSubClass_40', 'MSSubClass_45', 'MSSubClass_50', 'MSSubClass_60',
            'MSSubClass_70', 'MSSubClass_75', 'MSSubClass_80', 'MSSubClass_85', 'MSSubClass_90', 'MSSubClass_120', 
            'MSSubClass_160', 'MSSubClass_180', 'MSSubClass_190','Street_Grvl','Alley_Grvl', 'Alley_Pave',
           'LandContour_Bnk', 'LandContour_HLS', 'LandContour_Low', 'Neighborhood_Blueste', 'Neighborhood_BrDale',
            'Neighborhood_BrkSide', 'Neighborhood_ClearCr', 'Neighborhood_CollgCr', 'Neighborhood_Crawfor',
            'Neighborhood_Edwards', 'Neighborhood_Gilbert', 'Neighborhood_IDOTRR', 'Neighborhood_MeadowV',
            'Neighborhood_Mitchel', 'Neighborhood_NAmes', 'Neighborhood_NPkVill', 'Neighborhood_NWAmes', 
            'Neighborhood_NoRidge', 'Neighborhood_NridgHt', 'Neighborhood_OldTown', 'Neighborhood_SWISU', 
            'Neighborhood_Sawyer', 'Neighborhood_SawyerW', 'Neighborhood_Somerst', 'Neighborhood_StoneBr',
            'Neighborhood_Timber', 'Neighborhood_Veenker', 'Feedr','RRAn','PosN','RRAe','PosA','RRNn','RRNe'
            , 'HouseStyle_1.5Fin', 'HouseStyle_1.5Unf', 'HouseStyle_1Story', 'HouseStyle_2.5Unf',
            'HouseStyle_2Story', 'HouseStyle_SFoyer', 'HouseStyle_SLvl','MasVnrType_BrkCmn', 'MasVnrType_BrkFace',
            'MasVnrType_Stone','ExterCond_Ex', 'ExterCond_Fa', 'ExterCond_Gd', 'ExterCond_Po',
            'Heating_GasA', 'Heating_GasW', 'Heating_Grav', 'Heating_Wall','Functional_Maj1', 'Functional_Maj2',
            'Functional_Min1', 'Functional_Min2', 'Functional_Mod', 'Functional_Sev', 'BsmtCond_Fa', 'BsmtCond_Gd',
            'BsmtCond_NO', 'BsmtCond_Po', 'BsmtCond_TA', 'BsmtExposure_Av', 'BsmtExposure_Gd', 'BsmtExposure_Mn',
            'BsmtFinType1_ALQ', 'BsmtFinType1_BLQ', 'BsmtFinType1_GLQ', 'BsmtFinType1_LwQ', 'BsmtFinType1_Rec', 
            'BsmtFinType1_Unf', 'HeatingQC_Ex', 'HeatingQC_Fa', 'HeatingQC_Gd', 'HeatingQC_TA','CentralAir_Y',
            'KitchenQual_Ex', 'KitchenQual_Fa', 'KitchenQual_Gd','FireplaceQu_Ex', 'FireplaceQu_Fa', 
            'FireplaceQu_Gd', 'FireplaceQu_Po', 'FireplaceQu_TA','GarageCond_Ex', 'GarageCond_Fa', 'GarageCond_Gd',
             'GarageCond_Po', 'GarageCond_TA','LotArea_sqr_std','TotalBsmtSF_log_std','YrSold'
           
         ]

In [17]:
X_train = pd.DataFrame(feature_train,columns = features).values
X_cv = pd.DataFrame(feature_cv,columns = features).values
y_train = pd.Series(label_train['SalePrice_log']).values
y_cv = pd.Series(label_cv['SalePrice_log']).values

In [18]:
pd.DataFrame(feature_train,columns = features).isnull().sum()[pd.DataFrame(feature_train,columns = features).isnull().sum()!=0]

Series([], dtype: int64)

In [19]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression()

reg.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [20]:
pred_train = reg.predict(X_train)
pred_cv = reg.predict(X_cv)

In [21]:
print("Train_score = " + str(reg.score(X_train,y_train)))
print("Crossvalidation_score = " + str(reg.score(X_cv,y_cv)))

Train_score = 0.9293752039827229
Crossvalidation_score = 0.893922392154645


In [22]:
from sklearn.metrics import mean_squared_error
print("Train_error = " + str(mean_squared_error(pred_train,y_train)))
print("Crossvalidation_error = " + str(mean_squared_error(pred_cv,y_cv)))

Train_error = 0.0021520969062887555
Crossvalidation_error = 0.0030913780749118646


After loading the test set all these  above preprocessing operations have been performed on the test dataset

In [23]:
housing_test = pd.read_csv('test.csv')

In [24]:
MSZoning_mode = housing['MSZoning'].mode()[0]
Utilities_mode = housing['Utilities'].mode()[0]

In [25]:
housing_test['MSZoning'].fillna(MSZoning_mode,inplace = True )
housing_test['Utilities'].fillna(Utilities_mode,inplace = True )
housing_test['Exterior1st'].fillna('Other',inplace = True)
housing_test['Exterior2nd'].fillna('Other',inplace = True)
housing_test['BsmtFinSF1'].fillna(0,inplace = True)
housing_test['BsmtFinSF2'].fillna(0,inplace = True)
housing_test['BsmtUnfSF'].fillna(0,inplace = True)
housing_test['TotalBsmtSF'].fillna(0,inplace = True)
housing_test['BsmtFullBath'].fillna(0,inplace = True)
housing_test['BsmtHalfBath'].fillna(0,inplace = True)
housing_test['GarageCars'].fillna(0,inplace = True)
housing_test['GarageArea'].fillna(0,inplace = True)
housing_test['KitchenQual'].fillna('Po',inplace = True)
housing_test['Functional'].fillna('Typ',inplace = True)
housing_test['SaleType'].fillna('Oth',inplace = True)



housing_test['LotFrontage'].fillna(Lotfront_median,inplace = True )

housing_test['FireplaceQu'].fillna('NO',inplace = True)
housing_test['GarageYrBlt'].fillna(0,inplace = True)
housing_test['GarageType'].fillna('NO',inplace = True)
housing_test['GarageQual'].fillna('NO',inplace = True)
housing_test['GarageCond'].fillna('NO',inplace = True)
housing_test['GarageFinish'].fillna('NO',inplace = True)
housing_test['BsmtQual'].fillna('NO',inplace = True)
housing_test['BsmtCond'].fillna('NO',inplace = True)
housing_test['BsmtExposure'].fillna('NO',inplace = True)
housing_test['BsmtFinType1'].fillna('NO',inplace = True)
housing_test['BsmtFinType2'].fillna('NO',inplace = True)
housing_test['MasVnrType'].fillna('NO',inplace = True)
housing_test['MasVnrArea'].fillna(0,inplace = True)
housing_test['Alley'].fillna('NO',inplace = True)
housing_test['PoolQC'].fillna('NO',inplace = True)
housing_test['Fence'].fillna('NO',inplace = True)
housing_test['MiscFeature'].fillna('NO',inplace = True)

Electrical_mode = housing['Electrical'].mode()[0] 
housing_test['Electrical'].fillna(Electrical_mode,inplace = True)



In [26]:
housing_test['LotArea_log'] = np.log10(housing_test['LotArea'])
housing_test['Area_Qual'] = housing_test['LotArea_log']*housing_test['OverallQual']
housing_test['Age'] = housing_test['YrSold'] - housing_test['YearBuilt']
housing_test['Area_Cond'] = housing_test['LotArea_log']*housing_test['OverallCond']
housing_test['GrLivArea_log'] = np.log10(housing_test['GrLivArea'])
housing_test['MoSold_sqr'] = housing_test['MoSold']**2
housing_test['A/C'] = (housing_test['CentralAir'] == 'Y').astype(int)
housing_test['LotArea_sqr'] = housing_test['LotArea']**2

housing_test['TotalBsmtSF_log'] = np.log10(1+housing_test['TotalBsmtSF'])


In [27]:
housing_test['LotArea_std'] = (housing_test['LotArea'] - housing_test['LotArea'].mean())/(housing_test['LotArea'].std())
housing_test['Area_Qual_std'] = (housing_test['Area_Qual'] - housing_test['Area_Qual'].mean())/(housing_test['Area_Qual'].std())
housing_test['Area_Cond_std'] = (housing_test['Area_Cond'] - housing_test['Area_Cond'].mean())/(housing_test['Area_Cond'].std())
housing_test['GrLivArea_log_std'] = (housing_test['GrLivArea_log'] - housing_test['GrLivArea_log'].mean())/(housing_test['GrLivArea_log'].std())
housing_test['LotArea_sqr_std'] = (housing_test['LotArea_sqr'] - housing_test['LotArea_sqr'].mean())/(housing_test['LotArea_sqr'].std())
housing_test['TotalBsmtSF_log_std'] = (housing_test['TotalBsmtSF_log'] - housing_test['TotalBsmtSF_log'].mean())/(housing_test['TotalBsmtSF_log'].std())


In [28]:
housing_test_dummies = pd.get_dummies(housing_test,columns= ['MSZoning','MSSubClass','Street','Alley','LandContour',
                        'Neighborhood','Condition1','Condition2','HouseStyle','MasVnrType','ExterCond','Heating',
                                                   'Functional','MiscFeature','SaleType','BsmtCond','BsmtExposure',
                                                   'BsmtFinType1','HeatingQC','CentralAir','KitchenQual','FireplaceQu',
                                                   'GarageCond',])

In [29]:
housing_dummy_cols = list(housing_test_dummies.columns)

In [30]:
for col in ['Feedr','RRAn','PosN','RRAe','PosA','RRNn','RRNe']:
    col1 = "Condition1_" + col
    col2 = "Condition2_" + col
    if col2 in housing_dummy_cols:
        housing_test_dummies[col] = housing_test_dummies[col1] +housing_test_dummies[col2]
    else:
        housing_test_dummies[col] = housing_test_dummies[col1]

In [31]:
test_data = housing_test_dummies[features]

Since the logarithmic function has been applied to the final sale price in training set, the output for test set will also be logarithm of sale price. So, the output has to be raised to the power of 10 to get the actual saleprice. And finally dataframe is converted into csv file using read_csv function

In [32]:
housing_test['SalePrice_log'] = reg.predict(test_data.values)

housing_test['SalePrice'] = 10**housing_test['SalePrice_log']

In [33]:
housing_submit = housing_test[['Id','SalePrice']]
housing_submit.head()
housing_submit.to_csv('Final_Predn_complete_data_quality_var_interxn_and_standardization_on_whole.csv',index = False)