## Target Value Analysis

I want to focus on the target variable which is SalePrice. Let's create a histogram to see if the target variable is Normally distributed. If we want to create any linear model, it is essential that the features are normally distributed. This is one of the assumptions of multiple linear regression.

In [None]:
# Histrogram to see how data is distributes
sns.set_context("paper", rc={"font.size":8,"axes.titlesize":8,"axes.labelsize":5}) 
sns.distplot(train_df['SalePrice'],fit=norm)

#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train_df['SalePrice'], plot=plt)
plt.show()

## Plotting the box plot. 
sns.boxplot(train_df['SalePrice']);

These three charts above can tell us a lot about our target variable.

- Our target variable, SalePrice is not normally distributed.
- Our target variable is right-skewed.
- There are multiple outliers in the variable.

In [None]:
#skewness and kurtosis
print("Skewness: " + str(train_df['SalePrice'].skew()))
print("Kurtosis: " + str(train_df['SalePrice'].kurt()))

In [None]:
#view correlation between train_df features and sales price
sns.set(font_scale=3)
plt.figure(figsize=(50,35))
sns.heatmap(train_df.corr(),annot=True,annot_kws={'size':30},fmt='.1f',
           cmap='PiYG',linewidths=.5)

In [None]:
## Getting the correlation of all the features with target variable. 
sns.set(font_scale=1)
corr_new_train=train_df.corr()
plt.figure(figsize=(5,15))
sns.heatmap(corr_new_train[['SalePrice']].sort_values(by=['SalePrice'],ascending=False).head(30),annot_kws={"size": 16},vmin=-1, cmap='PiYG', annot=True)

#(train_df.corr()**2)["SalePrice"].sort_values(ascending = False)[1:]

In [None]:
fig, axes = plt.subplots(40, 2,figsize=(15,300))
fig.subplots_adjust(hspace=0.6)
for i,ax in zip(all_data_df.columns,axes.flatten()):
    sns.scatterplot(x=train_df[i], y=train_df["SalePrice"],ax=ax)
    plt.xlabel(i,fontsize=16)
    plt.ylabel('SalePrice',fontsize=16)
    ax.set_yticks(np.arange(0,900001,100000))
    ax.set_title('SalePrice'+' - '+str(i),color=color,fontweight='bold',size=16)

In [None]:
catagorical_mode=[]
catagorical_ordinal=[]
catagorical_nominal=[]
numrical=[]
numrical_mean=[]
numrical_median=[]
qual_count=all_data_df.OverallQual.value_counts().sort_index()
len(qual_count)

Ideally, if the assumptions are met, the residuals will be randomly scattered around the centerline of zero with no apparent pattern. The residual will look like an unstructured cloud of points centered around zero. However, our residual plot is anything but an unstructured cloud of points. Even though it seems like there is a linear relationship between the response variable and predictor variable, the residual plot looks more like a funnel. The error plot shows that as GrLivArea value increases, the variance also increases, which is the characteristics known as Heteroscedasticity. Let's break this down.

In [None]:
## trainsforming target variable using numpy.log1p, 
train_df["SalePrice"] = np.log1p(train_df["SalePrice"])

In [None]:
# Histrogram to see how data is distributes
sns.set_context("paper", rc={"font.size":8,"axes.titlesize":8,"axes.labelsize":5}) 
sns.distplot(train_df['SalePrice'],fit=norm)

#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train_df['SalePrice'], plot=plt)
plt.show()

## Plotting the box plot. 
sns.boxplot(train_df['SalePrice']);

Now, let's make sure that the target variable follows a normal distribution

Let's make a comparison of the pre-transformed and post-transformed state of residual plots.

In [None]:

sns.residplot(x = train_df.GrLivArea, y = train_df.SalePrice);

Here, we see that the pre-transformed chart on the left has heteroscedasticity, and the post-transformed chart on the right has Homoscedasticity(almost an equal amount of variance across the zero lines). It looks like a blob of data points and doesn't seem to give away any relationships. That's the sort of relationship we would like to see to avoid some of these assumptions.

In [3]:
### Feature engineering

In [4]:
#### Imputing Missing Values

In [None]:
## Some missing values are intentionally left blank, for example: In the Alley feature 
## there are blank values meaning that there are no alley's in that specific house. 
missing_val_col = ["Alley", 
                   "PoolQC", 
                   "MiscFeature",
                   "Fence",
                   "FireplaceQu",
                   "GarageType",
                   "GarageFinish",
                   "GarageQual",
                   "GarageCond",
                   'BsmtQual',
                   'BsmtCond',
                   'BsmtExposure',
                   'BsmtFinType1',
                   'BsmtFinType2',
                   'MasVnrType']

for i in missing_val_col:
    all_data_df[i] = all_data_df[i].fillna('None')

In [None]:
## In the following features the null values are there for a purpose, so we replace them with "0"
missing_val_col2 = ['BsmtFinSF1',
                    'BsmtFinSF2',
                    'BsmtUnfSF',
                    'TotalBsmtSF',
                    'BsmtFullBath', 
                    'BsmtHalfBath', 
                    'GarageYrBlt',
                    'GarageArea',
                    'GarageCars',
                    'MasVnrArea']

for i in missing_val_col2:
    all_data_df[i] = all_data_df[i].fillna(0)
    
## Replaced all missing values in LotFrontage by imputing the median value of each neighborhood. 
all_data_df['LotFrontage'] = all_data_df.groupby('Neighborhood')['LotFrontage'].transform( lambda x: x.fillna(x.mean()))

In [None]:
## Zoning class are given in numerical; therefore converted to categorical variables. 
all_data_df['MSSubClass'] = all_data_df['MSSubClass'].astype(str)
all_data_df['MSZoning'] = all_data_df.groupby('MSSubClass')['MSZoning'].transform(lambda x: x.fillna(x.mode()[0]))

## Important years and months that should be categorical variables not numerical. 
# all_data['YearBuilt'] = all_data['YearBuilt'].astype(str)
# all_data['YearRemodAdd'] = all_data['YearRemodAdd'].astype(str)
# all_data['GarageYrBlt'] = all_data['GarageYrBlt'].astype(str)
all_data_df['YrSold'] = all_data_df['YrSold'].astype(str)
all_data_df['MoSold'] = all_data_df['MoSold'].astype(str)

all_data_df['Functional'] = all_data_df['Functional'].fillna('Typ') 
all_data_df['Utilities'] = all_data_df['Utilities'].fillna('AllPub') 
all_data_df['Exterior1st'] = all_data_df['Exterior1st'].fillna(all_data_df['Exterior1st'].mode()[0]) 
all_data_df['Exterior2nd'] = all_data_df['Exterior2nd'].fillna(all_data_df['Exterior2nd'].mode()[0])
all_data_df['KitchenQual'] = all_data_df['KitchenQual'].fillna("TA") 
all_data_df['SaleType'] = all_data_df['SaleType'].fillna(all_data_df['SaleType'].mode()[0])
all_data_df['Electrical'] = all_data_df['Electrical'].fillna("SBrkr") 


In [None]:
#missing values with count and percentage
count=all_data_df.isnull().sum().sort_values(ascending=False)[
    all_data_df.isnull().sum().sort_values(ascending=False)!=0]
percent=round(all_data_df.isnull().sum().sort_values(ascending=False)
               /len(all_data_df)*100,2)[round(all_data_df.isnull().sum().sort_values(ascending=False)
               /len(all_data_df)*100,2)!=0]
missing_value_df=pd.concat([count,percent],axis=1)

#View missing df
missing_value_df

In [None]:
### Fixing Skewness

In [None]:
numeric_feats = all_data_df.dtypes[all_data_df.dtypes != "object"].index

skewed_feats = all_data_df[numeric_feats].apply(lambda x: skew(x)).sort_values(ascending=False)

skewed_feats

In [None]:
sns.set(font_scale=1)
sns.distplot(all_data_df['1stFlrSF']);

In [None]:
## Fixing Skewed features using boxcox transformation. 


def fixing_skewness(df):
    """
    This function takes in a dataframe and return fixed skewed dataframe
    """
    ## Import necessary modules 
    from scipy.stats import skew
    from scipy.special import boxcox1p
    from scipy.stats import boxcox_normmax
    
    ## Getting all the data that are not of "object" type. 
    numeric_feats = df.dtypes[df.dtypes != "object"].index

    # Check the skew of all numerical features
    skewed_feats = df[numeric_feats].apply(lambda x: skew(x)).sort_values(ascending=False)
    high_skew = skewed_feats[abs(skewed_feats) > 0.5]
    skewed_features = high_skew.index

    for feat in skewed_features:
        df[feat] = boxcox1p(df[feat], boxcox_normmax(df[feat] + 1))

fixing_skewness(all_data_df)

In [None]:
sns.distplot(all_data_df['1stFlrSF']);

In [None]:
# feture engineering a new feature "TotalFS"
all_data_df['TotalSF'] = (all_data_df['TotalBsmtSF'] 
                       + all_data_df['1stFlrSF'] 
                       + all_data_df['2ndFlrSF'])

all_data_df['YrBltAndRemod'] = all_data_df['YearBuilt'] + all_data_df['YearRemodAdd']

all_data_df['Total_sqr_footage'] = (all_data_df['BsmtFinSF1'] 
                                 + all_data_df['BsmtFinSF2'] 
                                 + all_data_df['1stFlrSF'] 
                                 + all_data_df['2ndFlrSF']
                                )
                                 

all_data_df['Total_Bathrooms'] = (all_data_df['FullBath'] 
                               + (0.5 * all_data_df['HalfBath']) 
                               + all_data_df['BsmtFullBath'] 
                               + (0.5 * all_data_df['BsmtHalfBath'])
                              )
                               

all_data_df['Total_porch_sf'] = (all_data_df['OpenPorchSF'] 
                              + all_data_df['3SsnPorch'] 
                              + all_data_df['EnclosedPorch'] 
                              + all_data_df['ScreenPorch'] 
                              + all_data_df['WoodDeckSF']
                             )

In [None]:
all_data_df['haspool'] = all_data_df['PoolArea'].apply(lambda x: 1 if x > 0 else 0)
all_data_df['has2ndfloor'] = all_data_df['2ndFlrSF'].apply(lambda x: 1 if x > 0 else 0)
all_data_df['hasgarage'] = all_data_df['GarageArea'].apply(lambda x: 1 if x > 0 else 0)
all_data_df['hasbsmt'] = all_data_df['TotalBsmtSF'].apply(lambda x: 1 if x > 0 else 0)
all_data_df['hasfireplace'] = all_data_df['Fireplaces'].apply(lambda x: 1 if x > 0 else 0)

In [None]:
## Deleting features

In [None]:
all_data_df = all_data_df.drop(['Utilities', 'Street', 'PoolQC',], axis=1)

In [None]:
### Creating Dummy Variables

In [None]:
## Creating dummy variable 
final_features = pd.get_dummies(all_data_df).reset_index(drop=True)
final_features.shape

In [None]:
y=train_df.SalePrice
X = final_features.iloc[:len(y), :]

X_sub = final_features.iloc[len(y):, :]

In [None]:
outliers = [30, 88, 462, 631, 1322]
X = X.drop(X.index[outliers])
y = y.drop(y.index[outliers])

In [None]:
counts = X.BsmtUnfSF.value_counts()

In [None]:
counts.iloc[0]