# Surprise Housing
A US-based housing company named Surprise Housing has decided to enter the Australian market. The company uses data analytics to purchase houses at a price below their actual values and flip them on at a higher price. For the same purpose, the company has collected a data set from the sale of houses in Australia.

 

The company is looking at prospective properties to buy to enter the market. You are required to build a regression model using regularisation in order to predict the actual value of the prospective properties and decide whether to invest in them or not.

 

The company wants to know:

* Which variables are significant in predicting the price of a house, and

* How well those variables describe the price of a house.

 

Also, determine the optimal value of lambda for ridge and lasso regression.

### Business Goal 
 

We are required to model the price of houses with the available independent variables. This model will then be used by the management to understand how exactly the prices vary with the variables. They can accordingly manipulate the strategy of the firm and concentrate on areas that will yield high returns. Further, the model will be a good way for management to understand the pricing dynamics of a new market.

This assignement has been divided in the follow parts:
* Data preparation
* Data cleaning
* EDA
* Modeling

## Know the data

In [None]:
# importing neccary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns;
pd.options.display.max_columns = None

In [None]:
#Reading data
main_df = pd.read_csv('train.csv')

In [None]:
main_df.head() #For a glimpse of the data

In [None]:
main_df.shape

So we have 1460 records and 81 attributes

In [None]:
main_df.describe()

In [None]:
main_df.info()

We cans see there are are many null objects in the data set provided.

In [None]:
(main_df.isnull().sum()/len(main_df)*100).sort_values(ascending=False)[:10]

Since there are lots of missing data for PoolQC, MiscFeature, Alley and Fence. We can safely remove these columns or we can impute the NA values with some appropriate values.

In [None]:
df = main_df.copy()

In [None]:
#Let's impute data
#For PoolQC, NA simple means None. So let's impute that way
df["PoolQC"] = df["PoolQC"].fillna("None")

In [None]:
#Similarly, for MiscFeature, Alley, Fence, FireplaceQu
df["MiscFeature"] = df["MiscFeature"].fillna("None")
df["Alley"] = df["Alley"].fillna("None")
df["FireplaceQu"] = df["FireplaceQu"].fillna("None")
df["Fence"] = df["Fence"].fillna("None")

In [None]:
df["GarageType"] = df["GarageType"].fillna("None")
df["GarageFinish"] = df["GarageFinish"].fillna("None")
df["GarageQual"] = df["GarageQual"].fillna("None")
df["GarageCond"] = df["GarageCond"].fillna("None")

In [None]:
#For some of the numerical attributes like GarageArea and GarageCars, we can replace nulls with 0
df["GarageArea"] = df["GarageArea"].fillna(0)
df["GarageCars"] = df["GarageCars"].fillna(0)

In [None]:
#LotFrontage, we can impute median LotFrontage of all Neighbourhood
df["LotFrontage"] = df.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))

In [None]:
#GarageYrBlt can be imputed by meadia
#df["GarageYrBlt"]=df["GarageYrBlt"].astype(str)
df["GarageYrBlt"]=df["GarageYrBlt"].fillna(df["GarageYrBlt"].mode()[0])

In [None]:
#df = df.drop(['PoolQC','MiscFeature','Alley','Fence'],axis=1)

In [None]:
#Let's see how much missing values are still left
(df.isnull().sum()/len(df)*100).sort_values(ascending=False)[:10]

In [None]:
#Again for BsmtFinType2, BsmtExposure, BsmtQual, BsmtFinType1 and BsmtCond can be imputed with None
df["BsmtFinType2"] = df["BsmtFinType2"].fillna("None")
df["BsmtExposure"] = df["BsmtExposure"].fillna("None")
df["BsmtQual"] = df["BsmtQual"].fillna("None")
df["BsmtFinType1"] = df["BsmtFinType1"].fillna("None")
df["BsmtCond"] = df["BsmtCond"].fillna("None")

In [None]:
# For MasVnrType and MasVnrArea, we can impute with None and 0 repectively
df["MasVnrType"] = df["MasVnrType"].fillna("None")
df["MasVnrArea"] = df["MasVnrArea"].fillna(0)

In [None]:
# For Electrical column let's impute it with the largest occured value, i.e. Mode.
df["Electrical"] = df["Electrical"].fillna(df["Electrical"].mode()[0])

In [None]:
#We can also make some new columns which will give more information about the data set. Such as Total Area, Bathroom and Year Average

df['TotalArea'] = df['TotalBsmtSF'] + df['1stFlrSF'] + df['2ndFlrSF'] + df['GrLivArea'] +df['GarageArea']
df['Bathrooms'] = df['FullBath'] + df['HalfBath']*0.5 
df['Year average']= (df['YearRemodAdd']+df['YearBuilt'])/2

In [None]:
df.shape

Let's have a look on the heat map that shows the correlation between the columns.

In [None]:
plt.subplots(figsize=(12,9))
sns.heatmap(df.corr(), square=True)

From the above Heat map we can say:
* Garage Year built and Year Built of the house are highly correlated.
* Total Rooms Above Grade and Above Grade living area are highly correlated.
* Garage Area and Garage Cars are highly correlated.
* Sale Price and Overall Quality are highly correlated.
* Sale Price and Grade living area are highly correlated.
* Garage Year built and Enclosed Porch are negatively correlated
* Garage Year built and Overall Condition are negatively correlated
* Year Built and Enclosed Porch are negatively correlated

## Data Preparation

In [None]:
#Extracting Numerical columns
df_num = df.select_dtypes(include=['float64', 'int64'])


We could have plotted the pair plot here but for 38 attributes, it will be a computational heavy deal. So we will select some of the attributes and see the pattern with Sales (target variable)

In [None]:
#sns.pairplot(df_num)

In [None]:
df_num.columns
df_num =df_num.drop('Id', axis=1)

Finding outliers using scatter plot

In [None]:
# Since we are focused on finding the SalePrice, we will try to find the pattern with other attributes.
plt.scatter(df.GrLivArea,df.SalePrice)
plt.xlabel('Living Area')
plt.ylabel('Sale Price')

Here, we can see the points in right hand side are not following the trend. Like for more `Living Area` the `Sale Price` is low. We can see there are some outliers here.

In [None]:
plt.scatter(df.TotalBsmtSF,df.SalePrice)
plt.xlabel('Basement Area')
plt.ylabel('Sale Price')

In [None]:
plt.scatter(df['1stFlrSF'],df.SalePrice)
plt.xlabel('First Floor Area')
plt.ylabel('Sale Price')

In [None]:
plt.scatter(df.MasVnrArea,df.SalePrice)
plt.xlabel('Masonry veneer area')
plt.ylabel('Sale Price')

Let's see some outliers using Box Plot.

In [None]:
plt.figure(figsize=(12,50))
i=0
for col in df_num.columns:
    i = i+1
    plt.subplot(30,4,i)
    sns.boxplot(y=col,data=df)
    #plt.show()
plt.tight_layout()

Let's treat the outliers.

In [None]:
#We are using z score approach to removing the outliers.
from scipy import stats
df_outliers_treated = df.copy()
df_outliers_treated=df_outliers_treated[(np.abs(stats.zscore(df_outliers_treated[df_num.columns])) < 3.5).all(axis=1)]
df_outliers_treated.dropna(axis=0,inplace=True)
df_outliers_treated.shape

For z score = 3 it was dropping nearly 500 rows and the outliers are less. So we have increased the z score value to 3.5

In [None]:
df = df_outliers_treated

#### Scatter plots after the outlier treatement

In [None]:
# Since we are focused on finding the SalePrice, we will try to find the pattern with other attributes.
plt.scatter(df.GrLivArea,df.SalePrice)
plt.xlabel('Living Area')
plt.ylabel('Sale Price')

In [None]:
plt.scatter(df.TotalBsmtSF,df.SalePrice)
plt.xlabel('Basement Area')
plt.ylabel('Sale Price')

In [None]:
plt.scatter(df['1stFlrSF'],df.SalePrice)
plt.xlabel('First Floor Area')
plt.ylabel('Sale Price')

In [None]:
plt.scatter(df.MasVnrArea,df.SalePrice)
plt.xlabel('Masonry veneer area')
plt.ylabel('Sale Price')

We can see there are no outliers anymore. And for the number of rows removed is shown below

In [None]:
print('We removed ',main_df.shape[0]- df.shape[0],'outliers')

In [None]:
df.head()

In [None]:
df_non_num = df.select_dtypes(include = 'object')

In [None]:
df_non_num.head()

In [None]:
col_data = []  #Makking array to store the column names with unique values and counts
for col in df_non_num.columns:
    col_data.append([col,df_non_num[col].unique(),df_non_num[col].nunique()])

In [None]:
df_non_num_col_data = pd.DataFrame(col_data, columns=['ColName','UniqueValues','UniqueCounts'])

In [None]:
df_non_num_col_data.sort_values(by='UniqueCounts')

Please note that `PoolQC` and `Utilities` have only one unique value. So we can say there are no variation for these columns and we can remove them.

In [None]:
df.drop(['PoolQC','Utilities'], axis = 1, inplace = True)

Let's create dummy variable for each non numerical variable. 

In [None]:
df_dummies = pd.get_dummies(df, prefix_sep='_', drop_first=True)

In [None]:
df_dummies.head()

In [None]:
df_dummies.drop('Id', inplace=True,axis = 1)
df_dummies.shape

There are 238 columns after creating dummies.

### Checking for Skew in the target (Sale Price)

In [None]:
plt.hist(df_dummies.SalePrice)  

We can see the histogram is slighly shifted to the left hand side. And this need to be fixed.

In [None]:
plt.hist(np.log(df_dummies.SalePrice))  #To rectify the the skewness.

In [None]:
y =np.log(df_dummies.pop('SalePrice')) #It can be considered as the target variable

### Scaling using Robust Scaler
Since we have seen there are Skew in the data and the outliers as well.
Robust Scaler is a better choice for this scenario

In [None]:
from sklearn.preprocessing import RobustScaler
scaler= RobustScaler()
cols = df_dummies.columns
X = pd.DataFrame(scaler.fit_transform(df_dummies))
X.columns = cols
X.head()

### Splitting data in Test and Train set

In [None]:
# split into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    train_size=0.7,
                                                    test_size = 0.3, random_state=100)

# Model Building and Evaluation
Here we use `Ridge` and `Lasso` regression

In [None]:
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge

ridge=Ridge()
params= {'alpha': [0.0001, 0.001, 0.01, 0.05, 0.1, 
 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 3.0, 
 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 20, 50, 100, 500, 1000 ]}#{'alpha':[x for x in range(1,101)]}

# cross validation
folds = 5
model_cv = GridSearchCV(estimator = ridge, 
                        param_grid = params, 
                        scoring= 'neg_mean_absolute_error', 
                        cv = folds, 
                        return_train_score=True,
                        verbose = 1)            
model_cv.fit(X_train, y_train) 
print("The best value of Alpha is: ",model_cv.best_params_)

### Now, let's see accuracy for different values of alphas

In [None]:
alphas = [0.0001,0.001,1,5,10]
for alp in alphas:
    ridge_mod = Ridge(alpha=alp)
    ridge_mod.fit(X_train, y_train)
    #ridge_mod_test = Ridge(X_test,y_test)
    print("Accuracy score for train set for alpha ", alp," is ",ridge_mod.score(X_train, y_train), " and Accuracy score for test is ",ridge_mod.score(X_test, y_test))

We can see the scores are pretty good for alpha = 10. (But for 7 we may get better result )

For alpha = 0.0001, 0.001 and 1 the test accuracy score is bad and train score is very good. Which is a sign of over fitting of the data.

### Plot showing the variation of test and train score with various alpha values

In [None]:
cv_results = pd.DataFrame(model_cv.cv_results_)
#cv_results.head()
cv_results['param_alpha'] = cv_results['param_alpha'].astype('float32')

# plotting
plt.plot(cv_results['param_alpha'], cv_results['mean_train_score'])
plt.plot(cv_results['param_alpha'], cv_results['mean_test_score'])
plt.xlabel('alpha')
plt.ylabel('Negative Mean Absolute Error')
plt.title("Negative Mean Absolute Error and alpha")
plt.legend(['train score', 'test score'], loc='upper left')
plt.show()

### We can the Root mean square error values below:

In [None]:
import sklearn.metrics as sklm
y_pred_train=ridge_mod.predict(X_train)
y_pred_test=ridge_mod.predict(X_test)
import math
print('Root Mean Square Error train = ' + str(round(math.sqrt(sklm.mean_squared_error(y_train, y_pred_train)),2)))
print('Root Mean Square Error test = ' + str(round(math.sqrt(sklm.mean_squared_error(y_test, y_pred_test)),2)))

In [None]:
plt.figure(figsize=(20, 10))
coefs = pd.Series(ridge_mod.coef_, index = X.columns)

imp_coefs = pd.concat([coefs.sort_values().head(10),
                     coefs.sort_values().tail(10)])
imp_coefs.plot(kind = "barh")
plt.title("Features importance in the Ridge Model")
plt.show()

Please note that Ridge Regularisation is not used to select the columns but we have picked the top 10 and bottom 10 columns from the Ridge coeficient.

So following are some inferences from the the plot:
* Overall quality of the house is an important factor for the Sale Price
* Living Area is accountable for the high Sale Price
* If the Neighbourhood is Crawford, the price is expected to be more
* If the Neighbourhood is Edwards, the prices will get very low.
* If the Pool quality is good also, the prices are expected to get low. 

## Lasso Regularisation

In [None]:
lasso = Lasso()

# cross validation
model_cv = GridSearchCV(estimator = lasso, 
                        param_grid = params, 
                        scoring= 'neg_mean_absolute_error', 
                        cv = folds, 
                        return_train_score=True,
                        verbose = 1)            

model_cv.fit(X_train, y_train) ;

In [None]:
cv_results = pd.DataFrame(model_cv.cv_results_)
cv_results.head()

In [None]:
# plotting mean test and train scoes with alpha 
cv_results['param_alpha'] = cv_results['param_alpha'].astype('float32')

# plotting
plt.plot(cv_results['param_alpha'], cv_results['mean_train_score'])
plt.plot(cv_results['param_alpha'], cv_results['mean_test_score'])
plt.xlabel('alpha')
plt.ylabel('Negative Mean Absolute Error')

plt.title("Negative Mean Absolute Error and alpha")
plt.legend(['train score', 'test score'], loc='upper left')
plt.show()

It is quite dificult to see the alpha value in the plot. But from we can find the alpha value

In [None]:
from sklearn.linear_model import Lasso

parameters= {'alpha':[0.0001,0.001,0.01,0.1,1,10,100]}

lasso=Lasso()
lasso_reg=GridSearchCV(lasso, param_grid=parameters, scoring='neg_mean_squared_error', cv=5)
lasso_reg.fit(X,y)
print('The best value of Alpha is: ',lasso_reg.best_params_)

Fitting the Lasso model with alpha = 0.001

In [None]:
alpha =0.001

lasso_mod = Lasso(alpha=alpha)
        
lasso_mod.fit(X_train, y_train) 

Below are coeficients generated by the lasso model.

In [None]:
lasso_mod.coef_

Let's visualize it for our better understanding.

In [None]:
plt.figure(figsize=(20, 10))
coefs = pd.Series(lasso_mod.coef_, index = X.columns)

imp_coefs = pd.concat([coefs.sort_values().head(10),
                     coefs.sort_values().tail(10)])
imp_coefs.plot(kind = "barh")
plt.title("Features importance in the Lasso Model")
plt.show()

In [None]:
print("Lasso Model selected",sum(coefs != 0), "important features and dropped the other", sum(coefs == 0)," features")

As we know the Lasso Model selection is used for selecting features as well. 

### We can the Root mean square error values below:

In [None]:
y_pred_train=lasso_mod.predict(X_train)
y_pred_test=lasso_mod.predict(X_test)
import math
print('Root Mean Square Error train = ' + str(round(math.sqrt(sklm.mean_squared_error(y_train, y_pred_train)),2)))
print('Root Mean Square Error test = ' + str(round(math.sqrt(sklm.mean_squared_error(y_test, y_pred_test)),2)))

So the RMSE is same as for Ridge as well.

## Understanding from the analysis
* The property with high Total Area will cost more and hence the profit margin can be better.
* If the Overall Quality is good the price of the property will be more.
* Property with Neighbourhood Crawfor, adds up the Sale Price.
* For Sale Type New, the price is expected to be more.
* Property with good condition will be sold for more price.
* The absence of the Fireplace has shown the decline of the price of the property. `Surprise` can invest in Fireplace and increase the price of the property.
* `Surprise` should be stay away from Townhouse building type.
* It should also avoid the property with neighbourhood Edwards.
* `Surprise` can invest some in heating system and enhance the quality and then the property can be sold in a better price.