# Advance Linear Regression
## Shared Bikes Demand Prediction - Assignment Solution

#### Problem Statement:

A US-based housing company named Surprise Housing has decided to enter the Australian market. The company uses data analytics to purchase houses at a price below their actual values and flip them on at a higher price. For the same purpose, the company has collected a data set from the sale of houses in Australia.

The company is looking at prospective properties to buy to enter the market. You are required to build a regression model using regularisation in order to predict the actual value of the prospective properties and decide whether to invest in them or not.

Essentially, the company wants to know —


- Which variables are significant in predicting the price of a house.


- How well those variables describe the price of a house.


The solution is divided into the following sections: 
- Data understanding and exploration
- Data Visualisation 
- Data preparation
- Model building and evaluation
- Subjective question solutions

### 1. Data Understanding and Exploration

Let's first import the required libraries and have a look at the dataset and understand the size, attribute names etc.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression, Ridge, Lasso

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Reading the dataset
df = pd.read_csv("train.csv")

# Let's look at the few features 
df.head()

In [None]:
# Getting insights of the features
df.describe()

In [None]:
# Summary of the dataset: 1460 rows, 81 columns, many features have null values
df.info()

#### Understanding the Data Dictionary and parts of Data Preparation

The data dictionary contains the meaning of various attributes; some of which are explored and manipulated here:

In [None]:
# From the above stats we can say that there are columns with more than 80% null values, therefore, let's drop those columns
df1 = df.drop(df.columns[(df.isnull().sum()/len(df.index))>0.80], axis=1)

In [None]:
df1.info()

In [None]:
# let's drop few more columns
# MasVnrType has more than 59% rows empty
# FireplaceQu has more than 47% rows empty
# Id column has no influence on target variable i.e Sale Price

df1 = df1.drop(["MasVnrType","FireplaceQu", "Id"], axis=1)
df1[df1.columns[df1.isnull().sum()>0]].isnull().sum()

There are `NA` rows in few feature which represents some meaning, therefore, instead of dropping the null values let's convert columns to meaning data

In [None]:
# `NA` in basement related features represents no basement, let's replace `NA` rows by `Nb`

df1["BsmtQual"] = df1['BsmtQual'].apply(lambda x: x if isinstance(x,str) else "Nb")
df1['BsmtQual'].value_counts()

In [None]:
df1['BsmtCond'] = df1['BsmtCond'].apply(lambda x: x if isinstance(x,str) else "Nb")
df1['BsmtCond'].value_counts()

In [None]:
df1['BsmtExposure'] = df1['BsmtExposure'].apply(lambda x: x if isinstance(x,str) else "Nb")
df1['BsmtExposure'].value_counts()

In [None]:
df1['BsmtFinType1'] = df1['BsmtFinType1'].apply(lambda x: x if isinstance(x,str) else "Nb")
df1['BsmtFinType1'].value_counts()

In [None]:
df1['BsmtFinType2'] = df1['BsmtFinType2'].apply(lambda x: x if isinstance(x,str) else "Nb")
df1['BsmtFinType2'].value_counts()

In [None]:
# `NA` in basement related features represents no basement, let's replace `NA` rows by `Nb`
df1['GarageType'] = df1['GarageType'].apply(lambda x: x if isinstance(x,str) else "NG")
df1['GarageType'].value_counts()

In [None]:
df1['GarageFinish'] = df1['GarageFinish'].apply(lambda x: x if isinstance(x,str) else "NG")
df1['GarageFinish'].value_counts()

In [None]:
df1['GarageQual'] = df1['GarageQual'].apply(lambda x: x if isinstance(x,str) else "NG")
df1['GarageQual'].value_counts()

In [None]:
df1['GarageCond'] = df1['GarageCond'].apply(lambda x: x if isinstance(x,str) else "NG")
df1['GarageCond'].value_counts()

In [None]:
df1 = df1.drop("GarageYrBlt", axis=1)

In [None]:
# listing down all features have null values
df1[df1.columns[df1.isnull().sum()>0]].isnull().sum()

In [None]:
#  Let's drop rows containing null values for MasVnrArea and Electrical
df1 = df1.drop(df1.index[df1['MasVnrArea'].isnull()].tolist(), axis=0)
df1[df1.columns[df1.isnull().sum()>0]].isnull().sum()

In [None]:
df1 = df1.drop(df1.index[df1['Electrical'].isnull()].tolist(), axis=0)
df1[df1.columns[df1.isnull().sum()>0]].isnull().sum()

In [None]:
# Replacing LotFrontage null values by it's mode
df1['LotFrontage'].mode()

In [None]:
df1.loc[df1["LotFrontage"].isnull(), "LotFrontage"] = 60.0

In [None]:
df1.shape

In [None]:
# Assigning string values to different months instead of numeric values which may misindicate some order to it.
# A function has been created to map the actual numbers to categorical levels.
def object_map(x):
    return x.map({1: 'Jan', 2: 'Feb', 3: 'Mar', 4: 'Apr', 5: 'May', 6: 'Jun', 7: 'Jul',8: 'Aug',9: 'Sept',10: 'Oct',11: 'Nov',12: 'Dec'})

# Applying the function to the two columns
df1[['MoSold']] = df1[['MoSold']].apply(object_map)

In [None]:
df1['MoSold'].astype('category').value_counts()

In [None]:
df1['YrSold'].astype('category').value_counts()

In [None]:
# All categorical variables in the dataset
df1_categorical=df1.select_dtypes(exclude=['float64','datetime64','int64'])
print(df1_categorical.columns)
print(len(df1_categorical.columns))

In [None]:
plt.figure(figsize=(20, 50))  
plt.subplot(13,3,1)
sns.boxplot(x = 'MSZoning', y = 'SalePrice', data = df1)
plt.subplot(13,3,2)
sns.boxplot(x = 'Street', y = 'SalePrice', data = df1)
plt.subplot(13,3,3)
sns.boxplot(x = 'LotShape', y = 'SalePrice', data = df1)
plt.subplot(13,3,4)
sns.boxplot(x = 'LandContour', y = 'SalePrice', data = df1)
plt.subplot(13,3,5)
sns.boxplot(x = 'Utilities', y = 'SalePrice', data = df1)
plt.subplot(13,3,6)
sns.boxplot(x = 'LotConfig', y = 'SalePrice', data = df1)
plt.subplot(13,3,7)
sns.boxplot(x = 'LandSlope', y = 'SalePrice', data = df1)
plt.subplot(13,3,8)
sns.boxplot(x = 'Neighborhood', y = 'SalePrice', data = df1)
plt.subplot(13,3,9)
sns.boxplot(x = 'Condition1', y = 'SalePrice', data = df1)
plt.subplot(13,3,10)
sns.boxplot(x = 'Condition2', y = 'SalePrice', data = df1)
plt.subplot(13,3,11)
sns.boxplot(x = 'BldgType', y = 'SalePrice', data = df1)
plt.subplot(13,3,12)
sns.boxplot(x = 'HouseStyle', y = 'SalePrice', data = df1)
plt.subplot(13,3,13)
sns.boxplot(x = 'RoofStyle', y = 'SalePrice', data = df1)
plt.subplot(13,3,14)
sns.boxplot(x = 'RoofMatl', y = 'SalePrice', data = df1)
plt.subplot(13,3,15)
sns.boxplot(x = 'Exterior1st', y = 'SalePrice', data = df1)
plt.subplot(13,3,16)
sns.boxplot(x = 'Exterior2nd', y = 'SalePrice', data = df1)
plt.subplot(13,3,17)
sns.boxplot(x = 'ExterQual', y = 'SalePrice', data = df1)
plt.subplot(13,3,18)
sns.boxplot(x = 'ExterCond', y = 'SalePrice', data = df1)
plt.subplot(13,3,19)
sns.boxplot(x = 'Foundation', y = 'SalePrice', data = df1)
plt.subplot(13,3,20)
sns.boxplot(x = 'BsmtQual', y = 'SalePrice', data = df1)
plt.subplot(13,3,21)
sns.boxplot(x = 'BsmtCond', y = 'SalePrice', data = df1)
plt.subplot(13,3,22)
sns.boxplot(x = 'BsmtExposure', y = 'SalePrice', data = df1)
plt.subplot(13,3,23)
sns.boxplot(x = 'BsmtFinType1', y = 'SalePrice', data = df1)
plt.subplot(13,3,24)
sns.boxplot(x = 'BsmtFinType2', y = 'SalePrice', data = df1)
plt.subplot(13,3,25)
sns.boxplot(x = 'Heating', y = 'SalePrice', data = df1)
plt.subplot(13,3,26)
sns.boxplot(x = 'HeatingQC', y = 'SalePrice', data = df1)
plt.subplot(13,3,27)
sns.boxplot(x = 'CentralAir', y = 'SalePrice', data = df1)
plt.subplot(13,3,28)
sns.boxplot(x = 'Electrical', y = 'SalePrice', data = df1)
plt.subplot(13,3,29)
sns.boxplot(x = 'KitchenQual', y = 'SalePrice', data = df1)
plt.subplot(13,3,30)
sns.boxplot(x = 'Functional', y = 'SalePrice', data = df1)
plt.subplot(13,3,31)
sns.boxplot(x = 'GarageType', y = 'SalePrice', data = df1)
plt.subplot(13,3,33)
sns.boxplot(x = 'GarageFinish', y = 'SalePrice', data = df1)
plt.subplot(13,3,34)
sns.boxplot(x = 'GarageQual', y = 'SalePrice', data = df1)
plt.subplot(13,3,35)
sns.boxplot(x = 'GarageCond', y = 'SalePrice', data = df1)
plt.subplot(13,3,36)
sns.boxplot(x = 'PavedDrive', y = 'SalePrice', data = df1)
plt.subplot(13,3,37)
sns.boxplot(x = 'MoSold', y = 'SalePrice', data = df1)
plt.subplot(13,3,38)
sns.boxplot(x = 'SaleType', y = 'SalePrice', data = df1)
plt.subplot(13,3,39)
sns.boxplot(x = 'SaleCondition', y = 'SalePrice', data = df1)
plt.show()

In [None]:
df1 = df1.drop(["LandSlope", "Exterior2nd", "BsmtFinType2", 'Condition2'], axis=1)

In [None]:
# All numeric variables in the dataset
df1_numeric = df1.select_dtypes(include=['float64', 'int64'])
df1_numeric.pop('SalePrice')
df1_numeric.head()

In [None]:
plt.figure(figsize=(20, 30))  
plt.subplot(3,2,1)
col = df1_numeric.columns[:7]
col = col.insert(0,"SalePrice")
sns.heatmap(df1[col].corr(), annot=True)
plt.subplot(3,2,2)
col = df1_numeric.columns[7:14]
col = col.insert(0,"SalePrice")
sns.heatmap(df1[col].corr(), annot=True)
plt.subplot(3,2,3)
col = df1_numeric.columns[14:21]
col = col.insert(0,"SalePrice")
sns.heatmap(df1[col].corr(), annot=True)
plt.subplot(3,2,4)
col = df1_numeric.columns[21:28]
col = col.insert(0,"SalePrice")
sns.heatmap(df1[col].corr(), annot=True)
plt.subplot(3,2,5)
col = df1_numeric.columns[28:]
col = col.insert(0,"SalePrice")
sns.heatmap(df1[col].corr(), annot=True)
plt.show()

The heatmap shows some useful insights:

Correlation of SalePrice with independent variables:
- SalePrice is highly (positively) correlated with 'OverallQual' and 'GrLivArea' and further it is high with 'GarageCars' 'GarageArea', 'TotalRmsAbvGrd', 'YearBuilt', 'YearRemodAdd', 'TotalBsmtSF', '1stFlrSF', 'MasvnrArea', 'fullbath'

- SalePrice is negatively correlated with no feature.

Correlation among independent variables:
- Some of the independent variables are highly correlated (look at the top-left part of matrix): 'OverallQual', 'YearBuilt' and 'YearRemodAdd' are highly (positively) correlated. The correlation between the three. '1stFlrSF' and 'TotalBsmtSF' also high correlation, almost equal to 1. Further we can see there many feature with highly correlating with each other.


Thus, while building the model, we'll have to pay attention to multicollinearity.

In [None]:
df1 = df1.drop(["YearBuilt","YearRemodAdd", "1stFlrSF", "BsmtFinSF1", "MasVnrArea", "BsmtUnfSF", "FullBath", "HalfBath", "GarageArea","TotRmsAbvGrd"], axis=1)

In [None]:
df1.shape

## 3. Data Preparation 


#### Data Preparation

Let's now prepare the data and build the model.
Note that we had not included 'yr', 'mnth', 'holiday', 'weekday' and 'workingday' as object variables in the initial data exploration steps so as to avoid too many dummy variables creation. They have binary values: 0s and 1s in them which have specific meanings associated with them.

In [None]:
# Subset all categorical variables
df1_categorical=df1.select_dtypes(include=['object'])
df1_categorical

In [None]:
# Convert into dummies
df1_dummies = pd.get_dummies(df1_categorical, drop_first=True)
df1_dummies.head()

In [None]:
# Drop categorical variable columns
df1 = df1.drop(list(df1_categorical.columns), axis=1)

In [None]:
# Concatenate dummy variables with the original dataframe
df1 = pd.concat([df1, df1_dummies], axis=1)
df1.head()

In [None]:
df1.info()

## 4. Model Building and Evaluation

Let's start building the model. The first step to model building is the usual test-train split. So let's perform that

In [None]:
from sklearn.model_selection import train_test_split
np.random.seed(0)
df_train, df_test= train_test_split(df1, train_size = 0.7, test_size = 0.3, random_state = 100)

In [None]:
from sklearn.preprocessing import MinMaxScaler 

### Scaling

Now that we have done the test-train split, we need to scale the variables for better interpretability. But we only need the scale the numeric columns and not the dummy variables. Let's take a look at the list of numeric variables we had created in the beginning. Also, the scaling has to be done only on the train dataset as you don't want it to learn anything from the test data.

Let's scale all these columns using MinMaxScaler. You can use any other scaling method as well; it is totally up to you.

In [None]:
scaler = MinMaxScaler()

In [None]:
# Apply scaler() to all the columns except the 'yes-no' and 'dummy' variables
var = df_train.select_dtypes(include=['float64', 'int64'])
var

In [None]:
df_train[var.columns] = scaler.fit_transform(df_train[var.columns])

In [None]:
df_train["SalePrice"]

In [None]:
y_train = df_train.pop("SalePrice")
X_train = df_train

In [None]:
df_test[var.columns] = scaler.transform(df_test[var.columns])

In [None]:
y_test = df_test.pop("SalePrice")
X_test = df_test

In [None]:
reg = LinearRegression() 
reg.fit(X_train,y_train)

In [None]:
# Predictions on the basis of the model
y_pred = reg.predict(X_train)
y_pred

In [None]:
r2_score(y_train, y_pred)

In [None]:
# Predictions on the basis of the model
y_test_pred = reg.predict(X_test)
r2_score(y_test, y_test_pred)

In [None]:
#Residual Sum of Squares = Mean_Squared_Error * Total number of datapoints
rss = np.sum(np.square(y_train - y_pred))
print(rss)
mse = mean_squared_error(y_train, y_pred)
print(mse)
# Root Mean Squared Error
rmse = mse**0.5
print(rmse)

# Residual analysis

In [None]:
# plot residual graph to see if linearity rule is not violated
sns.histplot((y_train-y_pred))

In [None]:
# Applying Ridge Regression with varying the hyperparameter 'lambda'

lambdas = [0,0.0001, 0.001, 0.01, 1, 10, 100] # Higher the value of lambda, 
                                                  # more the regularization
for i in lambdas: # for each lambda we get different model coefficients
    ridgereg = Ridge(alpha = i) # Initialize the Ridge Regression model with a specific lambda
    ridgereg.fit(X_train, y_train)
    print("alpha = " + str(i))
    #Computing the r2 score
    y_pred = ridgereg.predict(X_train)
    print("r2 score = " + str(r2_score(y_train, y_pred))) 
    y_test_pred = ridgereg.predict(X_test)
    print("test score = " + str(r2_score(y_test, y_test_pred)))
    predictors = list()
    for i, p in enumerate(X_train):
        if ridgereg.coef_[i] != 0:
            predictors.append(p)
    print('no of predictors :' + str(len(predictors)))

In [None]:
# Applying Lasso Regression with varying the hyperparameter 'lambda'

lambdas = [0,0.0001, 0.001, 0.01, 1, 10, 100] 
for i in lambdas:
    lassoreg = Lasso(alpha = i) # Initialize the lasso Regression model with a specific lambda
    lassoreg.fit(X_train, y_train)
    # Compute R^2 
    print("alpha = " + str(i))
    y_pred = lassoreg.predict(X_train)
    print("r2 score = " + str(r2_score(y_train, y_pred)))
    y_test_pred = lassoreg.predict(X_test)
    print("test score = " + str(r2_score(y_test, y_test_pred)))
    predictors = list()
    for i, p in enumerate(X_train):
        if lassoreg.coef_[i] != 0:
            predictors.append(p)
    print('no of predictors: ' + str(len(predictors)))

In [None]:
lassoreg = Lasso(alpha = 0.0001)
lassoreg.fit(X_train, y_train)
y_pred = lassoreg.predict(X_train)
print("r2 score = " + str(r2_score(y_train, y_pred)))
y_test_pred = lassoreg.predict(X_test)
print("test score = " + str(r2_score(y_test, y_test_pred)))
# print(lassoreg.coef_)
predictors = dict()
coef = list()
for i, p in enumerate(X_train):
    if lassoreg.coef_[i] != 0:
        coef.append(abs(lassoreg.coef_[i]))
        predictors[abs(lassoreg.coef_[i])] = p
print('no of predictors: ' + str(len(coef)))
coef.sort(reverse=True)
for c in coef:
    print(predictors[c])