# Advanced Regression
## House price prediction case study

#### Problem Statement:
A US-based housing company named Surprise Housing has decided to enter the Australian market. The company uses data analytics to purchase houses at a price below their actual values and flip them on at a higher price. For the same purpose, the company has collected a data set from the sale of houses in Australia. The data is provided in the CSV file below.

The company is looking at prospective properties to buy to enter the market. You are required to build a regression model using regularisation in order to predict the actual value of the prospective properties and decide whether to invest in them or not.

The company wants to know:

    1. Which variables are significant in predicting the price of a house, and
    2. How well those variables describe the price of a house.

Also, determine the optimal value of lambda for ridge and lasso regression.

 

#### Business Goal:
You are required to model the price of houses with the available independent variables. This model will then be used by the management to understand how exactly the prices vary with the variables. They can accordingly manipulate the strategy of the firm and concentrate on areas that will yield high returns. Further, the model will be a good way for management to understand the pricing dynamics of a new market.

#### Brief outline of different steps involved in modelling:
1. Partitioning the data into train/validation/test chunks 
2. Load the data and understand variables 
3. Data inspection
4. Exploratory Data Analysis
5. Pre-processing data - missing value imputation, scaling, dropping variables, etc.
6. Modelling using Regression
7. Hyper parameter tuning and regularization - ridge/lasso
8. Model evaluation


## 1. Loading and inspecting the Data

Given that we already have the data in the form of train.csv and test.csv, we can go ahead and skip the step where we split the data. We will use the train.csv to do understand, preprocess and perform some EDA on it in the next few steps.

In [None]:
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model, metrics
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

In [None]:
housing = pd.read_csv('train.csv')
housing.head()

In [None]:
#let's also take a look at the shape of the data
housing.shape

In [None]:
#some info about data types of the variables
housing.dtypes

In [None]:
#summary of the data
housing.info()

We have about 81 columns and 1460 rows in total. 38 of the columns are of numeric type(float and int64) and 43 of them are of object type(strings or characters, dates). We can also see that some of the columns have null values as well, we will perform some pre-processing operations to treat them in the subsequent steps.

In [None]:
#let's take a look at some of the numeric variables 
housing.describe()

This gives us some summary of distribution of numeric variables. Refering to the data dictionary, we can identify some important variables and how the mean, median and inter-quartile ranges are distributed as well as presence of some outliers. Some of the variables that could be important to our modelling - 
1. LotArea - which ranges from 1300 sq.feet to 215245 sq.feet, having a median of 9478 sq.feet
2. OverallQual and OverallCond - ratings of material/finish and condition of the house which ranges from 1-10, 1 being the worst and 10 being the best.
3. YearBuilt - Year in which the house is built, ranging from 1872 to 2010, with average year of building of the house being 1971.

These are only a few of the numerical variables present, we shall explore more and visualise them in the later steps to see what all could be important for us.

#### Understanding the data dictionary

Let's take a look at some of the variables that have sub-categories as mentioned in the data dictonary.

In [None]:
#MSSubClass has various categories under it which are all represented as numeric variables. Let's see what it looks like

ms_subclass_dic = { 
        20:	"1-STORY 1946 & NEWER ALL STYLES",
        30:	"1-STORY 1945 & OLDER",
        40:	"1-STORY W/FINISHED ATTIC ALL AGES",
        45:	"1-1/2 STORY - UNFINISHED ALL AGES",
        50:	"1-1/2 STORY FINISHED ALL AGES",
        60:	"2-STORY 1946 & NEWER",
        70:	"2-STORY 1945 & OLDER",
        75:	"2-1/2 STORY ALL AGES",
        80:	"SPLIT OR MULTI-LEVEL",
        85:	"SPLIT FOYER",
        90:	"DUPLEX - ALL STYLES AND AGES",
       120:	"1-STORY PUD (Planned Unit Development) - 1946 & NEWER",
       150:	"1-1/2 STORY PUD - ALL AGES",
       160:	"2-STORY PUD - 1946 & NEWER",
       180:	"PUD - MULTILEVEL - INCL SPLIT LEV/FOYER",
       190:	"2 FAMILY CONVERSION - ALL STYLES AND AGES"
    }

ms_subclass = housing['MSSubClass'].astype('category').value_counts()
#let's map the category description to get a better idea
ms_subclass = ms_subclass.to_frame().reset_index()
ms_subclass['index'] = ms_subclass['index'].map(ms_subclass_dic)

#let's take a look at the value counts of different categories present
print(ms_subclass)

In [None]:
#Let's repeat the same thing for the variable MSZoning
ms_zoning_dic = {
       "A":	"Agriculture",
       "C (all)":	"Commercial",
       "FV":	"Floating Village Residential",
       "I":	"Industrial",
       "RH":	"Residential High Density",
       "RL":	"Residential Low Density",
       "RP":	"Residential Low Density Park", 
       "RM":	"Residential Medium Density"
}

ms_zoningclass = housing['MSZoning'].astype('category').value_counts()
#let's map the category description to get a better idea
ms_zoningclass = ms_zoningclass.to_frame().reset_index()
ms_zoningclass['index'] = ms_zoningclass['index'].map(ms_zoning_dic)

#let's take a look at the value counts of different categories present
print(ms_zoningclass)


## 2. Exploratory Data Analysis
In this step, let's visualise and make sense of the different numeric and categorical variables in our data with the help of differnt plots and statistics.

In [None]:
# all numeric (float and int) variables in the dataset
housing_num = housing.select_dtypes(include=['float64', 'int64'])
housing_num.head()
# [['LotFrontage', 'LotArea', 'YearBuilt', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'GarageArea', 'YrSold']]

Some of the variables like ```MSSubClass``` are represented numerically, but they have discrete categories that they're mapped to. We shall drop some of the variables and convert them to categorical variables in the later steps. For now let's drop them and visualise.

In [None]:
housing_num = housing_num.drop(['MSSubClass','Id', 'OverallQual', 'OverallCond', 'MoSold'], axis=1)
housing_num.head()

In [None]:
#let's first understand how different feature variables are related to the target variable (features vs saleprice) 
# by plotting and visualising them.
sns.scatterplot(housing_num['LotFrontage'], housing_num['SalePrice'])
plt.show()


1. Looking at the above plot, there is a cluster of values between 0-150 feet of LotFootage and 0-400000 dollars. Even though the relationship might not look strictly linear, we could draw a straight line through the points and explain the data.
2. We can also spot the outliers in the data - there are two data points at the extreme end of LotFrontage and two points which exceed the sale price of ```700,000$```


In [None]:
sns.scatterplot(housing_num['LotArea'], housing_num['SalePrice'])
plt.show()

1. The plot of ```LotArea``` vs ```SalePrice``` is different from what we observed in the first case. The distribution is more compact and for a given value of area, the price ranges from under 100000 to over 400000 dollars. We could say that the relationship is not strictly linear and there is perhaps a more complicated mapping of feature variable to the target variable.
2. We can also spot the outliers from this plot - the ones having lot area exceeding 150000 sq.ft and the ones exceeding sale price of 700000 dollars.

Alright, let's select some of the variables and visualise pairplots between them to get a better idea of relationship between 2 or more variables.

In [None]:
housing_num_filtered = housing_num[['LotFrontage', 'LotArea', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'GarageArea', 'SalePrice']]
sns.pairplot(housing_num_filtered)
plt.show()

The above is a set of pair-plots that show the relationship between each of the variables with one another. Let's particularly focus on the last row of plots which show us the relationship between ```SalePrice``` and other features.
1. When we look at total basement sq.feet vs price plot, we can observe that there are various data points which have the value 0 on the x-axis, meaning these are houses that do not have a basement.
2. The plot between second floor sq.feet vs price also has data points corresponding to 0 on the x-axis, meaning there are a number of houses that do not comprise of a second floor. This is observed with garage area vs price as well.
3. The plots of ```1stFlrSF```, ```2ndFlrSF``` and ```GrLivArea``` have somewhat of a linear increasing trend. As the area increases, we also see prices go up. Similar trends are observed with ```GarageArea``` vs ```SalePrice``` as well.

Let's take a look at the relationship between sale prices and the year in which the house was built/sold.

In [None]:
#convert YearBuilt and SalePrice to categorical variables


#plot price vs year built and year sold
sns.histplot(data=housing_num, x="YearBuilt", y="SalePrice")
plt.show()

In [None]:
#plot price vs year built and year sold
sns.histplot(data=housing_num, x="YrSold", y="SalePrice")
plt.show()

Let's take a look at the categorical variables and some relevant plots for analyse them.

In [None]:
# all categorical (object) variables in the dataset
housing_cpy = housing.copy()
housing_cpy['Id'] = housing_cpy['Id'].astype(str)

housing_cat = housing_cpy.select_dtypes(include=['object'])
housing_cat.head()

As mentioned earlier, some of the variables are represented numerically even though they occupy certain discrete values to represent the data. Let's add them to the dataframe ```housing_cat``` and convert them to object types.

In [None]:
housing_num = housing.select_dtypes(include=['float64', 'int64'])
housing_num_filtered = housing_num[['MSSubClass', 'Id', 'OverallQual', 'OverallCond', 'MoSold', 'SalePrice']]

housing_cat['Id'] = housing_cat['Id'].astype('int64')
housing_cat_merge = pd.merge(housing_num_filtered, housing_cat, how='inner', on='Id')
housing_cat_merge.head()


In [None]:
#convert numeric types to object types
housing_cat_merge['MSSubClass'] = housing_cat_merge['MSSubClass'].astype(str)

#let's plot type of dwelling involved in the sale vs prßice the house is sold using a box plot
plt.figure(figsize=(60, 20))
plt.subplot(2,4,1)
sns.boxplot(x = 'MSSubClass', y = 'SalePrice', data = housing_cat_merge)

#plot zoning type vs price 
plt.subplot(2,4,2)
sns.boxplot(x = 'MSZoning', y = 'SalePrice', data = housing_cat_merge)

plt.show()

1. category ```60``` (2-STORY 1946 & NEWER) seems to have higher median and quartile values of house price than all other categories, followed by ```120```(1-STORY PUD (Planned Unit Development) - 1946 & NEWER)   and ```75```(2-1/2 STORY ALL AGES). 
2.  
3. We can also look at how many outliers are there in our data, which are the black points that exceed Q4.
4. The right plot shows us zoning type vs sale price. ```FV```(Floating Village Residential) has the maximum median sale price followed by ```RL```(Residential Low Density). Commercial property seems to have the least median sale price.

In [None]:
plt.figure(figsize=(60, 20))
plt.subplot(2,4,1)
sns.boxplot(x = 'OverallQual', y = 'SalePrice', data = housing_cat_merge)

plt.subplot(2,4,2)
sns.boxplot(x = 'OverallCond', y = 'SalePrice', data = housing_cat_merge)

plt.show()

The above plots show us the trend of overall quality and condition vs sale price. 
1. The left plot is like how we would expect it to be - i.e. higher the rating of house quality, higher the price of the house and we can see a linear increase in the values.
2. The plot on the right however, has steadily increasing median values till ```5``` after which the value drops and remains almost constant for categories ```6```,```7``` and ```8```. We see an increase in sale prices for category ```9```.

In [None]:
plt.figure(figsize=(60, 20))
plt.subplot(2,4,1)
sns.boxplot(x = 'BldgType', y = 'SalePrice', data = housing_cat_merge)

# plt.subplot(2,4,2)
# sns.boxplot(x = 'Neighborhood', y = 'SalePrice', data = housing_cat_merge)

plt.show()

In [None]:
#get numeric variables to see how they're correlated.
housing_num_fil = housing_num.drop(['MSSubClass','Id', 'OverallQual', 'OverallCond', 'MoSold'], axis=1)
housing_num_fil.head()

Let's now take a look at how the variables are correlated by plotting a heatmap.

In [None]:
plt.figure(figsize = (40, 20))
sns.heatmap(housing_num_fil.corr(), annot = True, cmap="YlGnBu")
plt.show()

## 3. Data pre-processing and preparation

Earlier when we had explored the dataset, we observed some rows and columns having null values, categorical data being cast as numerical values and so on. Here, we will treat these issues present in the data before making it ready for modelling. Some of the steps performed include:
1. missing/null value imputation
2. dropping rows/columns if necessary
3. dummy variables for categorical variables
4. scaling of numeric variables

In [None]:
#lets check missing values in our data
housing.info()

Let's remove columns which have more than ```85%``` null values in them.

In [None]:
housing_na = housing.dropna(thresh=len(housing)*0.85 , axis=1)
housing_na.info()

We have 74 different columns now after dropping columns with null values greater than treshold.

In [None]:
#let's look at columns which still have some missing values and perform imputation depending on the nature of data.

#select columns which still have some missing values
df_null = housing_na.loc[:, housing_na.isnull().any()]      #this gives us the
df_null.head()


In [None]:
#let's fill the null values in all of the above columns
values = {"MasVnrType": df_null.mode(dropna=True)['MasVnrType'][0], 
            "MasVnrArea":df_null.median(skipna=True)['MasVnrArea'],
            "BsmtQual": df_null.mode(dropna=True)['BsmtQual'][0], 
            "BsmtCond":df_null.mode(dropna=True)['BsmtCond'][0],
            "BsmtExposure": df_null.mode(dropna=True)['BsmtExposure'][0], 
            "BsmtFinType1":df_null.mode(dropna=True)['BsmtFinType1'][0],
            "BsmtFinType2": df_null.mode(dropna=True)['BsmtFinType2'][0], 
            "Electrical":df_null.mode(dropna=True)['Electrical'][0],
            "GarageType": df_null.mode(dropna=True)['GarageType'][0], 
            "GarageYrBlt":df_null.median(skipna=True)['GarageYrBlt'],
            "GarageFinish": df_null.mode(dropna=True)['GarageFinish'][0], 
            "GarageQual":df_null.mode(dropna=True)['GarageQual'][0],
            "GarageCond":df_null.mode(dropna=True)['GarageCond'][0]
        }
housing_na = housing_na.fillna(value=values)
housing_na.info()


We now have non-null values in all the columns. Let's proceed to perform the other pre-processing steps as mentioned earlier.

#### Dummy variable creation

In [None]:
#lets split the data now into X(feature variables) and Y(target variable) before creating dummy variables
Y = housing_na['SalePrice']
X = housing_na.drop(['Id', 'SalePrice'],axis=1)
print(X.shape, Y.shape)


In [None]:
X_cpy = X.copy()

#lets convert the categorical variables that have been represented as numeric values
X_cpy['MSSubClass'] = X_cpy['MSSubClass'].astype(str)
X_cpy['OverallQual'] = X_cpy['OverallQual'].astype(str)
X_cpy['OverallCond'] = X_cpy['OverallCond'].astype(str)
X_cpy['MoSold'] = X_cpy['MoSold'].astype(str)

X_cat = X_cpy.select_dtypes(include=['object'])
X_cat.head()



In [None]:
#changing the category name of MSSubClass based on data dict

ms_subclass_dic = { 
        "20":	"1-STORY 1946 & NEWER ALL STYLES",
        "30":	"1-STORY 1945 & OLDER",
        "40":	"1-STORY W/FINISHED ATTIC ALL AGES",
        "45":	"1-1/2 STORY-UNFINISHED ALL AGES",
        "50":	"1-1/2 STORY FINISHED ALL AGES",
        "60":	"2-STORY 1946 & NEWER",
        "70":	"2-STORY 1945 & OLDER",
        "75":	"2-1/2 STORY ALL AGES",
        "80":	"SPLIT OR MULTI-LEVEL",
        "85":	"SPLIT FOYER",
        "90":	"DUPLEX-ALL STYLES AND AGES",
       "120":	"1-STORY PUD-1946 & NEWER",
       "150":	"1-1/2 STORY PUD - ALL AGES",
       "160":	"2-STORY PUD-1946 & NEWER",
       "180":	"PUD-MULTILEVEL-INCL SPLIT LEV/FOYER",
       "190":	"2 FAMILY CONVERSION-ALL STYLES AND AGES"
    }

X_cat['MSSubClass'] = X_cat['MSSubClass'].map(ms_subclass_dic)
X_cat.head()



In [None]:
# convert into dummies - one hot encoding
X_cat_dummies = pd.get_dummies(X_cat, drop_first=True)
X_cat_dummies.head()

In [None]:
# drop categorical variables 
X = X.drop(list(X_cat.columns), axis=1)
X.head()

In [None]:
# concat dummy variables with X
X = pd.concat([X, X_cat_dummies], axis=1)
X.head()

Now that we have created dummy variables, let's move to the step of scaling these values using a StandardScaler.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
cols = X.columns.to_list()
print(cols)
X[cols] = scaler.fit_transform(X[cols])
X.head()


Now that we have scaled all of our values, let's go back to seeing correlation between variables as some of the variables can be dropped based on their correlation with other variables.

In [None]:
plt.figure(figsize = (40, 20))
sns.heatmap(housing_num_fil.corr(), annot = True, cmap="YlGnBu")
plt.show()

Looking at the above heatmap, we can observe two things.
1. ```SalePrice``` is correlated with a number of other features.
2. There is some correlation between the independent variables as well. We shall identify them and drop some of it before modelling as it will help us reduce the number of features in our model, thereby preventing overfitting.

In [None]:
housing_num_fil = housing_num_fil.drop(['SalePrice'],axis=1)
corr_mat = housing_num_fil.corr().abs()
corr_mat.head()

In [None]:
#get the upper traingle as the matrix is mirror image about the diagonal
upper_tri = corr_mat.where(np.triu(np.ones(corr_mat.shape),k=1).astype(np.bool))
print(upper_tri)

In [None]:
to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > 0.70)]
print(to_drop)

In [None]:
X = X.drop(to_drop,axis=1)
X.head()

## 4. Modelling

Let's dive into building a model using linear regression, first with a simple LinearRegression combined with RFE and evaluate it using metrics like RMSE and r2 scores. We will follow that up with using ridge and lasso regression and hyper parameter tuning to see how our model performs.

In [None]:
from sklearn.feature_selection import RFE
import statsmodels.api as sm

#sanity check
print(X.shape, Y.shape)

#we'll use the sklearn LinearRegression estimator for RFE 
lm = LinearRegression()
rfe = RFE(lm, n_features_to_select=60)             # running RFE
rfe = rfe.fit(X, Y)

In [None]:
#list of features from rfe
list(zip(X.columns,rfe.support_,rfe.ranking_))

In [None]:
col = X.columns[rfe.support_]
col

In [None]:
X.columns[~rfe.support_]

In [None]:
# Creating X_test dataframe with RFE selected variables
X_train_rfe = X[col]
X_train_rfe.head()

In [None]:
#fit the new dataset 

# Add a constant
X_train_lm = sm.add_constant(X_train_rfe)

# Create a first fitted model
lr = sm.OLS(Y, X_train_lm).fit()

In [None]:
#let's check the intercepts and coeffecients 
lr.params

In [None]:
# Print a summary of the linear regression model obtained
print(lr.summary())

In [None]:
#drop the const column before calculating VIF
X_train_lm.drop(['const'], axis=1, inplace=True)


In [None]:
# Calculate the VIFs for the new model
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
X_new = X_train_lm
vif['Features'] = X_new.columns
vif['VIF'] = [variance_inflation_factor(X_new.values, i) for i in range(X_new.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

Some of our features have a VIF of ```inf```, which means that there is high levels of correlation. Let's drop the columns that have infinite VIF and retrain our model again.

In [None]:
cols_to_drop = vif['Features'].where(vif['VIF'] == float('inf'))
cols_to_drop.dropna(inplace=True)
cols_to_drop = cols_to_drop.to_list()
cols_to_drop

In [None]:
X_train_new = X_new.drop(columns=cols_to_drop)
X_train_new.head()

In [None]:
#let's fit the model again with new set of features

# Add a constant
X_train_lm = sm.add_constant(X_train_new)

# Create a first fitted model
lr = sm.OLS(Y, X_train_lm).fit()

In [None]:
#Let's see the summary of the new model
print(lr.summary())

In [None]:
#drop the const column before calculating VIF
X_train_lm.drop(['const'], axis=1, inplace=True)

In [None]:
vif = pd.DataFrame()
X_new = X_train_lm
vif['Features'] = X_new.columns
vif['VIF'] = [variance_inflation_factor(X_new.values, i) for i in range(X_new.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

There are still features which have a high VIF. Let's remove them and repeat the same process again.

In [None]:
cols_to_drop = ['RoofMatl_CompShg', 'MasVnrType_None', 'MasVnrType_BrkFace', 'Exterior2nd_CmentBd', 'Exterior1st_CemntBd']
X_train_new = X_new.drop(columns=cols_to_drop)
X_train_new.head()

In [None]:
#let's fit the model again with new set of features

# Add a constant
X_train_lm = sm.add_constant(X_train_new)

# Create a first fitted model
lr = sm.OLS(Y, X_train_lm).fit()

In [None]:
#Let's see the summary of the new model
print(lr.summary())

In [None]:
#drop the const column before calculating VIF
X_train_lm.drop(['const'], axis=1, inplace=True)

vif = pd.DataFrame()
X_new = X_train_lm
vif['Features'] = X_new.columns
vif['VIF'] = [variance_inflation_factor(X_new.values, i) for i in range(X_new.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
#lets remove features which have VIF>5
cols_to_drop = ['GarageType_Attchd', 'GarageType_Detchd', 'BsmtQual_TA']
X_train_new = X_new.drop(columns=cols_to_drop)
X_train_new.head()

In [None]:
#let's fit the model again with new set of features

# Add a constant
X_train_lm = sm.add_constant(X_train_new)

# Create a first fitted model
lr = sm.OLS(Y, X_train_lm).fit()

In [None]:
#drop the const column before calculating VIF
X_train_lm.drop(['const'], axis=1, inplace=True)

vif = pd.DataFrame()
X_new = X_train_lm
vif['Features'] = X_new.columns
vif['VIF'] = [variance_inflation_factor(X_new.values, i) for i in range(X_new.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

Let's check the summary of our model. In specific, we'll take a look at the p-values of coeffecients to determine if they're significant. Coeffecients with more than 0.05 p-value can be rejected and we can retain the remaining variables.

In [None]:
print(lr.summary())

In [None]:
#let's evaluate the base model
from sklearn.metrics import r2_score, mean_squared_error

metrics_lr = []
y_pred_train = lr.predict(X_train_rfe)

r2_train_lr = r2_score(Y, y_pred_train)
print("R2 Score:", r2_train_lr)
metrics_lr.append(r2_train_lr)

rss1_lr = np.sum(np.square(Y - y_pred_train))
print("RSS value:", rss1_lr)
metrics_lr.append(rss1_lr)

mse_train_lr = mean_squared_error(Y, y_pred_train)
print("Mean squared error:", mse_train_lr)
metrics_lr.append(mse_train_lr**0.5)



In [None]:
#ridge regression

# params = {'alpha': [0.0001, 0.001, 0.01, 0.05, 0.1, 
#  0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 3.0, 
#  4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 20, 50, 100, 500, 1000 ]}

# ridge = Ridge()

# # cross validation
# folds = 5
# model_cv = GridSearchCV(estimator = ridge, 
#                         param_grid = params, 
#                         scoring= 'neg_mean_absolute_error',  
#                         cv = folds, 
#                         return_train_score=True,
#                         verbose = 1)            
# model_cv.fit(X, Y)

In [None]:
# Printing the best hyperparameter alpha
# print(model_cv.best_params_)

In [None]:
#Fitting Ridge model for alpha = 500 and printing coefficients which have been penalised
# alpha = 10
# ridge = Ridge(alpha=alpha)

# ridge.fit(X, Y)
# print(ridge.coef_)

In [None]:
# y_pred_train = ridge.predict(X)

# metrics_ridge = []

# r2_train_lr = r2_score(Y, y_pred_train)
# print("R2 Score:", r2_train_lr)
# metrics_ridge.append(r2_train_lr)

# rss1_lr = np.sum(np.square(Y - y_pred_train))
# print("RSS value:", rss1_lr)
# metrics_ridge.append(rss1_lr)

# mse_train_lr = mean_squared_error(Y, y_pred_train)
# print("Mean squared error:", mse_train_lr)
# metrics_ridge.append(mse_train_lr**0.5)


In [None]:
#lasso regression

# lasso = Lasso()

# # cross validation
# model_cv = GridSearchCV(estimator = lasso, 
#                         param_grid = params, 
#                         scoring= 'neg_mean_absolute_error', 
#                         cv = folds, 
#                         return_train_score=True,
#                         verbose = 1)            

# model_cv.fit(X, Y) 

In [None]:
# Printing the best hyperparameter alpha
# print(model_cv.best_params_)

In [None]:
# alpha =500
# lasso = Lasso(alpha=alpha)        
# lasso.fit(X, Y) 

In [None]:
# metrics_lasso = []

# y_pred_train = lasso.predict(X)

# r2_train_lr = r2_score(Y, y_pred_train)
# print("R2 Score:", r2_train_lr)
# metrics_lasso.append(r2_train_lr)

# rss1_lr = np.sum(np.square(Y - y_pred_train))
# print("RSS value:", rss1_lr)
# metrics_lasso.append(rss1_lr)

# mse_train_lr = mean_squared_error(Y, y_pred_train)
# print("Mean squared error:", mse_train_lr)
# metrics_lasso.append(mse_train_lr**0.5)

Let's visualise the different metrics in the form of a tabular column.

In [None]:
# # Creating a table which contain all the metrics

# lr_table = {'Metric': ['R2 Score (Train)','RSS (Train)',
#                        'MSE (Train)'], 
#         'Linear Regression': metrics_lr
#         }

# lr_metric = pd.DataFrame(lr_table ,columns = ['Metric', 'Linear Regression'] )

# rg_metric = pd.Series(metrics_ridge, name = 'Ridge Regression')
# ls_metric = pd.Series(metrics_lasso, name = 'Lasso Regression')

# final_metric = pd.concat([lr_metric, rg_metric, ls_metric], axis = 1)

# final_metric

#### Selecting the values of lambda for ridge and lasso regression