# Video 5 - Introduction to the Apex Dataset

Let's begin with importing the necessary libraries. For L1 regularization we will need to import Lasso and lassoCV from the sklearn.linear_model.

In [1]:
# Import necessary libraries
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.linear_model import Lasso, LassoCV

Now load the dataset into a DataFrame named df, and take a quick look at our data's structure using the .info() method: 

In [2]:
# Read the dataset
df = pd.read_csv("Energy_Efficiency_Overfit_Dataset_Updated.csv")

In [3]:
# Get data information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 24 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Wall_Area                 200 non-null    float64
 1   Roof_Area                 200 non-null    float64
 2   Window_Area               200 non-null    float64
 3   Overall_Height            200 non-null    float64
 4   Outdoor_Temperature       200 non-null    float64
 5   Humidity                  200 non-null    float64
 6   Energy_Efficiency_Rating  200 non-null    float64
 7   Noise_Feature_1           200 non-null    float64
 8   Noise_Feature_2           200 non-null    float64
 9   Noise_Feature_3           200 non-null    float64
 10  Noise_Feature_4           200 non-null    float64
 11  Noise_Feature_5           200 non-null    float64
 12  Noise_Feature_6           200 non-null    float64
 13  Noise_Feature_7           200 non-null    float64
 14  Noise_Feat

With our data loaded, we separate our features, or independent variables, from our target variable, 'Energy Efficiency Rating'. Our goal is to __predict this rating using the features available to us__.

In [4]:
# Separating the dependent and independent variables
X = df.drop('Energy_Efficiency_Rating', axis = 1)
y = df['Energy_Efficiency_Rating']

Now, let's do a 70:30 train-test split.

In [5]:
# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Now, let's apply a simple Linear Regression model and check its R-squared scores on the train and the test sets.

In [6]:
# Initialize the Linear Regression model
linear_reg = LinearRegression()

In [7]:
# Fit the Linear Regression model to the full training data
linear_reg.fit(X_train, y_train)

LinearRegression()

In [8]:
# R-squared scores for the Linear Regression model
r2_train_linear = linear_reg.score(X_train, y_train)
r2_test_linear = linear_reg.score(X_test, y_test)

In [9]:
# view the r2-scores
(r2_train_linear, r2_test_linear)

(0.9480121596373762, 0.8604514137224497)

As we can see, the simple linear regression model is overfitting. We see a gap of almost 9 percentage betweent the performances of the model on train and test data.

With that, let us take a break here. In the next video, we’ll deal with this issue using regularization. We’ll  start things off with the introduction of regularization and its types.

# Video 6 - L1 Regularization

<p style = 'color:green'><b>Run all the cells above before you begin</b><p>

We will start with creating a numpy array and name it alphas. This will have 20 equally spaced values between 0 and 10 and will include both 0 and 10 as the first and last values respectively. Also, note that the reason why we choose alpha values between 0-10 is because it is a convention.

In [10]:
# create an array with 20 numbers equllay spaced between 0 to 10
alphas = np.linspace(0, 10, 20)
alphas

array([ 0.        ,  0.52631579,  1.05263158,  1.57894737,  2.10526316,
        2.63157895,  3.15789474,  3.68421053,  4.21052632,  4.73684211,
        5.26315789,  5.78947368,  6.31578947,  6.84210526,  7.36842105,
        7.89473684,  8.42105263,  8.94736842,  9.47368421, 10.        ])

The alphas instance represents the Alpha parameter in regularization. The alpha parameter controls the strength of L1 regularization. 

Now, let us create a LassoCV constructor for our model. We will use the following parameters-
- alphas = alphas - this will set the alpha parameter equal to the 20 values that we generated in the alphas numpy array.
- cv = 10 states that we will use 10 fold corss validation.It is a recommended practce to use cross validation to select the hyperparameter.
- max_iter = This parameter sets the maximum number of iterations for the Lasso regression solver. It determines how many iterations the optimization algorithm will run to find the optimal coefficients.


You can save this constructor in a instance called lasso_cv.

So, in the next cell,let's run this code.

In [11]:
# Initialize LassoCV to find the best alpha
lasso_cv = LassoCV(alphas= alphas,
                   cv=10, 
                   max_iter=100000)

And now, let us fit the LassoCV model to the train data. This will help us find the best alpha value. Alpha is the key hyperparameter for Loasso and the best alpha is selected to reduce the loss fucntion. 





In [12]:
# fit the lasso_cv model to our training data
lasso_cv.fit(X_train, y_train)

  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(


LassoCV(alphas=array([ 0.        ,  0.52631579,  1.05263158,  1.57894737,  2.10526316,
        2.63157895,  3.15789474,  3.68421053,  4.21052632,  4.73684211,
        5.26315789,  5.78947368,  6.31578947,  6.84210526,  7.36842105,
        7.89473684,  8.42105263,  8.94736842,  9.47368421, 10.        ]),
        cv=10, max_iter=100000)

In [13]:
# Find the best alpha value
best_alpha = lasso_cv.alpha_
best_alpha

0.5263157894736842

We can see the best alpha comes out to be 0.52.The LassoCV instance automatically saves the model with the best alpha. Let's look at the coefficients of the best model.

In [14]:
# Compare the coefficients before and after L2 regularization
coefficients_after_l1 = pd.DataFrame({
    'Feature': X_train.columns,
    'After L1 Regularization': lasso_cv.coef_
})


In [15]:
coefficients_after_l1

Unnamed: 0,Feature,After L1 Regularization
0,Wall_Area,0.299125
1,Roof_Area,0.245946
2,Window_Area,0.261297
3,Overall_Height,-0.0
4,Outdoor_Temperature,-0.072988
5,Humidity,-0.050996
6,Noise_Feature_1,-0.0
7,Noise_Feature_2,0.0
8,Noise_Feature_3,-0.0
9,Noise_Feature_4,0.0


 As we can observe, now there are many features whose coefficients have been reduced to zero. With this, let's construct a new model with alpha value equalling the best_alpha and only using features with non-zero coefficients.

In [17]:
# Identify non-zero features from the Lasso model
non_zero_mask = lasso_cv.coef_ != 0
non_zero_features = X_train.columns[non_zero_mask]

# Print the features with non-zero coefficients 
non_zero_features.tolist()

['Wall_Area', 'Roof_Area', 'Window_Area', 'Outdoor_Temperature', 'Humidity']

We get a list of 5 variables that have coefficients not equal to zero. Based on this we will create a new training and test data, which only contains these 5 variables with non-zero coefficients.

In [18]:
# Reduce training and test sets to non-zero features
X_train_reduced = X_train[non_zero_features]
X_test_reduced = X_test[non_zero_features]

And finally, let's create the lasso model with best alpha as the parameter, fit it on the reduced train data.

Here, we're using the Lasso library instead of LassoCV to fit the model because we already got the best alpha value.

In [19]:
# Retrain the Lasso model on the reduced set of features
lasso_best_reduced = Lasso(alpha=best_alpha)
lasso_best_reduced.fit(X_train_reduced, y_train)

Lasso(alpha=0.5263157894736842)

And finally, let us print the r2 score of the new model for both the train and test data.

In [20]:
# R-squared scores for the model with the reduced feature set
r2_train_reduced = lasso_best_reduced.score(X_train_reduced, y_train)
r2_test_reduced = lasso_best_reduced.score(X_test_reduced, y_test)

In [21]:
# R-squared scores for L1 model
r2_train_reduced, r2_test_reduced

(0.9378427802510159, 0.873573116999379)

 And that is it. The unregularized linear regression gave us a score of 0.95 on train and 0.86 on test sets. The gap in the performace was approximately 9 percent. With the L1 regularization, not only has the performance on the test set increased by 1 percent point, the gap between train and test has also reduced to 6 percent. This has reduced the overfitting.

By introducing a penalty equivalent to the absolute value of the magnitude of the coefficients, L1 regularization has helped us in feature selection, identifying the most significant predictors for 'Energy Efficiency Rating'.

But our model is still overfitting and cannot be left as it is. 
This is where L2 regularization, or Ridge Regression, can help us. In our next video, we will explore L2 regularization, and try to overcome the overfitting of the model.
Stay tuned, see you there!
