# Machine Learning Foundation

## Section 2, Part e: Regularization LAB


## Learning objectives

By the end of this lesson, you will be able to:

*   Implement data standardization
*   Implement variants of regularized regression
*   Combine data standardization with the train-test split procedure
*   Implement regularization to prevent overfitting in regression problems


In [1]:
# import piplite
# await piplite.install(['tqdm', 'seaborn', 'pandas', 'numpy'])



In [2]:
%pip install wget
import wget

# from tqdm import tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline


# Surpress warnings:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

np.set_printoptions(precision=3, suppress=True)

Note: you may need to restart the kernel to use updated packages.


In the following cell we load the data and define some useful plotting functions.


In [3]:
np.random.seed(72018)



def to_2d(array):
    return array.reshape(array.shape[0], -1)


    
def plot_exponential_data():
    data = np.exp(np.random.normal(size=1000))
    plt.hist(data)
    plt.show()
    return data
    
def plot_square_normal_data():
    data = np.square(np.random.normal(loc=5, size=1000))
    plt.hist(data)
    plt.show()
    return data

### Loading in Boston Data


In [4]:
# from pyodide.http import pyfetch
 
# async def download(url, filename):
#     response = await pyfetch(url)
#     if response.status == 200:
#         with open(filename, "wb") as f:
#             f.write(await response.bytes())
# path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML240EN-SkillsNetwork/labs/data/boston_housing_clean.pickle"
 
#you will need to download the dataset; if you are running locally, please comment out the following 
# await download(path, "boston_housing_clean.pickle")
 
 
# Import pandas library
# import pandas as pd

In [5]:
from pathlib import Path

data_path = Path('boston_housing_clean.pickle')

if not data_path.is_file():
    url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML240EN-SkillsNetwork/labs/data/boston_housing_clean.pickle"
    downloaded_file = wget.download(url)

In [6]:
with open('boston_housing_clean.pickle', 'rb') as to_read:
    boston = pd.read_pickle(to_read)
boston_data = boston['dataframe']
boston_description = boston['description']

# show the first 5 rows using dataframe.head() method
print("The first 5 rows of the dataframe") 
boston_data.head()

The first 5 rows of the dataframe


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [7]:
boston_description



## Data standardization


**Standardizing** data refers to transforming each variable so that it more closely follows a **standard** normal distribution, with mean 0 and standard deviation 1.

So in regards to standardization, we know that standardizing data refers to transforming each variable so that is more closely following a standard normal distribution. That standard normal distribution will also have a mean of zero, and a standard deviation of one. So we'll be subtracting the mean, and dividing by the standard deviation.

The [`StandardScaler`](http://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.StandardScaler.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML240ENSkillsNetwork34171862-2022-01-01#sklearn.preprocessing.StandardScaler) object in SciKit Learn can do this.


**Generate X and y**:

So first we're going to generate our x and y, our features and our outcome variable. We set y call equal to the string, MEDV, which is just the median value for each one of our households, and then the Boston data, we're going to drop this y column from our columns, and set that equal to x, x being our feature variables, and then y is going to be our outcome variable.


In [8]:
y_col = "MEDV"

X = boston_data.drop(y_col, axis=1)
y = boston_data[y_col]

**Import, fit, and transform using `StandardScaler`**

We're then going to import our standard scaler and then quickly do our fit transform on our x data here. So now we started off with our object as we always do, s is equal to our standard scaler, and then we can call fit transform on our x. Now, I want you to be thinking, obviously, if we were going to have a train and test split, we'd do this a bit differently, and we'll discuss that in a couple more cells.


In [9]:
from sklearn.preprocessing import StandardScaler

s = StandardScaler()
X_ss = s.fit_transform(X)

### Exercise:

**Confirm standard scaling**

So first, we want to confirm that the standard scaling is working appropriately. The way that we want to do this is to have you work using NumPy in order to recreate the standardization on your own. The hints that we add here is that if you had your array, which is now going to be two rows and three columns, we can call a.mean, and if we specify an axis, we can get the mean value across either all of the columns or all of the rows. Because we are trying to standardize each column, we obviously want to be working with the mean across all of the columns, we also want to be getting the standard deviation across each one of the columns, and then we can subtract the mean using this.mean method that's available for a NumPy arrays as well as the.std method, which is available with NumPy arrays as well.

In [10]:
#Hint:

a = np.array([[1, 2, 3], 
              [4, 5, 6]]) 
print(a) # 2 rows, 3 columns

[[1 2 3]
 [4 5 6]]


In [11]:
a.mean(axis=0) # mean along the *columns*

array([2.5, 3.5, 4.5])

In [12]:
a.mean(axis=1) # mean along the *rows*

array([2., 5.])

We set X2 equal to just the array version of our x, which is our Pandas dataframe with all of our features, we're going to do a manual transformation, we'll call that man_transform. With that, we're going to do X2, which is just going to be our new NumPy array, which is all of our features, subtract the mean with the axis equal to zero. So looking back to the hint, we'd be essentially subtracting each one of our different mean values for that column, and then dividing by the standard deviation for that column. That should be all that your standard scaler should be doing under the hood. We're then going to check that the solution of this will end up being equal to one another. X_ss being this standard scaled version of our original dataframe that we have above using the standard scaler and man_transform is our new array where we created that manual transformation of the standard scaler on our own. The np.allclose is just telling us that every single value in two arrays are very similar to one another. By very similar, we mean that there may be under the hood some type of small difference due to just rounding error. So it would be something like the tenth decimal point can be difference and that's what the numpy.allclose. That'll be useful as we use different means of coming up with the same solution and trying to ensure that they are all the same values, all of the outcomes of the same values. So we call np.allclose and we see that that is true.

In [13]:
### BEGIN SOLUTION
X2 = np.array(X)
man_transform = (X2-X2.mean(axis=0))/X2.std(axis=0)
np.allclose(man_transform, X_ss)
### END SOLUTION

True

### Coefficients with and without scaling

Now that we've scaled our data and we have the scaled the non-scaled version, I am going to go in and create a linear regression object and fit our model to both our standardized and non-standardized version of our x data. So first we initiate our linear regression objects. We have our x and our y variables again, initiated to their original form.

In [14]:
from sklearn.linear_model import LinearRegression

In [15]:
lr = LinearRegression()

y_col = "MEDV"

X = boston_data.drop(y_col, axis=1)
y = boston_data[y_col]

Then we can just call lr.fit, lr being our linear regression object. We're fitting it on our x and our y, and then once we do that, we can see the coefficients that it has learned. We see that the coefficients are wildly different scales. We have the question underneath, is this a bad thing?

In [16]:
lr.fit(X, y)
print(lr.coef_) # min = -18

[ -0.107   0.046   0.021   2.689 -17.796   3.805   0.001  -1.476   0.306
  -0.012  -0.953   0.009  -0.525]


#### Discussion (together):

The coefficients are on widely different scales. Is this "bad"?

It's not necessarily a bad thing. So in regards to interpretability, in regards to how much will one unit change in each one of these columns change the outcome variable, this is much more interpretable. So one unit change in whatever this coefficient is. The first one, we'll subtract 0.108 from our median value households. So in that regards, it's much more interpretable. But in regards to figuring out which one of these different coefficients has the largest effect on the outcome of our median value, which one's the most important feature? We're not able to see that because that is now dependent, because each one of these coefficients are dependent on this scale of those original features, as we've discussed in prior lectures.

#### Scale Our Data

So now let's scale are data, as we did before we do our standard scaler, x_ss is just our transform version of our features. We create a new linear regression object, we'll call it lr2, and then what do we print these out, we're able to see each one of these different coefficients. Now they are all on the same scale. So now a larger value means that it has a larger positive or negative, whatever it is, but a larger impact on our actual outcome variable. So as one of these features varies by a single standard deviation, we see how largely that will affect our median value households.

In [17]:
from sklearn.preprocessing import StandardScaler

In [18]:
s = StandardScaler()
X_ss = s.fit_transform(X)

In [19]:
lr2 = LinearRegression()
lr2.fit(X_ss, y)
print(lr2.coef_) # coefficients now "on the same scale"

[-0.92   1.081  0.143  0.682 -2.06   2.671  0.021 -3.104  2.659 -2.076
 -2.062  0.857 -3.749]


### Exercise:

Based on these results, what is the most "impactful" feature (this is intended to be slightly ambiguous)? "In what direction" does it affect "y"?

**Hint:** Recall from last week that we can "zip up" the names of the features of a DataFrame `df` with a model `model` fitted on that DataFrame using:

```python
dict(zip(df.columns.values, model.coef_))
```


So now using what we have here, we want to find the most impactful feature. All we're going to do is zip together our columns and our coefficients, as we've noticed so far, our coefficients are not coming with the names of each one of those coefficients. So what we're going to want to do is zip together the x.columns and the coefficients themselves so that they're aligned one with the other, which we'll see in a second, and then we're going to put this into a dataframe. So we see here that LSTAT, which is just lower status of the area is associated with the very lowest number and RM, which stands for the number of rooms, is associated with the highest coefficient, which makes sense. The more rooms we have, the higher our median value for our household would be, and the lower the status of the surrounding area, the lower median value of our household would be. That's the idea of getting the magnitude of each one of these coefficients.

In [20]:
### BEGIN SOLUTION
pd.DataFrame(zip(X.columns, lr2.coef_)).sort_values(by=1)
### END SOLUTION

Unnamed: 0,0,1
12,LSTAT,-3.74868
7,DIS,-3.104448
9,TAX,-2.075898
10,PTRATIO,-2.062156
4,NOX,-2.060092
0,CRIM,-0.920411
6,AGE,0.021121
2,INDUS,0.142967
3,CHAS,0.682203
11,B,0.85664


Looking just at the strength of the standardized coefficients LSTAT, DIS, RM and RAD are all the 'most impactful'. Sklearn does not have built in statistical signifigance of each of these variables which would aid in making this claim stronger/weaker


### Lasso with and without scaling


We discussed Lasso in lecture.

Let's review together:

1.  What is different about Lasso vs. regular Linear Regression?

It's all about the cost function where we are adding on to that cost function, the absolute value of the coefficients to limit the size of each one of those coefficients.

Cost Function: Lasso adds a penalty to the cost function based on the absolute values of the coefficients, which helps limit their size.

2.  Is standardization more or less important with Lasso vs. Linear Regression? Why?

With linear regression, we won't have any penalization according to the size of our coefficients, whereas lasso we'll have extra penalty if those coefficients are larger versus them being smaller. So the scale of our features are going to affect the scale of our coefficients, and if we don't scale first and put them all on the same scale, then we end up having coefficients that are prone to the scale of each one of these features, and therefore can be larger or smaller, and therefore get penalized more or less according to the scale of each one of the features, so it's much more important to bring all of those features on the same scale.

Standardization: It's crucial to standardize features before applying Lasso, as the scale of features affects the coefficients. Without standardization, coefficients can vary significantly, leading to inconsistent penalization.

In [21]:
from sklearn.linear_model import Lasso
from sklearn.preprocessing import PolynomialFeatures

#### Create polynomial features


[`PolynomialFeatures`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML240ENSkillsNetwork34171862-2022-01-01)

We're going to initiate our polynomial features objects, our degree is equal to two, we've seen this before, we say that we don't want to include the bias, that's just saying that we don't want to include the intercept term, now just output a bunch of ones for that column, and we'd learned a coefficient for all those ones, but we don't need that, our lasso will take care of that automatically, and then we call fit transform on our original x to come up with our polynomial transformed version of x.

In [22]:
pf = PolynomialFeatures(degree=2, include_bias=False,)
X_pf = pf.fit_transform(X)

**Note:** We use `include_bias=False` since `Lasso` includes a bias by default.


Then we do our standardization by calling out that StandardScaler that we defined earlier as s and calling fit_transform on our polynomial features version of x, and we state that as x_pf_ss.

In [23]:
X_pf_ss = s.fit_transform(X_pf)

### Lasso

Model Fitting: The lecture discusses fitting a Lasso model using polynomial features and standard scaling. The default alpha value is set to 1.0, which results in high regularization and many coefficients being zeroed out.

[`Lasso` documentation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML240ENSkillsNetwork34171862-2022-01-01)

We're then going to use the default version of lasso. So we're not going to pass in any hyper parameters and we set las equal to this default version of lasso, we call fit on our polynomial transform plus the standard scaled version of our x values with our y, and then we can see the coefficients that it comes up with. By defaults, the Alpha value is going to be 1.0 here. In this case, this is a fairly high Alpha value and it ends up zeroing out a ton of our coefficients. High Alpha value again, is going to mean more regularization, which means a simpler model.

In [24]:
las = Lasso()
las.fit(X_pf_ss, y)
las.coef_ 

array([-0.   ,  0.   , -0.   ,  0.   , -0.   ,  0.   , -0.   , -0.   ,
       -0.   , -0.   , -0.991,  0.   , -0.   , -0.   ,  0.   , -0.   ,
        0.068, -0.   , -0.   , -0.   , -0.   , -0.   , -0.   , -0.   ,
       -0.   , -0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,
        0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   , -0.   ,  0.   ,
       -0.   , -0.   , -0.   , -0.05 , -0.   , -0.   , -0.   , -0.   ,
       -0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,
        0.   ,  0.   ,  0.   , -0.   , -0.   , -0.   , -0.   , -0.   ,
       -0.   , -0.   ,  0.   , -0.   ,  3.3  , -0.   , -0.   , -0.   ,
       -0.   , -0.   ,  0.42 , -3.498, -0.   , -0.   , -0.   , -0.   ,
       -0.   ,  0.   , -0.   , -0.   , -0.   , -0.146, -0.   , -0.   ,
       -0.   , -0.   , -0.   , -0.   , -0.   , -0.   , -0.   , -0.   ,
       -0.   , -0.   , -0.   ,  0.   , -0.   ,  0.   , -0.   , -0.   ])

### Exercise

Compare

*   Sum of magnitudes of the coefficients:

    Now we want to compare the sum of the magnitude of our coefficients, which is just going to be the sum of the absolute values to see adding up all of the coefficients, how large is that sum going to be? The larger that is, the more complex our model is. 

*   Number of coefficients that are zero:

    Then we're also going to look at the number of coefficients that are equal to zero to see how many of our terms are zeroed out.

for Lasso with alpha 0.1 vs. 1.
We're going to do this for lasso with Alpha equal to 0.1, which is a smaller Alpha, versus an Alpha of one, which is a larger Alpha, more regularization, and should therefore be a less complex model.

Magnitude and Coefficients: The magnitude of coefficients and the count of non-zero coefficients are compared for different alpha values (0.1 vs. 1.0). Higher alpha leads to lower magnitude and more zeros.

Before doing the exercise, answer the following questions in one sentence each:

*   Which do you expect to have greater magnitude?
*   Which do you expect to have more zeros?

If the Alpha is larger, then we are regularizing more, so we'd have smaller magnitude for a higher Alpha, as well as more zeros for a higher Alpha, it will zero out more values if we have an higher Alpha.

Now we're setting the las 01, it's just going to be equal to Alpha, where the Alphas equal to 0.1, rather than the default of one, and then we're going to fit our polynomial transform StandardScaler version of our x as well as our y, and then we're going to get the absolute value of each one of the coefficients, and take the sum and that will give us the magnitude. Then we're also going to say for each one of our coefficients, how many of them are not equal to zero? They'll pass over true or false values that are the same size of the array, and then we sum that together, each true value will be equal to one, and when we sum that together, then we end up having the number of coefficients that are non-zero. You run this, and we see that we have a 23 coefficients that are non-zero and our magnitude is 26.13.

In [25]:
### BEGIN SOLUTION
las01 = Lasso(alpha = 0.1)
las01.fit(X_pf_ss, y)
print('sum of coefficients:', abs(las01.coef_).sum() )
print('number of coefficients not equal to 0:', (las01.coef_!=0).sum())

sum of coefficients: 26.17241511542676
number of coefficients not equal to 0: 23


Just to pull this out to show you what this looks like, we ended up with all true or falses, and again, those trues would be equal to one, so when we called out sum, we get the number 23, and that's just the number of values that are greater than zero or not equal to zero.

In [26]:
las01.coef_!=0

array([False, False, False, False, False, False, False, False,  True,
       False, False, False, False, False,  True, False,  True, False,
        True, False, False, False, False, False,  True,  True,  True,
        True,  True, False, False, False, False, False, False, False,
       False, False,  True, False, False, False, False,  True, False,
       False, False, False, False, False,  True, False, False, False,
       False, False, False, False, False, False,  True, False,  True,
       False, False, False, False, False,  True, False, False, False,
       False,  True, False,  True, False, False, False, False, False,
       False, False, False, False,  True, False, False,  True, False,
       False, False,  True,  True, False, False, False,  True, False,
       False, False, False, False,  True])

Here we set Alpha equal to one, so now we have a higher Alpha, more regularization, we run the same steps and as expected, we have a lower magnitude of 8.4 and less values that are not equal to zero, of only seven values that are not equal to zero.

In [27]:
las1 = Lasso(alpha = 1)
las1.fit(X_pf_ss, y)
print('sum of coefficients:',abs(las1.coef_).sum() )
print('number of coefficients not equal to 0:',(las1.coef_!=0).sum())
### END SOLUTION

sum of coefficients: 8.472405227760158
number of coefficients not equal to 0: 7


More regularization again, meant lower magnitude as well as more zeros, less non-zero values in our coefficients. 

With more regularization (higher alpha) we will expect the penalty for higher weights to be greater and thus the coefficients to be pushed down. Thus a higher alpha means lower magnitude with more coefficients pushed down to 0.


### Exercise: $R^2$

Model Performance: Regularization generally reduces model complexity, which can lower performance on the training set but is intended to improve performance on unseen data.

Calculate the $R^2$ of each model without train/test split.

Recall that we import $R^2$ using:

```python
from sklearn.metrics import r2_score
```


Now, let's calculate the r2_score of each one of our models. We're going to import r2_score, hence we are going to run r2_score on y, as well as our predicted value for x_pf_ss. This las is the las that was defined up here, which had the default of 1.0, which was a fairly high regularization, and we see that the r2_scores about 72.

In [28]:
### BEGIN SOLUTION
from sklearn.metrics import r2_score
r2_score(y,las.predict(X_pf_ss))
### END SOLUTION

0.7207000461229028

#### Discuss:

Will regularization ever increase model performance if we evaluate on the same dataset that we trained on?

The idea, again, of regularization is to reduce the complexity, ensure that we don't overfit to our training set. If we are ensuring that we're not overfitting to our training set, then we're reducing how well we can fit to the actual training set, therefore, if we're testing and saying how well we able to predict on the same set that we trained, we will always get a lower value for r if we increase regularization. The key with regularization is to see how well it will perform on holdout sets, on new sets coming in, which is generally going to be the problem that you're trying to solve. You'll never be trying to predict on the same values that you already have available to you. With that in mind, we want to introduce train_test_split.

## With train/test split

Train-Test Split: The importance of splitting data into training and testing sets is emphasized, ensuring that scaling is only applied to the training set.

#### Discuss

Are there any issues with what we've done so far?

**Hint:** Think about the way we have done feature scaling.

Discuss in groups of two or three.


In [29]:
from sklearn.model_selection import train_test_split

We're going to call our train_test_split function, we're going to pass in x_pf and y, we're not calling in the scaled version, because if you recall, we can only scale on our training set, we can only fit our scaling to our training set, and then we can use that fit in order to transform our test set. So we're using the polynomial transform but not the StandardScaler transform, to get our x_train, x_test, y_train and y_test, we're holding out 30 percent for testing.

In [30]:
X_train, X_test, y_train, y_test = train_test_split(X_pf, y, test_size=0.3, 
                                                    random_state=72018)

We're then going to run our fit_transform and fit. So we only fit_transform again, on our training set. We can then fit our last model that we defined on our new x_train standardize and our y_train. We're then going to just transform our x_test using the StandardScaler, which has been fit to our x_train, and then we can come up with our predictions on our x_test_s on our standardized version of x_test, and then we can see the r2_score for our holdout set. We run this and we see a fairly low r2_score of 0.33.

In [31]:
X_train_s = s.fit_transform(X_train)
las.fit(X_train_s, y_train)
X_test_s = s.transform(X_test)
y_pred = las.predict(X_test_s)
r2_score(y_test, y_pred)

0.6780325981174931

Now, let's see the same steps for an Alpha of 0.1. We run this and we see an increased r2_score, which means that if we think between these two models, this one had high regularization, the model above had lower regularization. What would we think is the problem with the above dataset? The problem would be that we included too much regularization, we have too much bias, and we're not able to find the actual underlying relationship between x and y. We have dumbed down our model too much, we've reduced the complexity too much, and again, the goal is to find this balance.

In [32]:
X_train_s = s.fit_transform(X_train)
las01.fit(X_train_s, y_train)
X_test_s = s.transform(X_test)
y_pred = las01.predict(X_test_s)
r2_score(y_test, y_pred)

0.799926134284606

Next we will try different values of Alpha to try to hone in on what the optimal value for Alpha will be in order to get a higher r2_score or a higher whatever our metric is in regards to our holdout set.

### Exercise

In this section, we're going to look at another Alpha value for Lasso. We're going to see how well that's going to perform on a holdout set, and then compare that to linear regression, check the r2 score for each and see which one has the higher r2 score, and then we will also look at in part 3, which one is going to have a smaller magnitude of coefficients, as well as which one is going to have less overall coefficients. Hopefully, already you have in mind that Lasso should be reducing the magnitude of these coefficients, as well as bringing down the number of non-zero coefficients that we have in our resulting model.

#### Part 1:

Do the same thing with Lasso of:

*   `alpha` of 0.001
*   Increase `max_iter` to 100000 to ensure convergence.

Calculate the $R^2$ of the model.

Feel free to copy-paste code from above, but write a one sentence comment above each line of code explaining why you're doing what you're doing.

#### Part 2:

Do the same procedure as before, but with Linear Regression.

Calculate the $R^2$ of this model.

#### Part 3:

Compare the sums of the absolute values of the coefficients for both models, as well as the number of coefficients that are zero. Based on these measures, which model is a "simpler" description of the relationship between the features and the target?


#### Solution

##### Part 1:
So first, we're going to initiate our object, las001 is just going to be equal to a Lasso regression. Well, with an Alpha equal to 0.001, we are increasing the max iterations from its default. When we have a smaller Alpha value, it can happen that we are not able to get to the optimal solution. We briefly touched on the fact that Lasso takes longer because it uses something called gradient descent to step through until it gets to the optimal solution. So we want to ensure that we have enough steps to get to that optimal solution. Otherwise, it could end up stopping short of that optimal solution and not converging. When that happens, you will probably get a warning that you were not able to converge. We're going to set X_train_s to our fit transformed version of X_train using s. We're fit transforming it to the standard scaler. We're then going to take that X_train_s as well as our Y_train and we're going to fit our new Lasso regression to our training set. We're then going to transform our X_test. We're going to use s.transform, not fit_transform, but just transform and then we are going to use that as input to our prediction in order to get the predictions on our test set. We can then get the r2 score for our predictions versus our test set, and we'll print that out and you'll see that once we run all the code here.

##### Part 2:
Part 2 was to do the same thing for linear regression. So we initiate our linear regression objects, we fit it to our training set using our standardized version of X_train. We then predict using the standardized versions of the X_test that we defined in part 1 (X_train_s). Then we're going to print out the r2 score for our linear regression, which is just going to be the y_pred_lr, and our original outcome variable.

##### Part 3:
We're then also going to use the same code that we used above to see the magnitude of our coefficients by taking the absolute value and the sum of each one of those absolute values, and then we're going to see how many of our values are non-zero. So first we're going to do it for Lasso, then we're going to do it for linear regression.

In [33]:
### BEGIN SOLUTION

# Part 1

# Decreasing regularization and ensuring convergence
las001 = Lasso(alpha = 0.001, max_iter=100000)

# Transforming training set to get standardized units
X_train_s = s.fit_transform(X_train)

# Fitting model to training set
las001.fit(X_train_s, y_train)

# Transforming test set using the parameters defined from training set
X_test_s = s.transform(X_test)

# Finding prediction on test set
y_pred = las001.predict(X_test_s)

# Calculating r2 score
print("r2 score for Lasso, alpha = 0.001:", r2_score(y_test, y_pred))


# Part 2

# Using vanilla Linear Regression
lr = LinearRegression()

# Fitting model to training set
lr.fit(X_train_s, y_train)

# predicting on test set
y_pred_lr = lr.predict(X_test_s)

# Calculating r2 score
print("r2 score for Linear Regression:", r2_score(y_test,y_pred_lr))


# Part 3
print('Magnitude of Lasso coefficients:', abs(las001.coef_).sum())
print('Number of coeffients not equal to 0 for Lasso:', (las001.coef_!=0).sum())

print('Magnitude of Linear Regression coefficients:', abs(lr.coef_).sum())
print('Number of coeffients not equal to 0 for Linear Regression:', (lr.coef_!=0).sum())
### END SOLUTION

r2 score for Lasso, alpha = 0.001: 0.8847893236874363
r2 score for Linear Regression: 0.8689110469231008
Magnitude of Lasso coefficients: 435.572322904404
Number of coeffients not equal to 0 for Lasso: 90
Magnitude of Linear Regression coefficients: 1183.8918138649396
Number of coeffients not equal to 0 for Linear Regression: 104


So let's look at these results. It's iterating through, so it takes just a second and then we see the r2 score for Lasso was 0.868, and the r2 score for linear regression was 0.855. So on the holdout set, Lasso did a better job of explaining the additional variation. We then see the magnitude of the Lasso coefficients is 436 compared to the magnitude of the coefficients for linear regression, which was nearly triple that, at 1,185. Then we see that there is 89 non-zero coefficients and 104 coefficients for the linear regression. So we see how we are able to reduce the complexity of the model, reduce the number of coefficients, reduce the magnitude of those coefficients, and come up with a solution that's doing a better job of explaining variation on our holdout set. So there's that idea of bringing down the variance of our model in order for it to generalize better on new data.

## L1 vs. L2 Regularization


As mentioned in the deck: `Lasso` and `Ridge` regression have the same syntax in SciKit Learn.

Now we're going to compare the results from Ridge vs. Lasso regression:


[`Ridge`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML240ENSkillsNetwork34171862-2022-01-01)


In [35]:
from sklearn.linear_model import Ridge

### Exercise

Now, let's quickly introduce the same concept using Ridge regression. We have gone through this before. We saw that Lasso and Ridge had a similar formulation so we're just going to import the Ridge regression object. We're then going to fit our Ridge to 0.001 using that same Alpha. Now that Alpha doesn't necessarily translate, so don't think that just because they have the same Alpha, they will have the same regularization again. Now we're regularizing on the coefficient squared. So it could have a different effect then we may want to loop through different values of Alpha. But here we're just going to look at 0.001. We're going to do the same fit_transform using our standard scalar. We're then going to fit our Ridge model, which is just r, to our X_train_s and our y_train. We have our X_test_s using the s.transform, and then we can come up with our predictions for using r on our test set.

Following the Ridge documentation from above:

1.  Define a Ridge object `r` with the same `alpha` as `las001`.
2.  Fit that object on `X` and `y` and print out the resulting coefficients.


In [36]:
### BEGIN SOLUTION
# Decreasing regularization and ensuring convergence
r = Ridge(alpha = 0.001)
X_train_s = s.fit_transform(X_train)
r.fit(X_train_s, y_train)
X_test_s = s.transform(X_test)
y_pred_r = r.predict(X_test_s)

# Calculating r2 score
r.coef_
### END SOLUTION

array([  7.692,   9.889, -25.005,   5.31 ,  -2.61 ,  14.958,  22.314,
       -22.858,  27.765,  -1.565,  17.077,  21.858,  11.574,   1.05 ,
         0.431,  13.786,   1.814,  -8.345,   4.994,  -3.564,  -3.436,
       -16.336,  -7.047,   6.609,  -1.477,   4.686,  -1.304,  -0.059,
        -0.304, -12.842,   1.977,   1.081,  -0.676,  -1.079,   4.492,
        -4.267,   4.675,  -1.281,   8.666,  -0.274,  -8.12 ,  11.783,
         6.538,   1.323,   2.06 ,   0.899,   1.789,   4.744,  -4.664,
         5.31 ,  -3.236,  -8.668,   0.973,   1.136,   0.29 ,  -1.631,
        -2.926,   2.923,  -0.735,  11.896,   0.754,  -7.531,  18.306,
       -22.168,  35.651, -22.381,  -7.378,   1.732,   4.068, -12.254,
        -3.721,  -5.537, -16.169,  -5.872,  -4.515,  -9.903,   0.942,
        -0.095,  17.18 , -14.268,  -2.904,  -2.824,  -5.418,  11.908,
         0.065,  -4.433,  -5.679,  -3.515,   0.822, -29.082,  49.345,
       -21.72 ,  -1.428,  -8.864, -15.682,  11.982,  -0.818,   2.924,
        -0.701,  -4.

We see that we have all of these coefficients for r.

We want to remind that we're working with the same Alpha as above. By just calling Lasso, we see the Alpha value.

In [37]:
las001 # same alpha as Ridge above

Then we can look at the Lasso coefficients, and we see that we have a ton more zeros. Again, those zeros are due to the fact that we are using Lasso. Lasso by default will almost always zero more values than Ridge will.

In [38]:
las001.coef_

array([  0.   ,   0.   , -17.008,   2.592,   0.   ,  13.431,  10.062,
       -19.542,   9.237,   0.   ,   6.159,  17.087,  11.486,   1.208,
         0.219,  10.827,   2.185,  -7.106,   4.312,  -1.526,  -2.092,
        -9.669,   0.   ,   0.   ,  -1.177,   3.865,   0.378,   0.194,
        -0.296,  -3.478,   0.291,   0.72 ,  -0.795,  -0.74 ,   2.407,
        -0.892,   2.835,  -0.984,   3.864,  -0.963,   6.914,   6.2  ,
         4.197,   0.892,   2.028,   2.608,  -3.975,   2.574,  -4.568,
         2.086,  -2.   ,  -7.412,   1.604,   1.676,  -1.248,  -0.   ,
        -0.   ,   2.576,  -0.851,   3.206,  -0.   ,  -2.429,   9.276,
        -7.056,   0.   ,  -5.955,  -5.326,   0.531,   3.344,  -7.575,
        -0.   ,  -6.762, -10.2  ,  -6.428,  -2.835,  -9.561,   0.288,
         0.549,  11.094,  -6.731,  -1.161,  -2.301,  -3.555,   9.12 ,
        -0.014,  -2.384,  -2.042,  -1.691,   0.798, -20.822,  27.09 ,
         0.   ,  -1.527,  -6.793,  -5.084,  15.577,   0.   ,  -0.   ,
         0.569,  -3.

We're then going to look at both the magnitude of our coefficients as well as the number that are zeroed out.

In [39]:
print(np.sum(np.abs(r.coef_)))
print(np.sum(np.abs(las001.coef_)))

print(np.sum(r.coef_ != 0))
print(np.sum(las001.coef_ != 0))

792.867375512679
435.572322904404
104
90


As expected for a Ridge, we had more values. In fact, we didn't zero out any values compared to the number of values that we zeroed out or we have left that are nonzero for Lasso. In this particular version of Ridge and Lasso, we see that the magnitude was higher. We ran Ridge first, the magnitude was higher for Ridge and we had a simpler model for Lasso. So here Lasso is providing stronger overall regularization than Ridge for this particular Alpha. Again, this is not always necessarily true. It'll depend on your features. But we are able to test it using the methods that we see above.

**Conclusion:** Ridge does not make any coefficients 0. In addition, on this particular dataset, Lasso provides stronger overall regularization than Ridge for this value of `alpha` (not necessarily true in general).


We're then just going to look at r2 score for each. The first one is the r2 score for Ridge. The second one is the r2 score for our Lasso. Again, Lasso is doing a better job of generalizing, but this will obviously depend on the Alphas chosen, and you can iterate through different Alphas to see if certain values work better.

In [40]:
y_pred = r.predict(X_pf_ss)
print(r2_score(y, y_pred))

y_pred = las001.predict(X_pf_ss)
print(r2_score(y, y_pred))

0.9076091395029868
0.9103503442034426


**Conclusion**: Ignoring issues of overfitting, Lasso does slightly better than Ridge when `alpha` is set to 0.001 for each (not necessarily true in general).


# Example: Does it matter when you scale?

Now, does it matter when you scale? So we've been honing in on the fact that first, and we've discussed why, but first you want to do your transformation on your training set, your fit and transform on your training set, then just transform on your test set. Here we're going to see if we did plane linear regression. So here we're actually taking a standard scaled version first then running linear regression, and we get the r2 score, 0.58, and then in this version we're doing it as we said, where we should do our fit_transform first, then r transform. So if we look here, this is already having the fit_transform done for us, and that's going to be the same values for both the train and test set. Here, we're only using the train set in order to come up with our mean and standard deviation rather than the full dataset. Then we transform our test set using those values. But we see here that we ended up with the same r2 score. The conclusion that we want to take away here that we want to hone in on is that yes, it will be the same when you're working with plain vanilla linear regression. But as soon as you work with other models where the coefficients will matter in our cost score and we introduce things such as regularization, we may end up with different r2 scores or different predictions given when we actually fit and transform our data.

In [41]:
X_train, X_test, y_train, y_test = train_test_split(X_ss, y, test_size=0.3, 
                                                    random_state=72018)

In [42]:
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
r2_score(y_test, y_pred)

0.6982083583132748

In [43]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    random_state=72018)

In [44]:
s = StandardScaler()
lr_s = LinearRegression()
X_train_s = s.fit_transform(X_train)
lr_s.fit(X_train_s, y_train)
X_test_s = s.transform(X_test)
y_pred_s = lr_s.predict(X_test_s)
r2_score(y_test, y_pred)

0.6982083583132748

So to make this clear, imagine I just change this to a Lasso regression and I change this to Lasso regression, we'll see that using the entire dataset to come up with our standard deviation and our mean will come up with a different prediction than using just our training set to come up with those values. They'll be very similar because the mean and standard deviation should be fairly similar, but they will be different. So again, when you're working with most models, you want to run the fit_transform only on your training set, and then that transform on just your test set, not the fit transform.

In [45]:
X_train, X_test, y_train, y_test = train_test_split(X_ss, y, test_size=0.3, 
                                                    random_state=72018)

In [46]:
lr = Lasso()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
r2_score(y_test, y_pred)

0.6253107263912051

In [47]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    random_state=72018)

In [48]:
s = StandardScaler()
lr_s = Lasso()
X_train_s = s.fit_transform(X_train)
lr_s.fit(X_train_s, y_train)
X_test_s = s.transform(X_test)
y_pred_s = lr_s.predict(X_test_s)
r2_score(y_test, y_pred)

0.6253107263912051

**Conclusion:** It doesn't matter whether you scale before or afterwards, in terms of the raw predictions, for Linear Regression. However, it matters for other algorithms. Plus, as we'll see later, we can make scaling part of a `Pipeline`.


***

### Machine Learning Foundation (C) 2020 IBM Corporation
