# Regression Week 5: LASSO Assignment 1

In this assignment, you will use LASSO to select features, building on a pre-implemented solver for LASSO (using Turi Create, though you can use other solvers). You will:

    Write a function to normalize features
    Implement coordinate descent for LASSO
    Explore effects of L1 penalty

In the second assignment, you will implement your own LASSO solver, using coordinate descent.

In [2]:
import numpy as np
from sklearn import linear_model
from math import sqrt, log

In [1]:
import pandas as pd

dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':float, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}

sales = pd.read_csv('kc_house_data.csv', dtype=dtype_dict)

In [3]:
sales['sqft_living_sqrt'] = sales['sqft_living'].apply(sqrt)
sales['sqft_lot_sqrt'] = sales['sqft_lot'].apply(sqrt)
sales['bedrooms_square'] = sales['bedrooms']*sales['bedrooms']
sales['floors_square'] = sales['floors']*sales['floors']

# 1. Create new features by performing following transformation on inputs:
    
      i) Squaring bedrooms will increase the separation between not many bedrooms (e.g. 1) and lots of bedrooms (e.g. 4) since 1^2 = 1 but 4^2 = 16. Consequently this variable will mostly affect houses with many bedrooms.
      
      ii) On the other hand, taking square root of sqft_living will decrease the separation between big house and  small house. The owner may not be exactly twice as happy for getting a house that is twice as big.
      
      
# 2. Using the entire house dataset, learn regression weights using an L1 penalty of 5e2. Make sure to add "normalize=True" when creating the Lasso object. Refer to the following code snippet for the list of features.

In [6]:
from sklearn import linear_model  # using scikit-learn

all_features = ['bedrooms', 'bedrooms_square', 'bathrooms', 'sqft_living', 
                'sqft_living_sqrt', 'sqft_lot', 'sqft_lot_sqrt', 'floors', 
                'floors_square','waterfront', 'view',  'condition', 
                'grade', 'sqft_above', 'sqft_basement', 'yr_built', 
                'yr_renovated']

model_all = linear_model.Lasso(alpha=5e2, normalize=True) # set parameters
model_all.fit(sales[all_features], sales['price']) # learn weights

Lasso(alpha=500.0, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=True, positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False)

# 3. Quiz Question: Which features have been chosen by LASSO, i.e. which features were assigned nonzero weights?

In [10]:
model_all.coef_

array([    0.        ,     0.        ,     0.        ,   134.43931396,
           0.        ,     0.        ,     0.        ,     0.        ,
           0.        ,     0.        , 24750.00458561,     0.        ,
       61749.10309071,     0.        ,     0.        ,    -0.        ,
           0.        ])

# 4. To find a good L1 penalty, we will explore multiple values using a validation set. Let us do three way split into train, validation, and test sets. 

In [7]:
testing = pd.read_csv('wk3_kc_house_test_data.csv', dtype=dtype_dict)
training = pd.read_csv('wk3_kc_house_train_data.csv', dtype=dtype_dict)
validation = pd.read_csv('wk3_kc_house_valid_data.csv', dtype=dtype_dict)

# Make sure to create the 4 features as we did in step 1:

In [8]:
testing['sqft_living_sqrt'] = testing['sqft_living'].apply(sqrt)
testing['sqft_lot_sqrt'] = testing['sqft_lot'].apply(sqrt)
testing['bedrooms_square'] = testing['bedrooms']*testing['bedrooms']
testing['floors_square'] = testing['floors']*testing['floors']

training['sqft_living_sqrt'] = training['sqft_living'].apply(sqrt)
training['sqft_lot_sqrt'] = training['sqft_lot'].apply(sqrt)
training['bedrooms_square'] = training['bedrooms']*training['bedrooms']
training['floors_square'] = training['floors']*training['floors']

validation['sqft_living_sqrt'] = validation['sqft_living'].apply(sqrt)
validation['sqft_lot_sqrt'] = validation['sqft_lot'].apply(sqrt)
validation['bedrooms_square'] = validation['bedrooms']*validation['bedrooms']
validation['floors_square'] = validation['floors']*validation['floors']

# 5. Now for each l1_penalty in [10^1, 10^1.5, 10^2, 10^2.5, ..., 10^7] (to get this in Python, type np.logspace(1, 7, num=13).)

    Learn a model on TRAINING data using the specified l1_penalty. Make sure to specify normalize=True in the constructor:
    
    Compute the RSS on VALIDATION for the current model (print or save the RSS)

Report which L1 penalty produced the lower RSS on VALIDATION.

In [12]:
l1_values = np.logspace(1, 7, num=13)
l1_values

array([1.00000000e+01, 3.16227766e+01, 1.00000000e+02, 3.16227766e+02,
       1.00000000e+03, 3.16227766e+03, 1.00000000e+04, 3.16227766e+04,
       1.00000000e+05, 3.16227766e+05, 1.00000000e+06, 3.16227766e+06,
       1.00000000e+07])

In [17]:
def lasso_reg(l1_values, train_data, valid_data):
    rsslist=[]
    for l1_penalty in l1_values:
        all_features = ['bedrooms', 'bedrooms_square', 'bathrooms', 'sqft_living', 
                        'sqft_living_sqrt', 'sqft_lot', 'sqft_lot_sqrt', 'floors', 
                        'floors_square','waterfront', 'view',  'condition', 
                        'grade', 'sqft_above', 'sqft_basement', 'yr_built', 
                        'yr_renovated']
        lasso_model = linear_model.Lasso(alpha=l1_penalty, normalize=True)
        lasso_model.fit(train_data[all_features], train_data['price'])
        rss=(((lasso_model.predict(valid_data[all_features]))-valid_data['price'])**2).sum()
        rsslist.append(rss)
    return rsslist

In [18]:
lasso_reg(l1_values, training, validation)

[398213327300134.9,
 399041900253346.9,
 429791604072559.6,
 463739831045121.1,
 645898733633800.8,
 1222506859427163.0,
 1222506859427163.0,
 1222506859427163.0,
 1222506859427163.0,
 1222506859427163.0,
 1222506859427163.0,
 1222506859427163.0,
 1222506859427163.0]

# 6. Quiz Question: Which was the best value for the l1_penalty, i.e. which value of l1_penalty produced the lowest RSS on VALIDATION data?

> # 10

# 7. Now that you have selected an L1 penalty, compute the RSS on TEST data for the model with the best L1 penalty.

In [21]:
l1_penalty = 10
model2 = linear_model.Lasso(alpha=l1_penalty, normalize=True)
model2.fit(training[all_features], training['price'])

test_prediction = model2.predict(testing[all_features])
test_prediction

array([296066.64518244, 573310.90252684, 469890.60540606, ...,
       582482.45746782, 346556.81148311, 368373.5502992 ])

In [25]:
test_error = test_prediction - testing['price']
test_error.mean()

-4852.15776158293

In [27]:
test_rss = (test_error ** 2).sum()
test_rss

98467402552698.75

# 8. Quiz Question: Using the best L1 penalty, how many nonzero weights do you have? Count the number of nonzero coefficients first, and add 1 if the intercept is also nonzero. A succinct way to do this is

In [28]:
np.count_nonzero(model2.coef_) + np.count_nonzero(model2.intercept_)

15

# 9. What if we absolutely wanted to limit ourselves to, say, 7 features? This may be important if we want to derive "a rule of thumb" --- an interpretable model that has only a few features in them.

You are going to implement a simple, two phase procedure to achieve this goal:

    Explore a large range of ‘l1_penalty’ values to find a narrow region of ‘l1_penalty’ values where models are likely to have the desired number of non-zero weights.
    Further explore the narrow region you found to find a good value for ‘l1_penalty’ that achieves the desired sparsity. Here, we will again use a validation set to choose the best value for ‘l1_penalty’.

# 10. Assign 7 to the variable ‘max_nonzeros’.

In [None]:
max_nonzeros = 7

# 11. Exploring large range of l1_penalty

    For l1_penalty in np.logspace(1, 4, num=20):

    Fit a regression model with a given l1_penalty on TRAIN data. Add "alpha=l1_penalty" and "normalize=True" to the parameter list.

In [31]:
def large_range(training_data):
    num = []
    for l1_penalty in np.logspace(1, 4, num=20):
        model = linear_model.Lasso(alpha=l1_penalty, normalize=True)
        model.fit(training_data[all_features], training_data['price'])
        non_zero = np.count_nonzero(model.coef_) + np.count_nonzero(model.intercept_)
        num.append(non_zero)
    return num

In [32]:
large_range(training)

[15, 15, 15, 15, 13, 12, 11, 10, 7, 6, 6, 6, 5, 3, 3, 2, 1, 1, 1, 1]

In [33]:
np.logspace(1, 4, num=20)

array([   10.        ,    14.38449888,    20.69138081,    29.76351442,
          42.81332399,    61.58482111,    88.58667904,   127.42749857,
         183.29807108,   263.66508987,   379.26901907,   545.55947812,
         784.75997035,  1128.83789168,  1623.77673919,  2335.72146909,
        3359.81828628,  4832.93023857,  6951.92796178, 10000.        ])

# 12. Out of this large range, we want to find the two ends of our desired narrow range of l1_penalty. At one end, we will have l1_penalty values that have too few non-zeros, and at the other end, we will have an l1_penalty that has too many non-zeros.

More formally, find:

    The largest l1_penalty that has more non-zeros than ‘max_nonzeros’ (if we pick a penalty smaller than this value, we will definitely have too many non-zero weights)Store this value in the variable ‘l1_penalty_min’ (we will use it later)
    The smallest l1_penalty that has fewer non-zeros than ‘max_nonzeros’ (if we pick a penalty larger than this value, we will definitely have too few non-zero weights)Store this value in the variable ‘l1_penalty_max’ (we will use it later)

Hint: there are many ways to do this, e.g.:

    Programmatically within the loop above
    Creating a list with the number of non-zeros for each value of l1_penalty and inspecting it to find the appropriate boundaries.
    
 


# 13. Quiz Question: What values did you find for l1_penalty_min and l1_penalty_max?

## largest l1_penalty = l1_penalty_min = 127.42749857

## smallest l1_penalty = l1_penalty_max = 263.66508987

# 14. Exploring narrower range of l1_penalty

We now explore the region of l1_penalty we found: between ‘l1_penalty_min’ and ‘l1_penalty_max’. We look for the L1 penalty in this range that produces exactly the right number of nonzeros and also minimizes RSS on the VALIDATION set.

For l1_penalty in np.linspace(l1_penalty_min,l1_penalty_max,20):

    Fit a regression model with a given l1_penalty on TRAIN data. As before, use "alpha=l1_penalty" and "normalize=True".
    Measure the RSS of the learned model on the VALIDATION set

Find the model that the lowest RSS on the VALIDATION set and has sparsity equal to ‘max_nonzeros’. (Again, take account of the intercept when counting the number of nonzeros.)

In [48]:
l1_penalty_min = 127.42749857
l1_penalty_max = 263.66508987

def final_model(training_data,validation_data):
    sparsity = []
    valid_rss = []
    l1_penalty_list = []
    
    for l1_penalty in np.linspace(l1_penalty_min,l1_penalty_max,20):
        l1_penalty_list.append(l1_penalty)
        all_features = ['bedrooms', 'bedrooms_square', 'bathrooms', 'sqft_living', 
                        'sqft_living_sqrt', 'sqft_lot', 'sqft_lot_sqrt', 'floors', 
                        'floors_square','waterfront', 'view',  'condition', 
                        'grade', 'sqft_above', 'sqft_basement', 'yr_built', 
                        'yr_renovated']
        model = linear_model.Lasso(alpha=l1_penalty, normalize=True)
        model.fit(training_data[all_features], training_data['price'])
        non_zeros = np.count_nonzero(model.coef_) + np.count_nonzero(model.intercept_)
        
        sparsity.append(non_zeros)
        
        validation_rss = (((model.predict(validation_data[all_features]))-validation_data['price'])**2).sum()
        valid_rss.append(validation_rss)
        
    df = pd.DataFrame({'l1_penalty': l1_penalty_list, "Sparsity" : sparsity, "Validation_rss" : valid_rss})
    return df
        

In [50]:
df = final_model(training, validation)
df

Unnamed: 0,l1_penalty,Sparsity,Validation_rss
0,127.427499,10,435374700000000.0
1,134.597898,10,437009200000000.0
2,141.768298,8,438236100000000.0
3,148.938697,8,439158900000000.0
4,156.109097,7,440037400000000.0
5,163.279496,7,440777500000000.0
6,170.449896,7,441566700000000.0
7,177.620295,7,442406400000000.0
8,184.790695,7,443296700000000.0
9,191.961094,7,444239800000000.0


In [51]:
df.sort_values(by=['Validation_rss'])

Unnamed: 0,l1_penalty,Sparsity,Validation_rss
0,127.427499,10,435374700000000.0
1,134.597898,10,437009200000000.0
2,141.768298,8,438236100000000.0
3,148.938697,8,439158900000000.0
4,156.109097,7,440037400000000.0
5,163.279496,7,440777500000000.0
6,170.449896,7,441566700000000.0
7,177.620295,7,442406400000000.0
8,184.790695,7,443296700000000.0
9,191.961094,7,444239800000000.0


# 15. Quiz Question: What value of l1_penalty in our narrow range has the lowest RSS on the VALIDATION set and has sparsity equal to ‘max_nonzeros’?

> ## 156.109097

# 16. Quiz Question: What features in this model have non-zero coefficients?

In [54]:
all_features = ['bedrooms', 'bedrooms_square', 'bathrooms', 'sqft_living', 
                'sqft_living_sqrt', 'sqft_lot', 'sqft_lot_sqrt', 'floors', 
                'floors_square','waterfront', 'view',  'condition', 
                'grade', 'sqft_above', 'sqft_basement', 'yr_built', 
                'yr_renovated']
l1_penalty = 156.109097
model_test = linear_model.Lasso(alpha=l1_penalty, normalize=True)
model_test.fit(training[all_features], training['price'])
model_test.coef_

array([-0.00000000e+00, -0.00000000e+00,  1.06108902e+04,  1.63380252e+02,
        0.00000000e+00, -0.00000000e+00, -0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  5.06451687e+05,  4.19600436e+04,  0.00000000e+00,
        1.16253554e+05,  0.00000000e+00,  0.00000000e+00, -2.61223488e+03,
        0.00000000e+00])

# bathrooms , sqft_living, waterfront, view, grade, yr_built