# Regression and Feature Selection using LASSO (Interpretation)

Lasso regression uses the L1 penalty to drive more or fewer coefficients to 0 (according to the L1 penalty parameter) to reduce the set of predictors.

# Import Modules

In [1]:
import graphlab
import numpy as np

A newer version of GraphLab Create (v1.9) is available! Your current version is v1.8.3.

You can use pip to upgrade the graphlab-create package. For more information see https://dato.com/products/create/upgrade.


# Load in house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.

In [2]:
sales = graphlab.SFrame('kc_house_data.gl/')

[INFO] GraphLab Create v1.8.3 started. Logging: /tmp/graphlab_server_1464301809.log


# Create new features

In [3]:
sales['sqft_living_sqrt'] = sales['sqft_living'].apply(np.sqrt)
sales['sqft_lot_sqrt'] = sales['sqft_lot'].apply(np.sqrt)
sales['bedrooms_square'] = sales['bedrooms'] * sales['bedrooms']

# In the dataset, 'floors' was defined with type string; convert them to float first
sales['floors'] = sales['floors'].astype(float) 
sales['floors_square'] = sales['floors'] * sales['floors']

# Learn regression weights with L1 penalty

Fit a model with all the features available, plus the features 
just created above.

In [4]:
all_features = ['bedrooms', 'bedrooms_square', 'bathrooms', 'sqft_living', 'sqft_living_sqrt', 'sqft_lot', 
                'sqft_lot_sqrt', 'floors', 'floors_square', 'waterfront', 'view', 'condition', 'grade', 'sqft_above',
                'sqft_basement', 'yr_built', 'yr_renovated']

Implement LASSO in graphlab with the l1_penalty parameter:

In [5]:
model_all = graphlab.linear_regression.create(
    sales, target = 'price', features = all_features, validation_set = None, l2_penalty = 0., l1_penalty = 1e10)

Find what features had non-zero weight.

In [6]:
model_all.get('coefficients').print_rows(num_rows = 18)

+------------------+-------+---------------+--------+
|       name       | index |     value     | stderr |
+------------------+-------+---------------+--------+
|   (intercept)    |  None |  274873.05595 |  None  |
|     bedrooms     |  None |      0.0      |  None  |
| bedrooms_square  |  None |      0.0      |  None  |
|    bathrooms     |  None | 8468.53108691 |  None  |
|   sqft_living    |  None | 24.4207209824 |  None  |
| sqft_living_sqrt |  None | 350.060553386 |  None  |
|     sqft_lot     |  None |      0.0      |  None  |
|  sqft_lot_sqrt   |  None |      0.0      |  None  |
|      floors      |  None |      0.0      |  None  |
|  floors_square   |  None |      0.0      |  None  |
|    waterfront    |  None |      0.0      |  None  |
|       view       |  None |      0.0      |  None  |
|    condition     |  None |      0.0      |  None  |
|      grade       |  None | 842.068034898 |  None  |
|    sqft_above    |  None | 20.0247224171 |  None  |
|  sqft_basement   |  None |

# Selecting an L1 penalty

Find a good L1 penalty, by exploring multiple values using a validation set. 
Split data into train, validation, and test sets:

In [7]:
(training_and_validation, testing) = sales.random_split(0.9, seed = 1) 
(training, validation) = training_and_validation.random_split(0.5, seed = 1) 

Next, loop through penalties [10^1, 10^1.5, 10^2, 10^2.5, ..., 10^7], and find the value with the best validation error.

In [8]:
best_lambda = 1.0
best_rss = 9.99e20

for penalty in np.logspace(1, 7, num = 13):
    mod = graphlab.linear_regression.create(training, 
                                            target = 'price', 
                                            features = all_features, 
                                            validation_set = None, 
                                            l2_penalty = 0., 
                                            l1_penalty = penalty,
                                            verbose = False)
    preds = mod.predict(validation)
    error = preds - validation['price']
    rss = sum(error ** 2)
    if rss < best_rss:
        best_rss = rss
        best_lambda = penalty
    print 'rss:', rss, 'lambda:', penalty 

print ''
print 'best lambda:', best_lambda, '(rss:', best_rss, ')'

rss: 6.25766285142e+14 lambda: 10.0
rss: 6.25766285362e+14 lambda: 31.6227766017
rss: 6.25766286058e+14 lambda: 100.0
rss: 6.25766288257e+14 lambda: 316.227766017
rss: 6.25766295212e+14 lambda: 1000.0
rss: 6.25766317206e+14 lambda: 3162.27766017
rss: 6.25766386761e+14 lambda: 10000.0
rss: 6.25766606749e+14 lambda: 31622.7766017
rss: 6.25767302792e+14 lambda: 100000.0
rss: 6.25769507644e+14 lambda: 316227.766017
rss: 6.25776517727e+14 lambda: 1000000.0
rss: 6.25799062845e+14 lambda: 3162277.66017
rss: 6.25883719085e+14 lambda: 10000000.0

best lambda: 10.0 (rss: 6.25766285142e+14 )


In [11]:
best_lambda

10000000.0


In [12]:
best_mod = graphlab.linear_regression.create(training, 
                                             target = 'price', 
                                             features = all_features, 
                                             validation_set = None, 
                                             l2_penalty = 0., 
                                             l1_penalty = best_lambda,
                                             verbose = False)

In [14]:
coef_table = best_mod.get('coefficients')
print coef_table[['name', 'value']].print_rows(num_rows = 18)
print 'Number zeroed:', best_mod['coefficients']['value'].nnz()

+------------------+------------------+
|       name       |      value       |
+------------------+------------------+
|   (intercept)    |  18993.4272128   |
|     bedrooms     |  7936.96767903   |
| bedrooms_square  |  936.993368193   |
|    bathrooms     |  25409.5889341   |
|   sqft_living    |  39.1151363797   |
| sqft_living_sqrt |  1124.65021281   |
|     sqft_lot     | 0.00348361822299 |
|  sqft_lot_sqrt   |  148.258391011   |
|      floors      |   21204.335467   |
|  floors_square   |  12915.5243361   |
|    waterfront    |  601905.594545   |
|       view       |  93312.8573119   |
|    condition     |  6609.03571245   |
|      grade       |  6206.93999188   |
|    sqft_above    |  43.2870534193   |
|  sqft_basement   |  122.367827534   |
|     yr_built     |  9.43363539372   |
|   yr_renovated   |  56.0720034488   |
+------------------+------------------+
[18 rows x 2 columns]

None
Number zeroed: 18


# Limit the number of nonzero weights

Suppose we want or need to limit the number of predictors to, say, 7 features? 

In [15]:
max_nonzeros = 7

## Explore a larger range of values to find a narrow range with the desired sparsity

Define a wide range of possible `l1_penalty_values`:

In [16]:
l1_penalty_values = np.logspace(8, 10, num = 20)

For different l1_penalty values, track the number of non-zero coefficents:

In [17]:
for penalty in l1_penalty_values:
    mod = graphlab.linear_regression.create(training, 
                                            target = 'price', 
                                            features = all_features, 
                                            validation_set = None, 
                                            l2_penalty = 0., 
                                            l1_penalty = penalty,
                                            verbose = False)
    mod_coeffs = mod['coefficients']['value']
    print 'no. coeffs:', mod_coeffs.nnz(), 'lambda:', penalty


no. coeffs: 18 lambda: 100000000.0
no. coeffs: 18 lambda: 127427498.57
no. coeffs: 18 lambda: 162377673.919
no. coeffs: 18 lambda: 206913808.111
no. coeffs: 17 lambda: 263665089.873
no. coeffs: 17 lambda: 335981828.628
no. coeffs: 17 lambda: 428133239.872
no. coeffs: 17 lambda: 545559478.117
no. coeffs: 17 lambda: 695192796.178
no. coeffs: 16 lambda: 885866790.41
no. coeffs: 15 lambda: 1128837891.68
no. coeffs: 15 lambda: 1438449888.29
no. coeffs: 13 lambda: 1832980710.83
no. coeffs: 12 lambda: 2335721469.09
no. coeffs: 10 lambda: 2976351441.63
no. coeffs: 6 lambda: 3792690190.73
no. coeffs: 5 lambda: 4832930238.57
no. coeffs: 3 lambda: 6158482110.66
no. coeffs: 1 lambda: 7847599703.51
no. coeffs: 1 lambda: 10000000000.0


"Zoom in" to the area around 7 coefficients to get finer resolution of penalty values.

In [18]:
l1_penalty_min = 2976351441.63
l1_penalty_max = 3792690190.73

In [20]:
# Look at log-spaced values within this range
l1_penalty_values = np.linspace(l1_penalty_min, l1_penalty_max, 20)

In [21]:
for l1_penalty in l1_penalty_values:
    mod = graphlab.linear_regression.create(training, 
                                            target = 'price', 
                                            features = all_features, 
                                            validation_set = None, 
                                            l2_penalty = 0., 
                                            l1_penalty = l1_penalty,
                                            verbose = False)
    mod_coeffs = mod['coefficients']['value']
    num_params = mod_coeffs.nnz()
    preds = mod.predict(validation)
    error = preds - validation['price']
    rss = sum(error ** 2)
    print 'rss:', rss, 'no. params:', num_params, 'lambda:', l1_penalty
    

rss: 9.66925692362e+14 no. params: 10 lambda: 2976351441.63
rss: 9.74019450085e+14 no. params: 10 lambda: 3019316638.95
rss: 9.81188367942e+14 no. params: 10 lambda: 3062281836.27
rss: 9.89328342459e+14 no. params: 10 lambda: 3105247033.59
rss: 9.98783211266e+14 no. params: 10 lambda: 3148212230.91
rss: 1.00847716702e+15 no. params: 10 lambda: 3191177428.24
rss: 1.01829878055e+15 no. params: 10 lambda: 3234142625.56
rss: 1.02824799221e+15 no. params: 10 lambda: 3277107822.88
rss: 1.03461690923e+15 no. params: 8 lambda: 3320073020.2
rss: 1.03855473594e+15 no. params: 8 lambda: 3363038217.52
rss: 1.04323723787e+15 no. params: 8 lambda: 3406003414.84
rss: 1.04693748875e+15 no. params: 7 lambda: 3448968612.16
rss: 1.05114762561e+15 no. params: 7 lambda: 3491933809.48
rss: 1.05599273534e+15 no. params: 7 lambda: 3534899006.8
rss: 1.06079953176e+15 no. params: 7 lambda: 3577864204.12
rss: 1.0657076895e+15 no. params: 6 lambda: 3620829401.45
rss: 1.06946433543e+15 no. params: 6 lambda: 366379

In [22]:
best_lambda7 = 3448968612.16
best_mod = graphlab.linear_regression.create(training, 
                                             target = 'price', 
                                             features = all_features, 
                                             validation_set = None, 
                                             l2_penalty = 0., 
                                             l1_penalty = best_lambda7,
                                             verbose = False)
best_mod['coefficients'].print_rows(num_rows = 18)

+------------------+-------+---------------+--------+
|       name       | index |     value     | stderr |
+------------------+-------+---------------+--------+
|   (intercept)    |  None | 222253.192544 |  None  |
|     bedrooms     |  None | 661.722717782 |  None  |
| bedrooms_square  |  None |      0.0      |  None  |
|    bathrooms     |  None | 15873.9572593 |  None  |
|   sqft_living    |  None | 32.4102214513 |  None  |
| sqft_living_sqrt |  None | 690.114773313 |  None  |
|     sqft_lot     |  None |      0.0      |  None  |
|  sqft_lot_sqrt   |  None |      0.0      |  None  |
|      floors      |  None |      0.0      |  None  |
|  floors_square   |  None |      0.0      |  None  |
|    waterfront    |  None |      0.0      |  None  |
|       view       |  None |      0.0      |  None  |
|    condition     |  None |      0.0      |  None  |
|      grade       |  None | 2899.42026975 |  None  |
|    sqft_above    |  None | 30.0115753022 |  None  |
|  sqft_basement   |  None |