## Lasso Regression on House Sales Data

### Fire up Graphlab Create

In [1]:
import graphlab

### Load in house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.

In [98]:
sales = graphlab.SFrame('kc_house_data.gl/')

### Explore house sales data

In [99]:
sales[0:1]

id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront
7129300520,2014-10-13 00:00:00+00:00,221900.0,3.0,1.0,1180.0,5650,1,0

view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat
0,3,7,1180,0,1955,0,98178,47.51123398

long,sqft_living15,sqft_lot15
-122.25677536,1340.0,5650.0


### Import Numpy

In [100]:
import numpy as np

### Create new features

In [101]:
from math import log, sqrt
sales['sqft_living_sqrt'] = sales['sqft_living'].apply(sqrt)
sales['sqft_lot_sqrt'] = sales['sqft_lot'].apply(sqrt)
sales['bedrooms_square'] = sales['bedrooms']*sales['bedrooms']

# In the dataset, 'floors' was defined with type string, 
# so we'll convert them to float, before creating a new feature.
sales['floors'] = sales['floors'].astype(float)
sales['floors_square'] = sales['floors']*sales['floors']

* Squaring bedrooms will increase the separation between not many bedrooms (e.g. 1) and lots of bedrooms (e.g. 4) since 1^2 = 1 but 4^2 = 16. Consequently this variable will mostly affect houses with many bedrooms.
* On the other hand, taking square root of sqft_living will decrease the separation between big house and small house. The owner may not be exactly twice as happy for getting a house that is twice as big.

### Selected features

In [102]:
all_features = ['bedrooms', 'bedrooms_square',
            'bathrooms',
            'sqft_living', 'sqft_living_sqrt',
            'sqft_lot', 'sqft_lot_sqrt',
            'floors', 'floors_square',
            'waterfront', 'view', 'condition', 'grade',
            'sqft_above',
            'sqft_basement',
            'yr_built', 'yr_renovated']

## Model with a choosen l1 penalty

### Linear regression model with a single l1 penalty (lasso)

In [103]:
model_all = graphlab.linear_regression.create(sales, target='price', features=all_features,
                                              validation_set=None, l1_penalty=1e10, l2_penalty=0., verbose=None)

### Explore coefficients in the model

In [105]:
print model_all['coefficients'].print_rows(num_rows=18)
print "Number of Non-Zero Coefficients : " ,model_all['coefficients']['value'].nnz()

+------------------+-------+---------------+--------+
|       name       | index |     value     | stderr |
+------------------+-------+---------------+--------+
|   (intercept)    |  None |  274873.05595 |  None  |
|     bedrooms     |  None |      0.0      |  None  |
| bedrooms_square  |  None |      0.0      |  None  |
|    bathrooms     |  None | 8468.53108691 |  None  |
|   sqft_living    |  None | 24.4207209824 |  None  |
| sqft_living_sqrt |  None | 350.060553386 |  None  |
|     sqft_lot     |  None |      0.0      |  None  |
|  sqft_lot_sqrt   |  None |      0.0      |  None  |
|      floors      |  None |      0.0      |  None  |
|  floors_square   |  None |      0.0      |  None  |
|    waterfront    |  None |      0.0      |  None  |
|       view       |  None |      0.0      |  None  |
|    condition     |  None |      0.0      |  None  |
|      grade       |  None | 842.068034898 |  None  |
|    sqft_above    |  None | 20.0247224171 |  None  |
|  sqft_basement   |  None |

Note that a majority of the weights have been set to zero. So by setting an L1 penalty that's large enough, we are performing a subset selection.

### Splitting the data

In [106]:
(training_and_validation, testing) = sales.random_split(.9,seed=1)
(training, validation) = training_and_validation.random_split(0.5, seed=1)

## Model with best selected L1 penalty from a range of l1 penalties

In [141]:
max_nonzeros = 7 # maximum non zero weights allowed

### Exploring the larger range of values for l1 penalty to find a narrow range with the desired sparsity

In [112]:
l1_penalty_values = np.logspace(8, 10, num=20)

In [113]:
non_zeros = []
l1_penalties = []
for l1_penalty in l1_penalty_values:
    model = graphlab.linear_regression.create(training, target='price', features=all_features, validation_set=None, 
                                              l1_penalty=l1_penalty, l2_penalty=0., verbose=False)
    non_zeros.append(model['coefficients']['value'].nnz())
    l1_penalties.append(l1_penalty)

L1 penalties applied to the models and the corresponding coefficients which are non zero in the model

In [114]:
l1_penalties

[100000000.0,
 127427498.57031322,
 162377673.91887242,
 206913808.11147901,
 263665089.87303555,
 335981828.62837881,
 428133239.8719396,
 545559478.11685145,
 695192796.17755914,
 885866790.41008317,
 1128837891.6846883,
 1438449888.2876658,
 1832980710.8324375,
 2335721469.0901213,
 2976351441.6313128,
 3792690190.7322536,
 4832930238.5717525,
 6158482110.6602545,
 7847599703.5146227,
 10000000000.0]

In [115]:
non_zeros

[18, 18, 18, 18, 17, 17, 17, 17, 17, 16, 15, 15, 13, 12, 10, 6, 5, 3, 1, 1]

Out of this large range, we want to find the two ends of our desired narrow range of `l1_penalty`.  At one end, we will have `l1_penalty` values that have too few non-zeros, and at the other end, we will have an `l1_penalty` that has too many non-zeros.  

* The largest `l1_penalty` that has more non-zeros than `max_nonzeros`
* The smallest `l1_penalty` that has fewer non-zeros than `max_nonzeros'

In [128]:
l1_penalty_min = l1_penalties[14]
l1_penalty_max = l1_penalties[15]
print "Min l1 penalty: " ,l1_penalty_min
print "Max l1 penalty: " ,l1_penalty_max

Min l1 penalty:  2976351441.63
Max l1 penalty:  3792690190.73


### Explore narrow range of values for l1 penalty to find a solution with the right number of non-zeros that has lowest RSS on the validation set 

In [129]:
l1_penalty_values = np.linspace(l1_penalty_min, l1_penalty_max, 20)

In [133]:
all_rss = []
all_models = []
all_penalties = []
for l1_penalty in l1_penalty_values:
    model = graphlab.linear_regression.create(training, target='price', features=all_features, validation_set=None, 
                                              l1_penalty=l1_penalty, l2_penalty=0., verbose=False)
    predicted_price = model.predict(validation)
    residuals = predicted_price - validation['price']
    rss = (residuals*residuals).sum()
    all_rss.append(rss)
    all_models.append(model)
    all_penalties.append(l1_penalty)

### Explore all models with number of non zeros equal to max non zeroes and it's corresponding RSS and l1 penalty 

In [135]:
# Loop to select those models from all models whose number of non zero coefficients are equal to max non zeros allowed that is 7
selected_models = []
selected_rss = []
selected_penalties = []
index = 0
for model in all_models:
    if model['coefficients']['value'].nnz() == 7:
        selected_models.append(model)
        selected_rss.append(all_rss[index])
        selected_penalties.append(all_penalties[index])
    index += 1

In [138]:
# Select a model from selected models and a l1 penalty from selected penalties that has the lowest RSS
lowest_rss, index = min((val, idx) for (idx, val) in enumerate(selected_rss))
best_model = selected_models[index]
best_l1_penalty = selected_penalties[index]

### Best l1 penalty

In [139]:
best_l1_penalty

3448968612.1634364

### Explore coefficients in the best model

In [140]:
print best_model['coefficients'].print_rows(num_rows=18)
print "Number of non zero coefficients: " ,best_model['coefficients']['value'].nnz()

+------------------+-------+---------------+--------+
|       name       | index |     value     | stderr |
+------------------+-------+---------------+--------+
|   (intercept)    |  None | 222253.192544 |  None  |
|     bedrooms     |  None | 661.722717782 |  None  |
| bedrooms_square  |  None |      0.0      |  None  |
|    bathrooms     |  None | 15873.9572593 |  None  |
|   sqft_living    |  None | 32.4102214513 |  None  |
| sqft_living_sqrt |  None | 690.114773313 |  None  |
|     sqft_lot     |  None |      0.0      |  None  |
|  sqft_lot_sqrt   |  None |      0.0      |  None  |
|      floors      |  None |      0.0      |  None  |
|  floors_square   |  None |      0.0      |  None  |
|    waterfront    |  None |      0.0      |  None  |
|       view       |  None |      0.0      |  None  |
|    condition     |  None |      0.0      |  None  |
|      grade       |  None | 2899.42026975 |  None  |
|    sqft_above    |  None | 30.0115753022 |  None  |
|  sqft_basement   |  None |