# Week 4_1: Ridge regression (interpretation)

In this notebook, we will run ridge regression multiple times with different L2 penalties to see which one produces the best fit. We will revisit the example of polynomial regression as a means to see the effect of L2 regularization. In particular, we will:
* Use a pre-built implementation of regression (GraphLab Create) to run polynomial regression
* Use matplotlib to visualize polynomial regressions
* Use a pre-built implementation of regression (GraphLab Create) to run polynomial regression, this time with L2 penalty
* Use matplotlib to visualize polynomial regressions under L2 regularization
* Choose best L2 penalty using cross-validation.
* Assess the final fit using test data.

We will continue to use the House data from previous notebooks.  (In the next programming assignment for this module, you will implement your own ridge regression learning algorithm using gradient descent.)

In [1]:
import graphlab
import numpy as np

[INFO] This non-commercial license of GraphLab Create is assigned to chengjun@chem.ku.dk and will expire on January 27, 2017. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-94534 - Server binary: /usr/local/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1455184509.log
[INFO] GraphLab Server Version: 1.8.1


## Revisited polynomial model

In [2]:
def polynomial_sframe(feature, degree):
    poly_sframe = graphlab.SFrame()
    poly_sframe['power_1'] = feature
    if degree > 1:
        for power in range(2, degree+1):
            name = 'power_' + str(power)
            tmp = feature.apply(lambda x: x**power)
            poly_sframe[name] = tmp
    return poly_sframe

In [3]:
sales = graphlab.SFrame('kc_house_data.gl/')

In [4]:
#sales.sort(['sqft_living'])
sales = sales.sort(['sqft_living', 'price'])

In [5]:
poly15_data = polynomial_sframe(sales['sqft_living'], 15)
poly15_data['price'] = sales['price']

## Create ridge model

In [6]:
def ridge_model(data, l2_penalty, degree):
    features = []
    for i in range(1, degree+1):
        features.append('power_' + str(i))
        
    model = graphlab.linear_regression.create(data, 
                                      target='price', 
                                      features=features,
                                     l2_penalty=l2_penalty,
                                     l1_penalty=0.,
                                     validation_set=None,
                                     verbose=False)
    return model.get('coefficients')[1]

In [54]:
print ridge_model(poly15_data, 1e-5, 15)

{'index': None, 'stderr': 4735.64035967028, 'name': 'power_1', 'value': 103.09095591973461}


In [7]:
#l2_small_penalty = default 0.01
print ridge_model(poly15_data, 0.01, 15)

{'index': None, 'stderr': 4736.296739437046, 'name': 'power_1', 'value': 410.287462537506}


In [8]:
split1, split2 = poly15_data.random_split(0.5, seed=0)
set_1, set_2 = split1.random_split(0.5, seed=0)
set_3, set_4 = split2.random_split(0.5, seed=0)

In [53]:
#2_small_penalty=1e-9
print ridge_model(set_1, 1e-5, 15)
print ridge_model(set_2, 1e-5, 15)
print ridge_model(set_3, 1e-5, 15)
print ridge_model(set_4, 1e-5, 15)

{'index': None, 'stderr': 6003.288764621023, 'name': 'power_1', 'value': 585.8658233938417}
{'index': None, 'stderr': 9293.984717634332, 'name': 'power_1', 'value': 783.493800280331}
{'index': None, 'stderr': nan, 'name': 'power_1', 'value': -759.2518428541024}
{'index': None, 'stderr': 9978.427912873512, 'name': 'power_1', 'value': 1247.5903454090083}


In [52]:
#l2_large_penalty=1.e5
print ridge_model(set_1, 1e5, 15)
print ridge_model(set_2, 1e5, 15)
print ridge_model(set_3, 1e5, 15)
print ridge_model(set_4, 1e5, 15)

{'index': None, 'stderr': 9034.214550768973, 'name': 'power_1', 'value': 2.5873887567286933}
{'index': None, 'stderr': 12809.151526769316, 'name': 'power_1', 'value': 2.0447047418193693}
{'index': None, 'stderr': nan, 'name': 'power_1', 'value': 2.268904218765791}
{'index': None, 'stderr': 13195.254864203233, 'name': 'power_1', 'value': 1.9104093824432018}


## Selecting an L2 penalty via cross-validation

In [26]:
train_valid, test = sales.random_split(0.9, seed=1)
train_valid_shuffled = graphlab.toolkits.cross_validation.shuffle(train_valid, random_seed=1)

In [41]:
degree = 15
features = []
for i in range(1, degree+1):
    features.append('power_' + str(i))

def ridge_model_2(data, l2_penalty, output):        
    model = graphlab.linear_regression.create(data,
                                            target=output, 
                                            features=features,
                                            l2_penalty=l2_penalty,
                                            l1_penalty=0.,
                                            validation_set=None,
                                            verbose=False)
    return model
    
    
def k_fold_cross_validation(k, l2_penalty, data, output):
    rss_list = []
    n = len(data)

    for i in range(k):      
        start = (n*i)/k
        end = (n*(i+1))/k-1
        validation_set = data[start:end+1]
        training_set = data[0:start].append(data[end+1:n])
    
        model = ridge_model_2(training_set, l2_penalty, output)
        errors_squared = (validation_set[output] - model.predict(validation_set))**2
        rss_list.append(errors_squared.sum())
    cross_valication_error = np.sum(rss_list)/len(rss_list)       
    return cross_valication_error

In [46]:
poly15_train_valid_shuffled = polynomial_sframe(train_valid_shuffled['sqft_living'], 15)
poly15_train_valid_shuffled['price'] = train_valid_shuffled['price']

for l2 in np.logspace(1, 7, num=13):
    print k_fold_cross_validation(10, l2, poly15_train_valid_shuffled, 'price'), 'from ', l2

4.91826427769e+14 from  10.0
2.87504229919e+14 from  31.6227766017
1.60908965822e+14 from  100.0
1.22090967326e+14 from  316.227766017
1.21192264451e+14 from  1000.0
1.2395000929e+14 from  3162.27766017
1.36837175248e+14 from  10000.0
1.71728094842e+14 from  31622.7766017
2.2936143126e+14 from  100000.0
2.52940568729e+14 from  316227.766017
2.58682548441e+14 from  1000000.0
2.62819399742e+14 from  3162277.66017
2.64889015378e+14 from  10000000.0


In [47]:
best_model = ridge_model_2(poly15_train_valid_shuffled, 1000, 'price')

In [49]:
poly15_test = polynomial_sframe(test['sqft_living'], 15)


errors_squared = (test['price'] - best_model.predict(poly15_test))**2
errors_squared.sum()

128780855058449.25