<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Gradient Descent in Sklearn

_Authors: Kiefer Katovich (SF)_

---

Until now we've been using specific sklearn model classes to perform regression and classification such as `LinearRegression` and `LogisticRegression`. Unfortunately, while these methods work well on smaller datasets with relatively small numbers of columns, once you start getting into "Medium Data" these slow down to a crawl, and take up so much memory that fitting them becomes mind-numbingly slow (especially on a laptop).

Luckily, sklearn comes with  stochastic gradient descent solvers for regression and classification:
- `SGDRegressor`
- `SGDClassifier`

Due to its ability to minimize the loss function iteratively on smaller portions of the data, it avoids the intense slowdown other models suffer on large datasets.

> **Note:** The gradient descent solvers are very flexible and can fit a variety of different model types not covered here. I highly recommend reading their documentation in detail.

---

### SF assessor data

This lab uses data from the SF assessor's office on housing prices in San Francisco - it's already cleaned up.

You can see that the dataset has 250k rows. When expanding this with dummy-coded categorical columns it can become quite large. Be careful that you don't exceed the memory on your computer.


In [37]:
import numpy as np
import scipy 
import seaborn as sns
import pandas as pd
import scipy.stats as stats

import patsy

import matplotlib
import matplotlib.pyplot as plt

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

plt.style.use('fivethirtyeight')

### 1. Load the data.

Examine the columns.

In [38]:
prop = pd.read_csv('./datasets/assessor_sample.csv')

In [39]:
# A:
prop.shape

(250000, 17)

In [40]:
prop.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 17 columns):
baths             250000 non-null int64
beds              250000 non-null int64
lot_depth         250000 non-null float64
basement_area     250000 non-null float64
front_ft          250000 non-null float64
owner_pct         250000 non-null float64
rooms             250000 non-null int64
property_class    250000 non-null object
neighborhood      250000 non-null object
tax_rate          250000 non-null float64
volume            250000 non-null int64
sqft              250000 non-null int64
stories           250000 non-null int64
year_recorded     250000 non-null int64
year_built        250000 non-null int64
zone              250000 non-null object
value             250000 non-null float64
dtypes: float64(6), int64(8), object(3)
memory usage: 32.4+ MB


### 2. Sample down the data

Despite this already being a sample of the full assessor dataset, you should sample the data down further the sake of speed and your computers memory.

Use the `.sample()` function for pandas dataframes to subset this down to < 25000 rows. 

Sampling down large datasets is a common procedure. Finding the optimal parameters with larger subsets of the data may change the hyperparameters and the results, and will get you closer to the best coefficients, but the returns are marginal at a point.

In [41]:
prop_samp = prop.sample(n=25000)

### 3. Regression with stochastic gradient descent

Below I set up X, y data predicting value (housing price) from the remaining variables. There are ~75,000 rows, with 170 columns.


The `SGDRegressor` is very general and flexible, and can be customized with a variety of keyword arguments.

**Arguments**
- `loss`: `['squared_loss','huber', ...]`
    - The `'squared_loss'` loss corresponds to solving a regression with the least squares loss. This is what I expect you'll use, but there are other options. Huber loss is a "robust" regression loss.
- `penalty`: `['none','l1','l2','elasticnet']`
    - This defines the penalty on the regression that you would like to solve. The l1 and l2 are the Lasso and Ridge, while the elasticnet is the combination of them both.
- `alpha`
    - The regularization strength to be used with a chosen penalty. Same as in Lasso and Ridge.
- `l1_ratio`
    - The mix of the Lasso and Ridge penalties when elasticnet is chosen as the penalty.
- `n_iter`
    - The number of training "epochs" over the data. This is the number of passes that the gradient descent algorithm will make over the data to iteratively fit the weights (defaults to 5).

`SGDRegressor` is most often used in tandem with grid searching to find the optimal parameters for certain models. 

**It is up to you how you want to define the model. You should:**

1. Choose a target to estimate (this should be continuous).
- Select predictors to use.
- Standardize your predictor matrix.
- Build a stochastic gradient descent solver to fit your model. You will likely want to do some kind of gridsearch to find the optimal parameters for your model.
- Describe the model selected through gridsearch and compare the performance to baseline.
- Examine and interpret the coefficients.

In [42]:
f = 'value ~ ' + ' + '.join([c for c in prop_samp.columns if not c == 'value'])
print f

y, X = patsy.dmatrices(f, data=prop_samp, return_type='dataframe')
y = y.values.ravel()

print y.shape, X.shape

value ~ baths + beds + lot_depth + basement_area + front_ft + owner_pct + rooms + property_class + neighborhood + tax_rate + volume + sqft + stories + year_recorded + year_built + zone
(25000,) (25000, 163)


In [43]:
from sklearn.linear_model import SGDRegressor, SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

In [44]:
scaler = StandardScaler()
Xs = scaler.fit_transform(X)

In [45]:
sgd_params = {
    'loss':['squared_loss','huber'],
    'penalty':['l1','l2'],
    'alpha':np.logspace(-5,1,25)
}

sgd_reg = SGDRegressor()
sgd_reg_gs = GridSearchCV(sgd_reg, sgd_params, cv=5, verbose=False)

In [46]:
sgd_reg_gs.fit(Xs, y)

GridSearchCV(cv=5, error_score='raise',
       estimator=SGDRegressor(alpha=0.0001, average=False, epsilon=0.1, eta0=0.01,
       fit_intercept=True, l1_ratio=0.15, learning_rate='invscaling',
       loss='squared_loss', max_iter=None, n_iter=None, penalty='l2',
       power_t=0.25, random_state=None, shuffle=True, tol=None, verbose=0,
       warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'penalty': ['l1', 'l2'], 'loss': ['squared_loss', 'huber'], 'alpha': array([  1.00000e-05,   1.77828e-05,   3.16228e-05,   5.62341e-05,
         1.00000e-04,   1.77828e-04,   3.16228e-04,   5.62341e-04,
         1.00000e-03,   1.77828e-03,   3.16228e-03,   5.62341e-03,
         1.00000e-...2341e-01,
         1.00000e+00,   1.77828e+00,   3.16228e+00,   5.62341e+00,
         1.00000e+01])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=False)

In [47]:
print(sgd_reg_gs.best_params_)
print(sgd_reg_gs.best_score_)
# get the best model
sgd_reg = sgd_reg_gs.best_estimator_

{'penalty': 'l2', 'alpha': 1.0, 'loss': 'squared_loss'}
0.202694718146


In [48]:
value_coefs = pd.DataFrame({'coef':sgd_reg.coef_,
                            'mag':np.abs(sgd_reg.coef_),
                            'pred':X.columns})
value_coefs.sort_values('mag', ascending=False, inplace=True)
value_coefs.iloc[0:10, :]

Unnamed: 0,coef,mag,pred
151,28330.63143,28330.63143,beds
159,28300.755719,28300.755719,sqft
123,27200.744223,27200.744223,zone[T.RH1S]
63,26712.685032,26712.685032,neighborhood[T.07B]
70,21218.939825,21218.939825,neighborhood[T.08E]
10,19891.479713,19891.479713,neighborhood[T.01F]
54,19213.867708,19213.867708,neighborhood[T.05K]
4,18173.063253,18173.063253,property_class[T.Z]
158,-17663.037711,17663.037711,volume
129,17649.710185,17649.710185,zone[T.RHZ]


### 4. Classification with stochastic gradient descent

The `SGDClassifier` is very similar to the `SGDRegressor`. The main difference is that the loss functions are changed to regression loss functions.

**Arguments**
- `loss`: `['log', ...]`
    - The `'log'` loss corresponds to solving a logistic regression classifier. This is what I expect you'll use, but there are many other options.
- `penalty`: `['none','l1','l2','elasticnet']`
    - This defines the penalty on the regression that you would like to solve. The l1 and l2 are the Lasso and Ridge, while the elasticnet is the combination of them both.
- `alpha`
    - The regularization strength to be used with a chosen penalty. Same as in Lasso and Ridge.
- `l1_ratio`
    - The mix of the Lasso and Ridge penalties when elasticnet is chosen as the penalty.
- `n_iter`
    - The number of training "epochs" over the data. This is the number of passes that the gradient descent algorithm will make over the data to iteratively fit the weights (defaults to 5).

Like `SGDRegressor`, `SGDClassifier` is most often used in tandem with grid searching to find the optimal parameters for certain models. 

**It is up to you how you want to define the model. You should:**

1. Choose a target to classify (you may need to engineer one from existing variables).
- Calculate the baseline accuracy.
- Select predictors to use.
- Standardize your predictor matrix.
- Build a stochastic gradient descent solver to fit your model. You will likely want to do some kind of gridsearch to find the optimal parameters for your model.
- Describe the model selected through gridsearch and compare the performance to baseline.
- Examine and interpret the coefficients.

In [49]:
# A:
prop_samp.columns

Index([u'baths', u'beds', u'lot_depth', u'basement_area', u'front_ft',
       u'owner_pct', u'rooms', u'property_class', u'neighborhood', u'tax_rate',
       u'volume', u'sqft', u'stories', u'year_recorded', u'year_built',
       u'zone', u'value'],
      dtype='object')

In [50]:
prop_samp['year_built'].value_counts()

1941    975
1940    933
1925    886
1924    758
1926    749
1927    677
1923    674
1939    619
1948    612
1947    583
1908    515
1928    485
1950    462
1946    450
1922    448
1938    437
1951    430
1907    429
1906    409
1931    404
1910    357
1929    357
1930    348
1937    347
1936    345
1944    333
1949    319
1942    307
1912    301
1955    253
       ... 
1992    117
1965    100
1995     95
1933     87
1985     85
1977     84
1998     79
1973     71
1974     69
1918     68
1902     64
1966     63
1994     59
1970     58
1967     54
2000     51
1976     51
1968     45
1903     42
1934     38
1971     36
1969     21
1901     15
1999     15
2002      6
2001      4
2003      4
2005      2
2004      1
2006      1
Name: year_built, Length: 106, dtype: int64

In [51]:
# lets see if we can predict if a house was built past 1980
prop_samp['built_past1980'] = prop_samp.year_built.map(lambda x: 1 if x >= 1980 else 0)

In [52]:
# make the target and calculate the baseline:
y = prop_samp.built_past1980.values
print 1. - np.mean(y)

0.89296


In [53]:
f = '''
~ baths + beds + lot_depth + basement_area + front_ft + owner_pct +
rooms + property_class + neighborhood + tax_rate + volume + sqft + stories +
zone + value
'''


In [54]:
X = patsy.dmatrix(f, data=prop_samp, return_type='dataframe')

Xs = scaler.fit_transform(X)
print y.shape, Xs.shape

(25000,) (25000, 162)


In [55]:
sgd_cls_params = {
    'loss':['log'],
    'penalty':['l1','l2'],
    'alpha':np.logspace(-5,2,50)
}

sgd_cls = SGDClassifier()
sgd_cls_gs = GridSearchCV(sgd_cls, sgd_cls_params, cv=5, verbose=1)

In [56]:
sgd_cls_gs.fit(Xs, y)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed:   52.3s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=None, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
       shuffle=True, tol=None, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'penalty': ['l1', 'l2'], 'loss': ['log'], 'alpha': array([  1.00000e-05,   1.38950e-05,   1.93070e-05,   2.68270e-05,
         3.72759e-05,   5.17947e-05,   7.19686e-05,   1.00000e-04,
         1.38950e-04,   1.93070e-04,   2.68270e-04,   3.72759e-04,
         5.17947e-04,   7.19686e-04,...    1.93070e+01,   2.68270e+01,   3.72759e+01,   5.17947e+01,
         7.19686e+01,   1.00000e+02])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [57]:
print sgd_cls_gs.best_params_
print sgd_cls_gs.best_score_
sgd_cls = sgd_cls_gs.best_estimator_

{'penalty': 'l1', 'alpha': 0.00019306977288832496, 'loss': 'log'}
0.96276


In [58]:
value_coefs = pd.DataFrame({'coef':sgd_cls.coef_[0],
                            'mag':np.abs(sgd_cls.coef_[0]),
                            'pred':X.columns})
value_coefs.sort_values('mag', ascending=False, inplace=True)
value_coefs.iloc[0:10, :]

Unnamed: 0,coef,mag,pred
108,-437.584062,437.584062,zone[T.NCR]
116,284.052943,284.052943,zone[T.RC4NC3]
97,-230.655541,230.655541,zone[T.CRNC]
134,222.399478,222.399478,zone[T.RM2RM3]
110,-169.58714,169.58714,zone[T.OTCLEM]
22,-149.629774,149.629774,neighborhood[T.03D]
39,-145.394195,145.394195,neighborhood[T.04M]
138,-137.643788,137.643788,zone[T.RM3RM4]
132,-137.489194,137.489194,zone[T.RM1RM4]
73,-135.792977,135.792977,neighborhood[T.08H]
