<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Gradient Descent in Scikit-Learn

_Authors: Kiefer Katovich (SF)_

---

Until now we've been using specific scikit-learn model classes such as `LinearRegression` and `LogisticRegression` to perform regression and classification. While these methods work well on smaller data sets with relatively few features, the process slows down to a crawl once you start getting into "medium data." Plus, these methods take up so much memory that fitting them becomes mind-numbingly slow — especially on a laptop.

Luckily, scikit-learn comes with stochastic gradient descent (SGD) solvers for regression and classification:
- `SGDRegressor`
- `SGDClassifier`

Because of its ability to minimize the loss function iteratively on smaller portions of the data, the SGD solvers avoid the intense slowdown other models suffer on large data sets.

> **Note:** The gradient descent solvers are flexible and can fit a variety of different model types not covered here. We highly recommend reading their documentation in detail.

---

### San Francisco Assessor Data

This lab uses data on housing prices in San Francisco from the S.F. Assessor's Office — the set is already cleaned up.

You can see that the set has 250,000 rows. When expanding this with dummy-coded categorical columns, it can become quite large. Be careful that you don't exceed the memory on your computer.


In [None]:
import numpy as np
import scipy 
import seaborn as sns
import pandas as pd
import scipy.stats as stats

import patsy

import matplotlib
import matplotlib.pyplot as plt

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

plt.style.use('fivethirtyeight')

### 1) Load the data.

Examine the columns.

In [None]:
prop = pd.read_csv('datasets/assessor_sample.csv')

In [None]:
# A:

### 2) Sample down the data.

Even though this is already only a sample of the full assessor data set, you should sample the data down further for the sake of speed and your computer's memory.

Use the `.sample()` function for Pandas DataFrames to subset this down to < 25,000 rows. 

Sampling down large data sets is a common procedure. Finding the optimal parameters with larger subsets of the data may change the hyperparameters and results and will get you closer to the best coefficients — but the returns become marginal at a certain point.

In [None]:
# prop_samp = prop.sample(n=25000)

### 3) Regression with stochastic gradient descent.

Below are x, y data predicting value (housing price) from the remaining variables. There are ~75,000 rows with 170 columns.


The `SGDRegressor` is general and flexible and can be customized with a variety of keyword arguments.

**Arguments**
- `loss`: `['squared_loss','huber', ...]`
    - The `'squared_loss'` corresponds to solving a regression with the least squares loss. This is what you'll probably use, but there are other options. Huber loss is a "robust" regression loss.
- `penalty`: `['none','l1','l2','elasticnet']`
    - This defines the penalty on the regression you'd like to solve. The l1 and l2 are the lasso and ridge, while the elastic net is the combination of both.
- `alpha`
    - The regularization strength to be used with a chosen penalty. It's the same as in the lasso and ridge.
- `l1_ratio`
    - The mix of the lasso and ridge penalties when elastic net is chosen as the penalty.
- `n_iter`
    - The number of training epochs over the data. This is the number of passes that the gradient descent algorithm will make over the data to iteratively fit the weights (defaults to five).

`SGDRegressor` is most often used in tandem with grid searching to find the optimal parameters for certain models. 

**It's up to you how you want to define the model. You should:**

1) Choose a target to estimate (this should be continuous).
    - Select predictors to use.
    - Standardize your predictor matrix.
    - Build a stochastic gradient descent solver to fit your model. You'll likely want to perform some kind of grid search to find the optimal parameters for your model.
    - Describe the model selected through grid search and compare the performance to the baseline.
    - Examine and interpret the coefficients.

In [None]:
# A:

### 4) Classification with stochastic gradient descent.

The `SGDClassifier` is very similar to the `SGDRegressor`. The main difference is that the loss functions are changed to regression loss functions.

**Arguments**
- `loss`: `['log', ...]`
    - The `'log'` loss corresponds to solving a logistic regression classifier. This is what you'll probably use, but there are other options.
- `penalty`: `['none','l1','l2','elasticnet']`
    - This defines the penalty on the regression you'd like to solve. The l1 and l2 are the lasso and ridge, while the elastic net is the combination of both.
- `alpha`
    - The regularization strength to be used with a chosen penalty. It's the same as in the lasso and ridge.
- `l1_ratio`
    - The mix of the lasso and ridge penalties when elastic net is chosen as the penalty.
- `n_iter`
    - The number of training epochs over the data. This is the number of passes that the gradient descent algorithm will make over the data to iteratively fit the weights (defaults to five).

Like `SGDRegressor`, `SGDClassifier` is most often used in tandem with grid searching to find the optimal parameters for certain models. 

**It's up to you how you want to define the model. You should:**

1) Choose a target to classify (you may need to engineer one from existing variables).
    - Calculate the baseline accuracy.
    - Select predictors to use.
    - Standardize your predictor matrix.
    - Build a stochastic gradient descent solver to fit your model. You'll likely want to perform some kind of grid search to find the optimal parameters for your model.
    - Describe the model selected through grid search and compare the performance to baseline.
    - Examine and interpret the coefficients.

In [None]:
# A: