# sklearn and stochastic gradient descent

Until now we've been using the more standard LassoCV, RidgeCV, etc. (or GridSearchCV) to find our optimal parameters. Unfortunately, though these methods work well on smaller datasets with relatively small numbers of columns, once you start getting into "Medium Data" these slow down to a crawl, and take up so much memory that fitting them becomes untenable.

This is where stochastic gradient descent comes in. Because of its ability to fit iteratively on portions of the data, it avoids the issue of large datasets. It is the most common algorithm to fit models on large datasets.

---

### Import the packages

In [1]:
import numpy as np
import scipy 
import seaborn as sns
import pandas as pd

import patsy

import matplotlib
import matplotlib.pyplot as plt

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

plt.style.use('fivethirtyeight')

---

### Load the data

I've provided data from the SF assessor's office on housing prices in San Francisco - already cleaned up. However, if you want to try this out on data you've been having trouble fitting, such as the Yelp data for project 5, feel free!


In [2]:
prop = pd.read_csv('/Users/kiefer/github-repos/DSI-SF-2/datasets/sf_assessor_value/assessor_value_cleaned.csv')

---

### Sample down the data

For demonstration purposes and the sake of speed, I am sampling down this large dataset to a more reasonable amount of rows. 

Stochastic gradient descent is much faster, but the real benefit in my opinion is that it can fit much larger datasets. That being said, it is still slow to fit on huge datasets in sklearn. I don't recommend fitting on the entire data. This is actually my recommendation in general; finding the optimal parameters with more and more data will change the hyperparameters but often with marginal returns.

---

### SGD with regression

Below I set up X, y data predicting value (housing price) from the remaining variables. There are ~75,000 rows, with 170 columns.

---

### Import the modeling classes

`SGDRegressor` and `SGDClassifer` are the models used in this solution These are the very general, flexible stochastic gradient descent classes.

In [3]:
from sklearn.linear_model import (LinearRegression, LogisticRegression, 
                                  Lasso, Ridge,
                                  SGDRegressor, SGDClassifier)
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import cross_val_score
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV

import scipy.stats as stats


---

### Standardize the data

Always a necessary step when performing regularization.

---

### Run the gridsearch on parameters

SGDRegressor and SGDClassifier use GridSearchCV and RandomizedSearchCV (which will be introduced in the next section) to find the optimal parameters. If you are still unsure of how to use the GridSearchCV, please look this up in the sklearn documentation. I want you to get used to looking things up online - you will be doing this literally every day on the job.

---

### Visualize or otherwise look at the SGDRegression results

How you choose to examine the results is up to you. Visualizations are always a good idea, but not entirely neccessary (or easy) in this case.

---

### RandomizedSearchCV

This class is very similar to GridSearchCV in the way it is initialized with parameters. The big difference is that instead of searching across a strictly specified grid, it searches over random values that are defined by distributions.

Below I have set up for you an example of parameters and calls to the class.

    uniform: this random variable is from scipy.stats and it pulls random values from the uniform distribution from 0.01 to 20000
    sgd_rand_params: the only difference here is that alpha gets the random variable. It will pull random values from that distribution
    RandomizedSearchCV: this takes an n_iter parameter that specifies how long to search over random values
    
RandomizedSearchCV is often faster than GridSearchCV while getting the same optimal parameteres. However, this is not _always_ the case. As far as I know it is not completely random, but rather begins to favor values that are closer to its current optimum.

For more information, see:

[RandomizedSearchCV documentation](http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.RandomizedSearchCV.html#sklearn.grid_search.RandomizedSearchCV)

[scipy.stats available distributions and functions](http://docs.scipy.org/doc/scipy/reference/stats.html)


In [4]:
uniform = stats.uniform(loc=0.01, scale=20000)

sgd_rand_params = {
    'loss':['squared_loss'],
    'penalty':['l1','l2'],
    'alpha':uniform
}

sgd_reg = SGDRegressor()
sgd_reg_rand_gs = RandomizedSearchCV(sgd_reg, sgd_rand_params, cv=5, verbose=2, n_iter=50)

---

### Visualize/examine the results from your RandomizedSearchCV results

---

### SGDClassifier

Using either GridSearchCV or RandomizedSearchCV, set up an X, y classification problem to look at. Depending on the data you loaded in, you may need to get new data. 

Find the optimal parameters.

---

### Visualize/examine the results of the classifier

Just like above.