# Scikit-Learn: Regression Algorithms

*Regression algorithms* are primarily aimed at predicting future values of a (numerical/quantitative) attribute based on some continuous functional model for that attribute.
 
There are a number of algorithms built into the `scikit` suite that are tailor-made to solve regression 
problems - a few of these are enumerated below, along with their ideal use cases.

## Numpy and Scipy

`numpy` is the most commonly used general-purpose Python library for mathematics and numerical processing. 
A sister library, `scipy`, contains functions that are useful for scientific data processing and analysis. 
Both find use within `scikit` applications in various contexts.
This lab will assume basic familiarity with both.

## References
1. [Scikit Documentation](http://scikit-learn.org/stable/user_guide.html)
    * [LinearRegression](http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html)
    * [SGDRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html)
    * [Lasso and ElasticNet](http://scikit-learn.org/stable/auto_examples/linear_model/plot_lasso_and_elasticnet.html#sphx-glr-auto-examples-linear-model-plot-lasso-and-elasticnet-py)
    * [Ridge/OLS](http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols_ridge_variance.html#sphx-glr-auto-examples-linear-model-plot-ols-ridge-variance-py)

---
## Large datasets: linear regression, SGDRegressor

Linear regression algorithms fit a linear model (i.e. a trendline) to the training data.
There are a number of these algorithms included in the `scikit` library, 
so we'll cover two of the simpler implementations: `LinearRegression` and `SGDRegressor`.

### Linear Regression
`LinearRegression` uses *least-squares regression* to fit its linear model, 
and is best suited to large data sets (to reduce the effect of random noise and outliers).

*This example comes from the `scikit` documentation for `LinearRegression`.*

In [None]:
# load dataset and libraries
from sklearn import datasets, linear_model
diabetes = datasets.load_diabetes()
import numpy as np
import matplotlib.pyplot as plt

# reduce the dataset to a single feature
diabetes_X = diabetes.data[:, np.newaxis, 2] # index 2 --> mono-feature array

# separate out training and test data sets
X_train = diabetes_X[:-20]
X_test = diabetes_X[-20:]

# separate target data into training and test sets
y_train = diabetes.target[:-20]
y_test = diabetes.target[-20:]

# create and fit a model using the training sets
model = linear_model.LinearRegression().fit(X_train, y_train)

# plot the model
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, model.predict(X_test), color='blue', linewidth=3)
plt.show()

### Notes

The above line represents the least-squares regression on the dataset, and indicates a positive trend in the data.

---
### Stochastic Gradient Descent (SGD)
*Stochastic gradient descent* or SGD is an efficient and (typically) effective linear regression algorithm with some advantages over `LinearRegression`; 
a mathematical explanation can be found on here: https://en.wikipedia.org/wiki/Stochastic_gradient_descent

SGD is natively implemented within `scikit` via `SGDRegressor`, 
and is best suited to large datasets with low dimensionality (i.e. small number of relevant variables).

##### Data considerations and a note on other algorithms

The native implementation of `SGDRegressor` is designed to work with `numpy` arrays of floating point values for training data.

In [None]:
%matplotlib inline
import os, sys
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.linear_model import SGDRegressor
from sklearn.cross_validation import train_test_split

# Dataset location
DATASET = '/dsa/data/all_datasets/wine-quality/winequality-red.csv'
assert os.path.exists(DATASET)

# Load and shuffle
dataset = pd.read_csv(DATASET, sep=';').sample(frac = 1).reset_index(drop=True)

X = np.array(dataset.iloc[:,:-1])[:, [1]]
y = np.array(dataset.quality)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4) # split into training/test sets

clf = SGDRegressor(n_iter=250) # instantiate SGDRegressor, 250 iterations
clf.fit(X_train, y_train) # fit a linear model

# plot and display 
plt.scatter(X_test, y_test, color='lightblue')
sns.regplot(X_test, clf.predict(X_test), scatter=False)
plt.show()

### A word of warning
If you were to instantiate the `SGDRegressor` without parameters, you might run into some issues:

In [None]:
clf = SGDRegressor() # instantiate SGDRegressor, 5 iterations
clf.fit(X_train, y_train) # fit a linear model

# plot and display 
plt.scatter(X_test, y_test, color='lightblue')
sns.regplot(X_test, clf.predict(X_test), scatter=False)
plt.show()

Note that the model isn't even fit in the right direction this time (compare to other regressor examples - the slope of the regression line should be negative). 
Note the following from the `scikit` documentation:

![image](./n_iter.png)

The default constructor for `SGDRegressor` makes only 5 passes over the training data, 
and since the data are so striated, this can lead to severe underfitting. 
Thus, the 'correct' example above uses a much larger value for `n_iter` - try tweaking it yourself and plotting the results to see what happens.

---
## Lasso / ElasticNet

*The following example is based on the Lasso and Elastic Net demonstration in the `scikit` documentation.*

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split

# generate sparse, multi-featural data
np.random.seed(56)
n_samples, n_features = 75, 250
X = np.random.randn(n_samples, n_features) # Generate a random matrix
coef = 3 * np.random.randn(n_features)
noise = np.arange(n_features)
np.random.shuffle(noise)
coef[noise[10:]] = 0 
y = np.dot(X, coef) # y is the dot product of sparse coef and X vals

# throw in some random noise
y += 0.01 * np.random.normal((n_samples,))

# split data into training and test set (note that test_size is the fraction of the dataset to use for testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

# import models
from sklearn.linear_model import Lasso, ElasticNet

# construct lasso model
lasso = Lasso(alpha=0.1)
lassomodel = lasso.fit(X_train, y_train).predict(X_test)

# construct ElasticNet model
enet = ElasticNet(alpha = 0.1, l1_ratio = 0.7)
enetmodel= enet.fit(X_train, y_train).predict(X_test)

# plot models and compare
plt.plot(enet.coef_, color='red', linewidth=2, 
    label="ENet coefficients")
plt.plot(lasso.coef_, color='green', linewidth=2,
    label="Lasso coefficients")
plt.plot(coef, '--', color='blue', label="original coefficients")
plt.legend(loc="best")
plt.show()

---
## Ridge regression

`Ridge` is another model within `sklearn.linear_model` that produces a linear model using the linear least squares function, 
similar to `LinearRegression` above, but regularized differently.

The specific computational method is a parameter for the model, 
namely `solver` - see [here](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html) for explanations of each.

*This example is adapted from the `scikit` documentation, 
and compares models produced by ordinary least squares regression to those produced by ridge regression. 
We will be using the wine quality dataset referenced in the introductory Regression lab from module 1.*

In [None]:
%matplotlib inline
import os, sys
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.cross_validation import train_test_split

# Dataset location
DATASET = '/dsa/data/all_datasets/wine-quality/winequality-red.csv'
assert os.path.exists(DATASET)

# Load and shuffle
dataset = pd.read_csv(DATASET, sep=';').sample(frac = 1).reset_index(drop=True)

X = np.array(dataset.iloc[:,:-1])[:, [1]]
y = np.array(dataset.quality)

# Split training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.05)

# Fit models
ols = LinearRegression().fit(X_train, y_train)
ridge = Ridge().fit(X_train, y_train)

# Visualize data
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, ols.predict(X_test), color='red', linewidth=8,
        label='ordinary least squares')
plt.plot(X_test, ridge.predict(X_test), color='blue', linewidth=2,
        label='ridge regression')
plt.ylim(4.1,6.6)
plt.legend(loc='best')
plt.show()

As you can see, the results of ordinary least squares regression are very similar to ridge regression for a static dataset.
It's also worth noting that for all of the above regression algorithms, 
the only real differences in the initial setup come in during the *preparation* phase - selecting and slicing your dataset prior to training. 
Building machine learning systems in the `scikit` API works more or less the same across different models.

---
## Algorithms of last resort: SVR w/ RBF kernel, EnsembleRegressors

### SVR

*Support vector regression* (SVR) uses support vector machines for regression. 
`SVR` is part of the `sklearn.svm` package and is notable for allowing the use of different 
functional *kernels* for model production.
As these move up in complexity from the linear to the RBF (radial basis function) kernel, computational costs increase.

In [None]:
%matplotlib inline
import os, sys
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.svm import SVR
from sklearn import preprocessing
from sklearn.datasets import make_regression
from sklearn.cross_validation import train_test_split

# generate some roughly linear data
X, y = make_regression(n_samples=200, n_features=1, n_informative = 1,
                        n_targets=1, noise=0.8, random_state=2) # RNG seed locked for demonstrative purposes

# Split training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# Construct models using different kernels
rbf = SVR(kernel='rbf', C=100, gamma=0.1)
poly = SVR(kernel='poly', C=100, degree=2, epsilon=0.1)
lin = SVR(kernel='linear', C=1e3)

# Fit models
y_rbf = rbf.fit(X, y).predict(X_test)
y_poly = poly.fit(X, y).predict(X_test)
y_lin = lin.fit(X, y).predict(X_test)

# Visualize data
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, y_rbf, color='red', linewidth=3,
        label='rbf')
plt.plot(X_test, y_poly, color='blue', linewidth=3,
        label='polynomial')
plt.plot(X_test, y_lin, color='green', linewidth=3,
        label='linear')
plt.legend(loc='best')
plt.show()

### Discussion
Note that in the example, for roughly linear (degree 1) data, a polynomial kernel did an *extremely* poor job of approximating the test set. 
This is a compelling reason to carefully consider your available kernels when using support vector regression. 
(Intuitively, it makes very little sense to try and approximate a line with a curve.)

### EnsembleRegressors

`Ensemble` regression uses more than one regression - 
there are a number of ensemble-based regressors in `scikit`, 
but we'll focus on `GradientBoostingRegressor` as an  example:

*This example comes from the `scikit` documentation for `GradientBoostingRegressor`:*

In [None]:
import numpy as np
from sklearn.datasets import load_boston
from sklearn.cross_validation import train_test_split
from sklearn.utils import shuffle
from sklearn.ensemble import GradientBoostingRegressor

# load dataset
dataset = load_boston()
X = dataset.data
Y = dataset.target
X, Y = shuffle(X, Y, random_state=2)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.2)

# construct regressor (500 estimators)
gbr = GradientBoostingRegressor(n_estimators=500, loss = 'ls', verbose=0)
gbr.fit(X_train, Y_train)

# build and fill out an array of loss values
test_score = np.zeros((500),dtype=np.float64)
for i, Y_pred in enumerate(gbr.staged_predict(X_test)):
    test_score[i] = gbr.loss_(Y_test, Y_pred)

# plot deviance
plt.figure(figsize=(12,6))
plt.subplot(1,2,1)
plt.plot(np.arange(500) + 1, gbr.train_score_, 'b-', label="Training Set Deviance")
plt.plot(np.arange(500) + 1, test_score, 'b-', label="Test Set Deviance", color='red')
plt.xlabel('Estimators')
plt.ylabel('Deviance')
plt.legend(loc='best')
plt.show()