# Part 2: Regression

Similar to what we just did classifying objects based on features, we can also work on a continous scale to predict a value based on known features.

## 1) Boston Dataset

For demonstration, we will use scikit-learn's [Boston](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html) dataset. Instead of predicting discrete categories as we would in classification, with this dataset we can attempt to predict price, a continuous variable.

In [None]:
from sklearn.datasets import load_boston

boston = load_boston()

If you are going to follow along in other tutorials in the scikit-learn documentation, you will need to know the data structures used as inputs to the models. Let'see what's in the boston dataset:

In [None]:
boston.keys()

The description will tell us more about the dataset:

In [None]:
boston.DESCR

So we are working on predicting median value of a home from 506 observations, and 13 covariates including crime rate, lot size, industry/commercial proportion, presence of the Charles River, nitric oxide concentration, rooms per dwelling, units built before 1940, distance to employment centers, access to highways, tax rate, school proxy, black population, and status. To get the variable names we can ask for them in the dictionary:

In [None]:
print(boston.feature_names)
print()
print(type(boston.feature_names))
print()
print(len(boston.feature_names))

We see the input is a numpy array of strings for the variable labels. To get the variable data, we ask the dictionary for the data:

In [None]:
print(boston.data)
print()
print(type(boston.data))
print()
print(len(boston.data))

The data is a numpy array, inside of which there is a separate array for each observation (all 506 for each hous, *not* 13 for each variable). Each inner array *must* lineup with the order of the variables *and* all other arrays. **ORDER MATTERS**

The target (price), or *y* is accessed in the dictionary as well:

In [None]:
print(boston.target)
print()
print(type(boston.target))
print()
print(len(boston.target))

The target array is only one dimension, lined up in order with the with the observations in the data array.

Now that we're familiar with the input data, we need to split it up for training and testing, but first thing's first: **set the random seed!**

In [None]:
import numpy as np

np.random.seed(10)

Now we can use the train_test_split feature:

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target,
                                                    train_size=0.75, test_size=0.25)

Now we have 75% of the data as training data, and 25% of the data as testing data:

In [None]:
print(len(X_train), len(y_train))
print()
print(len(X_test), len(y_test))

In scikit-learn, as soon as you have `X_train`, `X_test`, `y_train`, and `y_test`, everything else is just a matter of choosing parameters for whichever model you choose. But this should not be trivialized, selecting models and that model's parameters is *very* important. While we will not cover it here, you should always select the model and parameters best suited for your data.

## 2) Building models

The syntax in scikit-learn does not change for each model, only the parameters. It also is not very different from the classification maodel syntax. Examples of various models are given below:

### Linear Regression

We'll start with a basic OLS linear regression model:

In [None]:
from sklearn import linear_model
lin_reg = linear_model.LinearRegression(n_jobs=1)  # CPUs to use)

model = lin_reg.fit(X_train, y_train)
print(model.score(X_test, y_test))

### GLM - Ridge Regression

In [None]:
from sklearn import linear_model
ridge_reg = linear_model.Ridge(alpha=1.0,  # regularization
                               normalize=True,  # normalize X regressors
                               solver='auto',
                               random_state = 10)  # options = ‘auto’, ‘svd’, ‘cholesky’, ‘lsqr’, ‘sparse_cg’, ‘sag'

model = ridge_reg.fit(X_train, y_train)
print(model.score(X_test, y_test))

### GLM - Elastic Net Regression

In [None]:
elastic_reg = linear_model.ElasticNet(alpha=1.0,  # penalty, 0 is OLS 
                               random_state=10,
                               selection='cyclic')  # or 'random', which converges faster

model = elastic_reg.fit(X_train, y_train)
print(model.score(X_test, y_test))

### Support Vector Regression

In [None]:
from sklearn import svm

sv_reg = svm.SVR(kernel='linear',  # ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’
                 degree=3,  # only used for 'poly' above
                 gamma='auto',  # kernal coeff, default auto is 1/n_features
                 C=1.0)

model = sv_reg.fit(X_train, y_train)
print(model.score(X_test, y_test))

### K-nearest neighbors regression

In [None]:
from sklearn import neighbors

knn_reg = neighbors.KNeighborsRegressor(n_neighbors=5,
                                        weights='uniform',  # ‘distance’ weights points by inverse of their distance
                                        algorithm='auto',  # out of ‘ball_tree’, ‘kd_tree’, ‘brute’
                                        leaf_size=30)  # for tree algorithms

model = knn_reg.fit(X_train, y_train)
print(model.score(X_test, y_test))

### Random Forest Regression

In [None]:
from sklearn import ensemble

rf_reg = ensemble.RandomForestRegressor(n_estimators=10,  # number of trees
                                        criterion='mse',  # how to measure fit
                                        max_depth=None,  # how deep tree nodes can go
                                        min_samples_split=2,  # samples needed to split node
                                        min_samples_leaf=1,  # samples needed for a leaf
                                        min_weight_fraction_leaf=0.0,  # weight of samples needed for a node
                                        max_features='auto',  # max feats
                                        max_leaf_nodes=None,  # max nodes
                                        n_jobs=1, # how many to run parallel
                                        random_state=10)

model = rf_reg.fit(X_train, y_train)
print(model.score(X_test, y_test))

### Boosting - AdaBoost Regression

In [None]:
ab_reg = ensemble.AdaBoostRegressor(base_estimator=None,  # default is DT 
                                    n_estimators=50,  # number to try before stopping
                                    learning_rate=1.0,  # decrease influence of each additional estimator
                                    random_state=10,
                                    loss='linear')  # also ‘square’, ‘exponential’


model = ab_reg.fit(X_train, y_train)
print(model.score(X_test, y_test))

## 3) Grid Search

As with classfication, you can also use grid search on regression models.

In [None]:
param_grid = {'n_estimators': range(10,50),
              'learning_rate': np.arange(0.01, 1, .1)}

In [None]:
from sklearn.model_selection import GridSearchCV

model_c = GridSearchCV(ensemble.AdaBoostRegressor(), param_grid)
model_c.fit(X_train, y_train)

In [None]:
best_index = np.argmax(model_c.cv_results_["mean_test_score"])

print(model_c.cv_results_["params"][best_index])
print(max(model_c.cv_results_["mean_test_score"]))
print(model_c.score(X_test, y_test))

## Challenge

Choose three algorithms and use grid search to determine the best model for this dataset.