# Regression Algorithms
You cannot know which algorithms are best suited to your problem beforehand. You must trial a number of methods and focus attention on those that prove themselves the most promising. In this lab you will discover six machine learning algorithms that you can use when spot-checking your regression problem in Python with scikit-learn. After completing this lesson you will know:
1. How to spot-check machine learning algorithms on a regression problem.
2. How to spot-check four linear regression algorithms.
3. How to spot-check three nonlinear regression algorithms.
<br>

Let’s get started.
<br>

Algorithms Overview
In this lesson we are going to take a look at seven regression algorithms that you can spot-check on your dataset. Starting with four linear machine learning algorithms:
- Linear Regression.
- Ridge Regression.
- LASSO Linear Regression.
- Elastic Net Regression.
<br>

Then looking at three nonlinear machine learning algorithms:
- k-Nearest Neighbors.
- Classiﬁcation and Regression Trees.
- Support Vector Machines.

# Linear Machine Learning Algorithms
Each recipe is demonstrated on the Boston House Price dataset. This is a regression problem where all attributes are numeric. A test harness with 10-fold cross-validation is used to demonstrate how to spot-check each machine learning algorithm and mean squared error measures are used to indicate algorithm performance. Note that mean squared error values are inverted (negative). This is a quirk of the cross val score() function used that requires all algorithm metrics to be sorted in ascending order (larger value is better). The recipes assume that you know about each machine learning algorithm and how to use them. We will not go into the API or parameterization of each algorithm.

This section provides examples of how to use four diﬀerent linear machine learning algorithms for regression in Python with scikit-learn.
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

## Linear Regression
Linear regression assumes that the input variables have a Gaussian distribution. It is also assumed that input variables are relevant to the output variable and that they are not highly correlated with each other (a problem called collinearity). You can construct a linear regression model using the LinearRegression class.
<br>
Running the example provides a estimate of mean squared error.

In [5]:
# Linear Regression 
from pandas import read_csv 
from sklearn.model_selection import KFold 
from sklearn.model_selection import cross_val_score 
from sklearn.linear_model import LinearRegression 
filename = 'housing.csv' 
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'] 
dataframe = read_csv(filename, delim_whitespace=True, names=names) 
array = dataframe.values 
X = array[:,0:13] 
Y = array[:,13] 
kfold = KFold(n_splits=10, random_state=7) 
model = LinearRegression() 
scoring = 'neg_mean_squared_error' 
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring) 
print(results.mean())

-34.70525594452486


## Ridge Regression
Ridge regression is an extension of linear regression where the loss function is modiﬁed to minimize the complexity of the model measured as the sum squared value of the coeﬃcient values (also called the L2-norm). You can construct a ridge regression model by using the Ridge class2.
2http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html 
<br>

Running the example provides an estimate of the mean squared error.

In [8]:
# Ridge Regression 
from pandas import read_csv 
from sklearn.model_selection import KFold 
from sklearn.model_selection import cross_val_score 
from sklearn.linear_model import Ridge 
filename = 'housing.csv' 
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'] 
dataframe = read_csv(filename, delim_whitespace=True, names=names) 
array = dataframe.values 
X = array[:,0:13] 
Y = array[:,13] 
kfold = KFold(n_splits=10, random_state=7)
model = Ridge() 
scoring = 'neg_mean_squared_error' 
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring) 
print(results.mean())

-34.07824620925938


## LASSO Regression
The Least Absolute Shrinkage and Selection Operator (or LASSO for short) is a modiﬁcation of linear regression, like ridge regression, where the loss function is modiﬁed to minimize the complexity of the model measured as the sum absolute value of the coeﬃcient values (also called the L1-norm). You can construct a LASSO model by using the Lasso class3.
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html 
<br>

Running the example provides an estimate of the mean squared error.

In [10]:
# Lasso Regression 
from pandas import read_csv 
from sklearn.model_selection import KFold 
from sklearn.model_selection import cross_val_score 
from sklearn.linear_model import Lasso 
filename = 'housing.csv' 
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'] 
dataframe = read_csv(filename, delim_whitespace=True, names=names) 
array = dataframe.values 
X = array[:,0:13] 
Y = array[:,13]
kfold = KFold(n_splits=10, random_state=7) 
model = Lasso() 
scoring = 'neg_mean_squared_error' 
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring) 
print(results.mean())

-34.46408458830231


## ElasticNet Regression
ElasticNet is a form of regularization regression that combines the properties of both Ridge Regression and LASSO regression. It seeks to minimize the complexity of the regression model (magnitude and number of regression coeﬃcients) by penalizing the model using both the L2-norm (sum squared coeﬃcient values) and the L1-norm (sum absolute coeﬃcient values). You can construct an ElasticNet model using the ElasticNet class4.
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html
<br>
Running the example provides an estimate of the mean squared error.

In [14]:
# ElasticNet Regression 
from pandas import read_csv 
from sklearn.model_selection import KFold 
from sklearn.model_selection import cross_val_score 
from sklearn.linear_model import ElasticNet 
filename = 'housing.csv' 
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'] 
dataframe = read_csv(filename, delim_whitespace=True, names=names) 
array = dataframe.values 
X = array[:,0:13] 
Y = array[:,13] 
kfold = KFold(n_splits=10, random_state=7) 
model = ElasticNet() 
scoring = 'neg_mean_squared_error' 
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring) 
print(results.mean())

-31.16457371424975


# Nonlinear Machine Learning Algorithms
This section provides examples of how to use three diﬀerent nonlinear machine learning algorithms for regression in Python with scikit-learn.


## K-Nearest Neighbors
The k-Nearest Neighbors algorithm (or KNN) locates the k most similar instances in the training dataset for a new data instance. From the k neighbors, the mean or median output variable is taken as the prediction. Of note is the distance metric used (the metric argument). The Minkowski distance is used by default, which is a generalization of both the Euclidean distance (used when all inputs have the same scale) and Manhattan distance (used when the scales of the input variables diﬀer). You can construct a KNN model for regression using the KNeighborsRegressor class.
http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html
<br>
Running the example provides an estimate of the mean squared error.

In [15]:
# KNN Regression 
from pandas import read_csv 
from sklearn.model_selection import KFold 
from sklearn.model_selection import cross_val_score 
from sklearn.neighbors import KNeighborsRegressor 
filename = 'housing.csv' 
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'] 
dataframe = read_csv(filename, delim_whitespace=True, names=names) 
array = dataframe.values 
X = array[:,0:13] 
Y = array[:,13] 
kfold = KFold(n_splits=10, random_state=7) 
model = KNeighborsRegressor() 
scoring = 'neg_mean_squared_error' 
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring) 
print(results.mean())

-107.28683898039215


## Classiﬁcation and Regression Trees
Decision trees or the Classiﬁcation and Regression Trees (CART as they are known) use the training data to select the best points to split the data in order to minimize a cost metric. The default cost metric for regression decision trees is the mean squared error, speciﬁed in the criterion parameter. You can create a CART model for regression using the DecisionTreeRegressor class6.
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
<br>
Running the example provides an estimate of the mean squared error.

In [17]:
# Decision Tree Regression 
from pandas import read_csv 
from sklearn.model_selection import KFold 
from sklearn.model_selection import cross_val_score 
from sklearn.tree import DecisionTreeRegressor 
filename = 'housing.csv'
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'] 
dataframe = read_csv(filename, delim_whitespace=True, names=names) 
array = dataframe.values 
X = array[:,0:13] 
Y = array[:,13] 
kfold = KFold(n_splits=10, random_state=7) 
model = DecisionTreeRegressor() 
scoring = 'neg_mean_squared_error' 
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring) 
print(results.mean())

-34.368815686274516


## Support Vector Machines
Support Vector Machines (SVM) were developed for binary classiﬁcation. The technique has been extended for the prediction real-valued problems called Support Vector Regression (SVR). Like the classiﬁcation example, SVR is built upon the LIBSVM library. You can create an SVM model for regression using the SVR class7.
http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html
<br>
Running the example provides an estimate of the mean squared error.

In [18]:
# SVM Regression 
from pandas import read_csv 
from sklearn.model_selection import KFold 
from sklearn.model_selection import cross_val_score 
from sklearn.svm import SVR 
filename = 'housing.csv' 
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'] 
dataframe = read_csv(filename, delim_whitespace=True, names=names) 
array = dataframe.values 
X = array[:,0:13] 
Y = array[:,13] 
kfold = KFold(n_splits=10, random_state=7) 
model = SVR() 
scoring = 'neg_mean_squared_error' 
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring) 
print(results.mean())




-91.04782433324428


# Summary
In this lab you discovered how to spot-check machine learning algorithms for regression problems in Python using scikit-learn. Speciﬁcally, you learned about four linear machine learning algorithms: Linear Regression, Ridge Regression, LASSO Linear Regression and Elastic Net Regression. You also learned about three nonlinear algorithms: k-Nearest Neighbors, Classiﬁcation and Regression Trees and Support Vector Machines.