<a href="https://colab.research.google.com/github/cagBRT/Machine-Learning/blob/master/Multi_Output_Linear_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we explore two methods for creating linear regression models with more than one output.

When we think of linear regression, we usually think of predicting a single output, for example a size or weight value.<br>
There are times when we want to predict more than one output. <br>
This usually means the two outputs are dependent upon each other. Which means the model needs to predict both outputs together or multiple models are used. <br>
We will look at both methods.

We need to check the version of sklearn. Versions of the library older than 0.22 will not work in this notebook.

In [None]:
# check scikit-learn version
import sklearn
print(sklearn.__version__)

# Create synthetic data

Problem Input: 10 numeric variables.<br>
Problem Output: 2 numeric variables.

In [None]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression

Notice the data shape is 1000 rows and 10 columns<br>
The label shape is 1000 rows and **2** columns

In [None]:
# create datasets
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, n_targets=2, random_state=1, noise=0.5)
# summarize dataset
print(X.shape, y.shape)

In [None]:
X[0]

**Some regression machine learning algorithms support multiple outputs directly.**<br>

This includes most of the popular machine learning algorithms implemented in the scikit-learn library, such as:<br>

>LinearRegression (and related)<br>
KNeighborsRegressor<br>
DecisionTreeRegressor<br>
RandomForestRegressor (and related)<br>





# Use the sklearn Linear Regression model

1. Select the model<br>
2. train it on the data and the labels<br>
3. Select a row of data and make a prediction<br>
<br>
The output from the linear regression model is two values

In [None]:
# define model
model = LinearRegression()
# fit model
model.fit(X, y)
# make a prediction
yhat = model.predict([X[0]])
# summarize prediction
print(yhat[0])

# Use the KNN Regression model on the data

In [None]:
# k-nearest neighbors for multioutput regression
from sklearn.datasets import make_regression
from sklearn.neighbors import KNeighborsRegressor

Use the the dataset generated above<br>
The model predicts two values

In [None]:
model = KNeighborsRegressor()
# fit model
model.fit(X, y)
# make a prediction
yhat = model.predict([X[4]])
# summarize prediction
print(yhat[0])

# Use the Decision Tree Regression Model

In [None]:
# decision tree for multioutput regression
from sklearn.datasets import make_regression
from sklearn.tree import DecisionTreeRegressor

Use the the dataset generated above<br>
The model predicts two values

In [None]:
# define model
model = DecisionTreeRegressor()
# fit model
model.fit(X, y)
# make a prediction
yhat = model.predict([X[0]])
# summarize prediction
print(yhat[0])

# Use Cross Validation on Multi Output Regression

In [None]:
# evaluate multioutput regression model with k-fold cross-validation
from numpy import absolute
from numpy import mean
from numpy import std
from sklearn.datasets import make_regression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold

In [None]:
# define model
model = DecisionTreeRegressor()
# define the evaluation procedure
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model and collect the scores
n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)

Running the example evaluates the performance of the decision tree model for multioutput regression on the test problem. The mean and standard deviation of the MAE is reported calculated across all folds and all repeats.

**The error is reported across both output variables**,<br>
rather than separate error scores for each output variable.



In [None]:
# force the scores to be positive
n_scores = absolute(n_scores)
# summarize performance
print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

# Support Vector Regression does not support multiple outputs

The following example will result in an error

In [None]:
# failure of support vector regression for multioutput regression (causes an error)
from sklearn.datasets import make_regression
from sklearn.svm import LinearSVR

In [None]:
# define model
model = LinearSVR()
# fit model
# (THIS WILL CAUSE AN ERROR!)
model.fit(X, y)

# Using regression models designed for predicting one value for multioutput regression<br>

We do this by dividing the multioutput regression problem into multiple sub-problems.

**For example**,<br>
if a multioutput regression problem required:<br>
>the prediction of three outputs y1, y2 and y3 given an input X,<br>
the problem can be partitioned into three single-output regression problems:<br>

Problem 1: Given X, predict y1.<br>
Problem 2: Given X, predict y2.<br>
Problem 3: Given X, predict y3.<br>
<br>



There are two main approaches to partitioning the problem.

1. **Direct Method:** this involves developing a separate regression model for each output value to be predicted. In other words each target value is modeled directly.

2. **Chain of Models**: this is an extension of the first method except the models are organized into a chain. The prediction from the first model is taken as part of the input to the second model, and the process of output-to-input dependency repeats along the chain of models.

In [None]:
# example of evaluating direct multioutput regression with an SVM model
from numpy import mean
from numpy import std
from numpy import absolute
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.multioutput import MultiOutputRegressor
from sklearn.svm import LinearSVR

In [None]:
# define dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, n_targets=2, random_state=1, noise=0.5)

Define the base model, in this case the support vector regression model

In [None]:
model = LinearSVR()

Define the wrapper model

In [None]:
wrapper = MultiOutputRegressor(model)

Use cross validation to evaluate the Direct approach model

In [None]:
# define the evaluation procedure
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model and collect the scores
n_scores = cross_val_score(wrapper, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
# force the scores to be positive
n_scores = absolute(n_scores)
# summarize performance
print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

In [None]:
# example of making a prediction with the direct multioutput regression model
from sklearn.datasets import make_regression
from sklearn.multioutput import MultiOutputRegressor
from sklearn.svm import LinearSVR

In [None]:
# define dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, n_targets=2, random_state=1, noise=0.5)

In [None]:
# define base model
model = LinearSVR()
# define the direct multioutput wrapper model
wrapper = MultiOutputRegressor(model)
# fit the model on the whole dataset
wrapper.fit(X, y)
# make a single prediction
yhat = wrapper.predict([X[0]])
# summarize the prediction
print('Predicted: %s' % yhat[0])

The prediction from out Linear Regression Model<br>
[50.06781692 64.56497273]