Regression

Christian Kahr edited this page May 23, 2018 · 15 revisions

Standard regression problems

In a regression problem, we would typically have some input vectors x and some desired output values y. Note that, differently from classification problems, here the output values y are not restricted to be class labels, but can rather be continuous variables or vectors.

Models

Linear Regression

Let's say we have some univariate, continuous sets of input data, and a corresponding univariate, continuous set of output data, such as a set of points in R². A simple linear regression is able to fit a line relating the input variables to the output variables in which the minimum-squared-error of the line and the actual output points is minimum.

// Declare some sample test data.
double[] inputs = { 80, 60, 10, 20, 30 };
double[] outputs = { 20, 40, 30, 50, 60 };

// Use Ordinary Least Squares to learn the regression
OrdinaryLeastSquares ols = new OrdinaryLeastSquares();

// Use OLS to learn the simple linear regression
SimpleLinearRegression regression = ols.Learn(inputs, outputs);

// Compute the output for a given input:
double y = regression.Transform(85); // The answer will be 28.088

// We can also extract the slope and the intercept term
// for the line. Those will be -0.26 and 50.5, respectively.
double s = regression.Slope;     // -0.264706
double c = regression.Intercept; // 50.588235

If you prefer an interpreter like Powershell :

using namespace Accord.Statistics.Models.Regression.Linear

# Declare some sample test data.
$inputs = @( 80, 60, 10, 20, 30 );
$outputs = @( 20, 40, 30, 50, 60 );

# Use Ordinary Least Squares to learn the regression
$ols = [OrdinaryLeastSquares]::new();

# Use OLS to learn the simple linear regression
$regression = $ols.Learn($inputs, $outputs);

# Compute the output for a given input:
$y = $regression.Transform(85); # The answer will be 28.088

# We can also extract the slope and the intercept term
# for the line. Those will be -0.26 and 50.5, respectively.
$s = $regression.Slope;     # -0.264706
$c = $regression.Intercept; # 50.588235

See Simple Linear Regression

Multivariate Linear Regression

The multivariate linear regression is a generalization of the multiple linear regression. In the multivariate linear regression, not only the input variables are multivariate, but also are the output dependent variables.

In the following example, we will perform a regression of a 2-dimensional output variable over a 3-dimensional input variable.

double[][] inputs = 
{
    // variables:  x1  x2  x3
    new double[] {  1,  1,  1 }, // input sample 1
    new double[] {  2,  1,  1 }, // input sample 2
    new double[] {  3,  1,  1 }, // input sample 3
};

double[][] outputs = 
{
    // variables:  y1  y2
    new double[] {  2,  3 }, // corresponding output to sample 1
    new double[] {  4,  6 }, // corresponding output to sample 2
    new double[] {  6,  9 }, // corresponding output to sample 3
};

The same in Powershell :

$inputs = 
@(
    # variables:  x1  x2  x3
    @(  1,  1,  1 ), # input sample 1
    @(  2,  1,  1 ), # input sample 2
    @(  3,  1,  1 ) # input sample 3
);

$outputs = 
@(
    # variables:  y1  y2
    @(  2,  3 ), # corresponding output to sample 1
    @(  4,  6 ), # corresponding output to sample 2
    @(  6,  9 ) # corresponding output to sample 3
);

With a quick eye inspection, it is possible to see that the first output variable y1 is always the double of the first input variable. The second output variable y2 is always the triple of the first input variable. The other input variables are unused. Nevertheless, we will fit a multivariate regression model and confirm the validity of our impressions:

// Use Ordinary Least Squares to create the regression
OrdinaryLeastSquares ols = new OrdinaryLeastSquares();

// Now, compute the multivariate linear regression:
MultivariateLinearRegression regression = ols.Learn(inputs, outputs);

// We can obtain predictions using
double[][] predictions = regression.Transform(inputs);

// The prediction error is
double error = new SquareLoss(outputs).Loss(predictions); // 0

PS :

# Use Ordinary Least Squares to create the regression
$ols = [OrdinaryLeastSquares]::new();

# Now, compute the multivariate linear regression:
$regression = $ols.Learn($inputs, $outputs);

# We can obtain predictions using
$predictions = $regression.Transform($inputs);

using namespace Accord.Math.Optimization.Losses

# The prediction error is
$NumericError = [SquareLoss]::new($outputs).Loss($predictions); # 0

See Multivariate Linear Regression

Multiple Linear Regression

We will try to model a plane as an equation in the form "ax + by + c = z". We have two input variables (x and y) and we will be trying to find two parameters a and b and an intercept term c.

// We will use Ordinary Least Squares to create a
// linear regression model with an intercept term
var ols = new OrdinaryLeastSquares()
{
    UseIntercept = true
};

// Now suppose you have some points
double[][] inputs = 
{
    new double[] { 1, 1 },
    new double[] { 0, 1 },
    new double[] { 1, 0 },
    new double[] { 0, 0 },
};

// located in the same Z (z = 1)
double[] outputs = { 1, 1, 1, 1 };

// Use Ordinary Least Squares to estimate a regression model
MultipleLinearRegression regression = ols.Learn(inputs, outputs);

// As result, we will be given the following:
double a = regression.Coefficients[0]; // a = 0
double b = regression.Coefficients[1]; // b = 0
double c = regression.Intercept; // c = 1

// This is the plane described by the equation
// ax + by + c = z => 0x + 0y + 1 = z => 1 = z.

// We can compute the predicted points using
double[] predicted = regression.Transform(inputs);

// And the squared error loss using 
double error = new SquareLoss(outputs).Loss(predicted);
# We will use Ordinary Least Squares to create a
# linear regression model with an intercept term

using namespace Accord.Statistics.Models.Regression.Linear

$ols = [OrdinaryLeastSquares]::new();
$ols.UseIntercept = $true;

# Now suppose you have some points
$inputs = 
@(
    @( 1, 1 ),
    @( 0, 1 ),
    @( 1, 0 ),
    @( 0, 0 )
);

# located in the same Z (z = 1)
$outputs = @( 1, 1, 1, 1 );

# Use Ordinary Least Squares to estimate a regression model
$regression = $ols.Learn($inputs, $outputs);

# As result, we will be given the following:
$a = $regression.Coefficients[0]; # a = 0
$b = $regression.Coefficients[1]; # b = 0
$c = $regression.Intercept; # c = 1

# This is the plane described by the equation
# ax + by + c = z => 0x + 0y + 1 = z => 1 = z.

# We can compute the predicted points using
$predicted = $regression.Transform($inputs);

using namespace Accord.Math.Optimization.Losses

# And the squared error loss using 
$NumError = [SquareLoss]::new($outputs).Loss($predicted);

See Multiple Linear Regression and Partial Least Squares

Logistic Regression

Suppose we have the following (fictional) data about some patients. The first variable is continuous and represent patient age. The second variable is dichotomic and give whether they smoke or not. We also know if they have had lung cancer or not, and we would like to know whether smoking has any connection with lung cancer.

double[][] input =
{              // age, smokes?, had cancer?
    new double[] { 55,    0  }, // false - no cancer
    new double[] { 28,    0  }, // false
    new double[] { 65,    1  }, // false
    new double[] { 46,    0  }, // true  - had cancer
    new double[] { 86,    1  }, // true
    new double[] { 56,    1  }, // true
    new double[] { 85,    0  }, // false
    new double[] { 33,    0  }, // false
    new double[] { 21,    1  }, // false
    new double[] { 42,    1  }, // true
};

bool[] output = // Whether each patient had lung cancer or not
{
    false, false, false, true, true, true, false, false, false, true
};

To verify this hypothesis, we are going to create a logistic regression model for those two inputs (age and smoking), learned using a method called "Iteratively Reweighted Least Squares":

// Create a new Iterative Reweighted Least Squares algorithm
var learner = new IterativeReweightedLeastSquares<LogisticRegression>()
{
    Tolerance = 1e-4,  // Let's set some convergence parameters
    Iterations = 100,  // maximum number of iterations to perform
    Regularization = 0
};

// Now, we can use the learner to finally estimate our model:
LogisticRegression regression = learner.Learn(input, output);

At this point, we can compute the odds ratio of our variables. In the model, the variable at 0 is always the intercept term, with the other following in the sequence. Index 1 is the age and index 2 is whether the patient smokes or not.

// For the age variable, we have that individuals with
//   higher age have 1.021 greater odds of getting lung
//   cancer controlling for cigarette smoking.
double ageOdds = regression.GetOddsRatio(1); // 1.0208597028836701

// For the smoking/non smoking category variable, however, we
//   have that individuals who smoke have 5.858 greater odds
//   of developing lung cancer compared to those who do not 
//   smoke, controlling for age (remember, this is completely
//   fictional and for demonstration purposes only).
double smokeOdds = regression.GetOddsRatio(2); // 5.8584748789881331

// If we would like to use the model to predict a probability for
// each patient regarding whether they are at risk of cancer or not,
// we can use the Probability function:

double[] scores = regression.Probability(input);

// Finally, if we would like to arrive at a conclusion regarding
// each patient, we can use the Decide method, which will transform
// the probabilities (from 0 to 1) into actual true/false values:

bool[] actual = regression.Decide(input);

See Logistic regression, Logistic Regression Analysis and Generalized Linear Models.

Multinomial Logistic Regression (Softmax)

See Multinomial Logistic Regression.

Support Vector Machines

// Declare a very simple regression problem 
// with only 2 input variables (x and y):
double[][] inputs =
{
    new[] { 3.0, 1.0 },
    new[] { 7.0, 1.0 },
    new[] { 3.0, 1.0 },
    new[] { 3.0, 2.0 },
    new[] { 6.0, 1.0 },
};

// The task is to output a weighted sum of those numbers 
// plus an independent constant term: 7.4x + 1.1y + 42
double[] outputs =
{
    7.4*3.0 + 1.1*1.0 + 42.0,
    7.4*7.0 + 1.1*1.0 + 42.0,
    7.4*3.0 + 1.1*1.0 + 42.0,
    7.4*3.0 + 1.1*2.0 + 42.0,
    7.4*6.0 + 1.1*1.0 + 42.0,
};

In the next example, we will create a Kernel SVM machine with a Gaussian kernel to learn this regression function. Since this is a quite easy linear problem, we will set the machine Complexity parameter to a very high value, forcing the learning algorithm to find hard-margin solutions that would otherwise not generalize very well. When training in real-world problems, leave the properties UseKernelEstimation and UseComplexityHeuristic set to true or perform a grid search to find their optimal parameters.

// Create a LibSVM-based support vector regression algorithm
var teacher = new FanChenLinSupportVectorRegression<Gaussian>()
{
    Tolerance = 1e-5,
    // UseKernelEstimation = true, 
    // UseComplexityHeuristic = true
    Complexity = 10000,
    Kernel = new Gaussian(0.1)
};

// Use the algorithm to learn the machine
var svm = teacher.Learn(inputs, outputs);

// Get machine's predictions for inputs
double[] prediction = svm.Score(inputs);

// Compute the error in the prediction (should be 0.0)
double error = new SquareLoss(outputs).Loss(prediction);

Since this is a linear problem, we can also take advantage of more specialized algorithms for learning with a linear kernel, as shown below:

// Create Newton-based support vector regression 
var teacher = new LinearRegressionNewtonMethod()
{
    Tolerance = 1e-5,
    // UseComplexityHeuristic = true
    Complexity = 10000
};

// Use the algorithm to learn the machine
var svm = teacher.Learn(inputs, outputs);

// Get machine's predictions for inputs
double[] prediction = svm.Score(inputs);

// Compute the error in the prediction (should be 0.0)
double error = new SquareLoss(outputs).Loss(prediction);

See Sequential Minimal Optimization for Regression, L1-regularized logistic regression, L2-regularized logistic regression in the dual and L2-regularized L2-loss logistic regression.

Neural Networks

The framework's Neural Networks module is in phase of being deprecated in favor of a more modern, GPU-based solution.

See Levenberg-Marquardt with Bayesian Regularization and Resilient Backpropagation.

Variations

Regression models censored in time

The framework also offers models for performing survival analysis, such as Cox's Proportional Hazards models. In survival analysis we are not only interested in the outcome of an experiment or event, but also on how long it takes to reach this outcome.

As an example, please consider the following survival problem depicted below. Each row in the table below represents a patient under care in a hospital. The first colum represents their age (a single feature, but there could have been many like age, height, weight, etc), the time until an event has happened (like, for example, unfortunatey death) and the event outcome (i.e. what has exactly happened after this amount of time, has the patient died or did he simply leave the hospital and we couldn't get more data about him?)

object[,] data =
{
    //    input         time until           outcome 
    // (features)     event happened     (what happened?)
    {       50,              1,         SurvivalOutcome.Censored  },
    {       70,              2,         SurvivalOutcome.Failed    },
    {       45,              3,         SurvivalOutcome.Censored  },
    {       35,              5,         SurvivalOutcome.Censored  },
    {       62,              7,         SurvivalOutcome.Failed    },
    {       50,             11,         SurvivalOutcome.Censored  },
    {       45,              4,         SurvivalOutcome.Censored  },
    {       57,              6,         SurvivalOutcome.Censored  },
    {       32,              8,         SurvivalOutcome.Censored  },
    {       57,              9,         SurvivalOutcome.Failed    },
    {       60,             10,         SurvivalOutcome.Failed    },
}; // Note: Censored means that we stopped recording data for that person,
   // so we do not know what actually happened to them, except that things
   // were going fine until the point in time appointed by "time to event"

// Parse the data above
double[][] inputs = data.GetColumn(0).ToDouble().ToJagged();
double[] time = data.GetColumn(1).ToDouble();
SurvivalOutcome[] output = data.GetColumn(2).To<SurvivalOutcome[]>();

// Create a new PH Newton-Raphson learning algorithm
var teacher = new ProportionalHazardsNewtonRaphson()
{
    ComputeBaselineFunction = true,
    ComputeStandardErrors = true,
    MaxIterations = 100
};

// Use the learning algorithm to infer a Proportional Hazards model
ProportionalHazards regression = teacher.Learn(inputs, time, output);

// Use the regression to make predictions (problematic)
SurvivalOutcome[] prediction = regression.Decide(inputs);

// Use the regression to make score estimates 
double[] score = regression.Score(inputs);

// Use the regression to make probability estimates 
double[] probability = regression.Probability(inputs);

See Cox's Proportional Hazards Model

  1. Accord.NET Framework
  2. Getting started
  3. Published books
  4. How to use
  5. Sample applications

Help improve this wiki! Those pages can be edited by anyone that would like to contribute examples and documentation to the framework.

Have you found this software useful? Consider donating only U$10 so it can get even better! This software is completely free and will always stay free. Enjoy!

Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.