## Gradient Descent and Multivariable Linear Regression

In this notebook we are going to focus on performing a multivariable linear regression with Kotlin and DiffKt, specifically where we have two input variables and one output variable. 

A **linear regression** fits a linear function through data, often using a least squares method, to make predictions on new data. Using DiffKt we will perform gradient descent to optimize the coefficients. 

Bring in the DiffKt library for the tensor library and automatic differentiation

In [1]:
@file:DependsOn("../kotlin/api/build/libs/api.jar")

We are going to need a few imports for this notebook. 

In [2]:
import java.net.URL
import org.diffkt.*
import kotlin.random.Random

Next up let's bring in our dataset. This dataset has two input variable columns `x1` and `x2` as well as an output variable column `y`. It only has 15 records. You can find the CSV data here: https://bit.ly/35ebET5.

To start our model, use the `URL` and Kotlin `Sequence` to process the CSV into `Point` objects, which we also declare a class for. 

In [3]:
data class Point(val x1: Float, val x2: Float, val y: Float)

val points = URL("https://bit.ly/35ebET5")    // read CSV
    .readText().split(Regex("\\r?\\n"))       // split lines using regular expression
    .filter { it.matches(Regex("[-,.0-9]+")) }  // filter only numeric records using regular expression
    .map { it.split(",").map{ it.toFloat()} } // split commas into columns
    .map { (x1,x2,y) -> Point(x1,x2,y) }      // map to Point objects


Below is a visualization of the data in a 3D scatterplot. `x1` and `x2` are mapped to the horizontal axes, and the output variable `y` to the vertical axis. 

![](./resources/HiQXwPkaiO.mp4)

We are going to need to map these points to DiffKt tensors. We will use the `tensorOf()` function and map the `points` inside it. 

Now when we map to the input tensor `x` and the output tensor `y`, we use lambda functions as arguments to specify what columns we want to generate and on what values. However notice on the `x` tensor below we add a third column simply returning a $ 1 $. This is going to add a column of 1's next to our `x1` and `x2` input variables. Why is this necessary? It will serve as a placeholder to generate the intercept coefficient. Without it, we would only generate the slopes for `x1` and `x2` without any intercept. 

In [4]:
// map variables to input and output variable tensors
// add a placeholder "1" column to generate intercept on input tensor
val x = tensorOf(points.flatMap { listOf(it.x1, it.x2, 1f) }.map(::FloatScalar) ).reshape(points.size, 3)
val y = tensorOf(points.map { it.y }.map(::FloatScalar) ).reshape(points.size, 1)

Before we dive into the DiffKt operation and perform differentation and gradient descent, let's look at the mathematical side of things. 

A linear function with two inputs and one outputs will have three coefficients: $ \beta_1 $ and $ \beta_2 $ for the slopes of $ x_1 $ and $ x_2 $ respectively, and $ \beta_0 $ for the intercept. Here is the linear function below. 

$ y = \beta_2x_2 + \beta_1x_1 + \beta_0 $

We need to solve for those three coefficients $ \beta_0 $, $ \beta_1 $, and $ \beta_2 $. Rather than separate these as three separate scalar values separately, we can consolidate them into a single tensor holding three values. Let's initalize the `betas` tensors here. 

In [5]:
// initialize coefficients
var betas: DTensor = FloatTensor.random(Random,Shape(3,1))

To visualize the tensor operations, let's say our betas were initialized with the following values. 

$ \beta = \left[\begin{matrix}0.1\\0.2\\0.5\end{matrix}\right] $ 

And let's say we have 3 records of $ X $ inputs with the added column of 1's. 

$ X = \left[\begin{matrix}2 & 10 & 1\\4 & 20 & 1\\10 & 30 & 1\end{matrix}\right] $ 

To get the predicted `Y` values, we apply matrix multiplication (dot products) between the input $ X $ variables (with the additional column of 1's) and the $ \beta $ coefficients. 

$ \hat{Y} = X \cdot \beta $ 

$ \hat{Y} = \left[\begin{matrix}2 & 10 & 1\\4 & 20 & 1\\10 & 30 & 1\end{matrix}\right] \cdot \left[\begin{matrix}0.1\\0.2\\0.5\end{matrix}\right] $ 

$ \hat{Y} =  \left[\begin{matrix}(2 \times 0.1) + (10 \times 0.2) + (1 \times 0.5) \\(1 \times 0.1) + (20 \times 0.2) + (1 \times 0.5) \\(10 \times 0.1) + (30 \times 0.2) + (1 \times 0.5) \end{matrix}\right] $

$ \hat{Y} = \left[\begin{matrix}2.7\\4.9\\7.5\end{matrix}\right] $

So that would yield predictions of $ 2.7 $, $ 4.9 $, and $ 7.5 $. 

To get predictions on all data given the current `betas` coefficients, use DiffKt's `*` operator: 

In [6]:
val yPredictions = x.matmul(betas)
yPredictions

tensorOf(
-0.30586803f, -0.32428598f, -3.6170025f, -1.7390538f, 0.83997166f, -0.20851892f, -0.2585467f, 0.0964275f, 1.623082f, -0.58209056f,
-0.10559827f, 2.7798853f, 0.2563885f, -0.3351602f, -1.3683783f, 2.598876f, -2.7025952f, 2.1805964f, -0.37243187f, -0.452739f,
-2.4604197f, -1.1078262f, 0.14497766f, -1.6079535f, 0.10575348f, -1.8638943f, 1.1504334f, -0.4303024f, 1.4628142f, -2.2788105f,
-1.9212686f, -0.91113997f, 1.016962f, -0.8875239f, 0.016145647f, 1.0592449f, -0.9645479f, -0.4120273f, -0.7363904f, 1.0662426f,
-0.10019079f, 1.9239367f, 2.454463f, -0.057665467f, -0.6313351f).reshape(Shape(45, 1))

To calculate the total loss, let's use a sum of squared loss. Subtract the actual $ Y $ values from the predicted $ \hat{Y} $ values. Take those differences, square them, and sum them. 

$ E = \sum{(Y - \hat{Y})^2 } $ 

Let's say we have these predicted $ \hat{Y} $ and actual $ Y $ values. 

$ \hat{Y} = \left[\begin{matrix}2.7\\4.9\\7.5\end{matrix}\right] $

$ Y = \left[\begin{matrix}3.0\\5.0\\7.0\end{matrix}\right] $

Here is how we would calculate the sum of squares. 

$ E = \sum{(Y - \hat{Y})^2 } $ 

$ E = \sum{(\left[\begin{matrix}3.0\\5.0\\7.0\end{matrix}\right] - \left[\begin{matrix}2.7\\4.9\\7.5\end{matrix}\right])^2 } $ 

$ E = \sum{(\left[\begin{matrix}0.3\\0.1\\-0.5\end{matrix}\right])^2} $ 

$ E = \sum{\left[\begin{matrix}0.09\\0.01\\0.25\end{matrix}\right]} $ 

$ E = 0.35 $ 

We can implement this as a `loss()` function in Kotlin using DiffKt as shown below. Remember that the predicted $ \hat{Y} $ values are the dot products of $ X $ and the $ \beta $ coefficients. 

In [7]:
fun loss(betas: DTensor): DScalar =
    (y - (x.matmul(betas))).pow(2).sum()

Finally we are ready to perform gradient descent. For $ 100,000 $ iterations, we will use a learning rate of $ .001 $ and take the reverse derivative of the `loss()` function with regards to the `betas` tensor. This will return the gradient for each $ \beta $ coefficient respectively which we multiply by the learning rate and subtract from the `betas` tensor. We subtract because we want to descend on the gradients.

In [9]:
// The learning rate
val lr = .001F

// The number of iterations to perform gradient descent
val iterations = 1_000

// Perform gradient descent
for (i in 0..iterations) {

    // get gradients for line slope and intercept
    val betaGradients = reverseDerivative(betas, ::loss)

    // update m and b by subtracting the (learning rate) * (slope)
    betas -= betaGradients * lr
}
print("betas=$betas")

betas=tensorOf(0.3192737f, 0.6086176f, -1.0212873f).reshape(Shape(3, 1))

We should get the following coefficient values based on our `betas` tensors after this finishes training. 

$ \beta_1 = 0.3192737 $ 

$ \beta_2 = 0.6086176 $ 

$ \beta_0 = -1.0212873 $ 

$ y = \beta_1 x_1 +  \beta_2 x_2  + \beta_0 $ 

$ y = 0.3192737 x_1 +  0.6086176 x_2 - 1.0212873 $ 

If we were to visualize this linear regression fit as a plane in a 3D scatterplot, here is what it looks like. 

![](./resources/cCVILwOVBr.mp4)