## Gradient Descent and Linear Regression

We are going to focus on linear regression here, using the least squares methodology and gradient descent. Validating and analyzing linear function models is beyond the scope of this notebook. Refer to the note below to learn more. 

A **linear regression** fits a linear function through data, often using a least squares method, to make predictions on new data. In this notebook we will learn how to build a linear regression in Kotlin. Along the way we will use gradient descent and DiffKT to perform automatic differentiation.

However we are just going to focus on fitting a linear regression here, using the least squares methodology and gradient descent. Validating and analyzing linear function models is beyond the scope of this notebook, and refer to the callout box below to learn more. 

> Just fitting a line through some points is by no means a guarantee that it will produce good predictions. Concepts like statistical significance, correlation, and prediction intervals can provide tools for further validation from a statistics standpoint. In machine learning, a train/test split is often performed where the linear function is fitted to a majority portion of the data, known as the training data, and measured for prediction performance on the remaining portion, known as the test data.

To visualize the linear regression and the least squares method, let's use [Lets-Plot](https://blog.jetbrains.com/kotlin/2020/12/lets-plot-in-kotlin/). It comes with the Kotlin Jupyter kernel which can be installed with the command below, which is also required to run this notebook. 

```shell
conda install kotlin-jupyter-kernel -c jetbrains
```

Then lets enable Lets-Plot in this notebook. 

In [1]:
%use lets-plot

Also let's bring in the DiffKT library to aid our gradient descent implementation.

In [2]:
@file:DependsOn("../kotlin/api/build/libs/api.jar")

Lets first declare some data and plot it here using Kotlin code. We will also count the number of records $ n $ in our data. 

In [3]:
val xData = listOf(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0)
val yData = listOf(5.0, 10.0, 10.0, 15.0, 14.0, 15.0, 19.0, 18.0, 25.0, 23.0)
val n = xData.size

val xScaleMin = 1
val xScaleMax = 10
val yScaleMin = 4
val yScaleMax = 26

// create the scatterplot 
val p = letsPlot { x=xData; y=yData; } + ggsize(600, 500) + geomPoint(shape=1, size=3) +
    scaleXContinuous("", (xScaleMin..xScaleMax).toList()) + 
    scaleYContinuous("", (yScaleMin..yScaleMax).toList())

p

Just looking at this plot we can see there is a visible linear correlation to this data. But how do we fit a linear function to it? 

First let's draw a line through the points. We will establish how we fit this line later but let's draw a line with a slope of $ 1.93939 $ and an intercept of $ 4.73333 $. 

In [4]:
fun testLine(x: Double) = 1.93939 * x + 4.73333

p + geomABLine(slope=1.93939, intercept=4.73333, color="red") 

Note that the line is not fit perfectly through the points, simply because the points do not exist on a straight line. There is a linear pattern but noise has caused them to be above or below this line by some margin. Note that the differences between the line and the points are called **residuals**. Let's plot the residuals as shown below. 

In [5]:
var residualGraph = p + geomPoint(shape=1, size=3) + 
    geomABLine(slope=1.93939, intercept=4.73333, color="red")  

// generate residuals 
for ((x, actualY) in xData.zip(yData)) { 
    residualGraph += geomSegment(x=x, y=actualY, xend=x, yend=testLine(x), size=1)
}

residualGraph

Numerically speaking, we can calculate the residuals by taking the difference between the actual $ y $ values and the predicted $ \hat{y} $ values as shown below. 

In [6]:
for ((x,actualY) in xData.zip(yData)) { 
    val predictedY = testLine(x.toDouble())
    val residual = actualY - predictedY
    println("Actual Y: $actualY, Predicted Y: $predictedY, Residual: $residual")
}

Actual Y: 5.0, Predicted Y: 6.67272, Residual: -1.67272
Actual Y: 10.0, Predicted Y: 8.61211, Residual: 1.3878900000000005
Actual Y: 10.0, Predicted Y: 10.5515, Residual: -0.5515000000000008
Actual Y: 15.0, Predicted Y: 12.49089, Residual: 2.5091099999999997
Actual Y: 14.0, Predicted Y: 14.43028, Residual: -0.4302799999999998
Actual Y: 15.0, Predicted Y: 16.36967, Residual: -1.3696699999999993
Actual Y: 19.0, Predicted Y: 18.30906, Residual: 0.6909400000000012
Actual Y: 18.0, Predicted Y: 20.24845, Residual: -2.2484499999999983
Actual Y: 25.0, Predicted Y: 22.187839999999998, Residual: 2.812160000000002
Actual Y: 23.0, Predicted Y: 24.127229999999997, Residual: -1.1272299999999973


How do we measure these residuals in total to see how well the line is fit to the points? We do not necessarily want to sum up the residuals, as the negatives will cancel out the positives. We do not want to sum the absolute values either, because absolute values are mathematically difficult to work with. Those sharp corners at the absolute value are not Calculus-friendly, an attribute needed for gradient descent.

Instead we square each of the residuals and sum them, giving us the total error. In mathematical notation, each prediction $ \hat{y}_i $ is subtracted from the actual value $ y_i $, and then squared. This is what we call our **loss function**.

$ E = \sum_{i=0}^{n}(y_i - \hat{y}_i)^2 $

We can visualize this literally by taking each residual and overlaying a square with a matching side length. The sum of all those square areas will be our **loss**, or total error measuring how far off our points are from the line in aggregate. Notice that by squaring the residuals, we penalize larger residuals by making them even larger ($ e.g. 2^2 = 4 $ but $9^2 = 81 $). The larger the number, the more amplified it will be. 

In [7]:
// Overlay squares onto residuals to visualize sum of squares
var squaresGraph = residualGraph + geomPoint(shape = 1, size = 3) + 
    geomABLine(slope=1.93939, intercept = 4.73333, color="red")

for ((x,actualY) in xData.zip(yData)) {
    
    val predictedY = testLine(x) 
    val residual = actualY - predictedY
    squaresGraph += geomRect(xmin=x, 
                             ymin=if (residual < 0) actualY else predictedY, 
                             xmax=x+abs(residual) * (xScaleMax.toDouble() / yScaleMax), // account for axis scaling
                             ymax=if (residual < 0) predictedY else actualY, 
                             color="orange", 
                             fill="orange",
                             alpha=.3, 
                            )
}

squaresGraph

Now that we have an objective function, how do we minimize it? What is the line that produces the smallest total area of these squares as possible? One way we can find this is by means of **gradient descent**, which optimizes the parameters by following their derivatives to the minimum loss.  

Given we have the $ x $ and $ y $ values provided in the data, we are left to solve the slope $ m $ and intercept $ b $ in the function $ y = mx + b $. 

$$
y = mx + b
$$

By taking the gradients of our loss function (the sum of squared errors), we can find the gradients for the slope and intercept. We then adjust the parameters by subtracting a fraction of the gradient (known as the **learning rate**) until a gradient of 0 is reached for all parameters. When all gradients are 0, that means the loss has been minimized. When we plot the loss for a simple linear regression with one slope and intercept, we can see it has a convex surface. We use gradient descent to find the slope of 0 for all gradients, putting us at the lowest point of the plane.  

![](resources/rWkCTQwALn.png)


We need to get to the lowest point on this surface to minimize loss. Thankfully we can use DiffKT to perform this automatic differentiation for us and guide us down the gradients to find the optimal linear regression parameters. Let's first start by declaring two scalar tensors holding our slope and intercept coefficients for our line. We will initialize them at $ 0 $, and gradient descent will optimize them. Let's also re-declare our `xData` and `yData` as tensors and express the sum of squares operation below in the declared `f()` function. The function accepts the slope and intercepts as parameters and returns the sum of squares. 

In [8]:
import org.diffkt.*

val xData = tensorOf(1f, 2f, 3f, 4f, 5f, 6f, 7f, 8f, 9f, 10f)
val yData = tensorOf(5f, 10f, 10f, 15f, 14f, 15f, 19f, 18f, 25f, 23f)

// Start m and b at 0
var slope: DScalar = FloatScalar.ZERO
var intercept: DScalar = FloatScalar.ZERO

// calculate sum of squares of the error with given slope and intercept for a line
fun f(slope: DScalar, intercept: DScalar): DScalar =
    (((xData * slope) + intercept) - yData).pow(2).sum()

To apply gradient descent to this squared error `f()` function, we will adjust the `slope` and `intercept` iteratively. For 1000 iterations, we will pass the `f()` function to the `reverseDerivative()` function to get the gradients for the slope and intercept respectively. Since we have more inputs than outputs (2 inputs, 1 output) for this function, we will get more efficiency using reverse differentiation. In a loop, we will then take the gradients and multiply them by a learning rate `lr` of $ .0025 $. We will then subtract that from the respective coefficients. 

In [9]:
// The learning rate
val lr = .0025F

// The number of iterations to perform gradient descent
val iterations = 1000

// Perform gradient descent
for (i in 0..iterations) {

    // get gradients for line slope and intercept
    val (slopeGradient, interceptGradient) = reverseDerivative(slope, intercept, ::f)

    // update m and b by subtracting the (learning rate) * (slope)
    slope -= slopeGradient * lr
    intercept -= interceptGradient * lr
}
print("slope=$slope intercept=$intercept") 

slope=1.9394095 intercept=4.7332234

After the iterations complete, we should get a slope of $ 1.9394095 $ and an intercept of $ 4.7332234 $ which provides the solved slope and intercept respectively. Let's graph this line against our scatterplot. 

In [11]:
p + geomABLine(slope=(slope as FloatScalar).value.toDouble(), 
               intercept = (intercept as FloatScalar).value.toDouble(), 
               color="red")

Looks pretty good! That line is comfortably fitted along those points. Now we have done linear regression by applying gradient descent using DiffKT.