## Linear Regression with User Defined Types and Gradient Descent

**Copyright (c) Meta Platforms, Inc. and affiliates.**
 
This source code is licensed under the MIT license found in the
LICENSE file in the root directory of this source tree.

A **linear regression** fits a linear function through data, often using a least squares method, to make predictions on new data. We are going to focus on linear regression here using DiffKt's user-defined type interfaces using the `Differentiable` interface (which is an implementation of `Wrappable`). We will create custom classes that implement this interface. Finally, we will use DiffKt's auto-differentiation features to aid gradient descent operations as well as operator functions that streamline the least squares methodology.

To visualize the linear regression and the least squares method, let's use [Lets-Plot](https://blog.jetbrains.com/kotlin/2020/12/lets-plot-in-kotlin/). It comes with the Kotlin Jupyter kernel which can be installed with the command below, which is also required to run this notebook. 

```shell
conda install kotlin-jupyter-kernel -c jetbrains
```

Then lets enable Lets-Plot in this notebook. 

In [1]:
%use lets-plot

Also let's bring in the DiffKT library to aid our gradient descent implementation.

In [2]:
@file:DependsOn("../kotlin/api/build/libs/api.jar")

As stated, we are going to take an object-oriented approach to using DiffKt's features, and leverage its unique features to close the bridge between numerical and object-oriented computing. The `Wrappable` and `Differentiable` interfaces will make this possible. 

Let's `import` DiffKt and then declare a `Point` class. This will hold two properties `x` and `y` which are both of type `Float`. We will also throw in a convenient overloaded constructor that accepts `x` and `y` as `Int` types and then converts them to `Float` values as they are passed to the primary constructor. 

In [3]:
import org.diffkt.*

data class Point(val x: Float, val y: Float) {
    constructor(x: Int, y: Int): this(x.toFloat(), y.toFloat())
}

Lets then declare some data as `Point` objects in a `List<Point>` and plot them. 

In [4]:
val trainingData = listOf(
    Point(1,5),
    Point(2,10),
    Point(3,10),
    Point(4,15),
    Point(5,14),
    Point(6,15),
    Point(7,19),
    Point(8,18),
    Point(9,25),
    Point(10,23)
)

val xScale = 1..10
val yScale = 4..26

// create the scatterplot 
val p = letsPlot { 
        x=trainingData.map(Point::x); 
        y=trainingData.map(Point::y); 
    } + ggsize(600, 500) + geomPoint(shape=1, size=3) +
    scaleXContinuous("", xScale.toList()) + 
    scaleYContinuous("", yScale.toList())

p

Just looking at this plot we can see there is a visible linear correlation to this data. But how do we fit a linear function to it? 

First let's draw a line through the points. We will establish how we find this best fit line later, but let's draw a line with a slope of $ 1.93939 $ and an intercept of $ 4.73333 $. 

In [5]:
val m = 1.93939
val b = 4.73333

var residualGraph = p + geomPoint(shape=1, size=3) + 
    geomABLine(slope=m, intercept=b, color="red")  

// generate residuals 
for ((x, actualY) in trainingData) { 
    residualGraph += geomSegment(x=x.toDouble(), y=actualY.toDouble(), xend=x.toDouble(), yend=m*x + b, size=1)
}

residualGraph

Numerically speaking, we can calculate the residuals by taking the difference between the actual $ y $ values and the predicted $ \hat{y} $ values as shown below. 

In [6]:
for ((x,actualY) in trainingData) { 
    val predictedY = m*x + b 
    val residual = actualY - predictedY
    println("Actual Y: $actualY, Predicted Y: $predictedY, Residual: $residual")
}

Actual Y: 5.0, Predicted Y: 6.67272, Residual: -1.67272
Actual Y: 10.0, Predicted Y: 8.61211, Residual: 1.3878900000000005
Actual Y: 10.0, Predicted Y: 10.5515, Residual: -0.5515000000000008
Actual Y: 15.0, Predicted Y: 12.49089, Residual: 2.5091099999999997
Actual Y: 14.0, Predicted Y: 14.43028, Residual: -0.4302799999999998
Actual Y: 15.0, Predicted Y: 16.36967, Residual: -1.3696699999999993
Actual Y: 19.0, Predicted Y: 18.30906, Residual: 0.6909400000000012
Actual Y: 18.0, Predicted Y: 20.24845, Residual: -2.2484499999999983
Actual Y: 25.0, Predicted Y: 22.187839999999998, Residual: 2.812160000000002
Actual Y: 23.0, Predicted Y: 24.127229999999997, Residual: -1.1272299999999973


How do we measure these residuals in total to see how well the line is fit to the points? We do not necessarily want to sum up the residuals, as the negatives will cancel out the positives. We do not want to sum the absolute values either, because absolute values are mathematically difficult to work with. Those sharp corners at the absolute value are not Calculus-friendly, an attribute needed for gradient descent.

Instead we square each of the residuals and sum them, giving us the total error. In mathematical notation, each prediction $ \hat{y}_i $ is subtracted from the actual value $ y_i $, and then squared. This is what we call our **loss function**.

$ E = \sum_{i=0}^{n}(y_i - \hat{y}_i)^2 $

We can visualize this literally by taking each residual and overlaying a square with a matching side length. The sum of all those square areas will be our **loss**, or total error measuring how far off our points are from the line in aggregate. Notice below that by squaring the residuals, we penalize larger residuals by making them even larger ($ e.g. 2^2 = 4 $ but $9^2 = 81 $). The larger the number, the more amplified it will be. 

In [7]:
// Overlay squares onto residuals to visualize sum of squares
var squaresGraph = residualGraph + geomPoint(shape = 1, size = 3) + 
    geomABLine(slope=1.93939, intercept = 4.73333, color="red")

for ((x,actualY) in trainingData) {
    
    val predictedY = m*x + b  
    val residual = actualY - predictedY
    squaresGraph += geomRect(xmin=x.toDouble(), 
                             ymin=if (residual < 0) actualY.toDouble() else predictedY, 
                             // account for axis scaling
                             xmax=x+abs(residual) * (xScale.last.toDouble() / yScale.last), 
                             ymax=if (residual < 0) predictedY.toDouble() else actualY.toDouble(), 
                             color="orange", 
                             fill="orange",
                             alpha=.3, 
                            )
}

squaresGraph

By taking the squares of the residuals and summing them, we have an effective loss function. Before we start modeling that though, let's create two custom types: a `Line` and a `Tangent`. 

In [8]:
data class Tangent(val dm: DScalar, val db: DScalar) {
    operator fun times(float: Float) = Tangent(dm * float, db * float)
}

// a simple line with slope and intercept
data class Line(val m: DScalar, val b: DScalar): Differentiable<Line> {
    constructor(m: Float, b: Float): this(FloatScalar(m), FloatScalar(b))
    override fun wrap(wrapper: Wrapper) = Line(wrapper.wrap(m), wrapper.wrap(b))

    operator fun plus(tangent: Tangent) = Line(m + tangent.dm, b + tangent.db)
    operator fun minus(tangent: Tangent) = Line(m - tangent.dm, b - tangent.db)
}

A `Line` will hold a slope `m` and intercept `b` as `DScalar` properties. These two coefficients are what we are solving for, so it is important we implement the `Differentiable` interface from DiffKt. This will allow us to perform derivatives on the `Line` object targeting its two coefficients, rather than holding those coefficients in a plain tensor. Implementing the `Differentiable` will also require us to override the `wrap()` function, which declares how to create a new `Line` object off the wrapped `DScalar` values of `m` and `b`. Remember that `DScalar` values (like all tensor types in DiffKt) are immutable, so as DiffKt performs differentiation it needs to know how to create new values and the resulting `Line` object. Note also we create a convenient overloaded constructor for the line to provide two `Float` coefficients and they will be passed to the primary constructor as `FloatScalar` properties.

The `Tangent` will be the result of differentiation on that `Line`, holding the differential values for `m` and `b` respectively as `dm` and `db`. They will also will be `DScalar`. Note that we implement [operator overloads](https://kotlinlang.org/docs/operator-overloading.html) for both the `Line` and `Tangent`, including `plus()`, `minus()`, and `times()` situationally accepting `Tangent` and `Float` parameters. This will make cleaner code when we perform gradient descent shortly. Note also that both `Line` and `Tanget` are [data classes](https://kotlinlang.org/docs/data-classes.html), which will automatically implement `toString()`, tuple-like behaviors, and other conveniences on the properties.

Next let's create the `sumOfSquares()` function, which will return a `DScalar` and accept a `Line` as an input. It will read the training data and calculate the squared residuals for each `Point` object against that given `Line`. Then the `reduce()` function will be used to sum all the squared residual `DScalar` values together, resulting in a single `DScalar`. We will also declare an intial `Line` with a slope and intercept of $ 0 $. It will be a mutable `var` rather than a `val` because it is going to be overwritten throughout gradient descent. 

In [9]:
// calculates sum of squares for given training x and y values and a line
fun sumOfSquares(line: Line): DScalar =
    trainingData.map {
            (it.y - (it.x * line.m + line.b)).pow(2f)
        }.reduce { x1,x2 -> x1 + x2 }

// declare the line 
var line = Line(0f, 0f)

To apply gradient descent to this `sumOfSquares()` function, we will adjust the `m` and `b` iteratively and package the differentials into a `Tangent`, which is then subtracted from the `Line`. We will do this for 1000 iterations, passing both the `Line` as an input and the `sumOfSquares()` function to the `primalAndReverseDerivative()` function to get a `Tangent`. In a loop, we will then take the gradients and multiply them by a learning rate `lr` of $ .0025 $. We will then subtract the resulting `tangent` from the `line` after scaling it down with the learning rate `lr`. 

In [10]:
// The learning rate
val lr = .0025F

// The number of iterations to perform gradient descent
val iterations = 1000

// Perform gradient descent
for (i in 0..iterations) {

    val (_, tangent) = primalAndReverseDerivative(
        x = line,
        f = ::sumOfSquares,
        extractDerivative= { input, output, extractTangent ->
            Tangent(
                dm = extractTangent(input.m, output) as DScalar,
                db = extractTangent(input.b, output) as DScalar
            )
        }
    )
    line -= tangent * lr
}

println(line)

Line(m=1.9394101, b=4.7332225)


You might notice the most abstract parameter in `primalAndReverseDerivative()` is the `extractDerivative`. It solicits how to take an input object (in this case a `Line`) and an output object (a `DScalar` representing the sum of squares), and how DiffKt should extract the `DScalar` properties `m` and `b` from the `Line` and map its relationship to the `output`. We call the `extractTangent()` to associate each of those `DScalar` properties with the sum of squares `output`, and DiffKt will then know how to nudge the derivative values between the input and output, which will then be packaged into our `Tangent` object. 

After the iterations complete, we should get a slope of $ 1.9393964 $ and an intercept of $ 4.733317 $ which provides the solved slope and intercept respectively. Let's graph this line against our scatterplot. 

In [11]:
p + geomABLine(slope=(line.m as FloatScalar).value.toDouble(), 
               intercept = (line.b as FloatScalar).value.toDouble(), 
               color="red")

Looks pretty good! That line is comfortably fitted along those points. Now we have done linear regression by applying gradient descent using DiffKT and user defined types. DiffKt innovates numerical computing by allowing numerical scalar properties on custom classes to be differentiated, completely without converting to explicit tensors. This is ideal because we can easily keep our numerical computing code organized with OOP/functional paradigms, rather than objects losing their identity in grids of tensors and matrix operations. It's the best of both worlds!

Below is an animation of gradient descent in action, showing how the loss landscape is navigated as coefficients for `m` and `b` are explored. 

![](./resources/ddSBpqSOSE.mp4)