## Logistic Regression and Gradient Descent

In a previous notebook we examined a linear regression using DiffKt and gradient descent. Linear regressions are not appropriate for classification and probability prediction models because its output values are infinite in range, as they produce a straight line extending into positive and negative infinity. We will use another model called **logistic regression**, which bends a linear function into an S-shaped curve so that its outputs are between $ 0.0 $ and $ 1.0 $, therefore resembling a probability between  $ 0% $ and $ 100% $.  

In this notebook, we will learn how to build a logistic regression from scratch in Kotlin. We will use DiffKt to calculate the derivatives needed by gradient descent. 

To visualize the logistic regression and the maximum likelihood estimation (MLE), let's use [Lets-Plot](https://blog.jetbrains.com/kotlin/2020/12/lets-plot-in-kotlin/). It comes with the Kotlin Jupyter kernel which can be installed with the command below, which is also required to run this notebook. 

```shell
conda install kotlin-jupyter-kernel -c jetbrains
```

Then lets enable Lets-Plot in this notebook. 

In [None]:
%use lets-plot

Also let's bring in the DiffKt library to aid our gradient descent implementation.

In [None]:
@file:DependsOn("../kotlin/api/build/libs/api.jar")

Let's say there was a small industrial accident, and we are trying to figure out how many hours of exposure it takes before an individual shows symptoms. In our sample of possible patients, we capture how many hours they were exposed and whether they exhibited symptoms ($ 1 = true $, $ 0 = false $). The data is stored here (https://bit.ly/3KfPIao) but it is also displayed below in a table for convenience. 

|Hours of Exposure  |Has Symptoms  |
|---|---|
|1.0|0  |
|1.5|0  |
|2.1|0  |
|2.4|0  |
|2.5|1  |
|3.1|0  |
|4.2|0  |
|4.4|1  |
|4.6|1  |
|4.9|0  |
|5.2|1  |
|5.6|0  |
|6.1|1  |
|6.4|1  |
|6.6|1  |
|7.0|0  |
|7.6|1  |
|7.8|1  |
|8.4|1  |
|8.8|1  |
|9.2|1  |

Let's bring in this data in Kotlin and visualize it using Lets-Plot. 

In [None]:
val xData = floatArrayOf(1.0f, 1.5f, 2.1f, 2.4f, 2.5f, 3.1f, 4.2f, 4.4f, 4.6f, 4.9f, 5.2f, 5.6f, 6.1f, 6.4f, 6.6f, 7.0f, 7.6f, 7.8f, 8.4f, 8.8f, 9.2f)
val yData = floatArrayOf(0.0f, 0.0f, 0.0f, 0.0f, 1.0f, 0.0f, 0.0f, 1.0f, 1.0f, 0.0f, 1.0f, 0.0f, 1.0f, 1.0f, 1.0f, 0.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f)

val n = xData.size 

// declare scatterplot 
letsPlot { x = xData; y = yData } + ggsize(600, 500) +  geomPoint(shape = 1, size = 5) 

A **logistic regression** outputs a probability a given item with attributes belongs to a certain class. Using the **logistic function**, an S-shaped curve also called the sigmoid curve, we fit it to the binary data and use it to make probability predictions. In this case we are measuring the probability a given patient shows symptoms given number of hours of exposure. 

This is the formula to get a probability $ y $ for a given logistic function: 

$$
y = \frac{1}{1 + e^{-\beta_0 + \beta_1 x} }
$$

Note that the expression in the exponent $ \beta_0 + \beta_1 x $ is a linear function, and effectively a logistic regression is compressing a linear function, called the log-odds function, between $ 0.0 $ and $ 1.0 $. Let's plot a logistic curve with $ \beta_0 = -3.1044 $ and $ \beta_1 = .6783 $ against our scatterplot below. 

In [None]:
fun logistic(x: Float, b0: Float, b1: Float) = 1f / (1f + exp(-(b0 + b1*x)))

val xSigmoid = generateSequence(0f) { it + .01f }.takeWhile { it <= 10.0f }.toList()
val ySigmoid = xSigmoid.map { logistic(it, -3.1044f, 0.6783f) }

letsPlot { x = xData.plus(xSigmoid); y = yData.plus(ySigmoid)} + 
    ggsize(600, 500) + geomPoint(shape = 1, size = 5)

We will discuss how to find this best fit curve using maximum likelihood estimation (MLE) and gradient descent using DiffKt. But first let's understand the end result and how to interpret it. Based on our patient data, what is the probability we would see symptoms for a patient with 6 hours of exposure? We simply trace along the x-axis at 6 hours and then see what probability the curve produces. This yields a probability below of $ 0.7274 $. 

In [None]:
letsPlot { x = xData.plus(xSigmoid); y = yData.plus(ySigmoid)} + 
    ggsize(600, 500) + geomPoint(shape = 1, size = 5) + 
    geomSegment(x=6.0,y=0.0,xend=6.0,yend=logistic(6.0F, -3.1044F, 0.6783F).toDouble(), size = 3, color="red") + 
    geomSegment(x=0.0,y=logistic(6.0F, -3.1044F, 0.6783F).toDouble(),xend=6.0,yend=logistic(6.0F, -3.1044F, 0.6783F).toDouble(), size = 3.5, color="red") +
    geomText(label= "(6.0, 0.7274)",x = 7.0, y=logistic(6.0F, -3.1044F, 0.6783F).toDouble())

Of course this does not necessarily mean the logistic regression is going to make good predictions. We should worry about statistical signficance, correlation strength, and other measures to ensure this is a useful model. From a machine learning perspective, we would also consider doing a train/test split of the data and evaluate the performance on the test dataset. However in this notebook we will put all these concerns aside, and we will focus on fitting the logistic regression here using DiffKt. 

Let's study the logistic function again. Notice that the $ x $ and $ y $ values are provided in the data. It is the $ \beta_0 $ and $ \beta_1 $ coefficients we need to solve for. 


$$
y = \frac{1}{1 + e^{-\beta_0 + \beta_1 x} }
$$

To use DiffKt for our gradient descent to find these $ \beta_0 $ and $ \beta_1 $ values, first re-declare `xData` and `yData` as tensors `xTensor` and `yTensor`. Let's also declare our $ \beta_0 $ and $ \beta_1 $ for our intercept and slope coefficients respectively as scalars. We will then model the logistic function as `f()` accepting these two coefficients. 

In [None]:
import org.diffkt.*

val xTensor = tensorOf(*xData)
val yTensor = tensorOf(*yData)

// Start slope and intercept at 0
var beta0: DScalar = FloatScalar.ZERO // intercept
var beta1: DScalar = FloatScalar.ZERO // slope

// calculate the logistic function likelihoods for a given slope and intercept
fun f(beta1: DScalar, beta0: DScalar): DTensor =
    FloatScalar.ONE / (1f + exp(-(beta0 + beta1*xTensor)))

Now let's fit the sigmoid curve using maximum likelihood estimation. **Maximum likelihood estimation (MLE)** is a technique that estimates values for parameters of an assumed probability distribution, and with that assumption we maximize the probability of observing the provided data by changing the parameters. In the case of logistic regression, the assumed probability distribution is the Bernoulli distribution, which has a single parameter $ p $ describing the probability of observing an event. The likelihood function is the probability of observing an event given a value for the Bernoulli distribution parameter $ p $. Therefore the function we want to optimize in this case is the product of the Bernoulli probability mass function (PMF) at each outcome assuming a $ p $ value equal to $ f(x) $. 


Let's find the parameters $ \beta_0 $ and $ \beta_1 $ that will produce the logistic curve most likely to observe that data. We are trying to maximize this formula: 

$$
\text{likelihood} =  \Pi ( f(x_i)^{y_i} \times (1 - f(x_i))^{1 - y_i} )
$$

And $ f(x) $ is defined as the logistic function.  

$$ 
f(x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1x)}}
$$

MLE is different than the sum of squares like in linear regression. For each iteration, we adjust the values of $ \beta_0 $ and $ \beta_1 $ before calculating the probability, we calculate the probability of producing each observation and multiply them together to get the joint probability. The $ \beta_0 $ and $ \beta_1 $ values that produces the highest joint probabilities is the best fit.

We take the logistic function $ f(x) $ and pass every x-value $ x_i $ through it. These will result in the corresponding Bernoulli probabilities for each x-value, based on the binomial outcome of a [Bernoulli trial](https://en.wikipedia.org/wiki/Bernoulli_trial). However we need to handle the false cases too and maximize those. This means $ x_i $ values that correspond to a false $ y_i $ will be expressed as $ 1 - f(x_i) $. This will allow the false observations to be maximized too.

But how do we handle both the true (1) and false (0) cases in a single math expression? Without resorting to `if-then` code? This is what the exponents $ y_i $ and $ 1 - y_i $ are for. We need to express that `if-then` in a mathematical function-friendly way for DiffKt to work. Knowing that any number raised to exponent $ 0 $ will always yield $ 1 $, we can raise the true case to the binary $ y_i $ power and the false case to $ 1 - y_i $ power. This way one of those expressions will always be 1 and when you multiply them together, only one of the cases will apply (since multiplying by 1 does nothing). When $ y_i $ is true (1), the right side of the expression $ (1 - f(x_i))^{1 - y_i} $ will turn to 1 as the exponent becomes 0. When $ y_i $ is false (0), the left side of the expression $ f(x_i)^{y_i} $ will turn to 1 as that exponent is 0. This is a clever way to express binary conditions in a single expression, and we multiply all of them together. 


There's one more complication though. We need to avoid floating point underflow errors which can happen when you are multiplying a lot of decimals together on a computer, which have memory limitations with decimals. A mathematical hack you can do is take the logarithm `log()` or `ln()` of each element, and because of the additive properties of logarithms you can _add_ the elements together rather than multiply them. 

$$ 
\text{likelihood} =  \sum ln( f(x_i)^{y_i} + (1 - f(x_i))^{1 - y_i} )
$$

Here is what the likelhood landscape looks like for input variables $ \beta_0 $ and $ \beta_1 $. Note we are trying to get to the highest peak which will maximize the likelihood of observing the data given that logistic function. 

![](resources/NUrlyYatqP.png)

What does this look like in Kotlin? Let's declare that likelihood estimation function that joints all those data probability predictions together by multiplying them. It will accept the $ \beta_1 $ and $ \beta_0 $ parameters and pass them respectively to the logistic function $ f(x) $. We will also leverage that "binary power and multiply" trick to accomodate both true and false cases. 

In [None]:
fun likelihood_est(beta1: DScalar, beta0: DScalar): DScalar =
    ln(
        f(beta1, beta0).pow(yTensor) * (1f - f(beta1, beta0)).pow(1f - yTensor)
    ).sum()

Finally let's perform gradient descent. Declaring a learning rate $ L $ of $ .01 $ and 10,000 iterations should sufficiently fit our logistic function. Since we are trying to maximize rather than minimize our objective function, we will add the gradient rather than subtract against the $ \beta_1 $ and intercept $ \beta_0 $ values. Use DiffKt's `forwardDerivative()` function to find the gradients for both, and multiply them by the learning rate before adding to the coefficients.  

In [None]:
// The learning rate
val L = .01f

// The number of iterations to perform gradient descent
val iterations = 10_000

// Perform gradient descent
for (i in 0..iterations) {

    // get gradients for line slope and intercept
    val (slopeGradient, interceptGradient) = forwardDerivative(beta1, beta0, ::likelihood_est)

    // update m and b by subtracting the (learning rate) * (slope)
    beta1 += slopeGradient * L
    beta0 += interceptGradient * L
}
print("slope=$beta1 intercept=$beta0") // slope=0.692664F intercept=-3.1757221F

We should get a slope $ \beta_1 $ of $ 0.692664 $ and an intercept $ \beta_0 $ of $ -3.1757221 $. What does this look like when plotted? 

In [None]:
val xSigmoid = generateSequence(0f) { it + .01f }.takeWhile { it <= 10.0f }.toList()
val ySigmoid = xSigmoid.map { logistic(it, beta0.value, beta1.value) }

letsPlot { x = xData.plus(xSigmoid); y = yData.plus(ySigmoid)} + 
    ggsize(600, 500) + geomPoint(shape = 1, size = 5)

Looks good! We can now use the logistic regression to make probability predictions given an input variable. Granted, there are many other measures we would need to take to ensure the logistic regression model is useful, like correlation coefficients, p-values, and train/test splits. But this demonstrates how to use DiffKt to train the logistic regression on a training dataset. 