## Exponential Distribution and Gradient Descent

**Copyright (c) Meta Platforms, Inc. and affiliates.**

This source code is licensed under the MIT license found in the
LICENSE file in the root directory of this source tree.

The **exponential distribution** models the elapsed time between events, such as the time between each ad clicks or video views. For any given $ x $, the distribution gives the likelihood that $ x $ time passes between two consecutive events. 

We will fit the exponential distribution using the maximum likelihood estimation (MLE) and gradient descent, utilizing the power of DiffKt. 

Bring in DiffKt with the following command: 

In [1]:
@file:DependsOn("../kotlin/api/build/libs/api.jar")

The exponential distribution is modeled with the following function. Note that $ x $ has to be greater than 0. 

$$
y = \lambda e^{-\lambda x}  \{x > 0\}
$$

To visualize how the exponential distribution morphs with different $ \lambda $ values, here is an animation showing lambda values ranging from $ 0 $ to $ 5 $. 

![](./resources/MLPgNgVTsy.mp4)

Let's say we are running a video streaming website and tracking traffic to a particular video. We are trying to predict the length of time $ x $ between each video view. Logically, $ x $ cannot be negative because a negative length of time between events does not make sense. 

Lambda $ \lambda $ is a parameter of rate, which in this case the number of views per minute. Therefore $ \lambda = 2 $ means there are 2 viewers every minute on average. 

$$
f(x) = 2 e^{-2 x} \{x > 0\}
$$

![](./resources/IUiwhJeriX.png)



To interpret this, let's pose a question: what is the probability we would get a video view in 1 second or less? To do that we would find the area under the curve for that range. Since this is a probability distribution, the area under the entire curve is $ 1.0 $, but we are only interested in the area between $ 0 \le x  \le 1 $. 

Let's visualize the area of interest $ 0 \le x  \le 1 $ below, which will yield a probability of .8647. 

![](./resources/ZEKHazQuWs.mp4)

That's how you interpret an exponential distribution. But how do you fit one when presented with a dataset? What is the correct $ \lambda $ parameter? This is where we can use DiffKt with gradient descent. Below is a sample of data where we tracked the time between 15 video views as declared below in Kotlin code. 

In [2]:
val xData = floatArrayOf(1.8929f, 
       6.3687f, 
       3.228f, 
       1.2192f, 
       0.2585f, 
       0.4404f, 
       3.0278f, 
       1.9918f, 
       3.4013f, 
       3.0343f, 
       1.0201f, 
       2.436f, 
       1.8981f, 
       2.9764f, 
       1.3621f
)

Here is the data visualized on the x-axis below. 

![](./resources/ZhsZtuABdv.png)

What is the best exponential distribution to fit to this data? What is the right lambda  $ \lambda $ value? We do need to use maximum likelihood estimation, which finds the exponential function most likely to output the observed data. But rather than go through a lot of derivative calculations (as demonstrated in Josh Starmer's video [here](https://www.youtube.com/watch?v=p3T-_LMrvBc)) we can instead use DiffKT to do all the derivative work for us. 

We can take the derivative of the likelihood with respect to lambda $ \lambda $ at a given $ x $ value. We can use gradient *ascent* to step up the gradient slope and find our way to the maximum likelihood. This means we find the parameter that is most likely to output the observed data, or their combined likelihoods. If we multiply the likelihoods together we will likely run into issues with floating point underflow. To combat this we can instead work in logspace, where we take the log of each likelihood and sum instead of multiply.

Set the learning rate to `.01` and use `25` iterations, and that should be sufficient to find the exponential distribution producing the maximum likelihood of observing those 15 points. 

In [3]:
import org.diffkt.*

val xTensor = tensorOf(*xData)

// Start lambda at 1
var lambda: DScalar = FloatScalar.ONE // intercept

// Declare exponential function as scalar function 
fun f(lambda: DScalar) = lambda * exp(-lambda*xTensor)

// Calculate total likelihood by taking natural log of each data point
// and then sum them 
fun likelihood_est(lambda: DScalar): DScalar = ln(f(lambda)).sum()

// The learning rate
val L = .01F

// The number of iterations to perform gradient descent
val iterations = 25

// Perform gradient descent
for (i in 0..iterations) {

    // get gradient for lambda 
    val lambdaGradient = forwardDerivative(lambda, ::likelihood_est)

    // update m and b by adding the (learning rate) * (slope)
    lambda += lambdaGradient * L
}
print("lambda=$lambda") // 

lambda=0.43408304

Running the code above we get a lambda value of about $ 0.434 $. Here is a visualization of the gradient descent at work below. 

![](./resources/LPUPIctltW.mp4)

This looks pretty good! As you can observe from the function above, we are much more likely to observe time between video videos closer to 0 seconds, but longer time lapses beyond 4 seconds are less likely. 