# **Section 3: Gradient Descent**

Notebook for "Introduction to Data Science and Machine Learning"

version 1.1, May 5 2025

# Linear Regression: Gradient Descent

In this lab we will implement the gradient descent step by step.

We need the following modules. So please run the following code:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import random as rnd

And import some functions written for this notebook:

In [None]:
from modules.RegressionFunctions import * 

Gradient descent bases on the sum of squared errors as error measures, which is calculeted by the function `sumOfSquaredErrors()`.


In [None]:
help(sumOfSquaredErrors)


The function `regPlot()` plots the values, regression line and residuals. Take a look at the information: 

In [None]:
help(regPlot)

We initialize random with a fixed seed to make sure to get the same results when we run the code a second time.

In [None]:
rnd.seed(0)

# Learning the intercept

We start with the example of the flipped classroom exercise. The `X` and `Y` values are defined as follows:

In [None]:
X_class=np.array([1,1.5,3])
Y_class=np.array([2.5 , 3.25, 5.5 ])

As in the lecture and the video we start learning only the intercept. We assume that the slope of the function is known. That is we need to learn the value $b$ for the following function: 

$f(x)=1.5 \cdot x+b$

Implement a Python function that calculates $f(x)$. Replace `return x` in the following function by the code calculating $f(x)$:

In [None]:
def function1(x,b):
    return x

To learn the intercept using gradient descent, we need the derivative with respect to the intercept of the sum of squared errors. The derivative is:

$\frac{\partial}{\partial_b} \left( \sum_{i} (y_i-(a  \cdot x_i + b))^2 \right) =  \sum_{i} (-2 \cdot(y_i-(a \cdot x_i + b))) = \sum_{i} (-2 \cdot (y_{i} - \hat{y_{i}}))$

The value of the slope $a$ does not play a role in the derivative of the intercept.

Implement a Python function to calculate the derivative of the intercept. Replace `pass` in the following function by the code to calculate the derivative:

In [None]:
def derivativeSumOfSquaredErrorIntercept(y,ypred):
    pass

Now we set the learning rate $\alpha=0.1$.

We can use gradient descent to learn the intercept (the correct value is 1). The steps are the following:

1. Start with a random value for the `intercept` (take 0)
3. Let `interceptSlope` be the derivative of the loss function for the actual intercept value
4. Calculate the `stepSize` as: `interceptSlope` times learning rate `alpha` 
5. Calculate the new `intercept` as: `intercept – stepSize`
6. increment the loop counter
7. go back to step 2. until either `abs(stepSize) < 0.001` (`abs()` is the absolute value) or the loop was executed 1000 times

Implement gradient descent in the following function by adding the code to calculate the `stepSize` and adapt the intercept.


In [None]:
def gradientIntercept(X,Y,alpha=0.1):
    # define initial intercept
    
    stepSize=1
    counter=0
    # define condition for while loop
    while :
        yPred=function1(X,intercept)
        interceptSlope=derivativeSumOfSquaredErrorIntercept(Y,yPred)
        # calculate the stepSize
        
        # calculate the new Intercept
       
        # increment the loop counter

        
    yPred=function1(X,intercept)
    print("Results")
    print("learned intercept", intercept)
    print("sum of Squared errors",sumOfSquaredErrors(Y,yPred))
    print("loop counter: ",counter)
    regPlot(X,Y,yPred,1.5,intercept)

If you implemented the function correctly, the intercept of 1 was learned.

In [None]:
# call the code
gradientIntercept(X_class,Y_class)

Now we run above code to learn the intercept (1) with other data. We use more `X` values:

In [None]:
X2=np.array(list(range(0,21)))/4

In [None]:
Y2=1.5*X2+1

And now we add some random noise:

In [None]:
Y2_rand=Y2+np.array([rnd.random()-0.5 for i in range(21)])

and create a scatter plot:

In [None]:
plt.plot(X2,Y2_rand,'.')


Call the function to learn the intercept using gradient descent for `X` and `Yrand`:

In [None]:
gradientIntercept(X2,Y2_rand)

Now we get an error. If we remember the video, this might be due to a too large learning rate. So let's test a smaller learning rate:

In [None]:
gradientIntercept(X2,Y2_rand,alpha=0.001)

In the video it was proposed to reduce the learning rate with every step to avoid above problems. Modify the gradient descent function by adapting the learning rate in each step (by multiplying it with 0.99):

In [None]:
def gradientIntercept2(X,Y,alpha=0.1):
    # define initial intercept
    
    stepSize=1
    counter=0
    # define condition for while loop
    while :
        yPred=function1(X,intercept)
        interceptSlope=derivativeSumOfSquaredErrorIntercept(Y,yPred)
        # calculate the stepSize
        
        # calculate the new Intercept
       
        # adapt the learning rate

        # increment the loop counter

        
    yPred=function1(X,intercept)
    print("Results")
    print("learned intercept", intercept)
    print("sum of Squared errors",sumOfSquaredErrors(Y,yPred))
    print("loop counter: ",counter)
    regPlot(X,Y,yPred,1.5,intercept)

Let's test, whether the problem accurs again:

In [None]:
gradientIntercept2(X2,Y2_rand)

# Learning the slope and intercept

Now we will implement gradient descent to learn the $a$ and $b$ of the function $f(x)=a \cdot x + b$.

Implement the function that calculates $a\cdot x + b$:

In [None]:
def function2(x,a,b):
    return x

We now need the gradient, that is we need the derivate with respect to the interecpt for the sum of squared errors 

$\frac{\partial}{\partial_b} \left( \sum_{i} (y_i-(a  \cdot x_i + b))^2 \right) =  \sum_{i} (-2 \cdot(y_i-(a \cdot x_i + b))) = \sum_{i} (-2 \cdot (y_{i} - \hat{y_{i}}))$

The value of the slope $a$ does not play a role in the derivative.

as well as with respect to the slope for the sum of squared errors.

$\frac{\partial}{\partial_a} \left( \sum_{i} (y_i-(a  \cdot x_i + b))^2 \right) =  \sum_{i} (-2 \cdot x_i \cdot (y_i-(a \cdot x_i + b))) = \sum_{i} (-2 \cdot x_i \cdot (y_{i} - \hat{y_{i}}))$

We already implemented the function `derivativeSumOfSquaredErrorIntercept(y,ypred)`. Now we implement the function `derivativeSumOfSquaredErrorSlope(x,y,ypred)`. This function requires the value of `x` as additional parameter:

In [None]:
def derivativeSumOfSquaredErrorSlope(x,y,ypred):
    pass

The gradient descent algorithm is similar to learning the intercept alone:

1. Start with a random value for the `intercept` and `slope`. We define default valzes of 0 in the function. 
2. Let `interceptSlope` be the derivative of the loss function with respect to the slope for the actual intercept value (assume slope to be given)
3. Let `slopeSlope` be the derivative of the loss function  with respect to the slope for the atual slope value (assume intercept to be given)
4. Calculate the `stepSizeIntercept` as: `interceptSlope` times learning rate `alpha` 
5. Calculate the `stepSizeSlope` as: `slopeSlope` times learning rate `alpha` 
6. Calculate the new `intercept` as: old intercept – step size intercept
7. Calculate the new `slope` as: old slope – step size slope
8. adapt the learning rate `alpha` by multiplying it with 0.99
9. go back to step 2. until either both `abs(stepSizeIntercept) < 0.001` **and** `abs(stepSizeSlope) < 0.001` (`abs()` is the absolute value) or the loop was executed 1000 times

Implement gradient descent in the following function:

In [None]:
# starting with alpha 0.1 is too large, we set it to 0.01 by default
def gradientInterceptSlope(X,Y,intercept=0,slope=0,alpha=0.01, precision=0.001):
    stepSizeIntercept=1
    stepSizeSlope=1
    counter=0
    # define the condition for while
    while :
        yPred=function2(X,slope,intercept)
        # calculate the interceptSlope
        
        # calculate the slopeSlope
        
        # calculate the stepSizeIntercept
        
        # calculate the stepSizeSlope
        
        # calculate the new Intercept
        
        # calculate the new Slope
        
        # adapt the learning rate alpha
        
        # increment the loop counter
        
        
    yPred=function2(X,slope,intercept)
    print("Results")
    print("learned intercept", intercept)
    print("learned slope", intercept)
    print("sum of Squared errors",sumOfSquaredErrors(Y,yPred))
    print("loop counter: ",counter)
    regPlot(X,Y,yPred,slope,intercept)

Now we call the function first with the standard value for precision:

In [None]:
gradientInterceptSlope(X_class,Y_class)

Remember that the function was defined $f(x)=1.5\cdot x + 1$. Compare the learned intercept and slope with the values of the function. Take a look at the plot. You will see that the function was not precisely learned. Test out other values for precision. (The function call will be `gradientInterceptSlope(X_class,Y_class,precision=....)` with the new values).

In [None]:
# your code


Please call gradient descent now using the correct intecept (1) so that the function only needs to learn the slope. Compare the result.

In [None]:
# your code


Now run the code with the larger set of values (`X2` and `Y2`):

In [None]:
# your code


Test the gradient descent with other functions and values of your choice with and without random noise.

In [None]:
# your code

<a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png" /></a><br />This notebook was created by Christina B. Class for teaching at EAH Jena and is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License</a>.