In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [None]:
import jax.numpy as np
from jax import jit
import numpy.random as npr
import matplotlib.pyplot as plt
from ipywidgets import interact, FloatSlider
from pyprojroot import here

## Linear Regression

Linear regression is foundational to deep learning. It should be a model that everybody has been exposed to before. However, it is important for us to go through this with a view to how we connect linear regression to the neural diagrams that are shown.

### Discussion

- In our machine learning toolkit, where do we use linear regression? 
- What are its advantages? 
- What are its disadvantages?

### Equation Form

Linear regression, as a model, is expressed as follows:

$$y = wx + b$$

Here:

- The **model** is the equation, $y = wx + b$.
- $y$ is the output data.
- $x$ is our input data.
- $w$ is a slope parameter.
- $b$ is our intercept parameter.
- Implicit in the model is the fact that we have transformed $y$ by another function, the "identity" function, $f(x) = x$.

In this model, $y$ and $x$ are, in a sense, "fixed", because this is the data that we have obtained. On the other hand, $w$ and $b$ are the parameters of interest, and *we are interested in **learning** the parameter values for $w$ and $b$ that let our model best explain the data*.

I will reveal the punchline early:

> The **learning** in deep learning is about figuring out parameter values for a given model.

### Make Simulated Data

To explore this idea in a bit more depth as applied to a linear regression model, let us start by making some simulated data with a bit of injected noise.

You can specify a true $w$ and a true $b$ as you wish, or you can just follow along.

### Exercise: Simulate Data

Fill in `w_true` and `b_true` with values that you like, or else leave them alone and follow along.

In [None]:
x = np.linspace(-5, 5, 100)
w_true = 2  # exercise: specify ground truth w.
b_true = 20  # exercise: specify ground truth b.

def noise(n):
    return npr.normal(size=(n))

# exercise: write the linear equation down.
y = w_true * x + b_true + noise(len(x))

# Plot ground truth data
plt.scatter(x, y)
plt.xlabel('x')
plt.ylabel('y');

### Exercise: Take bad guesses

Now, let's plot what would be a very bad estimate of $w$ and $b$.
Replace the values assigned to `w` and `b` with something of your preference,
or feel free to leave them alone and go on.

In [None]:
# Plot a very bad estimate
w = -5  # exercise: fill in a bad value for w
b = 3   # exercise: fill in a bad value for b
y_est = w * x + b  # exercise: fill in the equation.
plt.plot(x, y_est, color='red', label='bad model')
plt.scatter(x, y, label='data')
plt.xlabel('x')
plt.ylabel('y')
plt.legend();

## Regression Loss Function

How bad is our model?
We can quantify this by looking at a metric called the "mean squared error".
The mean squared error is defined as "the average of the sum of squared errors".

"Mean squared error" is but one of many **loss functions**
that are available in deep learning frameworks.
It is commonly used for regression tasks.

Loss functions are designed to quantify how bad our model is in predicting the data.

### Exercise: Mean Squared Error

Implement the mean squred error function in NumPy code.

In [None]:
def mse(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """Implement the function here"""

from dl_workshop.answers import mse

# Calculate the mean squared error between 
print(mse(y, y_est))

### Activity: Optimize model by hand.

Now, we're going to optimize this model by hand.
(This is best done in a live notebook, and not on the website;
use the sliders provided to adjust the model.)

In [None]:
import pandas as pd
from ipywidgets import interact, FloatSlider
import seaborn as sns

@interact(w=FloatSlider(value=0, min=-10, max=10), b=FloatSlider(value=0, min=-10, max=30))
def plot_model(w, b):
    y_est = w * x + b
    plt.scatter(x, y)
    plt.plot(x, y_est)
    sns.despine()

### Discussion

As you were optimizing the model, what did you notice about the MSE? score?

## Detour: Gradient-Based Optimization

Implicit in what you were doing was something we formally call "gradient-based optimization". This is a very important point to understand. If you get this for a linear model, you will understand how this works for more complex models. Hence, we are going to go into a small crash-course detour on what gradient-based optimization is.

### Derivatives

At the risk of ticking off mathematicians for a sloppy definition, for this workshop's purposes, a useful way of defining the derivative is:

> How much our output changes as we take a small step on the inputs, taken in the limit of going to very small steps.

If we have a function:

$$f(w) = w^2 + 3w - 5$$

What is the derivative of $f(x)$ with respect to $w$? From first-year undergraduate calculus, we should be able to calculate this:

$$f'(w) = 2w + 3$$

(We will use the apostrophe marks to indicate derivatives. 1 apostrophe mark means first derivative, 2nd apostrophe mark means 2nd derivative.)

### Minimizing $f(w)$ Analytically

What is the value of $w$ that minimizes $f(w)$? Again, from undergraduate calculus, we know that at a minima of a function (whether it is a global or local), the first derivative will be equal to zero, i.e. $f'(w) = 0$. By taking advantage of this property, we can analytically solve for the value of $w$ at the minima.

$$2w + 3 = 0$$

Hence, 

$$w = -\frac{3}{2} = 1.5$$

To check whether the value of $w$ at the place where $f'(w) = 0$ is a minima or maxima, we can use another piece of knowledge from 1st year undergraduate calculus: The sign of the second derivative will tell us whether this is a minima or maxima.

- If the second derivative is positive regardless of the value of $w$, then the point is a minima. (Smiley faces are positive!)
- If the second derivative is negative regardless of the value of $w$, then the point is a maxima. (Frowning faces are negative!)

Hence, 

$$f''(w) = 2$$

We can see that $f''(w) > 0$ for all $w$, hence the stationary point we find is going to be a local minima.

### Minimizing $f(w)$ Computationally

An alternative way of looking at this is to take advantage of $f'(w)$, the gradient, evaluated at a particular $w$. A known property of the gradient is that if you take steps in the negative direction of the gradient, you will eventually reach a function's minima. If you take small steps in the positive direction of the gradient, you will reach a function's maxima (if it exists).

### Exercise: Implement gradient functions by hand

Let's implement this using the function $f(w)$, done using NumPy.

Firstly, implement the aforementioned function $f$ below.

In [None]:
# Exercise: Write f(w) as a function.
def f(w):
    return w**2 + 3 * w - 5

This function is the **objective function** that we wish to optimize,
where "optimization" means finding the minima or maxima.

Now, implement the gradient function $\frac{df}{dw}$ below in the function `df`:

In [None]:
# Exercise: Write df(w) as a function. 
def df(w):
    """
    The derivative of f with respect to w.
    """
    return 2 * w + 3

This function is the **gradient of the objective w.r.t. the parameter of interest**.
It will help us find out the direction in which to change the parameter $w$
in order to optimize the objective function.

Now, pick a number at random to start with.
You can specify a number explicitly,
or use a random number generator to draw a number.

In [None]:
# Exercise: Pick a number to start w at.
w = 10.0  # start with a float

This gives us a starting point for optimization.

Finally, write an "optimization loop",
in which you adjust the value of $w$
in the negative direction of the gradient of $f$ w.r.t. $w$ (i.e. $\frac{df}{dw}$).

In [None]:
# Now, adjust the value of w 1000 times, taking small steps in the negative direction of the gradient.
for i in range(1000):
    w = w - df(w) * 0.01  # 0.01 is the size of the step taken.
    
print(w)

Congratulations, you have just implemented **gradient descent**!

Gradient descent is an **optimization routine**: a way of programming a computer to do optimization for you so that you don't have to do it by hand.

### Minimizing $f(w)$ with `jax`

`jax` is a Python package for automatically computing gradients; 
it provides what is known as an "automatic differentiation" system
on top of the NumPy API.
This way, we do not have to specify the gradient function by hand-calculating it;
rather, `jax` will know how to automatically take the derivative of a Python function
w.r.t. the first argument,
leveraging the chain rule to help calculate gradients.
With `jax`, our example above is modified in only a slightly different way:

In [None]:
from jax import grad
import jax
from tqdm.autonotebook import tqdm

def f(w):
    return w**2 + 3 * w - 5

# This is what changes: we use autograd's `grad` function to automatically return a gradient function.
df = grad(f)

# Exercise: Pick a number to start w at.
w = -10.0

# Now, adjust the value of w 1000 times, taking small steps in the negative direction of the gradient.
for i in tqdm(range(1000)):
    w = w - df(w) * 0.01  # 0.01 is the size of the step taken.

print(w)

## Back to Optimizing Linear Regression

### What are we optimizing?

In linear regression, we are:

- minimizing (i.e. optimizing) the loss function 
- _with respect to_ the linear regression parameters.

Here are the parallels to the example above:

- In the example above, we minimized $f(w)$, the polynomial function. With linear regression, we are minimizing the mean squared error.
- In the example above, we minimized $f(w)$ with respect to $w$, where $w$ is the key parameter of $f$. With linear regression, we minimize mean squared error of our model prediction with respect to the linear regression parameters. (Let's call the parameters collectively $\theta$, such that $\theta = (w, b)$.

### Ingredients for "Optimizing" a Model

At this point, we have learned what the ingredients are for optimizing a model:

1. A model, which is a function that maps inputs $x$ to outputs $y$, and its parameters of the model. 
    1. Not to belabour the point, but in our linear regression case, this is $w$ and $b$; 
    1. Usually, in the literature, we call this **parameter set** $\theta$, such that $\theta$ encompasses all parameters of the model.
2. Loss function, which tells us how bad our predictions are.
3. Optimization routine, which tells the computer how to adjust the parameter values to minimize the loss function.

**Keep note:** Because we are optimizing the loss w.r.t. two parameters, finding the $w$ and $b$ coordinates that minimize the loss is like finding the minima of a bowl.

The latter point, which is "how to adjust the parameter values to minimize the loss function", is the key point to understand here.

### Writing this in JAX/NumPy

How do we optimize the parameters of our linear regression model using JAX?
Let's explore how to do this.

### Exercise: Define the linear regression model

Firstly, let's define our model function.
Write it out as a Python function,
named `linear_model`,
such that the parameters $\theta$ are the first argument,
and the data `x` are the second argument.
It should return the model prediction.

What should the data type of $\theta$ be?
You can decide, as long as it's a built-in Python data type,
or NumPy data type, or some combination of.

In [None]:
# Exercise: Define the model in this function
def linear_model(theta, x):
    pass

from dl_workshop.answers import linear_model

### Exercise: Initialize linear regression model parameters using random numbers

Using a random number generator,
such as the `numpy.random.normal` function,
write a function that returns 
random number starting points for each linear model parameter.
Make sure it returns params in the form that are accepted by
the `linear_model` function defined above.

Hint: NumPy's random module (which is distinct from JAX's) has been imported for you in the namespace `npr`.

In [None]:
def initialize_linear_params():
    pass

# Comment this out if you fill in your answer above.
from dl_workshop.answers import initialize_linear_params
theta = initialize_linear_params()

### Exercise: Define the mean squared error loss function with linear model parameters as first argument

Now, define the mean squared error loss function, called `mseloss`,
such that 
1. the parameters $\theta$ are accepted as the first argument,
2. `model` function as the second argument,
3. `x` as the third argument,
4. `y` as the fourth argument, and
5. returns a scalar valued result.

This is the function we will be differentiating,
and JAX's `grad` function will take the derivative of the function w.r.t. the first argument.
Thus, $\theta$ must be the first argument!

In [None]:
# Differentiable loss function w.r.t. 1st argument
def mseloss(theta, model, x, y):
    pass

from dl_workshop.answers import mseloss

Now, we generate a new function called `dmseloss`, by calling `grad` on `mseloss`!
The new function `dmseloss` will have the exact same signature
as `mseloss`,
but will instead return the value of the gradient
evaluated at each of the parameters in $\theta$,
in the same data structure as $\theta$.

In [None]:
# Put your answer here.


# The actual dmseloss function is also present in the answers,
# but seriously, go fill the one-liner to get dmseloss defined!
# If you fill out the one-liner above,
# remember to comment out the answer
# so that mine doesn't clobber over yours!
from dl_workshop.answers import dmseloss

I've provided an execution of the function below,
so that you have an intuition of what's being returned.
In my implementation,
because theta are passed in as a 2-tuple,
the gradients are returned as a 2-tuple as well.
The return type will match up with how you pass in the parameters.

In [None]:
dmseloss(dict(w=0.3, b=0.5), linear_model, x, y)

### Exercise: Write the optimization routine

Finally, write the optimization routine!

Make it run for 3,000 iterations,
and record the loss on each iteration.
Don't forget to update your parameters!
(How you do so will depend on how you've set up the parameters.)

In [None]:
# Write your optimization routine below.


# And if you implemented your optimization loop,
# feel free to comment out the next two lines
from dl_workshop.answers import model_optimization_loop
losses, theta = model_optimization_loop(theta, linear_model, mseloss, x, y, n_steps=3000)

Now, let's plot the loss score over time. It should be going downwards.

In [None]:
plt.plot(losses)
plt.xlabel('iteration')
plt.ylabel('mse');

Inspect your parameters to see if they've become close to the true values!

In [None]:
print(theta)

## Recap: Ingredients

1. Model specification ("equations", e.g. $y = wx + b$) and the parameters of the model to be optimized ($w$ and $b$, or more generally, $\theta$).
2. Loss function: tells us how wrong our model parameters are w.r.t. the data ($MSE$)
3. Opitmization routine (for-loop)

## Linear Regression In Pictures

Linear regression can be expressed pictorially, not just in equation form. Here are two ways of visualizing linear regression.

### Matrix Form

Linear regression in one dimension looks like this:

![](../figures/linreg-scalar.png)

Linear regression in higher dimensions looks like this:

![](../figures/linreg-matrices.png)

This is also known in the statistical world as "multiple linear regression".
The general idea, though, should be pretty easy to catch.
You can do linear regression that projects any arbitrary number of input dimensions
to any arbitrary number of output dimensions.

### Neural Diagram

We can draw a "neural diagram" based on the matrix view, with the implicit "identity" function included in orange.

![](../figures/linreg-neural.png) 

The neural diagram is one that we commonly see in the introductions to deep learning. As you can see here, linear regression, when visualized this way, can be conceptually thought of as the baseline model for understanding deep learning. 

The neural diagram also expresses the "compute graph" that transforms input variables to output variables.

## Break (10 min.)

Let's take a break here, and reconvene in 10 minutes.

--------

## The Modular Components of Modelling

A key idea from this tutorial is to treat the aforementioned four ingredients (data, model, loss, parameters) as being **modular** components of machine learning. This means that we can swap out the model (and its associated parameters) to fit the problem that we have at hand.

## Logistic Regression

Logistic regression builds upon linear regression.
We use logistic regression to perform **binary classification**, 
that is, distinguishing between two classes. 
Typically, we label one of the classes with the integer 0, 
and the other class with the integer 1.

What does the model look like?
To help you build intuition, let's visualize logistic regression using pictures again.

### Matrix Form

Here is logistic regression in matrix form.

![](../figures/logreg-matrices.png)

### Neural Diagram

Now, here's logistic regression in a neural diagram:

![](../figures/logreg-neural.png)

### Interactive Activity

As should be evident from the pictures, 
logistic regression builds upon linear regression 
simply by **changing the activation function 
from an "identity" function to a "logistic" function**. 
In the one-dimensional case, 
it has the same two parameters as one-dimensional linear regression,  $w$ and $b$. 
Let's use an interactive visualization 
to visualize how the parameters $w$ and $b$ affect the shape of the curve.

(Note: this exercise is best done in a live notebook!)

In [None]:
from dl_workshop.answers import logistic
logistic??

In [None]:
@interact(w=FloatSlider(value=0, min=-5, max=5), b=FloatSlider(value=0, min=-5, max=5))
def plot_logistic(w, b):
    x = np.linspace(-10, 10, 1000)
    z = w * x + b  # linear transform on x
    y = logistic(z)    
    plt.plot(x, y)

### Discussion

1. What do $w$ and $b$ control, respectively? 
2. Where else do you see logistic-shaped curves?

### Make simulated data

In this section, we're going to show that we can optimize a logistic regression model using the same set of ingredients. 

First off, let's start by simulating some data.

In [None]:
x = np.linspace(-5, 5, 100)
w = 2
b = 1
z = w * x + b + npr.random(size=len(x))
y_true = np.round(logistic(z))
plt.scatter(x, y_true, alpha=0.3);

## Loss Function

How would we quantify how good or bad our model is? In this case, we use the logistic loss function, also known as the binary cross entropy loss function.

Expressed in equation form, it looks like this:

$$L = -\sum (y \log(p) + (1-y)\log(1-p)$$

Here:

- $y$ is the actual class, namely $1$ or $0$.
- $p$ is the predicted class.

If you're staring at this equation, and thinking that it looks a lot like the Bernoulli distribution log likelihood, you are right!

### Discussion

Let's think about the loss function for a moment:

- What happens to the term $y \log(p)$ when $y=0$ and $y=1$? What about the $(1-y)\log(1-p)$ term?
- What happens to both terms when $p \approx 0$ and when $p \approx 1$ (but still bounded between 0 and 1)?

### Exercise: Write down the logistic regression model

Using the same patterns as you did before for the linear model,
define a function called `logistic_model`,
which accepts parameters `theta` and data `x`.

In [None]:
# Exercise: Define logistic model
def logistic_model(theta, x):
    pass

from dl_workshop.answers import logistic_model

### Exercise: Write down the logistic loss function

Now, write down the logistic loss function.
It is defined as the negative binary cross entropy
between the ground truth and the predicted.

The binary cross entropy function is available for you to use:

In [None]:
from dl_workshop.answers import binary_cross_entropy

binary_cross_entropy??

In [None]:
# Exercise: Define logistic loss function, using flattened parameters
def logistic_loss(p, model, x, y):
    pass

from dl_workshop.answers import logistic_loss
logistic_loss??

Now define the gradient of the loss function, using `grad`!

In [None]:
# Exercise: Define gradient of loss function.
dlogistic_loss = grad(logistic_loss)

### Exercise: Initialize logistic regression model parameters using random numbers

Because the parameters are identical to linear regression,
you probably can use the same `initialize_linear_params` function.

In [None]:
from dl_workshop.answers import initialize_linear_params

theta = initialize_linear_params()
theta

### Exercise: Write the training loop!

This will look very similar to the linear model training loop,
because the same two parameters are being optimized.
The thing that should change is the loss function and gradient of the loss function.

In [None]:
from dl_workshop.answers import model_optimization_loop

losses, theta = model_optimization_loop(
    theta,
    logistic_model,
    logistic_loss,
    x,
    y_true,
    n_steps=5000,
    step_size=0.0001
)

In [None]:
print(theta)

You'll notice that the values are off from the true value. Why is this so? Partly it's because of the noise that we added, and we also rounded off values.

In [None]:
plt.plot(losses);

In [None]:
plt.scatter(x, y_true, alpha=0.3)
plt.plot(x, logistic_model(theta, x), color='red');

### Exercise

What if we did not round off the values, and did not add noise to the original data? Try re-running the model without those two.

## Neural Networks

Neural networks are basically very powerful versions of logistic regressions. Like linear and logistic regression, they also take our data and map it to some output, but does so without ever knowing what the true equation form is.

That's all a neural network model is: an arbitrarily powerful model.
How do feed forward neural networks look like?
To give you an intuition for this, let's see one example of a deep neural network in pictures.

### Matrix diagram

As usual, in a matrix diagram.

![](../figures/deepnet_regressor-matrices.png)

### Neural diagram

And for correspondence, let's visualize this in a neural diagram.

![](../figures/deepnet_regressor-neural.png)

### Discussion

- When would we want to use a neural network? When would we not want to use a neural network?

### Real Data

We are going to try using some real data from the UCI Machine Learning Repository to something related to our work: QSAR modelling!

With the dataset below, we want to predict whether a compound is biodegradable based on a series of 41 chemical descriptors.

The dataset was taken from: https://archive.ics.uci.edu/ml/datasets/QSAR+biodegradation#. I have also prepared the data such that it is split into X (the predictors) and Y (the class that we are trying to predict), so that you do not have to worry about manipulating pandas DataFrames.

Let's read in the data.

In [None]:
import pandas as pd

X = pd.read_csv(here() / 'data/biodeg_X.csv', index_col=0)
y = pd.read_csv(here() / 'data/biodeg_y.csv', index_col=0)

### Neural network model definition

Now, let's write a neural network model. 
This neural network model starts with 41 input nodes,
has 1 hidden layer with 20 nodes, and 1 output layer with 1 node.
Because this is a classification problem,
we will use a logistic activation function right at the end.

The parameter shapes define the size of the neural network,
while the model function determines the transformations.

Let's start with the model parameters.

In [None]:
def noise(size):
    return npr.normal(size=size)

params = dict()
params['w1'] = noise((41, 20))
params['b1'] = noise((20,))
params['w2'] = noise((20, 1))
params['b2'] = noise((1,))

Now, let's define the model as a Python function:

In [None]:
def neural_network_model(theta, x):
    # "a1" is the activation from layer 1
    a1 = np.tanh(np.dot(x, theta['w1']) + theta['b1'])
    # "a2" is the activation from layer 2.
    # Explain why we need logistic at the end.
    a2 = logistic(np.dot(a1, theta['w2']) + theta['b2'])
    return a2

### Optimization loop

Now, write the optimization loop!
It will look very similar to the optimization loop
that we wrote for the logistic regression classification model.
The difference here is the model that is used,
as well as the initialized set of parameters.

In [None]:
losses, params = model_optimization_loop(
    params, 
    neural_network_model, 
    logistic_loss,
    X.values,
    y.values,
    step_size=0.0001
)

In [None]:
plt.plot(losses)
plt.title(f"final loss: {losses[-1]:.2f}");

### Visualize trained model performance

We can use a confusion matrix to see how "confused" a model was.
Read more on [Wikipedia](https://en.wikipedia.org/wiki/Confusion_matrix).

In [None]:
from sklearn.metrics import confusion_matrix

y_pred = neural_network_model(params, X.values)
confusion_matrix(y, np.round(y_pred))

In [None]:
import seaborn as sns

sns.heatmap(confusion_matrix(y, np.round(y_pred)))
plt.xlabel('predicted')
plt.ylabel('actual');

## Recap

Deep learning, and more generally any modelling, has the following ingredients:

1. A model and its associated parameters to be optimized.
2. A loss function against which we are optimizing parameters.
3. An optimization routine.

You have seen these three ingredients at play with 3 different models: a linear regression model, a logistic regression model, and a deep feed forward neural network model.

### In Pictures

Here is a summary of what we've learned in this tutorial!

![](../figures/infographic.png) 

## Caveats of this tutorial

Deep learning is an active field of research. I have only shown you the basics here. In addition, I have also intentionally omitted certain aspects of machine learning practice, such as 

- splitting our data into training and testing sets, 
- performing model selection using cross-validation
- tuning hyperparameters, such as trying out optimizers
- regularizing the model, using L1/L2 regularization or dropout
- etc.

## Parting Thoughts

Deep learning is nothing more than optimization of a model with a really large number of parameters.

In its current state, it is not artificial intelligence.
You should not be afraid of it;
it is nothing more than a really powerful model that maps X to Y.