In [2]:
import numpy as np
import plotly.graph_objects as go

# **Deep Learning**
## **Goodfellow, Bengio, and Courville**

Christopher La Valle

## **1 - Introduction**

## **2 - Linear Algebra**

## **3 - Probability and Information Theory**

## **4 - Numerical Computation**

See numerical linear algebra notebook

Machine learning algorithms usually require a high amount of numerical computation. Thys typcially refers to algorithms that sovle mathematical problems by methods that update estimates of the solution via an iterative process, rather than analytically deriving a formula to provide a symbolic expression for the correct solution. Even just evaluating a mathematical function on a digital computer can be difficult when the function involves real numbers, which cannot be represented precisely using a finite amount of memory.

### **4.1 - Overflow and Underflow**


The fundamental difficulty in performing continous math on a digital computer is that we need to represent infinitely many real numbers with a finite number of bit patterns. This means that for almsost al lreal numbers, we incur some approximation error when we represent the number in the computer. In many cases, this is just round error. Rounding error is problematic, especially when it compounds across many operations, and can cause algorithms that work in theory  to fail in practice if they are not designed to minimize the accumulation of rounding error.

**Underflow** occurs when numbers near zero are rounded to zero. Many functions behave qualitatively differently when their argument is zero rather than small positive number. 

**Overflow** occurs when numbers with large magnitude are approximated as $\infty$ or $-\infty$. Further arithmetic will usually change these infinite values into not-a-number values.

One example of a function that must be stabilized against underflow and overflow is the **softmax function**. The softmax function is often used to predict the probabilities assoicated with a multinoulli distribution. The softmax function is defined to be

$$\text{softmax}(x_i)=\frac{\exp(x_i)}{\sum_{j=1}^n\exp(x_j)}\tag{4.1}$$

Consider what happens when al lthe $x_i$ are equal to some constant $c$. Analytically, we can see that all the outputs should be equal to $\frac{1}{n}$. Numerically, this may not occur when $c$ has large magnitude. If $c$ is very negative, then $\exp(c)$ will underflow. This means the denominatior of the softmax will become $0$, so the final result is undefined. When $c$ is very large and positive, $\exp(c)$ will overflow, again resulting the expression as a whole being undefined. Both of these difficulties can be resolved by instead evaluating $\text{softmax}(z)$ where $z=x-\max_ix_i$. Simple algebra shows that the value of the softmax function is not changed analytically by adding or subtracting a scalar from the input vector. Subtracting $\max_ix_i$ results in the largest argument to $\exp$ being $0$, which rules out the possibility of overflow. LIkewise, at least one term in the denominiaotr has a value of $1$, which rules out the posssibility of under flow in the denominatior leading to a division by zero.

There is still one small problem. Underflow in the numerator can still cause the expression as a whole to evaluate to zero. This means that if we impliment $\log\text{softmax}(x)$ by first running the fotmax subroutine then passing the result to the log function, we could erroneously obtain $-\infty$. Instead, we must implement a separate function that calculates $\log\text{softmax}$ in a numerically stable way. The $\log\text{softmax}$ function can be stabilized using the same trick as we used to stabilize the softmax function.

Theano (Bergstra et al., 2010 Bastien et al. 2012) is an example of a software package that automatically detects and stabilizes many common numerically unstable expressions that arise in the context of deep learning.

### **4.2 - Poor Conditioning**

Conditioning refers to how rapidly a function changes with respect to small changes in its inputs. Functions that change rapidly when their inputs are perturbed slightly can be problematic for scientific computaiton because rounding errors in the inputs can result in large changes in the output.

Consider the function $f(x)=\mathbf{A}^{-1}\mathbf{x}$. When $\mathbf{A}\in\mathbb{R}^{n\times n}$ has an eigenvalue decomposition, the **condition number* is

$$\underset{i,j}{\max}\left|\frac{\lambda_i}{\lambda_j}\right|\tag{4.2}$$

This is the ratio of the magnitude of the largest and smallest eigenvalue. When this number is large, matrix inversion is particularly sensitive to error in the input.

This sensitivity is an intrinsic property of the matrix itself, not the result of rounding error during matrix inversion. Poorly conditioned matrices amplify pre-exisitng errors when we multiply by the true matrix inverse. In practice, the error will be compounded furhter by numerical errors in the inversion process itself.

### **4.3 - Gradient-Based Optimization**

Most dep learning algorithms involve optimizations of osme sort. Optimization refers to the task of either minimizing or maximizing some function $f(\mathbf{x})$ by altering $\mathbf{x}$. We usually phrase most optimization problems in terms of minimizing $f(\mathbf{x})$. Maximization may be axxomplished via a minimization algorithm by minimizing $-f(\mathbf{x})$.

The function we want to minize of maximize is called the **objective function**, or **criterion**. When we are minimizing it, we may also call it the **cost function**, **loss function**, or **error function**. 

We often denote the value that minimizes or maximizes a function with a superscript $*$: $\mathbf{x}^*=\arg\min f(\mathbf{x})$ for example.

Suppose we have a function $y=f(x)$, where both $x$ and $y$ are real numbers. The **derivative** $f'(x)$ gives the slope of $f(x)$ at the point $x$. On other words, it specifies how to scale a small change in the input to obtain the corresponding change in the output: $f(x+\epsilon)\approx f(x)+\epsilon f'(x)$.

The derivative is therefore useful for minimizing a function because it tells us how to change $x$ in order to make a small improvement in $y$. For example, we know that $f(x-\epsilon\text{sign}(f'(x)))$ is less than $f(x)$ for small enough $\epsilon$. We can thus reduce $f(x)$ by moving $x$ in small steps with the opposite sign of the derivative. This technique is called **gradient descent**.

When $f'(x)=0$, the derivative provides no information about which direction to move. Points where $f'(x)=0$ are known as **critical points**, or **stationary points**. A **local minimum** is a point where $f(x)$ is lower that at all neighboring points, so it is no longer possible to decrease $f(x)$ by making infinitesimal steps. A **locsal maximum** is a point where $f(x)$ is higher than at a ll neighboring points, so it is not possible to increase $f(x)$ by making infinitesimal steps. Some critical points are neither maxima nor minima. These are known as **saddle points**.



In [5]:
# Create a function with multiple critical points
import numpy as np
import plotly.graph_objects as go

# Define x range
x = np.linspace(-4, 4, 1000)

# Create a polynomial with multiple critical points
# f(x) = 0.1x^6 - 0.8x^4 + 1.5x^2 - 2x + 1
y = 0.1*x**6 - 0.8*x**4 + 1.5*x**2 - 2*x + 1

# Find approximate critical points by looking for where derivative changes sign
dx = x[1] - x[0]
dy_dx = np.gradient(y, dx)

# Find critical points (where derivative is approximately zero)
critical_indices = []
for i in range(1, len(dy_dx)-1):
    if abs(dy_dx[i]) < 0.1 and ((dy_dx[i-1] > 0 and dy_dx[i+1] < 0) or 
                                 (dy_dx[i-1] < 0 and dy_dx[i+1] > 0) or
                                 (abs(dy_dx[i-1]) > abs(dy_dx[i]) and abs(dy_dx[i+1]) > abs(dy_dx[i]))):
        critical_indices.append(i)

# Manually identify specific critical points for clear labeling
critical_points = [
    (-2.8, 0.1*(-2.8)**6 - 0.8*(-2.8)**4 + 1.5*(-2.8)**2 - 2*(-2.8) + 1, "Local Maximum"),
    (-1.5, 0.1*(-1.5)**6 - 0.8*(-1.5)**4 + 1.5*(-1.5)**2 - 2*(-1.5) + 1, "Local Minimum"),
    (-0.2, 0.1*(-0.2)**6 - 0.8*(-0.2)**4 + 1.5*(-0.2)**2 - 2*(-0.2) + 1, "Local Maximum"),
    (1.2, 0.1*(1.2)**6 - 0.8*(1.2)**4 + 1.5*(1.2)**2 - 2*(1.2) + 1, "Global Minimum"),
    (2.5, 0.1*(2.5)**6 - 0.8*(2.5)**4 + 1.5*(2.5)**2 - 2*(2.5) + 1, "Local Maximum"),
]

# Create the main plot
fig = go.Figure()

# Add the main function curve
fig.add_trace(go.Scatter(
    x=x, 
    y=y,
    mode='lines',
    name='f(x) = 0.1x⁶ - 0.8x⁴ + 1.5x² - 2x + 1',
    line=dict(color='blue', width=3)
))

# Add critical points
for i, (x_pt, y_pt, label) in enumerate(critical_points):
    color = 'red' if 'Maximum' in label else 'green' if 'Minimum' in label else 'orange'
    
    fig.add_trace(go.Scatter(
        x=[x_pt],
        y=[y_pt],
        mode='markers',
        name=label,
        marker=dict(
            size=10,
            color=color,
            symbol='circle',
            line=dict(width=2, color='black')
        )
    ))
    
    # Add dashed lines to critical points
    fig.add_trace(go.Scatter(
        x=[x_pt, x_pt],
        y=[min(y), y_pt],
        mode='lines',
        line=dict(color=color, width=2, dash='dash'),
        showlegend=False
    ))

# Add a simple saddle point example (separate function)
x_saddle = np.linspace(-2, 2, 100)
y_saddle = x_saddle**3 - 3*x_saddle  # This has a saddle point at x=0

# Create second subplot for saddle point
fig.add_trace(go.Scatter(
    x=x_saddle + 6,  # Offset to separate from main function
    y=y_saddle,
    mode='lines',
    name='g(x) = x³ - 3x (Saddle Point Example)',
    line=dict(color='purple', width=3)
))

# Mark the saddle point
saddle_x, saddle_y = 6, 0  # At x=0 for the cubic function, offset by 6
fig.add_trace(go.Scatter(
    x=[saddle_x],
    y=[saddle_y],
    mode='markers',
    name='Saddle Point',
    marker=dict(
        size=10,
        color='orange',
        symbol='diamond',
        line=dict(width=2, color='black')
    )
))

# Add dashed line for saddle point
fig.add_trace(go.Scatter(
    x=[saddle_x, saddle_x],
    y=[-10, saddle_y],
    mode='lines',
    line=dict(color='orange', width=2, dash='dash'),
    showlegend=False
))

# Update layout
fig.update_layout(
    title='Critical Points in Optimization: Minima, Maxima, and Saddle Points',
    xaxis_title='x',
    yaxis_title='f(x)',
    width=1000,
    height=600,
    showlegend=True,
    legend=dict(
        yanchor="top",
        y=0.99,
        xanchor="left",
        x=0.01
    )
)

# Add annotations for clarity
fig.add_annotation(
    x=-2.8, y=critical_points[0][1] + 5,
    text="Local Max",
    showarrow=True,
    arrowhead=2,
    arrowcolor="red"
)

fig.add_annotation(
    x=1.2, y=critical_points[3][1] - 5,
    text="Global Min",
    showarrow=True,
    arrowhead=2,
    arrowcolor="green"
)

fig.add_annotation(
    x=6, y=-5,
    text="Saddle Point<br>(f'(x)=0 but neither min nor max)",
    showarrow=True,
    arrowhead=2,
    arrowcolor="orange"
)

fig.show()
