Ben Steves, CS344, 2-16-22

# Softmax, part 1

Task: practice using the `softmax` function.

**Why**: The softmax is a building block that is used throughout machine learning, statistics, data modeling, and even statistical physics. This activity is designed to get comfortable with how it works at a high and low level.

**Note**: Although "softmax" is the conventional name in machine learning, you may also see it called "soft *arg* max". The [Wikipedia article](https://en.wikipedia.org/w/index.php?title=Softmax_function&oldid=1065998663) has a good explanation.

## Setup

In [1]:
import torch
from torch import tensor
import ipywidgets as widgets
import matplotlib.pyplot as plt
%matplotlib inline

## Task

The following function defines `softmax` by using PyTorch built-in functionality.

In [2]:
def softmax_torch(x):
    return torch.softmax(x, axis=0) #computes coftmax

Let's try it on an example tensor.

In [3]:
x = tensor([1., 2., 3.])  #the numbers
softmax_torch(x)

tensor([0.0900, 0.2447, 0.6652])

1. Start by playing with the interactive widget below. Describe the outputs when:

    1. All of the inputs are the same.
    2. One input is much bigger than the others.
    3. One input is much smaller than the others.

Finally, describe the input that gives the largest possible value for output 1.

In [4]:
r = 2.0
@widgets.interact(x0=(-r, r), x1=(-r, r), x2=(-r, r))
def show_softmax(x0, x1, x2):
    x = tensor([x0, x1, x2])
    xs = softmax_torch(x)
    plt.barh([2, 1, 0], xs)
    plt.xlim(0, 1)
    plt.yticks([2, 1, 0], ['output 0', 'output 1', 'output 2'])
    plt.ylabel("softmax(x)")
    return xs

interactive(children=(FloatSlider(value=0.0, description='x0', max=2.0, min=-2.0), FloatSlider(value=0.0, desc…

A. When inputs are the same, the probability output is also the same. Each of the three probabilities are 0.33 (repeating).

B. When one input is much bigger, that input has the highest probability.

C. When one input is much smaller, it depends on what the other two values are, but the one input is almost always much smaller than the other two.

For output 1, softmax of x1 needs to be 2.0, and softmax of x0 and x2 need to be -2.0, based on the sliders

2. Fill in the following function to implement softmax yourself:

In [5]:
def softmax(xx):
    # Exponentiate x so all numbers are positive.
    expos = xx.exp()
    assert expos.min() >= 0
    # Normalize (divide by the sum).
    return expos / sum(expos)

3. Evaluate `softmax(x)` and verify that it is close to the `softmax_torch(x)` you evaluated above.

In [6]:
softmax(x)

tensor([0.0900, 0.2447, 0.6652])

4. Evaluate `softmax_torch(__)` for each of the following expressions. Observe how each output relates to `softmax_torch(x)`.

- `x + 1`
- `x - 100`
- `x - x.max()`
- `x * 0.5`
- `x * 3.0`

In [7]:
q4_1 = softmax_torch(x+1)  #no changes
q4_2 = softmax_torch(x-100) # no changes
q4_3 = softmax_torch(x-x.max()) #no changes
q4_4 = softmax_torch(x*0.5) #softmax does change, and goes up slightly for the first two and down for the last one
q4_5 = softmax_torch(x*3.0) #softmax changes, but now a very high probability for last proportion
print("x+1: %s \nx-100: %s \nx-x.max: %s \nx*0.5: %s \nx*3.0: %s" 
      %(q4_1, q4_2, q4_3, q4_4, q4_5))

x+1: tensor([0.0900, 0.2447, 0.6652]) 
x-100: tensor([0.0900, 0.2447, 0.6652]) 
x-x.max: tensor([0.0900, 0.2447, 0.6652]) 
x*0.5: tensor([0.1863, 0.3072, 0.5065]) 
x*3.0: tensor([0.0024, 0.0473, 0.9503])


5. *Numerical issues*. Assign `x2 = 50 * x`. Try `softmax(x2)` and observe that the result includes the dreaded `nan` -- "not a number". Something went wrong. **Evaluate the first mathematical operation in `softmax`** for this particularly problematic input. You should see another kind of abnormal value.

In [8]:
x2 = 50 * x
softmax(x2)

tensor([0., nan, nan])

In [9]:
x2.exp() #same as e^x

tensor([5.1847e+21,        inf,        inf])

6. *Fixing numerical issues*. Now try `softmax(x2 - 150.0)`. Observe that you now get valid numbers. Also observe how the constant we subtracted relates to the value of `x2`.

In [10]:
softmax(x2-150.0)

tensor([3.7835e-44, 1.9287e-22, 1.0000e+00])

7. Copy your `softmax` implementation to a new function, `softmax_stable`, and change it so that it subtracts `xx.max()` before exponentiating. (Don't use any in-place operations.) Verify that `softmax_stable(x2)` now works, and obtains the same result as `softmax_torch(x2)`.

In [11]:
def softmax_stable(xx):
    # Exponentiate x so all numbers are positive.
    expos = (xx - xx.max()).exp() #can make numbers passed to function smaller if they are too big
    assert expos.min() >= 0
    # Normalize (divide by the sum).
    return expos / sum(expos)

softmax_stable(x2)

tensor([3.7835e-44, 1.9287e-22, 1.0000e+00])

In [13]:
softmax_torch(x2)

tensor([3.7835e-44, 1.9287e-22, 1.0000e+00])

## Analysis

Consider the following situation:

In [14]:
x2 = tensor([1., 0.,])
x3 = x2 - 1
x3

tensor([ 0., -1.])

In [15]:
x4 = x2 * 2
x4

tensor([2., 0.])

1. Are `softmax(x2)` and `softmax(x3)` the same or different? How could you tell without having to evaluate them?


Softmax(x2) and softmax(x3) return the same results. Subtracting values in softmax does not make a differece to the proportions because though the denominator increases or decreases, so does the numerator, and at the same rate. 

2. Are `softmax(x2)` and `softmax(x4)` the same or different? How could you tell without having to evaluate them?


Softmax(x2) and x4 are different. For example, multiplying x2 by two increases x4[0] to 2, and of course x4[1] is still 0. The actual distance between the two numbers changes and now there is a larger weight to x4[1], so the softmax calculation will result in a proportion that is a little larger. 

3. Explain why `softmax(x2)` failed.

First, you multiply each value in the tensor by 50, and then you do e^50, e^100, and e^150 in the first part of softmax computation, which are massive numbers and probably too large to be able to store. 

4. Use your observations in \#1-2 above to explain why `softmax_stable` still gives the correct answer even though we changed the input.

The numbers inserted into softmax prior to softmax_stable were too large for python to handle. Because subtracting does not affect the proportion, we subtracted our xx value to shrink the values before exponentiating. It doesn't matter if its e^0/e^1+e^0 or e^63/e^64+e^63, they will give the same proportions. 

5. Explain why `softmax_stable` doesn't give us infinity or Not A Number anymore.

Going off of the previous answer, the numbers are not too large for python to handle now because we are subtracting them by the maximum numerical value in python.

## Extension *optional*

Try to prove your observation in Analysis \#1 by symbolically simplifying the expression `softmax(logits + c)` and seeing if you can get `softmax(logits)`. Remember that `softmax(x) = exp(x) / exp(x).sum()` and `exp(a + b) = exp(a)exp(b)`.