## A Prediction Function
A prediction for a binary event (true vs. false) is mathematically expressed as a floating-point number between 0.0 (could not possibly happen) and 1.0 (happens all the time). A _prediction_ function predicts the probability `Y` of an outcome `C`, given data `X`:

$\hat{Y} = P(C | X)$

A machine learning classification model is essentially a prediction function. For example, a super-simple spam classifier for emails could make keyword-based predictions, given the message body. By convention, a prediction `>= 0.5` is `True` and `< 0.5` is False, although you could set the True/False threshold higher or lower depending on the situation.

In [28]:
# predict spam if 2 or more keywords are present in mail body
def predict_spam(body):
    spam_words = ['credit', 'card', 'loan', 'meds', 'viagra']
    probability = 0.0
    for word in spam_words:
        if word in body:
            probability += 0.25
        if probability == 1.0:
            return 1.0
    return probability

spam = "Get a loan for our low price viagra meds!"
ham = "Congratulations on your promotion to staff scientist!"
d = {True: "is", False: "is not"}

for x in [spam, ham]:
    p_spam = predict_spam(x)
    is_spam = p_spam >= 0.5
    print(f"Probability '{x}' is spam = {p_spam}. Probably it {d[is_spam]} spam.")

Probability 'Get a loan for our low price viagra meds!' is spam = 0.75. Probably it is spam.
Probability 'Congratulations on your promotion to staff scientist!' is spam = 0.0. Probably it is not spam.


## Multi-class prediction

### A naive approach
When an outcome can be one of several classes, a classification model makes an array of predictions. We assign one position in the array to each class; thus each element in the prediction array represents the prediction for its corresponding class.

The sample code below is one way to predict the probabilities for each face when you roll a die.

In [29]:
import numpy as np

def predict_die_classes(seed):
    np.random.seed(seed)
    return np.random.uniform(size = (6,))

predict_die_classes(11)

array([0.18026969, 0.01947524, 0.46321853, 0.72493393, 0.4202036 ,
       0.4854271 ])

In [30]:
def predict_spots(class_predictions):
    return np.argmax(class_predictions) + 1

for i in range(6):
    print(predict_spots(predict_die_classes(i)))

2
2
3
6
3
4


### The problem with naivete
We have a problem: the class probabilities for an event must sum to 1.0. But the sum of the array elements returned by the `predict_die_classes` function does not equal 1.0:

In [31]:
np.sum(predict_die_classes(11))

2.293528088810396

### The softmax solution
One way to solve the problem would be proportionality: simply divide each prediction/element by the sum of the array. A better way is the softmax function, which makes the highest probability go toward 1.0 and the smaller probabilities go toward 0.0 when the predictions are unbounded real numbers, as they typically are in real-world machine learning models:

$softmax([y_0, y_1,..., y_k]) = \frac{[e^{y_0}, e^{y_1}, ...., e^{y_k}]}{\sum{e^{y_n}}}$

In [32]:
def predict_die_classes_v2(seed):
    np.random.seed(seed)
    return np.random.randn(6) * 4

print(predict_die_classes_v2(42))

def softmax(y):
    return np.exp(y)/np.sum(np.exp(y))

print()
print(softmax(predict_die_classes_v2(42)))

[ 1.98685661 -0.5530572   2.59075415  6.09211943 -0.9366135  -0.93654783]

[1.57049260e-02 1.23869772e-03 2.87279915e-02 9.52640149e-01
 8.44090392e-04 8.44145826e-04]


## Measuring Prediction Accuracy with a Loss Function
Now that we have a class prediction array is an array of probabilities that sums to 1.0, we can design a _loss function_ that measures the accuracy of the prediction. The basic idea is that the more accurate the prediction is, the lower the loss; and conversely, the less accurate the prediction, the higher the loss. Machine learning is, at its most basic, a way to gradually adjust the parameters of a model so that its loss decreases as it iteratively compares training data examples to their corresponding predictions. As the model's loss decreases, we expect (with certain caveats) its accuracy in predicting unseen examples to increase. 

The process for adjusting parameters is a complex topic for another day. For today, let's just see how we can measure the loss of a classification model. 

### Step 1: Class labels in one-hot format
Remember that the model's prediction is simply the index of the highest probability in the prediction array. A naive way to measure loss would be to set the loss = 0.0 when the index matches the ground truth label, and set the loss = 1.0 when the index does not match the ground truth label. However, this would not be very helpful, because we want to adjust the model parameter proportionally to the error. 

For example, if the actual die roll is `3` and the softmax prediction array is `[0.01, 0.95, 0.01, 0.01, 0.01, 0.01]` ==> `2`, we would like to make a big adjustment to model parameters that and disfavored a `3`. 

If the actual die roll is `3` and the prediction array is `[0.01, 0.01, 0.465, 0.01, 0.01, 0.475]` ==> `6`, we have a different situation; the amount by which we want to adjust the parameters that disfavored a `3` is much smaller.

The way forward is to put the ground truth label in _one-hot_ format. This means that the label is an array with a value of 1.0 at the index corresponding to the correct class, and zeros everywhere else. For example, a ground truth label of `3` for a die roll would be as follows in one-hot format:

```
[0,0,1,0,0,0]
```

A ground truth label of `1` would of course be:

```
[1,0,0,0,0,0]
```

By putting the label in one-hot format, we can now make an element-wise comparison between the prediction array and the one-hot ground truth array.



In [33]:
def one_hot(label, num_classes):
    the_array = np.zeros((num_classes,))
    the_array[label - 1] = 1.0
    return the_array

one_hot(3, 6)

array([0., 0., 1., 0., 0., 0.])

In [34]:
label_6 = one_hot(6, 6)
diffs_6 = softmax(predict_die_classes_v2(42)) - label_6
print(diffs_6) # not very good

label_4 = one_hot(4, 6)
diffs_4 = softmax(predict_die_classes_v2(42)) - label_4
print(diffs_4) # much better!


[ 1.57049260e-02  1.23869772e-03  2.87279915e-02  9.52640149e-01
  8.44090392e-04 -9.99155854e-01]
[ 0.01570493  0.0012387   0.02872799 -0.04735985  0.00084409  0.00084415]


### Step 2: The cross-entropy loss function
The simple difference array does not work well as a loss function because it makes the optimization of model parameters very difficult. (See [this Quora post](https://qr.ae/TW4uPv) for an explanation.) The cross-entropy loss function (symbolized as `J`) is much more effective for the purpose:

`Let: Prediction array = P, one-hot ground truth = Y`

$J(P, Y) = -\sum{Y_i*log(P_i)}$

See [this blog post](https://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/) for a deeper explanation of why cross-entropy is a useful construct for measuring loss.

In [35]:
def cross_entropy_loss(prediction, actual):
    """ Returns the CE loss, given a prediction array and a one-hot ground truth array"""
    return -np.sum(np.log(prediction) * actual)

loss_predicting_6 = cross_entropy_loss(softmax(predict_die_classes_v2(42)), label_6)
print(loss_predicting_6)

loss_predicting_4 = cross_entropy_loss(softmax(predict_die_classes_v2(42)), label_4)
print(loss_predicting_4)


7.077185298608924
0.04851804518010041
