In [None]:
%load_ext autoreload
%autoreload 2

import edunn as nn
import numpy as np


# Cross Entropy Layer

In this exercise, you need to implement the `CrossEntropyWithLabels` error layer, which allows you to calculate the error of a model that outputs probabilities in terms of distances between distributions.

In this case, `WithLabels` indicates that the true probability distribution (obtained from the dataset) is actually encoded with labels. For example, for a problem with `C=3` classes, if an example is of class 2 (counting from 0), then its label is `2`. This is a convenient way to specify that its encoding as a probability distribution would be `[0,0,1]`, which is a vector of `3` elements, where element `2` (again, counting from 0) has a probability of 1, and the rest are 0.



# Forward Method

The `forward` method of the `CrossEntropyWithLabels` layer assumes that its input `y` is a probability distribution, i.e., `C` positive values that sum to 1, where `C` is the number of classes. Similarly, `y_true` is a label indicating which class out of the `C` classes is correct.

For example, if $y=(0.3,0.4,0.3)$ and $y_{true}=2$, then there will be a considerable error since the value $y_{true}=2$ indicates that the distribution $y=(0,0,1)$ is expected. So, the values $0.3$ and $0.4$ for classes 0 and 1 should decrease, and the value $0.3$ for class 2 should increase.

Cross entropy quantifies this error by calculating the negative logarithm of the probability of the correct class ($-ln(y_{y_{true}})$), in this case, class 2 ($-ln(y_2)$). So,

$$CrossEntropy(y,y_{true}) = CrossEntropy((0.3,0.4,0.3),2) = -ln(0.3) = 1.20$$

Again, in this case, the value $0.3$ was chosen because it is at index 2 of the vector $y$, meaning another way to write the above would be:

$$E(y,y_{true}) = -ln(y_{y_{true}}) = -ln(0.3) = 1.20$$

The reason for using the function $-ln(0.3)$ to penalize is that if the probability for the correct class is 1, then

$$-ln(y_{y_{true}}) = -ln(1) = -0 = 0$$

and there is no penalty. Otherwise, the output of $-ln$ will be positive and indicate an error. This way, it penalizes that the probability of the correct class does not reach 1. This can be visualized easily in a graph of the function $-ln(x)$:

<img src="img/cross_entropy.png" width="400">

Finally, since the values of $y$ are normalized, it is not necessary to penalize that the rest of the probabilities are greater than 0; if the error leads to the probability of the correct class being 1, then the rest must be 0. For this reason (among others), cross entropy is a good combination with the softmax function for training classification models.

In the case of a batch of examples, the calculation is independent for each example.

Implement the `forward` method of the `CrossEntropyWithLabels` class:


In [None]:
y = np.array([[1,0],
             [0.5,0.5],
              [0.5,0.5],
             ])
y_true = np.array([0,0,1])


layer = nn.CrossEntropyWithLabels()
E = -np.log(np.array([[1],[0.5],[0.5]]))

nn.utils.check_same(E, layer.forward(y_true, y))

# Backward Method

Since the derivation of the equations for the `backward` method of cross entropy is a bit long, we provide [this note](http://facundoq.github.io/guides/crossentropy_derivative) with the derivation of all cases.

Again, since this error is for each example, the calculations are independent for each row.

Implement the `backward` method of the `CrossEntropyWithLabels` class:

In [None]:
# Number of random values of x and δEδy to generate and test gradients
samples = 100
batch_size = 2
features_in = 3
features_out = 5
input_shape = (batch_size, features_in)


layer = nn.CrossEntropyWithLabels()
nn.utils.check_gradient.cross_entropy_labels(layer, input_shape, samples=samples, tolerance=1e-5)    


# Logistic Regression Applied to Flower Classification

Now that we have all the elements, we can define and train our first logistic regression model to classify flowers in the [Iris dataset](https://www.kaggle.com/uciml/iris).

Now, we can do it with Cross Entropy; although in this case, the results in terms of accuracy are similar, the model has a convex error, making optimization easier.


In [None]:

# Load data with labels as outputs
# (note: class labels start at 0)
x, y, classes = nn.datasets.load_classification("iris")
# Normalize the data
x = (x - x.mean(axis=0)) / x.std(axis=0)
n, din = x.shape
# Calculate the number of classes
classes = y.max() + 1
print("Sizes of x and y:", x.shape, y.shape)

# Logistic Regression model, 
# with `din` input dimensions (4 for Iris)
# and `classes` output dimensions (3 for Iris)
model = nn.LogisticRegression(din, classes)
# Mean Squared Error
error = nn.MeanError(nn.CrossEntropyWithLabels())
optimizer = nn.GradientDescent(lr=0.1, epochs=1000, batch_size=32)

# Optimization algorithm
history = optimizer.optimize(model, x, y, error)
nn.plot.plot_history(history, error_name=error.name)


print("Model Metrics:")
y_pred = model.forward(x)
y_pred_labels = nn.utils.onehot2labels(y_pred)
nn.metrics.classification_summary(y, y_pred_labels)