<img src="https://www.th-koeln.de/img/logo.svg" style="float: right;" width="200">

# 4th exercise: <font color="#C70039">Multi-Class classification of newswires</font>
* Course: DIS21a.1
* Lecturer: <a href="https://www.gernotheisenberg.de/">Gernot Heisenberg</a>
* Author of notebook modifications and adaptations: <a href="https://www.gernotheisenberg.de/">Gernot Heisenberg</a>
* Date: 08.08.2023

<img src="https://miro.medium.com/max/700/1*HgXA9v1EsqlrRDaC_iORhQ.png" style="float: center;" width="400">

---------------------------------
**GENERAL NOTE 1**: 
Please make sure you are reading the entire notebook, since it contains a lot of information about your tasks (e.g. regarding the set of certain paramaters or specific computational tricks, etc.), and the written mark downs as well as comments contain a lot of information on how things work together as a whole. 

**GENERAL NOTE 2**: 
* Please, when commenting source code, just use English language only. 
* When describing an observation (for instance, after you have run through your test plan) you may use German language.
This applies to all exercises in DIS 21a.1.  

--------------------

### <font color="ce33ff">DESCRIPTION</font>:
The previous exercises were dealing with classification problems, that was to classify data into two mutually exclusive classes using a densely-connected neural network. In this exercise you will deal with a problem where you have more than two classes. 

This notebook allows you for classifying Reuters newswires into 46 different mutually-exclusive topics (multi-class classification problem), and since each data point should be classified into only one category, the problem is more specifically spoken a "single-label, multi-class classification" problem (compare lecture slide p.163). 
If each data point could have belonged to multiple categories (in our case the topics) then we would be facing a "multi-label, multi-class classification" problem.

The example works with the so-called _Reuters dataset_, a set of short newswires and their topics, published by Reuters in 1986. It is a very simple, widely used toy data set for text classification. There are 46 different topics. Some topics are more represented than others, but each topic has got at least 10 examples in the training set. Like IMDB and MNIST, the Reuters data set comes packaged as part of Keras.

-----------------------

### <font color="FFC300">TASKS</font>:
Within this notebook, the tasks that you need to work on are always listed as bullet points below. 
If a task is more challenging and consists of several steps, this is indicated as well. 
Make sure you have worked down the task list and commented your doings. 
This should be done using markdown.<br> 
<font color=red>Make sure you don't forget to specify your name and your matriculation number in the notebook before submitting it.</font>

**YOUR TASKS in this exercise are as follows**:
1. import the notebook to Google Colab.
2. make sure you specified you name and your matriculation number in the header below my name and date.
    * set the date too and remove mine.
3. read the entire notebook carefully.
    * add comments whereever you feel it necessary for better understanding
    * run the notebook for the first time and note the result in your markdown result table (your test plan). 
4. decode the newswire back to english words by what you have learned from the previous exercise.
5. go into the section 'building the ANN'. 
    * add the missing code that does create a network as shown in the image in the lecture slides on page 172 (File: 'DIS21a.1-7.HANDS_ON.First.DLNetwork.Architectures.for.Solving.Three.Interesting.Problems.pdf')
    * set the activation function to ReLu
    * set the correct activation function in the last layer/the output. What is correct when doing a single-label, multi-class classification?
    * add the missing code for compiling the network by setting
        * the loss function 
        * the optimizer
        * an evaluation metric, that makes sense
6. optimize the hyperparameters, build a new model and evaluate it on the test data. 
    * determine the minimum number of epochs and train a new model from scratch for this number of epochs.
    * evaluate it on the test set from the data set you have loaded.
7. make combinations of this according to your test plan. Make sure you combine with sense and reason and not just chaotically.
    * Try using smaller or larger layers: 32 units, 128 units...
    * We were using two hidden layers. Now try to use a single hidden layer, or three hidden layers.
8. comment your observations.
    * when is the accuracy increasing/decreasing? Describe your findings!

## START OF THE NOTEBOOK CODE
----------------------------------------------------------------------------------------------------------------------

In [None]:
import tensorflow
tensorflow.keras.__version__

### loading the reuters newswire data set

In [None]:
from tensorflow.keras.datasets import reuters
(train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words=10000)

Like with the IMDB data set, the argument `num_words=10000` restricts the data to the 10.000 most frequently occurring words found in the data.

There are 8.982 training examples and 2.246 test examples.

In [None]:
len(train_data)

In [None]:
len(test_data)

As with the IMDB reviews, each example is a list of integers (word indices).

In [None]:
train_data[10]

#### <font color="#00ff00">Task 4:</font> decode one entry back to words

In [None]:
# add your code here

In [None]:
decoded_newswire

The label associated with an example is an integer between 0 and 45, since there are 46 topics: a topic index

In [None]:
train_labels[10]

### data preparation

The data is going to be vectorized with the exact same code as in the previous exercise.

In [None]:
import numpy as np

def vectorize_sequences(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

# Our vectorized training data
x_train = vectorize_sequences(train_data)
# Our vectorized test data
x_test = vectorize_sequences(test_data)

To vectorize the labels, one-hot-encoding is used, since it is a widely used format for categorical data, very often also referred to as "categorical encoding". 

Here, one-hot encoding of the labels consists in embedding each label as an all-zero vector with a 1 in the place of the label index.

In [None]:
def to_one_hot(labels, dimension=46):
    results = np.zeros((len(labels), dimension))
    for i, label in enumerate(labels):
        results[i, label] = 1.
    return results

# Our vectorized training labels
one_hot_train_labels = to_one_hot(train_labels)
# Our vectorized test labels
one_hot_test_labels = to_one_hot(test_labels)

<font color=red>Note</font> that there is a more elegant and built-in way to do this in Keras, which you have already seen in action in the MNIST example:

In [None]:
from tensorflow.keras.utils import to_categorical

one_hot_train_labels = to_categorical(train_labels)
one_hot_test_labels = to_categorical(test_labels)

### building the ANN

This topic classification problem looks very similar to our previous movie review classification problem: in both cases, we are trying to classify short snippets of text. There is however a new constraint here: the number of output classes has gone from 2 to 46, i.e. the dimensionality of the output space is much larger. 

#### <font color ="#00ff00">Task 5:</font> build and compile the ANN
as described in the task list

In [None]:
from tensorflow.keras import models
from tensorflow.keras import layers

model = models.Sequential()

'''ADD THE MISSING CODE HERE'''
'''LOOK AT THE TEXT ABOVE TO SEE WHAT PARAMETERS THE NETWORK SHALL HAVE'''

# your code


### Validating the ANN model

Let's set apart 1000 samples from the training data to use as a validation set. The testing data remains untouched.

In [None]:
x_val = x_train[:1000]
partial_x_train = x_train[1000:]

y_val = one_hot_train_labels[:1000]
partial_y_train = one_hot_train_labels[1000:]

Now let's train the ANN for 20 epochs with batch_size=512 and use the history object.

In [None]:
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=20,
                    batch_size=512,
                    validation_data=(x_val, y_val))

In [None]:
history.history.keys()

Visualize the loss and accuracy by using pyplot.

In [None]:
import matplotlib.pyplot as plt

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(loss) + 1)

plt.plot(epochs, loss, 'bo', label='training loss')
plt.plot(epochs, val_loss, 'b', label='validation loss')
plt.title('training / validation loss')
plt.xlabel('epochs')
plt.ylabel('loss')
plt.legend()

plt.show()

In [None]:
plt.clf()   # clear the old figure (if you forget it, this might cause problems sometimes.)

acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

plt.plot(epochs, acc, 'bo', label='training acc')
plt.plot(epochs, val_acc, 'b', label='validation acc')
plt.title('training / validation accuracy')
plt.xlabel('epochs')
plt.ylabel('loss')
plt.legend()

plt.show()

As you can see the ANN starts overfitting after some epochs. 

#### <font color ="#00ff00">Task 6:</font> Optimize the hyperparameters, build a new model and evaluate it on the test data 
1. determine the minimum number of epochs and train a new model from scratch for this number of epochs
    * set all missing hyperparameters and other parameters that are needed
2. evaluate it on the test set

In [None]:
model = models.Sequential()

#add your parameters here
model.add(layers.Dense(xxx, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(xxx, activation='relu'))
model.add(layers.Dense(xxx, activation='softmax'))

#add your parameters here
model.compile(ADD YOUR params HERE)

#add your parameters here
model.fit(partial_x_train,
          partial_y_train,
          epochs=xxxx,
          batch_size=xxxx,
          validation_data=(x_val, y_val))

results = model.evaluate(x_test, one_hot_test_labels)

In [None]:
results

The approach reaches an accuracy of ~78%. With a balanced binary classification problem, the accuracy reached by a purely random classifier would be 50%, so your results seem pretty good, at least when compared to a random baseline.

In [None]:
import copy

test_labels_copy = copy.copy(test_labels)
np.random.shuffle(test_labels_copy)
float(np.sum(np.array(test_labels) == np.array(test_labels_copy))) / len(test_labels)

### <font color="#C70039">Include your result table here and reflect a good test plan (see task list)</font>

### generating predictions on new data

The `predict` method of your model returns a probability distribution over all 46 topics (which sums to 1). 
Now, generate topic predictions for all of the test data.

In [None]:
predictions = model.predict(x_test)

Each entry in predictions is a vector of length 46.

In [None]:
predictions[0].shape

The coefficients in this vector sum to 1.

In [None]:
np.sum(predictions[0])

The largest entry is the predicted class, i.e. the class with the highest probability.

In [None]:
np.argmax(predictions[0])