<!--NAVIGATION-->

<a href="https://colab.research.google.com/github/bpesquet/machine-learning-katas/blob/master/notebooks/katas/algorithms/DNN_ReutersNews.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open in Google Colaboratory"></a>


## Instructions

This is a self-correcting exercise generated by [nbgrader](https://github.com/jupyter/nbgrader). 

Fill in any place that says `YOUR CODE HERE` or `YOUR ANSWER HERE`. Run subsequent cells to check your code.

---

# Kata: Reuters News Dataset

The [Reuters](https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection) dataset is a set of short newswires and their topics, published by Reuters in 1987 and widely used for text classification. There are 46 different topics, some more represented than others. These topics are mutually exclusive: a news can only belong to one topic. 

The goal is to classify news articles by their topic.

![Reuters logo](images/Reuters-logo.png)

## Package setup

In [None]:
# The mlkatas package contains various utility functions required by all katas
!pip install mlkatas

In [None]:
# Import base packages
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import mlkatas

In [None]:
# Setup plots
%matplotlib inline
plt.rcParams['figure.figsize'] = 10, 10
%config InlineBackend.figure_format = 'retina'
sns.set()

In [None]:
# Import ML packages (edit this list if needed)
import tensorflow as tf
from tensorflow.keras import regularizers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.datasets import reuters
from tensorflow.keras.utils import to_categorical

print(f'TensorFlow version: {tf.__version__}')
print(f'Keras version: {tf.keras.__version__}')

## Step 1: Loading the data

### Question

* Load the Reuters dataset included with Keras. Limit yourself to the 10,000 most frequent words.
* Print shapes of training data and labels.
* Print the first training sample.
* Print the first 10 labels.

In [None]:
# The following code prevents a loading error caused by an API change in NumPy
# https://stackoverflow.com/questions/55890813/how-to-fix-object-arrays-cannot-be-loaded-when-allow-pickle-false-for-imdb-loa

# save np.load
np_load_old = np.load

# modify the default parameters of np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)

# YOUR CODE HERE

# restore np.load for future normal usage
np.load = np_load_old

In [None]:
# Showing the first 10 samples as text

# word_index is a dictionary mapping words to an integer index
word_index = reuters.get_word_index()
# We reverse it, mapping integer indices to words
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
# We decode the news; note that our indices were offset by 3
# because 0, 1 and 2 are reserved indices for "padding", "start of sequence", and "unknown".
decoded_news = ' '.join([reverse_word_index.get(i - 3, '?') for i in train_data[0]])
print(decoded_news)

## Step 2: Preparing the data

### Question

Prepare data for training. Set apart the first 1,000 examples for validation. Store the data subsets in variables named `x_train`/`y_train`, `x_val`/`y_val` and `x_test`/`y_test`.

In [None]:
# Turn news into vectors of 0s and 1s (one-hot encoding)
x_train = mlkatas.vectorize_sequences(train_data)
x_test = mlkatas.vectorize_sequences(test_data)

# YOUR CODE HERE

In [None]:
# Show a sample of encoded input
df_x_train = pd.DataFrame(x_train)
df_x_train.sample(n=10)

In [None]:
print(f'x_train: {x_train.shape}. y_train: {y_train.shape}')
print(f'x_val: {x_val.shape}. y_val: {y_val.shape}')
print(f'x_test: {x_test.shape}. y_test: {y_test.shape}')

# Assert shapes of prepared data
assert x_train.shape == (7982, 10000)
assert y_train.shape == (7982, 46)
assert x_val.shape == (1000, 10000)
assert y_val.shape == (1000, 46)
assert x_test.shape == (2246, 10000)
assert y_test.shape ==(2246, 46)

## Step 3: Training a model

### Question

Train a model on the data to obtain a training accuracy > 95%. Store the training history in a variable named `history`.

In [None]:
# YOUR CODE HERE

In [None]:
# Show training history
mlkatas.plot_loss_acc(history)

In [None]:
# Retrieve final training accuracy
train_acc = history.history['acc'][-1]

# Assert final accuracy
assert train_acc > 0.95

In [None]:
# Evaluate model performance on test data
test_loss, test_acc = model.evaluate(x_test, y_test)

print(f'Test accuracy: {test_acc * 100:.2f}%')

## Step 4: Tuning the model

### Question

If necessary, tune your model to obtain a validation accuracy > 82%.

In [None]:
# YOUR CODE HERE

In [None]:
# Show training history
mlkatas.plot_loss_acc(history)

In [None]:
# Retrieve final validation accuracy
val_acc = history.history['val_acc'][-1]

# Assert final accuracy
assert val_acc > 0.82

In [None]:
# Evaluate model performance on test data
test_loss, test_acc = model.evaluate(x_test, y_test)

print(f'Test accuracy: {test_acc * 100:.2f}%')