<a href="https://colab.research.google.com/github/nitinpunjabi/nlp-demystified/blob/main/notebooks/nlpdemyst_neural_network_foundations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing Demystified | Neural Networks Foundations
https://nlpdemystified.org<br>
https://github.com/nitinpunjabi/nlp-demystified

At the time this notebook was created, spaCy had newer releases but Colab was still using version 2.x by default. So the first step is to upgrade spaCy and download a statistical model for English.
<br><br>
**IMPORTANT**<br>
If you're running this for free in the cloud rather than using a paid tier or using a local Jupyter server on your machine, then the notebook will *timeout* after a period of inactivity. If that happens and you don't reconnect in time, you will need to upgrade spaCy again and reinstall the requisite statistical package(s).
<br><br>
Refer to this link on how to run Colab notebooks locally on your machine to avoid this issue:<br>
https://research.google.com/colaboratory/local-runtimes.html

In [None]:
!pip install -U spacy==3.*
!python -m spacy download en_core_web_sm

In this demo, we're once again going to use the **20 newgroups** dataset. This is so we can see how a neural network approach compares against our previous model.<br>
https://scikit-learn.org/stable/datasets/real_world.html#the-20-newsgroups-text-dataset

In [None]:
# Download the *train* dataset without headers, footers, and quotes to make the problem more challenging.
from sklearn.datasets import fetch_20newsgroups
train_corpus = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))

# Tokenization

We're going to use **Tensorflow/Keras** to build our model, but stick with spaCy for text preprocessing. While Keras does come with a basic tokenizer, it lacks spaCy's useful, specialist linguistic features.
<br><br>
To that end, we'll load the small English statistical model and create a tokenizer function as we did in the previous videos.

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [None]:
# We don't need named entity recognition nor parsing. Removing them will speed up processing.
unwanted_pipes = ['ner', 'parser']

def spacy_tokenizer(doc):
  with nlp.disable_pipes(*unwanted_pipes):
    return [t.lemma_.lower() for t in nlp(doc) if \
            len(t) > 2 and \
            not t.is_punct and \
            not t.is_space and \
            not t.is_stop and \
            t.is_alpha]

The function below takes some text, runs it through the spaCy tokenizer, then _joins_ the tokens back using a '|' character. The reason why we're doing this is further below.

In [None]:
def preprocess_text(text):
  tokens = spacy_tokenizer(text)
  return "|".join(tokens)

Preprocess each post in the training corpus. We'll end up with a collection of posts where each token is delimited with '|'.

In [None]:
%%time
preprocessed_train_corpus = [preprocess_text(post) for post in train_corpus.data]

In [None]:
preprocessed_train_corpus[0]

As before, we'll split the corpus into a training set and validation set.<br>
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
from sklearn.model_selection import train_test_split
train_data, val_data, train_labels, val_labels = train_test_split(preprocessed_train_corpus, train_corpus.target, train_size=0.85, random_state=1)

In [None]:
print(len(train_data), len(val_data))

At this point, we'll bring in **Keras**. Keras is a deep-learning framework built on top of Tensorflow, and makes it easy to compose models and iterate fast. Most of the time, Keras will provide everything you need but you can drop down to Tensorflow directly for more low-level customization.<br>
https://keras.io/<br>
https://www.tensorflow.org/
<br><br>
As an aside, the word _tensor_ in Tensorflow simply refers to a mathematical object. It's a generalization of scalars and vectors.  A scalar is a zero-rank tensor, a vector is a first-rank tensor, and so on.

We're going to use Keras' basic tokenizer to split our posts into sequences of tokens with no further processing. By doing this, Keras will generate an internal vocabulary which we can use to encode the posts into vectors as we'll see.<br>
https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer
<br><br>
There are other ways to do this rather than a two-step tokenization process (e.g. use spaCy to encode our posts as integer sequences and pass that directly to Keras) but this is the most straight-forward for our purpose.

Here, we're initializing a tokenizer to do nothing but split text on the '|' character.
<br><br>
We're also including an Out-of-Vocabulary token **('OOV')**. Recall that during testing or inference, it's possible for our model to encounter words it didn't see during training. When that happens, the new word is fed into the model as an **'OOV'** token.

In [None]:
from tensorflow import keras
tokenizer = keras.preprocessing.text.Tokenizer(filters="", lower=False, split='|', oov_token='OOV')

Calling _fit_on_texts_ generates an internal vocabulary.<br>
https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#fit_on_texts

In [None]:
tokenizer.fit_on_texts(train_data)

We can look at the tokenizer's internals using the _get_config_ method.<br>
https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#get_config
<br><br>
We can see information such as how many documents were processed to generate the vocabulary, the frequency of each token, and various indices. Btw--_num_words_ does NOT mean the number of words in the vocabulary. It's actually a parameter we can pass to the tokenizer upon initialization to keep the most frequent {_num_words_} words and to dump the rest. Here, we didn't set any limit.

In [None]:
tokenizer.get_config()

# Vectorization

The next step is to vectorize our text with a bag-of-words (BoW) approach.
<br><br>


---


**NOTE:**<br>
Now that we understand how neural networks work, there are **MUCH** better ways to vectorize text than bag-of-words for neural network models. But since we haven't learned them yet and this demo is just to get a feel of building models, we'll stick with BoW for now.


---



The Keras tokenizer's _texts_to_matrix_ method builds a BoW. It can create different BoW types including binary (default), TF-IDF, and others.<br>
https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#texts_to_matrix 

In [None]:
# Vectorize the first post using binary. We're using [:1] here because the 
# tokenizer expects an *array* of sequences.
print(train_data[:1])

# The resulting binary BoW has a 1 set for every word present in the sequence.
binary_bow = tokenizer.texts_to_matrix(train_data[:1])
binary_bow

In [None]:
# Get indices where the binary BoW is set to 1 indicating the associated word is
# present in the sequence.
import numpy as np
present_tokens = np.where(binary_bow[0] == 1)[0]
present_tokens

In [None]:
# Retrieve the words.
" ".join(tokenizer.index_word[n] for n in present_tokens)

In [None]:
# Vectorize the first post using TF-IDF and look at the scores for present tokens.
# https://numpy.org/doc/stable/user/basics.indexing.html
tfidf_bow = tokenizer.texts_to_matrix(train_data[:1], mode='tfidf')
print(tfidf_bow)
tfidf_bow[0][present_tokens]

For simplicity, we'll stick to binary BoW. Feel free to experiment with different modes to see if you can squeeze better performance.

In addition to vectorizing the text into binary BoWs, we're going to also store them in Tensorflow **sparse matrices**. Our vocabulary is quite large and for each post, very few indices in each vector will be set to 1. This means we'll have large matrices of mostly zeros which is expensive to store and can be problematic for environments such as this free tier of Colab.
<br><br>
Tensorflow **sparse tensors** store these types of data structures more efficiently, and Keras can work seamlessly with them.<br>
https://www.tensorflow.org/api_docs/python/tf/sparse/SparseTensor
<br><br>
In the future, we'll learn different vectorization techniques to create smaller, _dense_ vectors that can pack in more information beyond just simply indicating whether a word is present.

In [None]:
# Vectorize the training dataset.
import tensorflow as tf
x_train = tf.sparse.from_dense(tokenizer.texts_to_matrix(train_data))

The shape of the tensor corresponds to the number of tokenized documents (rows) and vocabulary (columns).

In [None]:
x_train.shape

We also need to vectorize our labels. Since our goal is multiclass classification, we'll one-hot encode the labels. That is, each label vector will be an array of length 20 (corresponding to the 20 categories) with one index set to 1 to indicate the correct category. The rest will be zero.
<br><br>
Keras has a _to_categorical_ method to help with this.<br>
https://www.tensorflow.org/api_docs/python/tf/keras/utils/to_categorical

In [None]:
y_train = keras.utils.to_categorical(train_labels)

The shape of the tensor corresponds to the number of documents (rows) and categories (columns)

In [None]:
y_train.shape

Looking at the first entry in the vectorized labels, we can see its corresponding category.<br>
https://numpy.org/doc/stable/reference/generated/numpy.argmax.html

In [None]:
print(y_train[0])
print(train_corpus.target_names[np.argmax(y_train[0])])

We'll vectorize the validation data and labels as well.

In [None]:
x_val = tf.sparse.from_dense(tokenizer.texts_to_matrix(val_data))
y_val = keras.utils.to_categorical(val_labels)

# Building an Initial Model

We'll use the **layers** API from Keras to build our models. Through **layers**, we can describe a model layer-by-layer including the number of weights, which activation function to use, add regularization steps, and more.<br>
https://keras.io/api/layers/
<br><br>
The **Sequential** class groups a stack of layers and provides training/inference features:<br>
https://keras.io/api/models/sequential/<br>
https://keras.io/guides/sequential_model/
<br><br>
You can alternatively use the **Functional** API for more flexibility but we'll stick with **Sequential** for now.<br>
https://keras.io/guides/functional_api/

We'll build a simple model with two hidden layers. The output layer uses **softmax** since we're performing multiclass classification.

In [None]:
NUM_CATEGORIES = len(train_corpus.target_names)

from keras import layers
model = keras.Sequential([
  layers.Dense(128, activation='relu'),
  layers.Dense(128, activation='relu'),
  layers.Dense(NUM_CATEGORIES, activation='softmax')
])

After specifying the layers, we'll compile the model and specify which optimizer, loss function, and performance metric we want to use.<br>
https://www.tensorflow.org/api_docs/python/tf/keras/Sequential#compile

In [None]:
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

Similar to our previous experience with Scikit-learn, we can train a Keras model by calling its _fit_ method and specifying a number of parameters. Here, we're also passing in our validation data on which the model will evaluate the loss after each epoch.<br>
https://www.tensorflow.org/api_docs/python/tf/keras/Sequential#fit

In [None]:
history = model.fit(x_train, y_train, epochs=15, batch_size=128, validation_data=(x_val, y_val))


---

**Note:**<br>
Because we didn't set a fixed seed value and there is some randomness with neural networks, your model training output may look **different** though I don't expect the differences to be dramatic.

---
<br><br>
During training, our model outputted a history of loss and accuracy metrics for both the training set and validation set. We can see the training and validation metrics get better for a certain number of epochs before they start diverging. Performance on the training set keeps improving while performance on the validation set starts degrading at some point, signalling that the model is starting to overfit.
<br><br>
We can plot this information as well.

In [None]:
training_losses = history.history['loss']
validation_losses = history.history['val_loss']

training_accuracy = history.history['accuracy']
validation_accuracy = history.history['val_accuracy']

epochs = range(1, len(training_losses) + 1)

import matplotlib.pyplot as plt
fig, (ax1, ax2) = plt.subplots(2)
fig.set_figheight(15)
fig.set_figwidth(15)
fig.tight_layout(pad=5.0)

# Plot training vs. validation loss.
ax1.plot(epochs, training_losses, 'bo', label='Training Loss')
ax1.plot(epochs, validation_losses, 'b', label='Validation Loss')
ax1.title.set_text('Training vs. Validation Loss')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.legend()

# PLot training vs. validation accuracy.
ax2.plot(epochs, training_accuracy, 'bo', label='Training Accuracy')
ax2.plot(epochs, validation_accuracy, 'b', label='Validation Accuracy')
ax2.title.set_text('Training vs. Validation Accuracy')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy')
ax2.legend()

plt.show()

Our current model, the way it's trained, has overfit on the data.
<br><br>
Since we have an idea of when that overfitting begins, we can now train a _new_ model that stops training at or right before that point. In the following cell, we'll retrain an identical model but this time with the number of epochs equalling the point where the divergence began in our previous model.
<br><br>
Again, your results may look slightly different so modify accordingly.

In [None]:
model = keras.Sequential([
  layers.Dense(128, activation='relu'),
  layers.Dense(128, activation='relu'),
  layers.Dense(NUM_CATEGORIES, activation='softmax')
])

model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

history = model.fit(x_train, y_train, epochs=<DESIRED_EPOCHS_HERE>, batch_size=128, validation_data=(x_val, y_val))

We can look at a summary of our model using the _summary_ method.


In [None]:
model.summary()

Looking at the preceding summary, there's an outsized number of parameters in the input layer because the BoW encoding results in a wide vocabulary array. This isn't great.
<br><br>
If you're wondering where that _param_ number comes from, here's how it's calculated:



In [None]:
# Size of vocabulary. The '+ 1' is because the zero index is reserved for padding.
v = (len(tokenizer.word_index) + 1)
print('Size of BoW array(v): {}'.format(v))

n = 128
print('Number of units in the input layer(n): {}'.format(n))

print('')

# The '+ n' accounts for the number of biases. Each unit has one.
p = v * n + n
print('Number of params in the input layer(p) = v * n + n = {}'.format(p))

We can look at the weights in each layer as well using the _get_weights_ method. Here are the weights of the input layer. It's a two-element array where the first contains the non-bias weights and the second the bias weights.<br>
https://keras.io/api/layers/base_layer/#getweights-method

In [None]:
model.layers[0].get_weights()

And these are the weights from the first layer's first unit without the bias.

In [None]:
ws = model.layers[0].get_weights()[0][0]
print(len(ws))
ws

Let's try our model on the test set.

In [None]:
test_corpus = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

In [None]:
%%time
preprocessed_test_corpus = [preprocess_text(post) for post in test_corpus.data]

In [None]:
x_test = tf.sparse.from_dense(tokenizer.texts_to_matrix(preprocessed_test_corpus))
y_test = keras.utils.to_categorical(test_corpus.target)

Since we're evaluating the model on the test set, we'll use the _evaluate_ method.<br>
https://keras.io/api/models/model_training_apis/#evaluate-method

In [None]:
results = model.evaluate(x_test, y_test)

_evaluate_ returns a two-element list with the loss as the first entry and the metric of interest as the second entry.

In [None]:
results

Random guessing would result in an accuracy of ~5% (since there are 20 categories and the data is balanced), so the results are much better than that. But it's still not very satisfying.
<br><br>
Let's take a look at a confusion matrix and classification report. To generate those, we'll need the actual predictions from the model which we'll generate using the _predict_ method.
<br>
https://keras.io/api/models/model_training_apis/#predict-method

In [None]:
y_pred_probs = model.predict(x_test, verbose=1)

The output layer ends with a softmax, so each y_pred element is a probability distribution. We need to convert each element into a single category number in order to plot a confusion matrix.

In [None]:
# Look at the softmax output for the first item.
print(y_pred_probs[0])

y_pred = np.argmax(y_pred_probs, axis=1)

# Look at the most probable category for the first item.
print('Category with highest probability (for the first item): {}'.format(y_pred[0]))

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report

# Not normalizing this time. Just looking at raw numbers.
cm = confusion_matrix(test_corpus.target, y_pred)
cmd = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=test_corpus.target_names)
fig, ax = plt.subplots(figsize=(15, 15))
cmd.plot(ax=ax, xticks_rotation='vertical')

A few observations:
- As before, there's a cluster of brighter squares around the technology-related subjects (pc.hardware, mac.hardware, electronics, etc), and subjects such as athiesm, christianity, guns, and politics being confused for each other which drag the overall accuracy down.
- The more focused topics have brighter diagonal squares.



---

**NOTE:**<br>
Again, your results may differ.


---



In [None]:
print(classification_report(test_corpus.target, y_pred, target_names=test_corpus.target_names))

Let's take a look at some posts in categories with a high discrepancy between precision and recall.

In [None]:
# The category with a high discrepancy.
category_of_interest = test_corpus.target_names.index(<DESIRED_TARGET_NAME>)
category_of_interest

In [None]:
# Get the indices of predictions which matches the category of interest.
category_pred = np.where(y_pred == category_of_interest)[0]
category_pred

In [None]:
# Get the indices of incorrect predictions.
incorrect_pred = np.nonzero(test_corpus.target != y_pred)[0]
incorrect_pred

In [None]:
# Get the indices where the model predicted the category of interest, but was wrong.
incorrect_category_pred = category_pred[np.in1d(category_pred, incorrect_pred)]
incorrect_category_pred

In [None]:
def incorrect_pred_posts(post_idx):
  print("Predicted category: {}".format(test_corpus.target_names[y_pred[post_idx]]))
  print("Actual category: {}".format(test_corpus.target_names[test_corpus.target[post_idx]]))
  print("Post: {}".format(preprocessed_test_corpus[post_idx]))


In [None]:
# Take a look at a few of the posts.
for i in range(10):
  incorrect_pred_posts(incorrect_category_pred[i])
  print()

# Making Another Attempt

Before building another model, let's throw away any preprocessed posts with fewer than five words.

In [None]:
def filter_short_texts(text, min_len, split_char):
  tokens = text.split(split_char)
  return len(tokens) >= min_len

In [None]:
print('Before filtering short texts (train): {}'.format(len(preprocessed_train_corpus)))

# Filter training corpus.
z = zip(preprocessed_train_corpus, train_corpus.target)
f = filter(lambda t: filter_short_texts(t[0], 5, '|'), z)
preprocessed_train_corpus, train_corpus.target = zip(*f)

print('After filtering short texts (train): {}'.format(len(preprocessed_train_corpus)))

In [None]:
print('Before filtering short texts (test): {}'.format(len(preprocessed_test_corpus)))

# Do the same for the test corpus.
z = zip(preprocessed_test_corpus, test_corpus.target)
f = filter(lambda t: filter_short_texts(t[0], 5, '|'), z)
preprocessed_test_corpus, test_corpus.target = zip(*f)

print('Before filtering short texts (test): {}'.format(len(preprocessed_test_corpus)))

In [None]:
# Resplit the training data into train/validation sets.
train_data, val_data, train_labels, val_labels = train_test_split(preprocessed_train_corpus, train_corpus.target, train_size=0.85, random_state=1)

In [None]:
# Re-vectorize the training, validation, and test data.
x_train = tf.sparse.from_dense(tokenizer.texts_to_matrix(train_data))
y_train = keras.utils.to_categorical(train_labels)

x_val = tf.sparse.from_dense(tokenizer.texts_to_matrix(val_data))
y_val = keras.utils.to_categorical(val_labels)

x_test = tf.sparse.from_dense(tokenizer.texts_to_matrix(preprocessed_test_corpus))
y_test = keras.utils.to_categorical(test_corpus.target)

For the next attempt, add another layer with some **dropout** regularization and use **He initialization**.<br>
https://keras.io/api/layers/regularization_layers/dropout/<br>
https://keras.io/api/layers/initializers/#layer-weight-initializers
<br><br>
We'll also leverage **early stopping** to halt training once our validation stops improving. This'll save us the trouble of manually training another model with fewer epochs as we did with the previous model. This is done through a **callback**. Here, the **patience** parameter specifies how many epochs to process with no improvement before training stops. Since we saw in the early graphs that validation loss diverges pretty sharply, we're setting it to 1. If you saw that validation tends to plateau for a bit before improving again, you could consider setting a higher value. There are other settings worth reading about as well.<br>
https://keras.io/api/callbacks/<br>
https://keras.io/api/callbacks/early_stopping/<br>



In [None]:
es_callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=1)

initializer = tf.keras.initializers.HeNormal()

model_next = keras.Sequential([
  layers.Dense(128, activation='relu', kernel_initializer=initializer),
  layers.Dense(128, activation='relu', kernel_initializer=initializer),
  layers.Dense(128, activation='relu', kernel_initializer=initializer),
  layers.Dropout(0.3),
  layers.Dense(NUM_CATEGORIES, activation='softmax')
])

model_next.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

history = model_next.fit(x_train, y_train, epochs=15, batch_size=128, validation_data=(x_val, y_val), callbacks=[es_callback])

In [None]:
results = model_next.evaluate(x_test, y_test)

In [None]:
y_pred_probs = model_next.predict(x_test, verbose=1)
y_pred = np.argmax(y_pred_probs, axis=1)

# Not normalizing this time. Just looking at raw numbers.
cm = confusion_matrix(test_corpus.target, y_pred)
cmd = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=test_corpus.target_names)
fig, ax = plt.subplots(figsize=(15, 15))
cmd.plot(ax=ax, xticks_rotation='vertical')

In [None]:
print(classification_report(test_corpus.target, y_pred, target_names=test_corpus.target_names))

In [None]:
target_names = test_corpus.target_names.copy()

def classify_post(post):
  vectorized_post = tokenizer.texts_to_matrix([('|').join(spacy_tokenizer(post))])
  probs = model_next.predict(vectorized_post)
  pred = np.argmax(probs, axis=1)[0]
  return target_names[pred], probs[0][pred]

In [None]:
# Post from r/medicine.
s = "New primary care attending here. Why are all my new patients age 60-80 yo on Ambien? Serious question, why? Was there a strong marketing push at this time frame? Was it given out like candy to anyone who said they had some trouble with sleep? Was there any discussion of risks and duration of therapy? Has anyone had success/tips for weaning them off of it?"
classify_post(s)

In [None]:
# Post from r/space.
s = "James Webb Space Telescope has successfully deployed its forward sunshield pallet! Next up: aft sunshield deployment"
classify_post(s)

In [None]:
# Post from r/cars.
s = "Cars made in the last 10 years with a 4 Speed Manual Transmission? As per the title really, I’m wondering if any vehicles have been made in the last 10 years that still utilise a 4 speed (or less) manual transmission. My Google research has thus far not turned up any results."
classify_post(s)

In [None]:
# Post from r/electronics.
s = "This project is powered by an ATTiny85. Five of its pins were used, three of them for the MAX7219 module controling the 7-segment display and one for the button and piezo buzzer respectively. The user can give input through the button. A normal short press to count one up and a long 6-second press to reset it to 0. I also added a simple switch. The microcontroller stores the value in its EEPROM so it doesn't lose it when powered off. I used a charger of an old phone as a power supply. My dad was really excited when he got it for Christmas and it should certainly help him quit smoking :)"
classify_post(s)

So at this point, we have a model that roughly matches the performance of the naive bayes classifier. You could further experiment with a bunch of other things from what we learned:
- Use a different tokenizer mode (e.g. count or TF-IDF).
- Filter out words based on frequency (e.g. bottom and top 20%).
- Train with much more data.
- Use another optimizer.
- Use more layers (deeper network) or more units in a layer (wider network).
- Tweak the regularization.
<br><br>

That being said, it's going to be difficult to squeeze much more performance because:
- Stripped of metadata, a lot of these posts are ambiguous and which a human would have a hard time classifying.
- Our BoW encoding is also subpar in that it's large but encodes no information beyond whether a word is present. Especially with the overlapping topics which drag the overall accuracy down, throwing away context makes it much harder to classify.
- Because our input vectors are extremely wide and sparse, we're forced to reduce it down aggressively to a manageable number of units in the input layer. If we were to have an input layer with 20,000 units for example (roughly half of the vocabulary size), that layer alone would have over 800 million parameters which is absurd for a problem of this nature.
<br><br>
But this dataset was used because, beyond putting what we learned into practice, it's important to keep in mind that dirty/low-signal data and information loss during vectorization has a heavy influence on downstream work and performance.
<br><br>
In the rest of the course, we'll learn vectorization techniques which encode much more information in a much smaller space.

# Additional Reading
If you're curious about how to build a custom model using the low-level features of Tensorflow, here are a few links to work through:<br>
https://www.tensorflow.org/tutorials/customization/basics<br>
https://www.tensorflow.org/tutorials/customization/custom_layers<br>
https://www.tensorflow.org/tutorials/customization/custom_training_walkthrough<br>
https://www.tensorflow.org/guide/keras/customizing_what_happens_in_fit

# Practice

Tensorflow comes with a number dataset loaders, one of which is a collection of ~11,000 Reuters news articles in 46 categories.
<br><br>
Retrieve the data, vectorize both the articles and labels, and build a model to classify the articles.

In [None]:
from tensorflow.keras.datasets import reuters

In [None]:
# Call the load_data method to retrieve the train and test sets. Explore the load_data
# method to see what options there are (e.g. limiting the number of words).
# https://www.tensorflow.org/api_docs/python/tf/keras/datasets/reuters/load_data
#
# NOTE: The load_data method doesn't return arrays of strings, but rather
# arrays of integers. Each news article is encoded as a sequence of integers. There's 
# no need to tokenize. You can recreate the article using get_word_index.
# https://www.tensorflow.org/api_docs/python/tf/keras/datasets/reuters/get_word_index
#


In [None]:
# Vectorize x_train and x_test (i.e. the articles) as some bag of words matrices.
# Maybe you can use the Keras Tokenizer?
# https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer


In [None]:
# Vectorize y_train and y_test (i.e. the labels) as one-hot/categorical encodings.


In [None]:
# Create your model architecture here.
from tensorflow import keras 
from tensorflow.keras import layers

model = keras.Sequential([
  # Your layers here
])

In [None]:
# Compile your model here specifying an optimizer, loss function, and performance metric.


In [None]:
# Fit your model on your test set using early stopping. Optionally divide the test set 
# into test/validation splits and pass the validation data to the fit method.


In [None]:
# If you're satisfied, evaluate the model on the test set and see what you get.
