It is highly recommended to use a powerful **GPU**, you can use it for free uploading this notebook to [Google Colab](https://colab.research.google.com/notebooks/intro.ipynb).
<table align="center">
 <td align="center"><a target="_blank" href="https://colab.research.google.com/github/ezponda/intro_deep_learning/blob/main/class/NLP/text_classification_rnn.ipynb">
        <img src="https://i.ibb.co/2P3SLwK/colab.png"  style="padding-bottom:5px;" />Run in Google Colab</a></td>
  <td align="center"><a target="_blank" href="https://github.com/ezponda/intro_deep_learning/blob/main/class/NLP/text_classification_rnn.ipynb">
        <img src="https://i.ibb.co/xfJbPmL/github.png"  height="70px" style="padding-bottom:5px;"  />View Source on GitHub</a></td>
</table>

In [None]:
import numpy as np
import tensorflow_datasets as tfds
import tensorflow as tf
import matplotlib.pyplot as plt

## Loading IMBD Dataset


We’ll work with the IMDB dataset: a set of 50,000 highly polarized reviews from the Internet Movie Database. They’re split into 25,000 reviews for training and 25,000 reviews for testing, each set consisting of 50% negative and 50% positive reviews. The parameter num_words controls how many words different we want to use.


We are going to download the dataset using [TFDS](https://www.tensorflow.org/datasets). TFDS provides a collection of ready-to-use datasets for use with TensorFlow, Jax, and other Machine Learning frameworks.


In [None]:
#!pip install -q tensorflow_datasets

In [None]:
dataset, info = tfds.load('imdb_reviews', with_info=True,
                          as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']

train_dataset.element_spec

Initially this returns a dataset of (text, label pairs):

In [None]:
for example, label in train_dataset.take(1):
    print('text: ', example.numpy())
    print('--'*50)
    print('label: ', label.numpy())

Next shuffle the data for training and create batches of these `(text, label)` pairs:

In [None]:
BUFFER_SIZE = 10000
BATCH_SIZE = 512

In [None]:
train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
test_dataset = test_dataset.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

In [None]:
for example, label in train_dataset.take(1):
    print('texts: ', example.numpy()[:3])
    print()
    print('labels: ', label.numpy()[:3])

## Preprocessing layer

The raw text loaded by `tfds` needs to be processed before it can be used in a model. The simplest way to process text for training is using the [`experimental.preprocessing.TextVectorization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/TextVectorization) layer. It transforms strings into arrays of word indexes.

```python
tf.keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens=None, standardize=LOWER_AND_STRIP_PUNCTUATION,
    split=SPLIT_ON_WHITESPACE, ngrams=None, output_mode=INT,
    output_sequence_length=None, pad_to_max_tokens=True, vocabulary=None, **kwargs
)
```

- **output_sequence_length**: If set, the output will have its time dimension padded or truncated to exactly output_sequence_length values, resulting in a tensor of shape `[batch_size, output_sequence_length]`.
- **max_tokens**: The maximum size of the vocabulary for this layer
- **standardize**: Standardize each sample (usually lowercasing + punctuation stripping)


Create the layer, and pass the dataset's text to the layer's `.adapt` method:

In [None]:
vocab_size = 5000
max_sequence_length = None# 100

preprocessing = tf.keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens=vocab_size, output_sequence_length=max_sequence_length)
preprocessing.adapt(train_dataset.map(lambda text, label: text))

The `.adapt` method sets the layer's vocabulary. Here are the first 20 tokens. After the padding and unknown tokens they're sorted by frequency: 

In [None]:
vocab = np.array(preprocessing.get_vocabulary())
vocab[:20]

Once the vocabulary is set, the layer can encode text into indices. The tensors of indices are 0-padded to the longest sequence in the batch (unless you set a fixed `output_sequence_length`):

In [None]:
voc = preprocessing.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))

text = 'the film is good asfadf'
[word_index.get(w, 1) for w in text.split()]

As you can see, we obtain the same encoding

In [None]:
preprocessing([text])

Lets see some examples of the preprocessing step:

In [None]:
processed_example = preprocessing(example).numpy()
for n in range(2):
    print("Original: ", example[n].numpy())
    print()
    print("Preprocessed: ", processed_example[n])
    print()
    print("Round-trip: ", " ".join(vocab[processed_example[n]]))
    print()
    print()

### Embedding layer


```python
tf.keras.layers.Embedding(
    input_dim,
    output_dim,
    input_length=None,
    mask_zero=False,
)
```

- **input_dim**: Integer. Size of the vocabulary, i.e. maximum integer index + 1.
- **output_dim**: Integer. Dimension of the dense embedding.
- **input_length**: Length of input sequences, when it is constant.


This layer can only be used as the first layer in a model:

```python
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(input_dim=1000, output_dim=64, input_length=10))
...
```

## Create the model

The embedding layer [uses masking](https://www.tensorflow.org/guide/keras/masking_and_padding) to handle the varying sequence-lengths. Configure the embedding layer with `mask_zero=True`.


In [None]:
vocab_size = 4000
max_sequence_length = 120 
embedding_size = 64


preprocessing = tf.keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens=vocab_size, output_sequence_length=max_sequence_length)
preprocessing.adapt(train_dataset.map(lambda text, label: text))


model = tf.keras.Sequential()
model.add(preprocessing)
model.add(tf.keras.layers.Embedding(
        input_dim=len(preprocessing.get_vocabulary()),
        output_dim=embedding_size,
        # Use masking to handle the variable sequence lengths
        mask_zero=False))
model.add(tf.keras.layers.GRU(32))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

In [None]:
print([layer.supports_masking for layer in model.layers])

Compile the Keras model to configure the training process:

In [None]:
model.compile(loss='BinaryCrossentropy',
              optimizer='adam',
              metrics=['accuracy'])
history = model.fit(train_dataset, epochs=5,
                    validation_data=test_dataset, 
                    validation_steps=5)

In [None]:
results = model.evaluate(test_dataset)

print('Test Loss: {}'.format(results[0]))
print('Test Accuracy: {}'.format(results[1]))

In [None]:
import pandas as pd
def show_loss_accuracy_evolution(history):
    
    hist = pd.DataFrame(history.history)
    hist['epoch'] = history.epoch

    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

    ax1.set_xlabel('Epoch')
    ax1.set_ylabel('Sparse Categorical Crossentropy')
    ax1.plot(hist['epoch'], hist['loss'], label='Train Error')
    ax1.plot(hist['epoch'], hist['val_loss'], label = 'Val Error')
    ax1.grid()
    ax1.legend()

    ax2.set_xlabel('Epoch')
    ax2.set_ylabel('Accuracy')
    ax2.plot(hist['epoch'], hist['accuracy'], label='Train Accuracy')
    ax2.plot(hist['epoch'], hist['val_accuracy'], label = 'Val Accuracy')
    ax2.grid()
    ax2.legend()

    plt.show()

show_loss_accuracy_evolution(history)

Run a prediction on a new sentence:

If the prediction is >= 0.5, it is positive else it is negative.

In [None]:
reviews = ['the film was really bad and i am very disappointed',
           'The film was very funny entertaining and good we had a great time . brilliant film',
           'this film was just brilliant',
          'This movie has been a disaster',
           'the movie is not bad']
predictions = model.predict(np.array(reviews))

for review, pred in zip(reviews, predictions.flatten()):
    print()
    print(review)
    print('Sentiment: ', np.round(pred, 2))

### Question 1: Change the  `vocab_size`, `max_sequence_length` and embedding dimension too compare the results

In [None]:
vocab_size = 1000 # Number of words
max_sequence_length = 50#None# 100  # Max length of a sentence 
embedding_size = 64## embedding dimension

In [None]:
preprocessing = tf.keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens=vocab_size, output_sequence_length=max_sequence_length)
preprocessing.adapt(train_dataset.map(lambda text, label: text))

model = tf.keras.Sequential()
model.add(preprocessing)
model.add(tf.keras.layers.Embedding(
        input_dim=len(preprocessing.get_vocabulary()),
        output_dim=embedding_size,
        # Use masking to handle the variable sequence lengths
        mask_zero=True))
model.add(tf.keras.layers.GRU(32))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

In [None]:
model.compile(loss='BinaryCrossentropy',
              optimizer='adam',
              metrics=['accuracy'])
history = model.fit(train_dataset, epochs=10,
                    validation_data=test_dataset, 
                    validation_steps=5)
show_loss_accuracy_evolution(history)

results = model.evaluate(test_dataset)

print('Test Loss: {}'.format(results[0]))
print('Test Accuracy: {}'.format(results[1]))

In [None]:
reviews = ['the film was really bad and i am very disappointed',
           'The film was very funny entertaining and good we had a great time . brilliant film',
           'this film was just brilliant',
          'This movie has been a disaster', 'the movie is not bad']
predictions = model.predict(np.array(reviews))

for review, pred in zip(reviews, predictions.flatten()):
    print()
    print(review)
    print('Sentiment: ', np.round(pred, 2))

### Question 2: Use a convolutional   network instead of a RNN

```python
tf.keras.layers.Conv1D(
    filters, kernel_size
)
```

```python
tf.keras.layers.MaxPool1D(
    pool_size=2
)
```

```python
tf.keras.layers.Flatten()
```

In [None]:
vocab_size = 5000 # Number of words
max_sequence_length = 600#None# 100  # Max length of a sentence 
embedding_size = 300## embedding dimension

In [None]:
from tensorflow.keras import layers

preprocessing = tf.keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens=vocab_size, output_sequence_length=max_sequence_length)
preprocessing.adapt(train_dataset.map(lambda text, label: text))

model = tf.keras.Sequential()
model.add(preprocessing)
model.add(tf.keras.layers.Embedding(
        input_dim=len(preprocessing.get_vocabulary()),
        output_dim=embedding_size,
        input_length=max_sequence_length))

model.add(layers.Conv1D(..., ..., activation=...))
model.add(layers.MaxPooling1D(...))
...
model.add(layers.Flatten())
...
model.add(layers.Dense(1, activation='sigmoid'))


In [None]:
model.compile(loss='BinaryCrossentropy',
              optimizer='adam',
              metrics=['accuracy'])
history = model.fit(train_dataset, epochs=5,
                    validation_data=test_dataset, 
                    validation_steps=5)
show_loss_accuracy_evolution(history)

results = model.evaluate(test_dataset)

print('Test Loss: {}'.format(results[0]))
print('Test Accuracy: {}'.format(results[1]))

In [None]:
reviews = ['the film was really bad and i am very disappointed',
           'The film was very funny entertaining and good we had a great time . brilliant film',
           'this film was just brilliant',
          'This movie has been a disaster',
          'very good',
          'film very good',
           'the film is very good',
           'the film is not good',
           'the film is not very good',
           'the movie is not bad',
           'the movie is not very bad']
predictions = model.predict(np.array(reviews))

for review, pred in zip(reviews, predictions.flatten()):
    print()
    print(review)
    print('Sentiment: ', np.round(pred, 2))

## Generalization

We are going to see, how the trained model generalizes in a new dataset.

Large Yelp Review Dataset. This is a dataset for binary sentiment classification. We provide a set of 560,000 highly polar yelp reviews for training, and 38,000 for testing. ORIGIN The Yelp reviews dataset consists of reviews from Yelp. It is extracted from the Yelp Dataset Challenge 2015 data. For more information, please refer to http://www.yelp.com/dataset



In [None]:
dataset_yelp, info = tfds.load('yelp_polarity_reviews', with_info=True,
                          as_supervised=True)
train_dataset_yelp, test_dataset_yelp = dataset_yelp['train'], dataset_yelp['test']

train_dataset_yelp.element_spec

Initially this returns a dataset of (text, label pairs):

In [None]:
for example, label in test_dataset_yelp.take(2):
    print('text: ', example.numpy())
    print('label: ', label.numpy())

Next shuffle the data for training and create batches of these `(text, label)` pairs:

In [None]:
BUFFER_SIZE = 10000
BATCH_SIZE = 512
train_dataset_yelp = train_dataset_yelp.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
test_dataset_yelp = test_dataset_yelp.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

In [None]:
for example, label in test_dataset_yelp.take(1):
    print('text: ', example.numpy()[0])
    print('label: ', label.numpy()[0])

### Generalization of the IMBD-model

In [None]:
results = model.evaluate(test_dataset_yelp)

print('Test Loss: {}'.format(results[0]))
print('Test Accuracy: {}'.format(results[1]))

### Question 3: Create a model for the Yelp dataset and obtain `val_accuracy>0.92`

In [None]:
output_dim = 200
max_sequence_length = 100
vocab_size = 5000
preprocessing_yelp = tf.keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens=vocab_size, output_sequence_length=max_sequence_length)
preprocessing_yelp.adapt(train_dataset_yelp.map(lambda text, label: text))
model_yelp = tf.keras.Sequential()


In [None]:
model_yelp.compile(loss='BinaryCrossentropy',
              optimizer='adam',
              metrics=['accuracy'])

history = model_yelp.fit(train_dataset_yelp, epochs=2,
                    validation_data=test_dataset_yelp, 
                    validation_steps=10)
show_loss_accuracy_evolution(history)

results = model_yelp.evaluate(test_dataset_yelp)

print('Test Loss: {}'.format(results[0]))
print('Test Accuracy: {}'.format(results[1]))

In [None]:
reviews = ['the film was really bad and i am very disappointed',
           'The film was very funny entertaining and good we had a great time . brilliant film',
           'this film was just brilliant',
          'This movie has been a disaster',
          'very good',
           'the film isn\'t very good',
           'the film is very good',
           'the movie is not bad',
           'the film is not good',
           'the movie is  bad']
predictions = model_yelp.predict(np.array(reviews))

for review, pred in zip(reviews, predictions.flatten()):
    print()
    print(review)
    print('Sentiment: ', np.round(pred, 2))

In [None]:
results = model_yelp.evaluate(test_dataset_yelp)

print('Yelp-model in Yelp dataset Test Loss: {}'.format(results[0]))
print('Yelp-model in Yelp dataset Test Accuracy: {}'.format(results[1]))
print('--'*50)

results = model.evaluate(test_dataset_yelp)

print('IMBD-model in Yelp dataset Test Loss: {}'.format(results[0]))
print('IMBD-model in Yelp dataset Test Accuracy: {}'.format(results[1]))

In [None]:
results = model_yelp.evaluate(test_dataset)

print('Yelp-model in IMBD dataset Test Loss: {}'.format(results[0]))
print('Yelp-model in IMBD dataset Test Accuracy: {}'.format(results[1]))
print('--'*50)

results = model.evaluate(test_dataset)

print('IMBD-model in IMBD dataset Test Loss: {}'.format(results[0]))
print('IMBD-model in IMBD dataset Test Accuracy: {}'.format(results[1]))

## Practice



In [None]:
from tensorflow.keras.datasets import reuters

((train_seqs, train_labels), (test_seqs, test_labels)) = reuters.load_data(
    path='reuters.npz', test_split=0.15,  index_from=0
)
word_index = tf.keras.datasets.reuters.get_word_index()
index_word = {wid: w for w, wid in word_index.items()}


def seq2sentence(seq, index_word):
    return ' '.join([index_word[wid] for wid in seq])


train_sentences = np.array([seq2sentence(seq, index_word) for seq in train_seqs])
test_sentences = np.array([seq2sentence(seq, index_word) for seq in test_seqs])

labels = np.array(['cocoa', 'grain', 'veg-oil', 'earn', 'acq', 'wheat', 'copper', 'housing', 'money-supply',
          'coffee', 'sugar', 'trade', 'reserves', 'ship', 'cotton', 'carcass', 'crude', 'nat-gas',
          'cpi', 'money-fx', 'interest', 'gnp', 'meal-feed', 'alum', 'oilseed', 'gold', 'tin',
          'strategic-metal', 'livestock', 'retail', 'ipi', 'iron-steel', 'rubber', 'heat', 'jobs',
          'lei', 'bop', 'zinc', 'orange', 'pet-chem', 'dlr', 'gas', 'silver', 'wpi', 'hog', 'lead'])

num_classes = 46
train_sentences[0]
test_sentences[0]

In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices((train_sentences, train_labels))
test_dataset = tf.data.Dataset.from_tensor_slices((test_sentences, test_labels))

In [None]:
BUFFER_SIZE = 10000
BATCH_SIZE = 64
train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
test_dataset = test_dataset.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

In [None]:
for example, label in train_dataset.take(1):
    print('texts: ', example.numpy()[:3])
    print()
    print('labels: ', label.numpy()[:3], labels[label.numpy()[:3]])

## Extra
### Transfer learning: pre-trained text embeddings

For this example we will use a **pre-trained text embedding model** from [TensorFlow Hub](https://tfhub.dev) called [google/nnlm-en-dim50/2](https://tfhub.dev/google/nnlm-en-dim50/2).

[TensorFlow Hub](https://tfhub.dev/) has hundreds of trained, ready-to-deploy machine learning models.  You can find more [text embedding models](https://tfhub.dev/s?module-type=text-embedding) on TFHub.

One way to represent the text is to convert sentences into embeddings vectors. We can use a pre-trained text embedding as the first layer, which will have three advantages:

One way to represent the text is to convert sentences into embeddings vectors. We can use a pre-trained text embedding as the first layer, which will have three advantages:

*   we don't have to worry about text preprocessing,
*   we can benefit from transfer learning,
*   the embedding has a fixed size, so it's simpler to process.

For this example we will use a pre-trained text embedding model from TensorFlow Hub called google/nnlm-en-dim50/2.
Let's first create a Keras layer that uses a TensorFlow Hub model to embed the sentences.

In [None]:
!pip install tensorflow-hub

In [None]:
import tensorflow_hub as hub

embedding = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(embedding, input_shape=[], 
                           dtype=tf.string, trainable=True)


In [None]:
hub_layer(['The film was ok'])

In [None]:
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(32, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='softmax'))

model.summary()

In [None]:
model.compile(loss='BinaryCrossentropy',
              optimizer='adam',
              metrics=['accuracy'])
history = model.fit(train_dataset, epochs=1,
                    validation_data=test_dataset, 
                    validation_steps=5)
show_loss_accuracy_evolution(history)

results = model.evaluate(test_dataset)

print('Test Loss: {}'.format(results[0]))
print('Test Accuracy: {}'.format(results[1]))