Natural Language Processing deals witho processing human languages into computer language - like spellcheck or autocomplete. To do so, we will use something called a Recurren Neural Network (RNN) - **a deep learning model that is trained to process and convert a sequential data input into a specific sequential data output**

Up until this point we have been using something called **feed-forward** neural networks. This simply means that all our data is fed forwards (all at once) from left to right through the network. This was fine for the problems we considered before but won't work very well for processing text. After all, even we (humans) don't process text all at once. We read word by word from left to right and keep track of the current meaning of the sentence so we can understand the meaning of the next word. Well this is exaclty what a recurrent neural network is designed to do. When we say recurrent neural network all we really mean is a network that contains a loop. A RNN will process one word at a time while maintaining an internal memory of what it's already seen. This will allow it to treat words differently based on their order in a sentence and to slowly build an understanding of the entire input, one word at a time.

This is why we are treating our text data as a sequence! So that we can pass one word at a time to the RNN.

Let's have a look at what a recurrent layer might look like.

![alt text](https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png)
*Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/*

As an example, we will now do a sentiment analysis of a movie review to see whether it was a positive or negative one

Can you believe it, keras has a movie reviews database! Lets use it! It turns out all the words are already encoded with a number - yay for us!

This dataset contains 25,000 reviews from IMDB where each one is already preprocessed and has a label as either positive or negative. Each review is encoded by integers that represents how common a word is in the entire dataset. For example, a word encoded by the integer 3 means that it is the 3rd most common word in the dataset.

In [4]:
from keras.datasets import imdb
from keras.preprocessing import sequence
import keras
import tensorflow as tf
import os
import numpy as np

#loading in our data

VOCAB_SIZE = 88584

MAXLEN = 250 #max length of a review
BATCH_SIZE = 64

#the data will be the review arrays encoded by integers, while the labels will be a 0 or 1, depending on whether it is negative or positive, respectively
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words = VOCAB_SIZE)


Lets look at an example, the first review in the dataset.

In [5]:
train_data[2]


[1,
 14,
 47,
 8,
 30,
 31,
 7,
 4,
 249,
 108,
 7,
 4,
 5974,
 54,
 61,
 369,
 13,
 71,
 149,
 14,
 22,
 112,
 4,
 2401,
 311,
 12,
 16,
 3711,
 33,
 75,
 43,
 1829,
 296,
 4,
 86,
 320,
 35,
 534,
 19,
 263,
 4821,
 1301,
 4,
 1873,
 33,
 89,
 78,
 12,
 66,
 16,
 4,
 360,
 7,
 4,
 58,
 316,
 334,
 11,
 4,
 1716,
 43,
 645,
 662,
 8,
 257,
 85,
 1200,
 42,
 1228,
 2578,
 83,
 68,
 3912,
 15,
 36,
 165,
 1539,
 278,
 36,
 69,
 44076,
 780,
 8,
 106,
 14,
 6905,
 1338,
 18,
 6,
 22,
 12,
 215,
 28,
 610,
 40,
 6,
 87,
 326,
 23,
 2300,
 21,
 23,
 22,
 12,
 272,
 40,
 57,
 31,
 11,
 4,
 22,
 47,
 6,
 2307,
 51,
 9,
 170,
 23,
 595,
 116,
 595,
 1352,
 13,
 191,
 79,
 638,
 89,
 51428,
 14,
 9,
 8,
 106,
 607,
 624,
 35,
 534,
 6,
 227,
 7,
 129,
 113]

All the 25000 reviews are of different lengths. This is a problem since the same length data must be passed into a neural network

- if the review is greater than 250 words then trim off the extra words
- if the review is less than 250 words add the necessary amount of 0's to make it equal to 250.

Luckily for us keras has a function that can do this for us (this is called padding):

In [6]:
train_data = sequence.pad_sequences(train_data, MAXLEN)
test_data = sequence.pad_sequences(test_data, MAXLEN)

As an example lets see the padding

In [12]:
train_data[0]

array([    0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     1,    14,    22,    16,
          43,   530,   973,  1622,  1385,    65,   458,  4468,    66,
        3941,     4,   173,    36,   256,     5,    25,   100,    43,
         838,   112,    50,   670, 22665,     9,    35,   480,   284,
           5,   150,     4,   172,   112,   167, 21631,   336,   385,
          39,     4,   172,  4536,  1111,    17,   546,    38,    13,
         447,     4,   192,    50,    16,     6,   147,  2025,    19,
          14,    22,     4,  1920,  4613,   469,     4,    22,    71,
          87,    12,    16,    43,   530,    38,    76,    15,    13,
        1247,     4,    22,    17,   515,    17,    12,    16,   626,
          18, 19193,     5,    62,   386,    12,     8,   316,     8,
         106,     5,

Now let us create out model for the data. Our first layer will be a word embedding layer and then a LSTM layer afterwards

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(VOCAB_SIZE, 32), #The Embedding layer converts each word index into a 32-dimensional vector, allowing the model to learn useful representations of words during training.
    
    tf.keras.layers.LSTM(32), #when we pass the embdedding to LSTM, we have to tell it that it has 32 dimesnsions for every single word
    tf.keras.layers.Dense(1, activation="sigmoid") #this makes the final predictiong
])

We are trying the predict the sentiment of the review. If we have the sentiment between 0 and 1, and then if the review outputs a number greater than 0.5, we can classify it as a positive review. the activation function __sigmoid__ is perfect for this since it squishes all values in between 0 and 1 so that we can make an accurate prediction

In [13]:
model.summary()

Now Let us train our model! model.compile is creating the loss functions, optimizer, and metrics that we would like to track. Binary crossentropy tells us how far away we are from the correct 0 or 1 value. Could have used 'adam' for opitimizer. the 0.2 in validation_split means that 20% of the training data is set aside for validation

In [9]:
model.compile(loss="binary_crossentropy",optimizer="rmsprop",metrics=['acc'])

history = model.fit(train_data, train_labels, epochs=10, validation_split=0.2)

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 34ms/step - acc: 0.6549 - loss: 0.5908 - val_acc: 0.8438 - val_loss: 0.3931
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 34ms/step - acc: 0.8809 - loss: 0.3022 - val_acc: 0.8836 - val_loss: 0.2999
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 34ms/step - acc: 0.9179 - loss: 0.2235 - val_acc: 0.8600 - val_loss: 0.3172
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 34ms/step - acc: 0.9348 - loss: 0.1767 - val_acc: 0.8884 - val_loss: 0.3213
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 34ms/step - acc: 0.9491 - loss: 0.1491 - val_acc: 0.8736 - val_loss: 0.3046
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 34ms/step - acc: 0.9520 - loss: 0.1367 - val_acc: 0.8838 - val_loss: 0.2969
Epoch 7/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0

Now lets use the untouched testing dataset with another 25000 reviews that our model has never seen before

In [10]:
results = model.evaluate(test_data, test_labels)
print(results)

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 8ms/step - acc: 0.8141 - loss: 0.5924
[0.6005488038063049, 0.8129600286483765]


How about we try and make some predictions

In [None]:
word_index = imdb.get_word_index()

def encode_

In [None]:
import tensorflow as tf

# Create the vectorization layer
vectorizer = tf.keras.layers.TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_sequence_length=MAXLEN
)
# Adapt the vectorizer to the IMDB word index vocabulary
# You can adapt on your own text data or a sample list of texts
vectorizer.adapt(["that movie was just amazing, so amazing"])

def encode_text(text):
    # The vectorizer will tokenize, index, and pad automatically
    return vectorizer([text])[0].numpy()

text = "that movie was just amazing, so amazing"
encoded = encode_text(text)
print(encoded) 

[4 6 3 7 2 5 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


Lets make a decode function while we are at it - going from integers to words

In [2]:
reverse_word_index = {value: key for (key, value) in word_index.items()}


NameError: name 'word_index' is not defined