# Classifying movie critics with Neural Networks (Text Vectorization)

In [1]:
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import losses
import matplotlib.pyplot as plt

import os
import re
import string

* We will use a dataset containing the text of 50,000 reviews from the [Internet Movie Database (IMDb)](https://www.imdb.com/).

* These are split into 25,000 reviews for training and 25,000 reviews for testing.

* Our purpose is to build a model that will read a review and be able to decide whether it is positive or negative.

* For more information see Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics. Available at <https://aclanthology.org/P11-1015>.

* We will use a dataset containing the text of 50,000 reviews from the [Internet Movie Database (IMDb)](https://www.imdb.com/).

* These are split into 25,000 reviews for training and 25,000 reviews for testing.

* Our purpose is to build a model that will read a review and be able to decide whether it is positive or negative.

* For more information see Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics. Available at <https://aclanthology.org/P11-1015>.

In [2]:
dataset_dir = 'aclImdb'

* The data is contained in two subdirectories, `train` and `test`.

In [3]:
os.listdir(dataset_dir)

['test', 'README', 'train']

* To see what is in the `train` directory.

In [4]:
train_dir = os.path.join(dataset_dir, 'train')
os.listdir(train_dir)

['neg', 'pos']

* The `pos` directory contains positive reviews, each in a separate file.

* Likewise, the `neg` directory contains negative reviews, each in a separate file.

* To see a review.

In [5]:
sample_file = os.path.join(train_dir, 'pos/1181_9.txt')
with open(sample_file) as f:
    print(f.read())

Rachel Griffiths writes and directs this award winning short film. A heartwarming story about coping with grief and cherishing the memory of those we've loved and lost. Although, only 15 minutes long, Griffiths manages to capture so much emotion and truth onto film in the short space of time. Bud Tingwell gives a touching performance as Will, a widower struggling to cope with his wife's death. Will is confronted by the harsh reality of loneliness and helplessness as he proceeds to take care of Ruth's pet cow, Tulip. The film displays the grief and responsibility one feels for those they have loved and lost. Good cinematography, great direction, and superbly acted. It will bring tears to all those who have lost a loved one, and survived.


* To get the data into TensorFlow we will use the `text_dataset_from_directory()` function.

* This assumes that our data is placed in directories, one for each distinct class, ie:

 ```
 main_directory/
 ...class_a/
 ......a_text_1.txt
 ......a_text_2.txt
 ...class_b/
 ......b_text_1.txt
 ......b_text_2.txt
 ```

* This is exactly what we already have.

* We will take the training subset.

* We will keep 20% for validation (beyond audit data).

In [6]:
batch_size = 32
seed = 42

raw_train_ds = tf.keras.utils.text_dataset_from_directory(
    'aclImdb/train', 
    batch_size=batch_size, 
    validation_split=0.2, 
    subset='training', 
    seed=seed)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.


2022-03-18 15:07:19.688419: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


* Let's look at some reviews and the class they belong to.

In [7]:
for text_batch, label_batch in raw_train_ds.take(1):
    for i in range(0, 2+1):
        print("Review", text_batch.numpy()[i])
        print("Label", label_batch.numpy()[i])

Review b'"Pandemonium" is a horror movie spoof that comes off more stupid than funny. Believe me when I tell you, I love comedies. Especially comedy spoofs. "Airplane", "The Naked Gun" trilogy, "Blazing Saddles", "High Anxiety", and "Spaceballs" are some of my favorite comedies that spoof a particular genre. "Pandemonium" is not up there with those films. Most of the scenes in this movie had me sitting there in stunned silence because the movie wasn\'t all that funny. There are a few laughs in the film, but when you watch a comedy, you expect to laugh a lot more than a few times and that\'s all this film has going for it. Geez, "Scream" had more laughs than this film and that was more of a horror film. How bizarre is that?<br /><br />*1/2 (out of four)'
Label 0
Review b"David Mamet is a very interesting and a very un-equal director. His first movie 'House of Games' was the one I liked best, and it set a series of films with characters whose perspective of life changes as they get into 

* To see what the two classes `0` and `1` correspond to we can use the `class_names` property.

In [8]:
print("Label 0 corresponds to", raw_train_ds.class_names[0])
print("Label 1 corresponds to", raw_train_ds.class_names[1])

Label 0 corresponds to neg
Label 1 corresponds to pos


* Having taken the training subset, we will now take the validation subset.

* Note that we use the same seed to ensure that there will be no overlap between the training and validation data.

In [9]:
raw_val_ds = tf.keras.utils.text_dataset_from_directory(
    'aclImdb/train', 
    batch_size=batch_size, 
    validation_split=0.2, 
    subset='validation', 
    seed=seed)

Found 25000 files belonging to 2 classes.
Using 5000 files for validation.


* Finally, we also get the control data.

In [10]:
raw_test_ds = tf.keras.utils.text_dataset_from_directory(
    'aclImdb/test', 
    batch_size=batch_size)

Found 25000 files belonging to 2 classes.


* As we have seen before, reviews contain in addition to text the line break in HTML (`<br />`).

* We will remove these.

* We will also lowercase all characters and remove punctuation.

In [11]:
def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
    return tf.strings.regex_replace(stripped_html,
                                    '[%s]' % re.escape(string.punctuation),
                                    '')

* Next we will make a `TextVectorization` layer.

* This layer will do the following preprocessing:

 * Will call `custom_standardization()`.

 * It will split each string corresponding to a review into individual tokens, using whitespace characters as separators.

 * It will assign each of the most frequently occurring word units an integer, creating a dictionary of size 10,000.

 * It will ensure that each resulting array of integers (representing each review) will have the same length.

In [12]:
max_features = 10000
sequence_length = 250

vectorize_layer = layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    output_mode='int',
    output_sequence_length=sequence_length)

* We call the `adapt()` method of `vectorize_layer` on the training set to construct the mapping (vocabulary) between units and integers.

In [13]:
# Make a text-only dataset (without labels), then call adapt
train_text = raw_train_ds.map(lambda x, y: x)
vectorize_layer.adapt(train_text)

* To better see what `vectorize_layer` does, we'll write a helper function so we can call it on our data.

In [14]:
def vectorize_text(text, label):
    # text is a tensor with shape (), we need to make it with shape (1,)
    text = tf.expand_dims(text, -1)
    return vectorize_layer(text), label

In [15]:
# retrieve a batch (of 32 reviews and labels) from the dataset
text_batch, label_batch = next(iter(raw_train_ds))
first_review, first_label = text_batch[0], label_batch[0]
print("Review", first_review)
print("Label", raw_train_ds.class_names[first_label])
print("Vectorized review", vectorize_text(first_review, first_label))

Review tf.Tensor(b'Great movie - especially the music - Etta James - "At Last". This speaks volumes when you have finally found that special someone.', shape=(), dtype=string)
Label neg
Vectorized review (<tf.Tensor: shape=(1, 250), dtype=int64, numpy=
array([[  86,   17,  260,    2,  222,    1,  571,   31,  229,   11, 2418,
           1,   51,   22,   25,  404,  251,   12,  306,  282,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
       

* As we see, each word unit is represented by an integer.

* The number 0 is used for padding, so that each review has the same length.

* The number 1 is used for unknown words, i.e. words that are outside the 10,000 most frequent, which our dictionary holds.

In [16]:
print("0 ---> ", vectorize_layer.get_vocabulary()[0])
print("1 ---> ",vectorize_layer.get_vocabulary()[1])
print("1287 ---> ", vectorize_layer.get_vocabulary()[1287])
print("9999 ---> ",vectorize_layer.get_vocabulary()[9999])
print("Vocabulary size: ", vectorize_layer.vocabulary_size())

0 --->  
1 --->  [UNK]
1287 --->  silent
9999 --->  rushes
Vocabulary size:  10000


* To improve speed, we will use `cache()` and `prefetch()` methods.

* With the `cache()` method, data the first time it is read can be cached in memory.

* With the `prefetch()` method, data is fed to the neural network as the network is already processing the previous data, so no time is wasted feeding (since it can be done in parallel with processing).

In [17]:
AUTOTUNE = tf.data.AUTOTUNE

raw_train_ds = raw_train_ds.cache().prefetch(buffer_size=AUTOTUNE)
raw_val_ds = raw_val_ds.cache().prefetch(buffer_size=AUTOTUNE)
raw_test_ds = raw_test_ds.cache().prefetch(buffer_size=AUTOTUNE)

* We proceed to build our model.

* The first layer we put in our model is `vectorize_layer`.

In [18]:
model = tf.keras.Sequential()

model.add(vectorize_layer)

* Next we will add an *embedding layer*.

* This layer converts the integer representing each word unit into a 16-dimensional *vector* (our choice).

* Therefore we go from a "word-number" representation to a "word-vector" representation.

* This vector representation, the integration, will somehow express the meaning of each verbal unit.

* How is the vector representation of each word derived? The network will find out!

* The embedding layer input is a sequence of integers, length 250.

* The output of the layer will now be an array of dimensions $250 \times 16$.

In [19]:
embedding_dim = 16

model.add(layers.Embedding(max_features, embedding_dim))

* A dropout layer follows.

In [20]:
model.add(layers.Dropout(0.2))

* Each review is represented by a $250 \times 16$ dimensional matrix.

* From it we will produce a vector with 16 dimensions.

* We will do this with a `GlobalAveragePooling1D` layer.

* From the 250 vectors of 16 dimensions we will take their average.

* Intuitively, this will be the vector representation of the "average" meaning of the word units of each review.

* Even more intuitively, this will correspond to the meaning (in an ideal word) that sums up the entire critique.

In [21]:
model.add(layers.GlobalAveragePooling1D())
model.add(layers.Dropout(0.2))

* Finally, we will add a densely connected neuron to the last layer to do the classification.

In [22]:
model.add(layers.Dense(1))

* Let's see briefly what we have:

In [23]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization (TextVec  (None, 250)              0         
 torization)                                                     
                                                                 
 embedding (Embedding)       (None, 250, 16)           160000    
                                                                 
 dropout (Dropout)           (None, 250, 16)           0         
                                                                 
 global_average_pooling1d (G  (None, 16)               0         
 lobalAveragePooling1D)                                          
                                                                 
 dropout_1 (Dropout)         (None, 16)                0         
                                                                 
 dense (Dense)               (None, 1)                 1

* As usual, we define optimizer, loss, and metric.

In [24]:
model.compile(loss=losses.BinaryCrossentropy(from_logits=True),
              optimizer='adam',
              metrics=tf.metrics.BinaryAccuracy(threshold=0.0))

* We proceed to train for ten seasons.

In [25]:
epochs = 10
history = model.fit(
    raw_train_ds,
    validation_data=raw_val_ds,
    epochs=epochs)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


* After we train, we can see the performance on the test data.

* Because our model produces logits, we will add a layer that applies sigmoidal activation.

In [26]:
sigmoid_model = tf.keras.Sequential([
  model,
  layers.Activation('sigmoid')
])

sigmoid_model.compile(
    loss=losses.BinaryCrossentropy(from_logits=False), optimizer="adam", metrics=['accuracy']
)

# Test it with `raw_test_ds`, which yields raw strings
loss, accuracy = sigmoid_model.evaluate(raw_test_ds)
print("Loss: ", loss)
print("Accuracy: ", accuracy)

Loss:  0.310350239276886
Accuracy:  0.8730400204658508


* Let's go back a bit to the vector representation of words in our model.

* The vectors for each word are the weights of the `Embedding' layer.

* This layer has dimensions `(vocabulary_size, embedding_dim)`.

* The neural network learned the weights, i.e. it learned the vector representations of the words during training for the classification.

In [27]:
embedding = model.layers[1]
weights = embedding.get_weights()[0]
weights.shape

(10000, 16)

* So from where we started by representing words through integers, we end up representing each word as a point in a 16-dimensional space.

In [28]:
for num in range(0, 5+1):
    word = vectorize_layer.get_vocabulary()[num]
    vec = weights[num]
    print(word, num, vec)

 0 [-0.00942429  0.00145162  0.00879299 -0.00617482 -0.01833293 -0.01385313
  0.0061789   0.02245351  0.00614921 -0.00450425  0.00639349  0.01372139
  0.00953717  0.00095779  0.01652874 -0.0068995 ]
[UNK] 1 [-0.0311668   0.05364797  0.07109778  0.13044252  0.00565497  0.08358029
  0.07035164  0.03607412  0.08179341  0.00629571  0.02268083  0.02624291
 -0.10860924  0.01658307 -0.01757364  0.03755026]
the 2 [ 0.00751571 -0.02789821  0.00926731 -0.02363539 -0.06039637 -0.09125661
 -0.07214823 -0.06925654 -0.04588158 -0.01215377 -0.04814544 -0.03682938
  0.02616501 -0.06814333  0.06467235 -0.00741657]
and 3 [ 0.17544998 -0.19089417 -0.20559171 -0.27843451 -0.17202763 -0.17646806
 -0.20626883 -0.20862857 -0.1764741  -0.12801513 -0.2540157  -0.27600306
  0.24608037 -0.2930975   0.18469746 -0.19249779]
a 4 [ 0.15124586 -0.06932954 -0.06003192 -0.01276331 -0.08331891 -0.00961464
 -0.02208956 -0.05247774 -0.07412761 -0.05018733 -0.06016735 -0.04011693
 -0.01103566 -0.02208388  0.00852778 -0.047

* The use of vector representations of words is the basis of neural networks that manipulate language.

* We in our simple example used a small data set to learn the vectors.

* In practice, vector word representations are available that have been trained on huge text corpora.