<a href="https://colab.research.google.com/github/ch00226855/CMP414765Spring2022/blob/main/Week13_AnalyzingTexts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 13
# Analyzing Texts

This notebook classifies movie reviews as positive or negative using the text of the review.

We'll use the [IMDB dataset](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb) that contains the text of 50,000 movie reviews from the Internet Movie Database. These reviews are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are balanced, meaning they contain an equal number of positive and negative reviews.

**Please turn on GPU computing from the menu.**

In [1]:
import numpy as np

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.experimental.list_physical_devices("GPU") else "NOT AVAILABLE")

Version:  2.8.0
Eager mode:  True
Hub version:  0.12.0
GPU is available


## Download the dataset

In [2]:
# Split the training set into 60% and 40%, so we'll end up with 15,000 examples
# for training, 10,000 examples for validation and 25,000 examples for testing.
train_data, validation_data, test_data = tfds.load(
    name="imdb_reviews", 
    split=('train[:60%]', 'train[60%:]', 'test'),
    as_supervised=True)

## Explore the Data

In [3]:
?train_data

In [15]:
# Turn the training set into an iterator
iterator = iter(train_data.batch(10))

In [18]:
# Extract the first batch of 10 reviews
train_examples_batch, train_labels_batch = next(iterator) # The next() function returns the next item of an iterator

In [19]:
# Print a review
print(train_examples_batch[4].numpy())

b'Hilarious, evocative, confusing, brilliant film. Reminds me of Bunuel\'s L\'Age D\'Or or Jodorowsky\'s Holy Mountain-- lots of strange characters mucking about and looking for..... what is it? I laughed almost the whole way through, all the while keeping a peripheral eye on the bewildered and occasionally horrified reactions of the audience that surrounded me in the theatre. Entertaining through and through, from the beginning to the guts and poisoned entrails all the way to the end, if it was an end. I only wish i could remember every detail. It haunts me sometimes.<br /><br />Honestly, though, i have only the most positive recollections of this film. As it doesn\'t seem to be available to take home and watch, i suppose i\'ll have to wait a few more years until Crispin Glover comes my way again with his Big Slide Show (and subsequent "What is it?" screening)... I saw this film in Atlanta almost directly after being involved in a rather devastating car crash, so i was slightly dazed 

In [20]:
# Display the labels of the first 10 reviews
train_labels_batch

<tf.Tensor: shape=(10,), dtype=int64, numpy=array([0, 1, 1, 0, 1, 0, 1, 1, 1, 0])>

## Building the Model
- Represent words as vectors using pre-trained encoder
- Decide the number of hidden layers
- Decide the number of hidden units for each layer

For this example we will use a pre-trained text embedding model from TensorFlow Hub called `gnews-swivel-20dim`, which represents each word with a vector of length 20.

# Word Embedding

## Why transform words into vectors?

## Challenges for word embedding
- curse of dimensionality
- performance metrics
- training algorithm

# Popular embedding models
- Word2Vec
- BERT
- Train your own embedding

In [21]:
embedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(embedding, input_shape=[], 
                           dtype=tf.string, trainable=True)
hub_layer(train_examples_batch[:3])

<tf.Tensor: shape=(3, 20), dtype=float32, numpy=
array([[ 1.4830992 , -3.295094  ,  3.3016534 , -0.3216796 , -4.401221  ,
        -2.4952629 , -2.941581  ,  1.3125231 ,  3.715375  , -0.42447683,
        -4.0236163 ,  0.51593536,  0.45462236,  0.16177966, -3.9231987 ,
         1.8843338 ,  2.8799458 , -1.5778295 , -2.9341567 , -0.7681518 ],
       [ 4.042172  , -5.434198  ,  1.8562293 , -3.501284  , -5.3146043 ,
        -1.636114  , -3.4733946 ,  3.3484647 ,  6.48501   , -0.11143868,
        -4.1651607 ,  2.0633504 , -1.1959407 , -0.7932419 , -6.702236  ,
         2.6196687 ,  4.458931  , -3.0663729 , -5.538564  , -1.2161295 ],
       [ 3.065759  , -6.345902  ,  5.8373117 , -3.1535432 , -6.0480514 ,
        -2.0733445 , -2.2367308 ,  0.8322462 ,  6.4001536 , -2.9436262 ,
        -3.6775544 , -0.07057961, -2.0837922 , -0.6685285 , -5.250441  ,
         1.3003087 ,  6.1873426 , -2.3623805 , -4.021395  , -1.3652462 ]],
      dtype=float32)>

In [22]:
train_examples_batch[:3]

<tf.Tensor: shape=(3,), dtype=string, numpy=
array([b'During a sleepless night, I was switching through the channels & found this embarrassment of a movie. What were they thinking?<br /><br />If this is life after "Remote Control" for Kari (Wuhrer) Salin, no wonder she\'s gone nowhere.<br /><br />And why did David Keith take this role? It\'s pathetic!<br /><br />Anyway, I turned on the movie near the end, so I didn\'t get much of the plot. But this must\'ve been the best part. This nerdy college kid brings home this dominatrix-ish girl...this scene is straight out of the comic books -- or the cheap porn movies. She calls the mother anal retentive and kisses the father "Oh, I didn\'t expect tongue!" Great lines!<br /><br />After this, I had to see how it ended..<br /><br />Well, of course, this bitch from hell has a helluva past, so the SWAT team is upstairs. And yes...they surround her! And YES YES! The kid blows her brains out!!!! AHAHHAHAHAHA!!<br /><br />This is must-see TV. <br /><

In [24]:
vectors = hub_layer(train_examples_batch[:3])
vectors.shape

TensorShape([3, 20])

In [25]:
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 keras_layer (KerasLayer)    (None, 20)                400020    
                                                                 
 dense (Dense)               (None, 16)                336       
                                                                 
 dense_1 (Dense)             (None, 1)                 17        
                                                                 
Total params: 400,373
Trainable params: 400,373
Non-trainable params: 0
_________________________________________________________________


In [26]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [27]:
# the fit() methods returns a collection of intermediate results, which can be useful
# to evaluate the model
history = model.fit(train_data.shuffle(10000).batch(512),
                    epochs=20,
                    validation_data=validation_data.batch(512),
                    verbose=1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


## Evaluate the model

In [28]:
results = model.evaluate(test_data.batch(512), verbose=2)

for name, value in zip(model.metrics_names, results):
  print("%s: %.3f" % (name, value))

49/49 - 3s - loss: 0.3108 - accuracy: 0.8620 - 3s/epoch - 54ms/step
loss: 0.311
accuracy: 0.862


In [30]:
# How about my own reviews?
my_review = np.array(["This movie is the worst action movie I have ever watched in my entire life.",
                      "I really enjoyed the plot, but the lead actor didn't portray his character well.",
                      "It is the most visually stunning movie in the series. The acting is outstanding too.",
                      "I really like that everyone in this movie makes it crystal clear that they don't care the quality at all.",
                      "There is nothing about the movie that I don't like. I wish everyone else just stop making movies since no moive can be better than this one."])
model(my_review).numpy()

array([[-1.247422  ],
       [ 0.79790074],
       [ 2.7052898 ],
       [ 0.13088328],
       [-3.0741627 ]], dtype=float32)

In [31]:
# Extract 20 reviews from the test set
reviews, labels = next(iter(test_data.batch(20)))
predictions = model(reviews).numpy()

In [32]:
labels.numpy()

array([1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0])

In [33]:
(predictions > 0).astype(int).reshape(-1)

array([0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0])

In [36]:
reviews[8]

<tf.Tensor: shape=(), dtype=string, numpy=b'As long as you keep in mind that the production of this movie was a copyright ploy, and not intended as a serious release, it is actually surprising how not absolutely horrible it is. I even liked the theme music.<br /><br />And if ever a flick cried out for a treatment by Joel (or Mike) and the MST3K Bots, this is it! Watch this with a bunch of smart-ass wise-crackers, and you\'re in for a good time. Have a brew, butter up some large pretzels, and enjoy.<br /><br />Of course, obtaining a copy requires buying a bootleg or downloading it as shareware, but if you\'re here on the IMDb, then you\'re most likely savvy enough to do so. Good luck.<br /><br />And look for my favorite part....where Dr. Doom informs the FF that they have 12 hours to comply with his wishes....and he actually gestures the number "12" with his finger while doing so....it\'s like "Evil Sesame Street"....hoo boy.<br /><br />...and of course Mrs. Storm declaring "Just look a

In [35]:
predictions

array([[-3.6375815e-01],
       [ 1.6878791e+00],
       [-2.2796133e+00],
       [-2.1393821e+00],
       [ 4.1805129e+00],
       [ 4.0146170e+00],
       [ 7.1837163e+00],
       [ 5.3082333e+00],
       [ 1.8619432e+00],
       [-4.4788378e-01],
       [-5.8251934e+00],
       [-3.5016735e+00],
       [-4.4582561e-03],
       [-3.6215645e-01],
       [ 4.0439005e+00],
       [ 4.2393667e-01],
       [ 2.6834457e+00],
       [-3.7960603e+00],
       [ 3.0841858e+00],
       [-5.4220591e+00]], dtype=float32)