<a href="https://colab.research.google.com/github/ch00226855/CMP414765Fall2022/blob/main/Week12_AnalyzingTexts_Completed.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 12
# Analyzing Texts

This notebook classifies movie reviews as positive or negative using the text of the review.

We'll use the [IMDB dataset](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb) that contains the text of 50,000 movie reviews from the Internet Movie Database. These reviews are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are balanced, meaning they contain an equal number of positive and negative reviews.

**Please turn on GPU computing from the menu.**

In [1]:
import numpy as np

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

print("Version: ", tf.__version__)
print("Hub version: ", hub.__version__)
print(tf.config.experimental.list_physical_devices("GPU"))

Version:  2.9.2
Hub version:  0.12.0
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


## Download the dataset

In [2]:
# Split the training set into 60% and 40%, so we'll end up with 15,000 examples
# for training, 10,000 examples for validation and 25,000 examples for testing.
train_data, validation_data, test_data = tfds.load(
    name="imdb_reviews", 
    split=('train[:60%]', 'train[60%:]', 'test'),
    as_supervised=True)

Downloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to ~/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling ~/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteANBW02/imdb_reviews-train.tfrecord*...…

Generating test examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling ~/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteANBW02/imdb_reviews-test.tfrecord*...:…

Generating unsupervised examples...:   0%|          | 0/50000 [00:00<?, ? examples/s]

Shuffling ~/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteANBW02/imdb_reviews-unsupervised.tfrec…

Dataset imdb_reviews downloaded and prepared to ~/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.


## Explore the Data

In [3]:
?train_data

In [4]:
# Turn the training set into an iterator
iterator = iter(train_data.batch(10))

In [5]:
# Extract the first batch of 10 reviews
train_examples_batch, train_labels_batch = next(iterator) # The next() function returns the next item of an iterator

In [9]:
# Print a review
ind = 0
print(train_examples_batch[ind].numpy())

b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."


In [11]:
# Display the labels of the first 10 reviews
train_labels_batch[ind].numpy()

0

## Building the Model
Machine learning models take vectors (arrays of numbers) as input. When working with text, the first thing we must do is come up with a strategy to convert strings to numbers (or to "vectorize" the text) before feeding it to the model. 

### Attempt 1: One-Hot Encoding

The **one-hot encoding** method converts each string to a vector that contains a single non-zero component (thus it is called "one-hot"). It is a commonly used method to convert a categorical variable to a vector.

<img src="https://i.imgur.com/mtimFxh.png" width="400">

However, one-hot encoding is a poor idea for vectorizing texts because:
- The size of the converted vector equals to the size of the vocabulary, which is very large.
- It does not preserve relationships between words. For example, "no" and "not" have similar meanings, but their one-hot vectors are entirely different.
- It is not clear how one can combine all words from a sentence to form a single vector.

<img src="https://miro.medium.com/max/1400/0*QMGjp-fPYpPaE3eK" width="400">





### Attempt 2: The TF-IDF Approach

- **Term Frequency (TF)**: How many times does each word appears?

- **Document Frequency (DF)**: How many documents contains this word? 

- **TF-IDF metric**: 

$$
TF \cdot [log(n / DF) + 1].
$$
Here n is the total number of documents.

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
email1 = "how are you are you you"
email2 = "hello how you you are hello"
email3 = "hello you won a prize prize"
temp_data = np.array([email1, email2, email3])
count = CountVectorizer()
vectors = count.fit_transform(temp_data)
print(vectors)

  (0, 2)	1
  (0, 0)	2
  (0, 5)	3
  (1, 2)	1
  (1, 0)	1
  (1, 5)	2
  (1, 1)	2
  (2, 5)	1
  (2, 1)	1
  (2, 4)	1
  (2, 3)	2


In [14]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer()
vectors = tfidf.fit_transform(vectors)
print(vectors)

  (0, 5)	0.721466053195566
  (0, 2)	0.30967296752749107
  (0, 0)	0.6193459350549821
  (1, 5)	0.5355035314457559
  (1, 2)	0.34477914858865916
  (1, 1)	0.6895582971773183
  (1, 0)	0.34477914858865916
  (2, 5)	0.24259369753845733
  (2, 4)	0.4107468350088512
  (2, 3)	0.8214936700177023
  (2, 1)	0.31238355521006117


In [26]:
# Let's calculate the tf-idf vector for email1
tfidf_are = 2 * (np.log(3/2) + 1)
tfidf_how = 1 * (np.log(3/2) + 1)
tfidf_you = 3 * (np.log(3/3) + 1)
norm = np.sqrt(tfidf_are ** 2 + tfidf_how ** 2 + tfidf_you ** 2)
print(tfidf_are / norm, tfidf_how / norm, tfidf_you / norm)

0.6469749675187931 0.32348748375939657 0.6904920269308479


### Attempt 3: Use An Existing Word Embedding

For this example we will use a pre-trained text embedding model from TensorFlow Hub called `gnews-swivel-20dim`, which represents text with a vector of length 20. This embedding model is trained with Swivel matrix decomposition method on 130 GB texts from Google News.

In [28]:
embedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(embedding, input_shape=[], 
                           dtype=tf.string, trainable=True)
hub_layer(train_examples_batch[:3])

<tf.Tensor: shape=(3, 20), dtype=float32, numpy=
array([[ 1.7657859 , -3.882232  ,  3.913424  , -1.5557289 , -3.3362343 ,
        -1.7357956 , -1.9954445 ,  1.298955  ,  5.081597  , -1.1041285 ,
        -2.0503852 , -0.7267516 , -0.6567596 ,  0.24436145, -3.7208388 ,
         2.0954835 ,  2.2969332 , -2.0689783 , -2.9489715 , -1.1315986 ],
       [ 1.8804485 , -2.5852385 ,  3.4066994 ,  1.0982676 , -4.056685  ,
        -4.891284  , -2.7855542 ,  1.3874227 ,  3.8476458 , -0.9256539 ,
        -1.896706  ,  1.2113281 ,  0.11474716,  0.76209456, -4.8791065 ,
         2.906149  ,  4.7087674 , -2.3652055 , -3.5015903 , -1.6390051 ],
       [ 0.71152216, -0.63532174,  1.7385626 , -1.1168287 , -0.54515934,
        -1.1808155 ,  0.09504453,  1.4653089 ,  0.66059506,  0.79308075,
        -2.2268343 ,  0.07446616, -1.4075902 , -0.706454  , -1.907037  ,
         1.4419788 ,  1.9551864 , -0.42660046, -2.8022065 ,  0.43727067]],
      dtype=float32)>

In [29]:
# train_examples_batch.shape
vec1 = hub_layer(["no"]).numpy()
print(vec1)

[[ 0.10134147 -0.7071663   0.6704395   0.7686424  -0.4218064  -1.1730928
  -0.15512493  0.44104302  0.52672297 -0.26403418 -0.00403396 -0.23549394
  -0.09064813  0.22202587 -0.31055978  0.94597495  0.5804941  -0.30393267
  -0.4768906  -0.68173426]]


In [32]:
vec2 = hub_layer(["today"]).numpy()
print(vec2)

[[-0.72950435 -0.03503934  1.1545151   0.05922344 -0.32318655 -0.3155245
  -0.5626829  -0.5286604   0.20992158  0.27145267 -0.23958004 -0.19611275
  -0.4069648  -0.03226112 -0.45999515  0.27652228  0.01596249 -0.14209424
   0.40389413  0.54259264]]


In [33]:
# Calculate the cosine similarity of these vectors.
cosine = vec1.dot(vec2.T) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
print(cosine)

[[0.24056473]]


In [34]:
train_examples_batch[:3]

<tf.Tensor: shape=(3,), dtype=string, numpy=
array([b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.",
       b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell a

In [35]:
train_labels_batch

<tf.Tensor: shape=(10,), dtype=int64, numpy=array([0, 0, 0, 1, 1, 1, 0, 0, 0, 0])>

In [36]:
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 keras_layer_1 (KerasLayer)  (None, 20)                400020    
                                                                 
 dense (Dense)               (None, 16)                336       
                                                                 
 dense_1 (Dense)             (None, 1)                 17        
                                                                 
Total params: 400,373
Trainable params: 400,373
Non-trainable params: 0
_________________________________________________________________


In [37]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [38]:
# the fit() methods returns a collection of intermediate results, which can be useful
# to evaluate the model
history = model.fit(train_data.shuffle(10000).batch(512),
                    epochs=20,
                    validation_data=validation_data.batch(512),
                    verbose=1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


## Evaluate the model

In [39]:
results = model.evaluate(test_data.batch(512), verbose=2)

for name, value in zip(model.metrics_names, results):
  print("%s: %.3f" % (name, value))

49/49 - 2s - loss: 0.3124 - accuracy: 0.8639 - 2s/epoch - 36ms/step
loss: 0.312
accuracy: 0.864


In [43]:
# How about my own reviews?
my_review = np.array(["This movie is the worst action movie I have ever watched in my entire life.",
                      "I really enjoyed the plot, but the lead actor didn't portray his character well.",
                      "It is the most visually stunning movie in the series. The acting is outstanding too.",
                      "I really like that everyone in this movie makes it crystal clear that they don't care the quality at all.",
                      "There is nothing about the movie that I don't like. I wish everyone else just stop making movies since no moive can be better than this one."])
# Convert the model outputs to probabilities using the logistic function
def logistic(t):
    return 1 / (1 + np.exp(-t))

logits = model(my_review).numpy()
probs = logistic(logits)
print(probs)

[[0.14062406]
 [0.6058104 ]
 [0.8502525 ]
 [0.6880946 ]
 [0.04366769]]


In [49]:
logits = model(train_examples_batch)
probs = logistic(logits)
print(probs.round(3).reshape([10]))

[0.    0.015 0.852 0.914 0.573 0.997 0.03  0.229 0.063 0.082]


In [48]:
train_labels_batch.numpy()

array([0, 0, 0, 1, 1, 1, 0, 0, 0, 0])

In [None]:
# Extract 20 reviews from the test set
reviews, labels = next(iter(test_data.batch(20)))
predictions = model(reviews).numpy()

In [None]:
labels.numpy()

In [None]:
(predictions > 0).astype(int).reshape(-1)

### Create Our Own Word Embedding

<a href="https://www.tensorflow.org/text/guide/word_embeddings">TensorFlow tutorial on work embedding</a>