# Week 11
# Movie Reviews Classification
This notebook classifies movie reviews as positive or negative using the text of the review.

We'll use the [IMDB dataset](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb) that contains the text of 50,000 movie reviews from the Internet Movie Database. These reviews are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are balanced, meaning they contain an equal number of positive and negative reviews.

In [1]:
# Install TensorFlow Hub, which contains reusable machine learning models
!pip install --upgrade tensorflow-hub

Requirement already up-to-date: tensorflow-hub in c:\users\ch002\anaconda3\envs\tensorflow\lib\site-packages (0.8.0)


In [2]:
# Install TensorFlow Datasets
!pip install tensorflow-datasets



In [3]:
import numpy as np

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.experimental.list_physical_devices("GPU") else "NOT AVAILABLE")

Version:  2.1.0
Eager mode:  True
Hub version:  0.8.0
GPU is NOT AVAILABLE


## Download the dataset

In [4]:
# Split the training set into 60% and 40%, so we'll end up with 15,000 examples
# for training, 10,000 examples for validation and 25,000 examples for testing.
train_data, validation_data, test_data = tfds.load(
    name="imdb_reviews", 
    split=('train[:60%]', 'train[60%:]', 'test'),
    as_supervised=True)

## Explore the Data

In [5]:
?train_data

In [6]:
# Extract the first batch of 10 reviews
train_examples_batch, train_labels_batch = next(iter(train_data.batch(10))) # The next() function returns the next item of an iterator
train_examples_batch

<tf.Tensor: shape=(10,), dtype=string, numpy=
array([b'This is a big step down after the surprisingly enjoyable original. This sequel isn\'t nearly as fun as part one, and it instead spends too much time on plot development. Tim Thomerson is still the best thing about this series, but his wisecracking is toned down in this entry. The performances are all adequate, but this time the script lets us down. The action is merely routine and the plot is only mildly interesting, so I need lots of silly laughs in order to stay entertained during a "Trancers" movie. Unfortunately, the laughs are few and far between, and so, this film is watchable at best.',
       b"Perhaps because I was so young, innocent and BRAINWASHED when I saw it, this movie was the cause of many sleepless nights for me. I haven't seen it since I was in seventh grade at a Presbyterian school, so I am not sure what effect it would have on me now. However, I will say that it left an impression on me... and most of my friends

In [7]:
# Display the labels of the first 10 reviews
train_labels_batch

<tf.Tensor: shape=(10,), dtype=int64, numpy=array([0, 0, 1, 0, 1, 0, 1, 1, 1, 0], dtype=int64)>

## Building the Model
- Represent words as vectors using pre-trained encoder
- Decide the number of hidden layers
- Decide the number of hidden units for each layer

For this example we will use a pre-trained text embedding model from TensorFlow Hub called `gnews-swivel-20dim`, which represents each word with a vector of length 20.

In [8]:
embedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(embedding, input_shape=[], 
                           dtype=tf.string, trainable=True)
hub_layer(train_examples_batch[:3])

<tf.Tensor: shape=(3, 20), dtype=float32, numpy=
array([[ 2.209591  , -2.7093675 ,  3.6802928 , -1.0291991 , -4.1671185 ,
        -2.4566064 , -2.2519937 , -0.36589956,  1.9485804 , -3.1104462 ,
        -2.4610963 ,  1.3139242 , -0.9161584 , -0.16625322, -3.723651  ,
         1.8498232 ,  3.499562  , -1.2373022 , -2.8403084 , -1.213074  ],
       [ 1.9055302 , -4.11395   ,  3.6038654 ,  0.28555924, -4.658998  ,
        -5.5433393 , -3.2735848 ,  1.9235417 ,  3.8461034 ,  1.5882455 ,
        -2.64167   ,  0.76057523, -0.14820506,  0.9115291 , -6.45758   ,
         2.3990374 ,  5.0985413 , -3.2776263 , -3.2652326 , -1.2345369 ],
       [ 3.6510668 , -4.7066135 ,  4.71003   , -1.7002777 , -3.7708545 ,
        -3.709126  , -4.222776  ,  1.946586  ,  6.1182513 , -2.7392752 ,
        -5.4384456 ,  2.7078724 , -2.1263676 , -0.7084146 , -5.893995  ,
         3.1602864 ,  3.8389287 , -3.318196  , -5.1542974 , -2.4051712 ]],
      dtype=float32)>

In [9]:
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
keras_layer (KerasLayer)     (None, 20)                400020    
_________________________________________________________________
dense (Dense)                (None, 16)                336       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 400,373
Trainable params: 400,373
Non-trainable params: 0
_________________________________________________________________


In [11]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [12]:
# the fit() methods returns a collection of intermediate results, which can be useful
# to evaluate the model
history = model.fit(train_data.shuffle(10000).batch(512),
                    epochs=20,
                    validation_data=validation_data.batch(512),
                    verbose=1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20


Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20


Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20


Epoch 18/20
Epoch 19/20
Epoch 20/20


## Evaluate the model

In [13]:
results = model.evaluate(test_data.batch(512), verbose=2)

for name, value in zip(model.metrics_names, results):
  print("%s: %.3f" % (name, value))

loss: 0.318
accuracy: 0.854


In [14]:
# How about my own reviews?
my_review = np.array(["This movie is the worst action movie I have ever watched in my entire life.",
                      "I really enjoyed the plot, but the lead actor didn't portray his character well.",
                      "It is the most visually stunning movie in the series. The acting is outstanding too.",
                      "I really like that everyone in this movie makes it crystal clear that they don't care the quality at all.",
                      "There is nothing about the movie that I don't like. I wish everyone else just stop making movies since no moive can be better than this one."])
model(my_review).numpy()

array([[-1.1002342 ],
       [ 0.64738315],
       [ 3.419432  ],
       [-0.47325426],
       [-2.4934766 ]], dtype=float32)

In [16]:
# Extract 20 reviews from the test set
reviews, labels = next(iter(test_data.batch(20)))
predictions = model(reviews).numpy()

In [17]:
labels.numpy()

array([1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1],
      dtype=int64)

In [18]:
(predictions > 0).astype(int).reshape(-1)

array([0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1])

In [19]:
reviews[0]

<tf.Tensor: shape=(), dtype=string, numpy=b'It opens with your cliche overly long ship flying through space. All I could think at this point was "Spaceballs" and hoping there\'d be a sticker on back that said "We break for Nobody." The movie then shows some cryogenic freezers with Vin Diesel\'s narration. I\'ve always thought his voice sounded cool ever since I saw Fast and the Furious. From when I found out he was as criminal, I thought the movie was going to be cliche. It was. It was very cliche and fate seemed to be against them at every turn. Black out every 22 years. Lucky them, they land on that day. Aliens can only be in the darkness, hey it\'s a solar eclipse. As much as I thought it was too easy and just a cliche, the movie pulled through and kicked major @ss. I even went out and bought a copy of Pitch Black after seeing it. I really can\'t wait for Chronicles of Riddick.'>

# Word Embedding

## Why transform words into vectors?

## Challenges for word embedding
- curse of dimensionality
- performance metrics
- training algorithm

# Popular embedding models
- Word2Vec
- BERT
- Train your own embedding

# Homework: FashionMNIST

For this homework assignment, you are asked to build a neural network classifier on the FasionMNIST dataset. The FashionMNIST dataset has a lot in common with the MNIST dataset:
- The dataset contains 70,000 grayscale images, split into training set (60,000 images) and test set (10,000 images).
- The resolution of images is 28 by 28 pixels.
- There are a total of 10 target labels.

<img src="https://tensorflow.org/images/fashion-mnist-sprite.png" width="600">

In [None]:
# Import the dataset
fashion_mnist = tf.keras.datasets.fashion_mnist

(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

In [None]:
# Here are the list of class names
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

Please complete the following tasks:
1. Scale the values to [0, 1] by dividing every value by 255.0.
2. Use `plt.imshow()` to display one image from each class.
3. Build a neural network with three layers:
    - The first layer is a flatten layer of size 28 * 28.
    - The second layer is a dense layer with 128 nodes, with ReLU as activation function.
    - The last layer is a dense layer with 10 nodes without activation.
4. Compile the model, using `adam` as optimizer and `tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)` as loss function. Use `accuracy` as performance metrics.
5. Train the model using `train_images` and `train_labels` for 10 epochs.
6. Evaluate the accuracy on the test set.
7. (optional for undergraduate students) Compute the confusion matrix over test set. Which type of prediction mistake occurs most frequently?