##### aimldl >python3 > packages > tensorflow > tutorials > beginners > 1_3_1-text_classification_with_TF_Hub-IMDB.ipynb

* This notebook is a replica of [Text classification with preprocessed text: Movie reviews](https://www.tensorflow.org/tutorials/keras/text_classification_with_hub) with some comments.
  * [TensorFlow](https://www.tensorflow.org/) > [Learn](https://www.tensorflow.org/learn) > [TensorFlow Core](https://www.tensorflow.org/overview) > [Tutorials](https://www.tensorflow.org/tutorials) > ML basics with Keras > [Text classification with preprocessed text](https://www.tensorflow.org/tutorials/keras/text_classification_with_hub)
* It is prerequisite to install TensorFlow 2.0 & Keras.

# Text classification with preprocessed text: Movie reviews

This notebook classifies movie reviews as positive or negative using the text of the review. This is an example of binary—or two-class—classification, an important and widely applicable kind of machine learning problem.

The tutorial demonstrates the basic application of transfer learning with TensorFlow Hub and Keras.

We'll use the IMDB dataset that contains the text of 50,000 movie reviews from the Internet Movie Database. These are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are balanced, meaning they contain an equal number of positive and negative reviews.

### Prerequisite: Install tensorflow-hub & tensorflow-datasets

In [24]:
!pip install -q tensorflow-hub
!pip install -q tensorflow-datasets

### Import Modules

In [25]:
from __future__ import absolute_import, division, print_function, unicode_literals

import numpy as np
import tensorflow as tf

import tensorflow_hub as hub
import tensorflow_datasets as tfds

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.experimental.list_physical_devices("GPU") else "NOT AVAILABLE")

Version:  2.0.0
Eager mode:  True
Hub version:  0.7.0
GPU is NOT AVAILABLE


### Download the IMDB dataset
#### TensorFlow Dataset
[Module: tfds](https://www.tensorflow.org/datasets/api_docs/python/tfds) or tensorflow_datasets is TensorFlow's Dataset API which defines a collection of datasets ready-to-use with TensorFlow.

#### tfds.load
[tfds.load](#https://www.tensorflow.org/datasets/api_docs/python/tfds/load) loads the named dataset into a [tf.data.Dataset](#https://www.tensorflow.org/api_docs/python/tf/data/Dataset). In this example, "name" is "imdb_reviews", "as_supervised" is set to "True", and "split" specifies how to split the entire IMDB dataset. 

##### name
"name" is a string of the registered name. It can be either "dataset_name" or "dataset_name/config_name".

##### as_supervised
The default value is False. If "as_supervised" is True, [tf.data.Dataset](#https://www.tensorflow.org/api_docs/python/tf/data/Dataset) become a 2-tuple structure of (input, lable).

##### split
The default value is None which returns a dict with all splits of the dataset. Typically, it is specified to [tdfs.Split.TRAIN](https://www.tensorflow.org/datasets/api_docs/python/tfds/Split#TRAIN) and [tfds.Split.TEST](#https://www.tensorflow.org/datasets/api_docs/python/tfds/Split#TEST). The list of class members are:
* ALL
* TEST
* TRAIN
* VALIDATION

In [26]:
print( tfds.Split.ALL, tfds.Split.TEST, tfds.Split.TRAIN, tfds.Split.VALIDATION )

all test train validation


In [27]:
tfds.Split.ALL

NamedSplitAll()

In [6]:
tfds.Split.TEST

NamedSplit('test')

In [7]:
tfds.Split.TRAIN

NamedSplit('train')

In [8]:
tfds.Split.VALIDATION

NamedSplit('validation')

In [9]:
type(tfds.Split.ALL)

tensorflow_datasets.core.splits.NamedSplitAll

In [10]:
type(tfds.Split.TEST)

tensorflow_datasets.core.splits.NamedSplit

In [11]:
type(tfds.Split.TRAIN)

tensorflow_datasets.core.splits.NamedSplit

In [12]:
type(tfds.Split.VALIDATION)

tensorflow_datasets.core.splits.NamedSplit

In [28]:
# Split the training set into 60% and 40%, so we'll end up with 15,000 examples
# for training, 10,000 examples for validation and 25,000 examples for testing.
train_validation_split = tfds.Split.TRAIN.subsplit( [6, 4] )

(train_data, validation_data), test_data = tfds.load( 
    name="imdb_reviews", 
    split=(train_validation_split, tfds.Split.TEST),
    as_supervised=True)

The last message specifies the location of the downloaded data:
> Dataset imdb_reviews downloaded and prepared to /home/aimldl/tensorflow_datasets/imdb_reviews/plain_text/0.1.0.


Let's take a closer look at 
> split=(train_validation_split, tfds.Split.TEST)

In [29]:
train_validation_split

(NamedSplit('train')(tfds.percent[0:60]),
 NamedSplit('train')(tfds.percent[60:100]))

In [30]:
type( train_validation_split )

tuple

The IMDB data is split into training, validation, and test data. The ratio of training and validation data is 60% and 40% from the train data.

In [31]:
train_data

<_OptionsDataset shapes: ((), ()), types: (tf.string, tf.int64)>

In [37]:
type( train_data )

tensorflow.python.data.ops.dataset_ops._OptionsDataset

In [32]:
validation_data

<_OptionsDataset shapes: ((), ()), types: (tf.string, tf.int64)>

In [38]:
type( validation_data )

tensorflow.python.data.ops.dataset_ops._OptionsDataset

### Explore the data
#### 0 -> Negative & 1 -> Positive
The label is either 0 or 1 where 0 is a negative review and 1 is a positive review. 

#### First 10 Examples
train_data.batch(10) fetches the first 10 examples.

In [34]:
train_examples_batch, train_labels_batch = next( iter(train_data.batch(10)) )
train_labels_batch

<tf.Tensor: id=640, shape=(10,), dtype=int64, numpy=array([1, 1, 1, 1, 1, 1, 0, 1, 1, 0])>

Among them, the labels & data of the first example are printed below. train_labels_batch[0] shows the first label is "1".

In [13]:
train_labels_batch[0]

<tf.Tensor: id=288, shape=(), dtype=int64, numpy=1>

The data or review of the first example is train_examples_batch[0].

In [10]:
train_examples_batch[0]

<tf.Tensor: id=274, shape=(), dtype=string, numpy=b"As a lifelong fan of Dickens, I have invariably been disappointed by adaptations of his novels.<br /><br />Although his works presented an extremely accurate re-telling of human life at every level in Victorian Britain, throughout them all was a pervasive thread of humour that could be both playful or sarcastic as the narrative dictated. In a way, he was a literary caricaturist and cartoonist. He could be serious and hilarious in the same sentence. He pricked pride, lampooned arrogance, celebrated modesty, and empathised with loneliness and poverty. It may be a clich\xc3\xa9, but he was a people's writer.<br /><br />And it is the comedy that is so often missing from his interpretations. At the time of writing, Oliver Twist is being dramatised in serial form on BBC television. All of the misery and cruelty is their, but non of the humour, irony, and savage lampoonery. The result is just a dark, dismal experience: the story penned by a 

Can you read the review and see if this review is positive? The review starts off from disappointment, but ends up giving 10 points. So this review is positive.

## Understanduing the Loaded Data
This section doesn't exit in the TensorFlow tutorial, but I think it's necessary.
TODO:

In [39]:
type( train_data )

tensorflow.python.data.ops.dataset_ops._OptionsDataset

In [35]:
type( train_examples_batch )

tensorflow.python.framework.ops.EagerTensor

In [36]:
type( train_labels_batch )

tensorflow.python.framework.ops.EagerTensor

In [None]:
type( train_data.batch(10)) )

In [None]:
next( iter(train_data.batch(10)) )

### Text Representation

Create a Keras layer that uses a TensorFlow Hub model to embed the sentences.

In [14]:
embedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer( embedding, input_shape=[], dtype=tf.string, trainable=True )
hub_layer( train_examples_batch[:3] )

<tf.Tensor: id=468, shape=(3, 20), dtype=float32, numpy=
array([[ 3.9819887 , -4.4838037 ,  5.177359  , -2.3643482 , -3.2938678 ,
        -3.5364532 , -2.4786978 ,  2.5525482 ,  6.688532  , -2.3076782 ,
        -1.9807833 ,  1.1315885 , -3.0339816 , -0.7604128 , -5.743445  ,
         3.4242578 ,  4.790099  , -4.03061   , -5.992149  , -1.7297493 ],
       [ 3.4232912 , -4.230874  ,  4.1488533 , -0.29553518, -6.802391  ,
        -2.5163853 , -4.4002395 ,  1.905792  ,  4.7512794 , -0.40538004,
        -4.3401685 ,  1.0361497 ,  0.9744097 ,  0.71507156, -6.2657013 ,
         0.16533905,  4.560262  , -1.3106939 , -3.1121316 , -2.1338716 ],
       [ 3.8508697 , -5.003031  ,  4.8700504 , -0.04324996, -5.893603  ,
        -5.2983093 , -4.004676  ,  4.1236343 ,  6.267754  ,  0.11632943,
        -3.5934832 ,  0.8023905 ,  0.56146765,  0.9192484 , -7.3066816 ,
         2.8202746 ,  6.2000837 , -3.5709393 , -4.564525  , -2.305622  ]],
      dtype=float32)>

Note the output shape of the embeddings is fixed to (num_examples, embedding_dimension) regardless of the input text length.

### Build the model
TODO: Rewrite the following excerpt of the tutorial.
The layers are stacked sequentially to build the classifier:

* The first layer is a TensorFlow Hub layer.
  * This layer uses a pre-trained Saved Model to map a sentence into its embedding vector. The pre-trained text embedding model that we are using (google/tf2-preview/gnews-swivel-20dim/1) splits the sentence into tokens, embeds each token and then combines the embedding. The resulting dimensions are: (num_examples, embedding_dimension).
* This fixed-length output vector is piped through a fully-connected (Dense) layer with 16 hidden units.
* The last layer is densely connected with a single output node.
  * Using the sigmoid activation function, this value is a float between 0 and 1, representing a probability, or confidence level.

In [16]:
model = tf.keras.Sequential()
model.add( hub_layer )
model.add( tf.keras.layers.Dense(16, activation='relu') )
model.add( tf.keras.layers.Dense(1, activation='sigmoid') )

model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
keras_layer (KerasLayer)     (None, 20)                400020    
_________________________________________________________________
dense_2 (Dense)              (None, 16)                336       
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 17        
Total params: 400,373
Trainable params: 400,373
Non-trainable params: 0
_________________________________________________________________


### Compile the model

In [17]:
model.compile( optimizer='adam',
               loss='binary_crossentropy',
               metrics=['accuracy'] )

### Train the model

In [18]:
history = model.fit( train_data.shuffle(10000).batch(512),
                     epochs=20,
                     validation_data=validation_data.batch(512),
                     verbose=1 )

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


### Evaluate the model

In [19]:
results = model.evaluate( test_data.batch(512), verbose=2 )

for name, value in zip(model.metrics_names, results):
  print("%s: %.3f" % (name, value))

49/49 - 1s - loss: 0.3130 - accuracy: 0.8674
loss: 0.313
accuracy: 0.867


> This fairly naive approach achieves an accuracy of about 87%. With more advanced approaches, the model should get closer to 95%.

(EOF)