## NN with Embedding:

Coursera Colab link: https://colab.research.google.com/github/lmoroney/dlaicourse/blob/master/TensorFlow%20In%20Practice/Course%203%20-%20NLP/Course%203%20-%20Week%202%20-%20Lesson%201.ipynb

IMDB Review dataset: http://ai.stanford.edu/~amaas/data/sentiment/

converting tensors to numpy arrays: https://www.tensorflow.org/tutorials/customization/basics

Embedding in TF/Keras: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding

To visualise (projector) the embeddings and see the sentiment: http://projector.tensorflow.org/

In [0]:
%tensorflow_version 2.x
import tensorflow as tf
print(tf.__version__)

TensorFlow 2.x selected.
2.1.0


In [0]:
# Import TF dataset

# if we have to download the TF datasets:
# !pip install -q tensorflow-datasets

import tensorflow_datasets as tfds

# Load imdb review ds available in tf:
imdb, info = tfds.load('imdb_reviews', with_info=True, as_supervised=True)

[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


HBox(children=(IntProgress(value=1, bar_style='info', description='Dl Completed...', max=1, style=ProgressStyl…

HBox(children=(IntProgress(value=1, bar_style='info', description='Dl Size...', max=1, style=ProgressStyle(des…







HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteKRQLTM/imdb_reviews-train.tfrecord


HBox(children=(IntProgress(value=0, max=25000), HTML(value='')))



HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteKRQLTM/imdb_reviews-test.tfrecord


HBox(children=(IntProgress(value=0, max=25000), HTML(value='')))



HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteKRQLTM/imdb_reviews-unsupervised.tfrecord


HBox(children=(IntProgress(value=0, max=50000), HTML(value='')))

[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


In [0]:
type(imdb)
print(list(imdb)[:10])
print(list(imdb))

['test', 'train', 'unsupervised']
['test', 'train', 'unsupervised']


# Converting a tensor ds to list of strings for sentence data and to arrays for label data -> to make it ready for a NN trainer.

In [0]:
import numpy as np

train_data, test_data = imdb['train'], imdb['test']

print(train_data)  # data is stored in a tensor of two columns: col 1 are the texts and col 2 the labels!
print(test_data)

# tf.print(train_data, output_stream=sys.stderr)
# print(str(train_data.eval()))

# The datasets in TF are stored in tensors. 
# Thus, we have to convert them to list of strings:
training_sentences = []
training_labels = []

testing_sentences = []
testing_labels = []

# The .numpy() method explicitly converts a Tensor to a numpy array
for s,l in train_data :  # s picks up the first col of the tensor, l picks up the second
  training_sentences.append(str(s.numpy()))  # s is a tensor column -> it needs to be converted to array using numpy -> and then convert to a string using str -> and we add it to the list uisng append
  training_labels.append(l.numpy())  # labels can stay as array of integers, no need to convert to str.

for s,l in test_data :  # s picks up the first col of the tensor, l picks up the second
  testing_sentences.append(str(s.numpy()))  # s is a tensor column -> it needs to be converted to array using numpy -> and then convert to a string using str -> and we add it to the list uisng append
  testing_labels.append(l.numpy())  # labels can stay as array of integers, no need to convert to str.
 
# the NN model needs a numpy array as labels, thus we convert the list of labels into np array (using np.array()):
training_labels_final = np.array(training_labels)
testing_labels_final = np.array(testing_labels)


print(training_sentences[0])
print(len(training_sentences))

print(training_labels_final[0])
print(len(training_labels_final))

print(testing_sentences[0])
print(len(testing_sentences))

print(testing_labels_final[0])
print(len(testing_labels_final))

<DatasetV1Adapter shapes: ((), ()), types: (tf.string, tf.int64)>
<DatasetV1Adapter shapes: ((), ()), types: (tf.string, tf.int64)>
b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."
25000
0
25000
b"There are films that make careers. For George Romero, it was NIGHT OF THE LIVING DEAD; for Kevin Smith, CLERKS; for Robert Rodriguez, EL MA

## Vocabulary, Tokenizing and Padding the texts:

In [0]:
vocab_size = 10000
oov_tok = '<OOV>'

max_length = 120
trunc_type = 'post'

embedding_dim = 16


from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Building the Vocabulary
tokeniser = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokeniser.fit_on_texts(training_sentences)
word_index = tokeniser.word_index

# Building Sequences and Padding for the training strings/sentences
train_sequences = tokeniser.texts_to_sequences(training_sentences)
train_padded = pad_sequences(train_sequences, maxlen=max_length, truncating=trunc_type)

# Building Sequences and Padding for the testing strings/sentences
# the test sequences are tokenised using the word_index built from the training data
test_sequences = tokeniser.texts_to_sequences(testing_sentences)
test_padded = pad_sequences(test_sequences, maxlen=max_length, truncating=trunc_type)

In [0]:
sentence = "I really think this is amazing. honest."
sequence = tokeniser.texts_to_sequences(sentence)
print(sequence)

[[11], [], [1431], [966], [4], [1537], [1537], [4715], [], [790], [2019], [11], [2929], [2184], [], [790], [2019], [11], [579], [], [11], [579], [], [4], [1782], [4], [4517], [11], [2929], [1275], [], [], [2019], [1003], [2929], [966], [579], [790], []]


## NN with Embedding and Flatten

In [0]:
model = tf.keras.Sequential([
                             tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length),
                             tf.keras.layers.Flatten(),  # in this case, after Embedding we can also use GlobalAveragePool1D
                             tf.keras.layers.Dense(6, activation='relu'),
                             tf.keras.layers.Dense(1, activation='sigmoid')  # it's a binary classification problem
])

model.summary()


from tensorflow.keras.optimizers import RMSprop

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 120, 16)           160000    
_________________________________________________________________
flatten (Flatten)            (None, 1920)              0         
_________________________________________________________________
dense (Dense)                (None, 6)                 11526     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 7         
Total params: 171,533
Trainable params: 171,533
Non-trainable params: 0
_________________________________________________________________


In [0]:
# Training the NN:

class myCallback(tf.keras.callbacks.Callback):
  def on_epoch_end(self, epoch, logs={}):
    targ = 0.99
    if logs['accuracy']>targ :
      print('Reached %0.1f%% training accuracy. Training converged and stopping!' %(targ*100))
      self.model.stop_training = True

callback=myCallback()


model.fit(train_padded, training_labels_final, epochs=10, callbacks=[callback], validation_data=(test_padded, testing_labels_final), verbose=2)

Train on 25000 samples, validate on 25000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10


<tensorflow.python.keras.callbacks.History at 0x7f16bc649e48>

## Visualising the Embedding into a Projector to look at the Sentiment

In [0]:
# Layer weights of the Embedding layer (it's layer 0):

e = model.layers[0]
weights = e.get_weights()[0]
print(weights.shape) # shape: (vocab_size, embedding_dim)

(10000, 16)


In [0]:
# This is a function that reverst the word index from an index to a word:

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

print(decode_review(train_padded[1]))
print(training_sentences[1])

? ? ? ? ? ? ? b'i have been known to fall asleep during films but this is usually due to a combination of things including really tired being warm and comfortable on the <OOV> and having just eaten a lot however on this occasion i fell asleep because the film was rubbish the plot development was constant constantly slow and boring things seemed to happen but with no explanation of what was causing them or why i admit i may have missed part of the film but i watched the majority of it and everything just seemed to happen of its own <OOV> without any real concern for anything else i cant recommend this film at all '
b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. The plot development was constant. Constantly slow and boring. Things seemed to happen, but with no explanation of 

In [0]:
# To create the vectors to visualise in the embedding projector:

import io

out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')
for word_num in range(1, vocab_size):
  word = reverse_word_index[word_num]
  embeddings = weights[word_num]
  out_m.write(word + "\n")
  out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")
out_v.close()
out_m.close()

In [0]:
## The vector data will be downloaded

try:
  from google.colab import files
except ImportError:
  pass
else:
  # command to download files from Colab to my pc:
  files.download('vecs.tsv')
  files.download('meta.tsv')

## NN with Embedding and GlobalAveragePool1D

In [0]:
model = tf.keras.Sequential([
                             tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length),
                             tf.keras.layers.GlobalAveragePooling1D(),  # in this case, after Embedding we can use Flatten too
                             tf.keras.layers.Dense(6, activation='relu'),
                             tf.keras.layers.Dense(1, activation='sigmoid')  # it's a binary classification problem
])

model.summary()


from tensorflow.keras.optimizers import RMSprop

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 120, 16)           160000    
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 6)                 102       
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 7         
Total params: 160,109
Trainable params: 160,109
Non-trainable params: 0
_________________________________________________________________


In [0]:
# Training the NN:

class myCallback(tf.keras.callbacks.Callback):
  def on_epoch_end(self, epoch, logs={}):
    targ = 0.99
    if logs['accuracy']>targ :
      print('Reached %0.1f%% training accuracy. Training converged and stopping!' %(targ*100))
      self.model.stop_training = True

callback=myCallback()


history = model.fit(train_padded, training_labels_final, epochs=10, callbacks=[callback], validation_data=(test_padded, testing_labels_final), verbose=2)

Train on 25000 samples, validate on 25000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f4a870b8978>