# Loading Text in TF 
We will use two ways to load and preprocess text data - Keras and and low level utility tf.data.TextLineDataset

In [62]:
! pip install -q -U tf-nightly


In [63]:
# we have to make sure that we have tf and tf text installed

! pip install -q -U tensorflow-text-nightly

In [64]:
import collections
import pathlib
import re 
import string 

import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import losses 
from tensorflow.keras import preprocessing
from tensorflow.keras import utils 
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

import tensorflow_datasets as tfds 
import tensorflow_text as tf_text

## Problem 1 : Preditct the tag for a Stack Overflow question 

Our task is to develope a ML model that predicts the tag for a question. This is an example of multi class classification.

In [65]:
# download and explore the data

data_url = 'https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz'
dataset = utils.get_file(
    'stack_overflow_16k.tar.gz',
    data_url,
    untar = True,
    cache_dir = 'stack_overflow',
    cache_subdir = ''
)

dataset_dir = pathlib.Path(dataset).parent

In [66]:
list(dataset_dir.iterdir())

[PosixPath('/tmp/.keras/stack_overflow_16k.tar.gz.tar.gz'),
 PosixPath('/tmp/.keras/train'),
 PosixPath('/tmp/.keras/test'),
 PosixPath('/tmp/.keras/README.md')]

In [67]:
train_dir = dataset_dir/'train'
list(train_dir.iterdir())

[PosixPath('/tmp/.keras/train/csharp'),
 PosixPath('/tmp/.keras/train/java'),
 PosixPath('/tmp/.keras/train/javascript'),
 PosixPath('/tmp/.keras/train/python')]

The train/csharp, train/java, train/python and train/javascript directories contain many text files, each of which is a Stack Overflow question. Print a file and inspect the data.


In [68]:
sample_file = train_dir/'python/1755.txt'
with open(sample_file) as f :
  print(f.read())

why does this blank program print true x=true.def stupid():.    x=false.stupid().print x



## Load the dataset 

Next, you will load the data off disk and prepare it into a format suitable for training. To do so, you will use text_dataset_from_directory utility to create a labeled tf.data.Dataset. If you're new to tf.data, it's a powerful collection of tools for building input pipelines.


In [69]:
# lets make a validation split 

batch_size = 32
seed = 42 

raw_train_ds = preprocessing.text_dataset_from_directory(
    train_dir,
    batch_size = batch_size,
    validation_split = 0.2,
    subset = 'training',
    seed = seed
)

Found 8000 files belonging to 4 classes.
Using 6400 files for training.


In [70]:
# looking at the data 

for text_batch, label_batch in raw_train_ds.take(1):
  for i in range(10):
    print("Question:", text_batch.numpy()[i])
    print("Label:", label_batch.numpy()[i])

Question: b'"my tester is going to the wrong constructor i am new to programming so if i ask a question that can be easily fixed, please forgive me. my program has a tester class with a main. when i send that to my regularpolygon class, it sends it to the wrong constructor. i have two constructors. 1 without perameters..public regularpolygon().    {.       mynumsides = 5;.       mysidelength = 30;.    }//end default constructor...and my second, with perameters. ..public regularpolygon(int numsides, double sidelength).    {.        mynumsides = numsides;.        mysidelength = sidelength;.    }// end constructor...in my tester class i have these two lines:..regularpolygon shape = new regularpolygon(numsides, sidelength);.        shape.menu();...numsides and sidelength were declared and initialized earlier in the testing class...so what i want to happen, is the tester class sends numsides and sidelength to the second constructor and use it in that class. but it only uses the default cons

In [71]:
# lets check which labels correspond to 0,1,2,3

for i, label in enumerate(raw_train_ds.class_names):
  print("Label", i, "corresponds to", label)

Label 0 corresponds to csharp
Label 1 corresponds to java
Label 2 corresponds to javascript
Label 3 corresponds to python


Next, you will create a validation and test dataset. You will use the remaining 1,600 reviews from the training set for validation.

In [72]:
raw_val_ds = preprocessing.text_dataset_from_directory(
    train_dir,
    batch_size = batch_size,
    validation_split = 0.2,
    subset = 'validation',
    seed = seed 
)

Found 8000 files belonging to 4 classes.
Using 1600 files for validation.


In [73]:
test_dir = dataset_dir/'test'

raw_test_ds = preprocessing.text_dataset_from_directory(
    test_dir, batch_size = batch_size
)

Found 8000 files belonging to 4 classes.


## Vectorization 

We will use a module to standardize, tokenize and vectorize the text

In [74]:
VOCAB_SIZE = 10000

binary_vectorize_layer = TextVectorization(
    max_tokens = VOCAB_SIZE,
    output_mode = 'binary'
)

For int mode, in addition to maximum vocabulary size, you need to set an explicit maximum sequence length, which will cause the layer to pad or truncate sequences to exactly sequence_length values.

In [75]:
MAX_SEQUENCE_LENGTH = 250

int_vectorize_layer = TextVectorization(
    max_tokens = VOCAB_SIZE,
    output_mode = 'int',
    output_sequence_length = MAX_SEQUENCE_LENGTH
)

Next, you will call adapt to fit the state of the preprocessing layer to the dataset. This will cause the model to build an index of strings to integers.

In [76]:
train_text = raw_train_ds.map(lambda text, labels: text)
binary_vectorize_layer.adapt(train_text)
int_vectorize_layer.adapt(train_text)

See the result of using these layers to preprocess data:



In [77]:
def binary_vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return binary_vectorize_layer(text), label

def int_vectorize_text(text,label):
  text = tf.expand_dims(text, -1)
  return int_vectorize_layer(text), label

In [78]:
# retrive a batch of 32 reviews and labels from the dataset 

text_batch, label_batch = next(iter(raw_train_ds))
first_question, first_label = text_batch[0], label_batch[0]
print("Question", first_question)
print("Label", first_label)

Question tf.Tensor(b'"what is the difference between these two ways to create an element? var a = document.createelement(\'div\');..a.id = ""mydiv"";...and..var a = document.createelement(\'div\').id = ""mydiv"";...what is the difference between them such that the first one works and the second one doesn\'t?"\n', shape=(), dtype=string)
Label tf.Tensor(2, shape=(), dtype=int32)


Now we will see how these texts look like in binary and int vectorization

In [79]:
print("'binary' vectorized question:", 
      binary_vectorize_text(first_question, first_label)[0])

'binary' vectorized question: tf.Tensor([[1. 1. 0. ... 0. 0. 0.]], shape=(1, 10000), dtype=float32)


In [80]:
print("'int' vectorized question:",
      int_vectorize_text(first_question, first_label)[0])

'int' vectorized question: tf.Tensor(
[[ 55   6   2 410 211 229 121 895   4 124  32 245  43   5   1   1   5   1
    1   6   2 410 211 191 318  14   2  98  71 188   8   2 199  71 178   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0


binary mode returns an array denoting which tokens exist at least once in the input, while int mode replaces each token by an integer, thus preserving their order. You can lookup the token (string) that each integer corresponds to by calling .get_vocabulary() on the layer.

In [81]:
print("1289 --->",int_vectorize_layer.get_vocabulary()[1289])
print("Vocabulary size: {}".format(len(int_vectorize_layer.get_vocabulary())))

1289 ---> roman
Vocabulary size: 10000


In [82]:
# finally applying the textvect layer 

binary_train_ds = raw_train_ds.map(binary_vectorize_text)
binary_val_ds = raw_val_ds.map(binary_vectorize_text)
binary_test_ds = raw_test_ds.map(binary_vectorize_text)

int_train_ds = raw_train_ds.map(int_vectorize_text)
int_val_ds = raw_val_ds.map(int_vectorize_text)
int_test_ds = raw_test_ds.map(int_vectorize_text)

## Configure the dataset for performance 
These are two important methods you should use when loading data to make sure that I/O does not become blocking.

.cache and .prefetch


In [83]:
AUTOTUNE = tf.data.AUTOTUNE

def configure_dataset(dataset):
  return dataset.cache().prefetch(buffer_size=AUTOTUNE)

In [84]:
# now configure the dataset . 

binary_train_ds = configure_dataset(binary_train_ds)
binary_val_ds = configure_dataset(binary_val_ds)
binary_test_ds = configure_dataset(binary_test_ds)

int_train_ds = configure_dataset(int_train_ds)
int_val_ds = configure_dataset(int_val_ds)
int_test_ds = configure_dataset(int_test_ds)

In [85]:
# training 
# we will use both binary and int model and comare their performance

binary_model = tf.keras.Sequential([layers.Dense(4)])
binary_model.compile(
    loss = losses.SparseCategoricalCrossentropy(from_logits= True),
    optimizer = 'adam',
    metrics = ['accuracy']
)

history = binary_model.fit(
    binary_train_ds, validation_data = binary_val_ds, epochs = 10
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Next, you will use the int vectorized layer to build 1D ConvNet

In [86]:
def create_model(vocab_size, num_labels):
  model = tf.keras.Sequential([
                               layers.Embedding(vocab_size, 64, mask_zero= True),
                               layers.Conv1D(64, 5, padding = "valid", activation = "relu", strides=2),
                               layers.GlobalMaxPooling1D(),
                               layers.Dense(num_labels)
  ])

  return model

In [87]:
# vocab_size is VOCAB_SIZE + 1 since 0 is used additionally for padding.


int_model = create_model(vocab_size= VOCAB_SIZE +1, num_labels= 4)
int_model.compile(
    loss = losses.SparseCategoricalCrossentropy(from_logits= True),
    optimizer = 'adam',
    metrics = ['accuracy']
)

history = int_model.fit(int_train_ds, validation_data = int_val_ds, epochs = 5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [88]:
# compare the two models 

print("Linear model on binary vectorized data:")
print(binary_model.summary())

Linear model on binary vectorized data:
Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_3 (Dense)              (None, 4)                 40004     
Total params: 40,004
Trainable params: 40,004
Non-trainable params: 0
_________________________________________________________________
None


In [89]:
print("ConvNet model on int vectorized data:")
print(int_model.summary())

ConvNet model on int vectorized data:
Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, None, 64)          640064    
_________________________________________________________________
conv1d_2 (Conv1D)            (None, None, 64)          20544     
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 64)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 4)                 260       
Total params: 660,868
Trainable params: 660,868
Non-trainable params: 0
_________________________________________________________________
None


In [90]:
# Evaluate both models 

binary_loss, binary_accuracy = binary_model.evaluate(binary_test_ds)
int_loss, int_accuracy = int_model.evaluate(int_test_ds)

print("Binary model accuracy: {:2.2%}".format(binary_accuracy))
print("Int model accuracy: {:2.2%}".format(int_accuracy))


Binary model accuracy: 81.45%
Int model accuracy: 80.69%


## Export the model 



In [91]:
export_model = tf.keras.Sequential(
    [binary_vectorize_layer, binary_model,
     layers.Activation('sigmoid')]
)

export_model.compile(
    loss = losses.SparseCategoricalCrossentropy(from_logits= False),
    optimizer = 'adam',
    metrics = ['accuracy']
)

#testing testing 

loss,accuracy = export_model.evaluate(raw_test_ds)
print("Accuracy: {:2.2%}".format(binary_accuracy))

Accuracy: 81.45%


## Finally, prediction on new data 



In [92]:
def get_string_labels(predicted_score_batch):
  predicted_int_labels = tf.argmax(predicted_score_batch, axis =1)
  predicted_labels = tf.gather(raw_train_ds.class_names, predicted_int_labels)
  return predicted_labels 

In [93]:
# Run new data 

inputs = [
    "how do I extract keys from a dict into a list?",  # python
    "debug public static void main(string[] args) {...}",  # java
]

predicted_scores = export_model.predict(inputs)
predicted_labels = get_string_labels(predicted_scores)

for input, label in zip(inputs, predicted_labels):
  print("Question:", input)
  print("Predicted label:", label.numpy())

Question: how do I extract keys from a dict into a list?
Predicted label: b'python'
Question: debug public static void main(string[] args) {...}
Predicted label: b'java'


## Problem 2 : Predict the author of Illiad translations 

The following provides an example of using tf.data.TextLineDataset to load examples from text files, and tf.text to preprocess the data. In this example, you will use three different English translations of the same work, Homer's Illiad, and train a model to identify the translator given a single line of text.

The three authors we want to predict are William Cowper, Edward Earl and Samuel Butler

In [94]:
# their translated works, downloading

DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

for name in FILE_NAMES: 
  text_dir = utils.get_file(name, origin = DIRECTORY_URL + name)

parent_dir = pathlib.Path(text_dir).parent
list(parent_dir.iterdir())

[PosixPath('/root/.keras/datasets/derby.txt'),
 PosixPath('/root/.keras/datasets/butler.txt'),
 PosixPath('/root/.keras/datasets/cowper.txt')]

## Load the data 

In previous example we used test_dataset_from_directory which treats all contents of a file as a single example. In this problem, we will use TextLineDataset, which is designed to create a tf.data.Dataset from a text file in which each example is a line of text from the original file. 

In [95]:
def labeler(example, index):
  return example, tf.cast(index, tf.int64)

labeled_data_sets = []

for i, file_name in enumerate(FILE_NAMES):
  lines_dataset = tf.data.TextLineDataset(str(parent_dir/file_name))
  labeled_dataset = lines_dataset.map(lambda ex: labeler (ex, i))
  labeled_data_sets.append(labeled_dataset)

In [96]:
# Now combine the labeled dataset and shuffle 

BUFFER_SIZE = 50000
BATCH_SIZE = 64
VALIDATION_SIZE = 5000



In [97]:
all_labeled_data = labeled_data_sets[0]
for labeled_dataset in labeled_data_sets[1:]:
  all_labeled_data = all_labeled_data.concatenate(labeled_dataset)

all_labeled_data = all_labeled_data.shuffle(
    BUFFER_SIZE, reshuffle_each_iteration = False
)

Print out a few examples as before. The dataset hasn't been batched yet, hence each entry in all_labeled_data corresponds to one data point:

In [98]:
for text, label in all_labeled_data.take(10):
  print("Sentence:", text.numpy())
  print("Label:", label.numpy())

Sentence: b'"Hear me, ye Trojans, Dardans, and Allies!'
Label: 1
Sentence: b"The rapid runner's meed. First, he produced"
Label: 0
Sentence: b'fallen--he who was at once the right and might of Lycia; Mars has laid'
Label: 2
Sentence: b"These all obey'd four Chiefs, and galleys ten"
Label: 0
Sentence: b'And, diving deep into his host, escaped.'
Label: 0
Sentence: b'Without my aid; hath built a lofty wall,'
Label: 1
Sentence: b"To battle, and in accents wing'd began."
Label: 0
Sentence: b'rather by the hand of that brave man who was my husband. You used to'
Label: 2
Sentence: b"My spirit like thine is stirr'd; I feel my feet"
Label: 1
Sentence: b'son of Peleus--but he was a man of no substance, and had but a small'
Label: 2


## Prepare the dataset for training

Instead of using the Keras TextVectorization layer to preprocess our text dataset, you will now use the tf.text API to standardize and tokenize the data, build a vocabulary and use StaticVocabularyTable to map tokens to integers to feed to the model.

While tf.text provides various tokenizers, you will use the UnicodeScriptTokenizer to tokenize our dataset. Define a function to convert the text to lower-case and tokenize it. You will use tf.data.Dataset.map to apply the tokenization to the dataset.

In [99]:
tokenizer = tf_text.UnicodeScriptTokenizer()

def tokenize(text, unused_label):
  lower_case  = tf_text.case_fold_utf8(text)
  return tokenizer.tokenize(lower_case)

tokenized_ds = all_labeled_data.map(tokenize)

In [100]:
for text_batch in tokenized_ds.take(5):
  print("Tokens:", text_batch.numpy())

Tokens: [b'"' b'hear' b'me' b',' b'ye' b'trojans' b',' b'dardans' b',' b'and'
 b'allies' b'!']
Tokens: [b'the' b'rapid' b'runner' b"'" b's' b'meed' b'.' b'first' b',' b'he'
 b'produced']
Tokens: [b'fallen' b'--' b'he' b'who' b'was' b'at' b'once' b'the' b'right' b'and'
 b'might' b'of' b'lycia' b';' b'mars' b'has' b'laid']
Tokens: [b'these' b'all' b'obey' b"'" b'd' b'four' b'chiefs' b',' b'and'
 b'galleys' b'ten']
Tokens: [b'and' b',' b'diving' b'deep' b'into' b'his' b'host' b',' b'escaped' b'.']


In [101]:
# build the vocabulary 

tokenized_ds = configure_dataset(tokenized_ds)

vocab_dict = collections.defaultdict(lambda:0)

for toks in tokenized_ds.as_numpy_iterator():
  for tok in toks:
    vocab_dict[tok]+= 1

vocab = sorted(vocab_dict.items(), key = lambda x: x[1], reverse= True)
vocab = [token for token, count in vocab]
vocab = vocab[:VOCAB_SIZE]
vocab_size = len(vocab)
print("Vocab size:", vocab_size)
print("First five vocab entries:", vocab[:5])

Vocab size: 10000
First five vocab entries: [b',', b'the', b'and', b"'", b'of']


To convert the tokens into integers, use the vocab set to create a StaticVocabularyTable. You will map tokens to integers in the range [2, vocab_size + 2]. As with the TextVectorization layer, 0 is reserved to denote padding and 1 is reserved to denote an out-of-vocabulary (OOV) token.

In [102]:
keys = vocab
values = range(2, len(vocab) +2)

init  = tf.lookup.KeyValueTensorInitializer(
    keys, values, key_dtype = tf.string, value_dtype = tf.int64
)

num_oov_buckets = 1
vocab_table = tf.lookup.StaticVocabularyTable(init, num_oov_buckets)

In [103]:
vocab_table

<tensorflow.python.ops.lookup_ops.StaticVocabularyTable at 0x7f082c47b550>

Finally, define a fuction to standardize, tokenize and vectorize the dataset using the tokenizer and lookup table:

In [104]:
def preprocess_text(text, label):
  standardized = tf_text.case_fold_utf8(text)
  tokenized  = tokenizer.tokenize(standardized)
  vectorized = vocab_table.lookup(tokenized)
  return vectorized, label 

## Check how it looks like

In [105]:
example_text, example_label = next(iter(all_labeled_data))
print("Sentence: ", example_text.numpy())
vectorized_text, example_label = preprocess_text(example_text, example_label)
print("Vectorized sentence: ", vectorized_text.numpy())

Sentence:  b'"Hear me, ye Trojans, Dardans, and Allies!'
Vectorized sentence:  [  93  274   40    2  130   62    2 2618    2    4  749   59]


Now run the preprocess function on the dataset using tf.data.Dataset.map.



In [106]:
all_encoded_data = all_labeled_data.map(preprocess_text)

In [107]:
# splitting 

train_data = all_encoded_data.skip(VALIDATION_SIZE).shuffle(BUFFER_SIZE)
validation_data = all_encoded_data.take(VALIDATION_SIZE)

train_data = train_data.padded_batch(BATCH_SIZE)
validation_data = validation_data.padded_batch(BATCH_SIZE)



Validation_data and train_data are not collections of examples but are collections of batches. Each batch has several examples. 

In [108]:
sample_text, sample_labels = next(iter(validation_data))

print("Text batch shape:" , sample_text.shape)
print("Label batch shape:", sample_labels.shape)

print("First text example:", sample_text[0])
print("First label example:", sample_labels[0])

Text batch shape: (64, 18)
Label batch shape: (64,)
First text example: tf.Tensor(
[  93  274   40    2  130   62    2 2618    2    4  749   59    0    0
    0    0    0    0], shape=(18,), dtype=int64)
First label example: tf.Tensor(1, shape=(), dtype=int64)


Since we use 0 for padding and 1 for out of vocabulary, now the vocab size has been increased by two 

In [109]:
vocab_size += 2


In [110]:
# configuring the dataset, with cache and prefetch 

train_data = configure_dataset(train_data)
validation_data = configure_dataset(validation_data)

In [111]:
# model training

model = create_model(vocab_size=vocab_size, num_labels= 3)
model.compile(
    optimizer = 'adam',
    loss = losses.SparseCategoricalCrossentropy(from_logits=  True),
    metrics = ['accuracy']

)

history = model.fit(train_data, validation_data = validation_data, epochs =3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [112]:
loss, accuracy = model.evaluate(validation_data)

print("Loss:", loss)
print("Accuracy:", accuracy)

Loss: 0.39295727014541626
Accuracy: 0.843999981880188


## Exporting the model

Now before saving the model, we have to mention the vocabulary. Since we have already made a vocabulary. We dont have to use the adapt method. Instead, we will mention our vocab set to the 'set_vocabulary' arguement.

In [113]:
preprocess_layer = TextVectorization(
    max_tokens = vocab_size,
    standardize = tf_text.case_fold_utf8,
    split = tokenizer.tokenize,
    output_mode = 'int',
    output_sequence_length = MAX_SEQUENCE_LENGTH
)

preprocess_layer.set_vocabulary(vocab)

# finally exporting the model 

export_model = tf.keras.Sequential(
    [preprocess_layer, model,
     layers.Activation('sigmoid')]
)

export_model.compile(
    loss = losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer = 'adam',
    metrics = ['accuracy']
)

In [114]:
# checking the model performance on validation

test_ds = all_labeled_data.take(VALIDATION_SIZE).batch(BATCH_SIZE)
test_ds = configure_dataset(test_ds)

loss, accuracy = export_model.evaluate(test_ds)

print("Loss:", loss)
print("Accuracy:{:2.2%}".format(accuracy))

Loss: 0.522333025932312
Accuracy:79.68%


In [115]:
# model performance on new input text 

inputs = [
          "Join'd to th' Ionians with their flowing robes,",  # Label: 1
    "the allies, and his armour flashed about him so that he seemed to all",  # Label: 2
    "And with loud clangor of his arms he fell.",  # Label: 0
          
]

predicted_scores = export_model.predict(inputs)
predicted_labels = tf.argmax(predicted_scores, axis=1)

for input, label in zip(inputs, predicted_labels):
  print("Question:", input)
  print("Predicted label:", label.numpy())

Question: Join'd to th' Ionians with their flowing robes,
Predicted label: 1
Question: the allies, and his armour flashed about him so that he seemed to all
Predicted label: 2
Question: And with loud clangor of his arms he fell.
Predicted label: 0


## Prediction of the data in a different dataset 

Lets check the model against a different dataset downloadabe from tf.

In [116]:
# defining train and validation set 

train_ds = tfds.load(
    'imdb_reviews',
    split = 'train',
    batch_size = BATCH_SIZE,
    shuffle_files = True,
    as_supervised = True
)


val_ds = tfds.load(
    'imdb_reviews',
    split = 'train',
    batch_size = BATCH_SIZE,
    shuffle_files = True,
    as_supervised = True
)


# print some

for review_batch, label_batch in val_ds.take(1):
  for i in range(5):
    print('Review:', review_batch[i].numpy())
    print('Label', label_batch[i].numpy())

Review: b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."
Label 0
Review: b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. 

In [117]:
# prepare the dataset for training 

vectorize_layer = TextVectorization(
    max_tokens = BATCH_SIZE,
    output_mode = 'int',
    output_sequence_length = MAX_SEQUENCE_LENGTH
)


train_text = train_ds.map(lambda text, labels : text)
vectorize_layer.adapt(train_text)

In [118]:
# there is another process of vectorization called vectorize_text

def vectorized_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), label

train_ds = train_ds.map(vectorized_text)
val_ds = val_ds.map(vectorized_text)

# configure dataset 

train_ds = configure_dataset(train_ds)
val_ds = configure_dataset(val_ds)

# model training 

model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels = 1)
model.summary()

Model: "sequential_10"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, None, 64)          640064    
_________________________________________________________________
conv1d_4 (Conv1D)            (None, None, 64)          20544     
_________________________________________________________________
global_max_pooling1d_4 (Glob (None, 64)                0         
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 65        
Total params: 660,673
Trainable params: 660,673
Non-trainable params: 0
_________________________________________________________________


In [119]:
model.compile(
    loss = losses.BinaryCrossentropy(from_logits= True),
    optimizer =  'adam',
    metrics = ['accuracy']
)

history = model.fit(
    train_ds , validation_data = val_ds, epochs = 3
)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [121]:
loss , accuracy = model.evaluate(val_ds)

print("Loss:", loss)

print("Accuracy:{:2.2%}".format(accuracy))

Loss: 0.549254834651947
Accuracy:70.47%


In [122]:
# export the model 

export_model = tf.keras.Sequential(
    [vectorize_layer, model,
     layers.Activation('sigmoid')]
)

export_model.compile(
    loss = losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer = 'adam',
    metrics = ['accuracy']
)

## Check on the new data 

In [123]:
inputs = [
           "This is a fantastic movie.",
    "This is a bad movie.",
    "This movie was so bad that it was good.",
    "I will never say yes to watching this movie."
          
]

predicted_scores = export_model.predict(inputs)
predicted_labels = [int(round((x)[0])) for x in predicted_scores]

for input, label in zip(inputs, predicted_labels):
  print("Question:", input)
  print("Predicted label:", label)





Question: This is a fantastic movie.
Predicted label: 1
Question: This is a bad movie.
Predicted label: 1
Question: This movie was so bad that it was good.
Predicted label: 0
Question: I will never say yes to watching this movie.
Predicted label: 0
