Dataset Source = https://ai.stanford.edu/~amaas/data/sentiment/

In [64]:
import tensorflow as tf
import numpy as np

Downloading data

In [65]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  9065k      0  0:00:09  0:00:09 --:--:-- 15.7M


In [66]:
!rm -r aclImdb/train/unsup

In [67]:
!ls aclImdb

imdbEr.txt  imdb.vocab	README	test  train


In [68]:
!ls aclImdb/train

labeledBow.feat  pos		urls_neg.txt  urls_unsup.txt
neg		 unsupBow.feat	urls_pos.txt


In [69]:
!cat aclImdb/test/pos/1181_10.txt

Movies aren't always suppose to be about deep, provolking thoughts. Sometimes they're simply meant to be escapes from reality. Out To Sea fits the bill perfectly. <br /><br />A light hearted "golden years" romantic comedy, Out To Sea may not be big budget, you might be able to easily tell when they were acting in front of a green screen, but it's still very much a movie worth watching. A sweet movie that needs to be given a break. <br /><br />This is just good, light hearted fun. It's not meant to be a deep movie. It's something worth watching. If for nothing else, you must see it for Brent Spiner's humorously stiff and uptight rendition of Oye Como Va. Gil is a character you love to hate and Mr. Spiner pulls off the perfect evil comic foil to two beloved comedy movie gods.

In [95]:
ds_train_raw = tf.keras.utils.text_dataset_from_directory(
    "aclImdb/train",
    validation_split = 0.3,
    subset = 'training',
    shuffle = False, # sorts the data in the alphanumeric order
                     # so train and val splits do not overlap
)

Found 25000 files belonging to 2 classes.
Using 17500 files for training.


In [71]:
print("Class Names:", ds_train_raw.class_names)

Class Names: ['neg', 'pos']


In [72]:
ds_val_raw = tf.keras.utils.text_dataset_from_directory(
    "aclImdb/train",
    validation_split = 0.3,
    subset = 'validation',
    shuffle = False,
)

Found 25000 files belonging to 2 classes.
Using 7500 files for validation.


In [73]:
print("Class Names:", ds_val_raw.class_names)

Class Names: ['neg', 'pos']


In [74]:
ds_test_raw = tf.keras.utils.text_dataset_from_directory(
    "aclImdb/test"
)

Found 25000 files belonging to 2 classes.


In [75]:
print("Class Names:", ds_val_raw.class_names)

Class Names: ['neg', 'pos']


In [76]:
print(f"Number of batches in ds_train_raw: {ds_train_raw.cardinality()}")
print(f"Number of batches in ds_val_raw: {ds_val_raw.cardinality()}")
print(f"Number of batches in ds_test_raw: {ds_test_raw.cardinality()}")

Number of batches in ds_train_raw: 547
Number of batches in ds_val_raw: 235
Number of batches in ds_test_raw: 782


Quick preview:

In [77]:
for text_batch, label_batch in ds_train_raw.take(1):
    for i in range(5):
        print(text_batch.numpy()[i])
        print(label_batch.numpy()[i])

b"Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly."
0
b"Airport '77 starts as a brand new luxury 747 plane is loaded up with valuable paintings & such belonging to rich businessman Philip Stevens (James Stewart) who is flying them & a bunch of VIP's to his estate in preparation of it being opened to the public as a museum, also on board is Stevens daughter Julie (Kathleen Quinlan) & her son

Data Preparation

In [78]:
#from tensorflow.keras.layers import TextVectorization
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
import string
import re

In [79]:
def standardize(data):
  lowercase_data = tf.strings.lower(data)
  strip_html_tags = tf.strings.regex_replace(lowercase_data, "<br />", " ")
  removed_punctuation = tf.strings.regex_replace(strip_html_tags, f"[{re.escape(string.punctuation)}]", "")
  return removed_punctuation

In [80]:
vectorize_layer = TextVectorization(
    standardize=standardize,
    max_tokens=20000,
    output_mode="int",
    output_sequence_length=500,
)

In [81]:
text_ds = ds_train_raw.map(lambda x, y: x)

In [82]:
vectorize_layer.adapt(text_ds)

Data vectorization

In [83]:
def vectorize_text(text, label):
    text = tf.expand_dims(text, -1)
    return vectorize_layer(text), label

train_ds = ds_train_raw.map(vectorize_text)
val_ds = ds_val_raw.map(vectorize_text)
test_ds = ds_test_raw.map(vectorize_text)

train_ds = train_ds.cache().prefetch(buffer_size=10)
val_ds = val_ds.cache().prefetch(buffer_size=10)
test_ds = test_ds.cache().prefetch(buffer_size=10)

In [93]:
from tensorflow.keras import layers

inputs = tf.keras.Input(shape=(None,), dtype="int64")

x = layers.Embedding(20000, 128)(inputs)
x = layers.Dropout(0.5)(x)

x = layers.Conv1D(128, 7, padding="valid", activation="relu", strides=3)(x)
x = layers.Conv1D(128, 7, padding="valid", activation="relu", strides=3)(x)
x = layers.GlobalMaxPooling1D()(x)

x = layers.Dense(128, activation="relu")(x)
x = layers.Dropout(0.5)(x)

predictions = layers.Dense(1, activation="sigmoid", name="predictions")(x)

model = tf.keras.Model(inputs, predictions)

model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

Train the model

In [94]:
model.fit(train_ds, validation_data=val_ds, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f3c9ef4c0a0>

Model Evaluation

In [97]:
loss, accuracy = model.evaluate(test_ds)



In [98]:
print(f"Loss is {loss} accuracy is {accuracy}")

Loss is 0.8944777250289917 accuracy is 0.5
