<a href="https://colab.research.google.com/github/fboldt/aulasann/blob/main/aula10e_embedding_bidir_lstm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz
!rm -r aclImdb/train/unsup
!cat aclImdb/train/pos/4077_10.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  41.9M      0  0:00:01  0:00:01 --:--:-- 41.9M
I first saw this back in the early 90s on UK TV, i did like it then but i missed the chance to tape it, many years passed but the film always stuck with me and i lost hope of seeing it TV again, the main thing that stuck with me was the end, the hole castle part really touched me, its easy to watch, has a great story, great music, the list goes on and on, its OK me saying how good it is but everyone will take there own best bits away with them once they have seen it, yes the animation is top notch and beautiful to watch, it does show its age in a very few parts but that has now become part of it beauty, i am so glad it has came out on DVD as it is one of my top 10 films of all time. Buy it or rent it just see it, best viewing is at night alone with drin

In [2]:
import os, pathlib, shutil, random
base_dir = pathlib.Path('aclImdb')
val_dir = base_dir / 'val'
train_dir = base_dir / 'train'
train_pos_dir = train_dir / 'pos'
train_neg_dir = train_dir / 'neg'
val_pos_dir = val_dir / 'pos'
val_neg_dir = val_dir / 'neg'
for category in ('neg', 'pos'):
  new_dir = val_dir / category
  if not new_dir.exists():
    os.makedirs(new_dir)
    files = os.listdir(train_dir / category)
    random.Random(1337).shuffle(files)
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
      shutil.move(train_dir / category / fname,
                  val_dir / category / fname)


In [3]:
from tensorflow import keras
batch_size = 32
train_ds = keras.utils.text_dataset_from_directory(
    'aclImdb/train', batch_size=batch_size)
val_ds = keras.utils.text_dataset_from_directory(
    'aclImdb/val', batch_size=batch_size)
test_ds = keras.utils.text_dataset_from_directory(
    'aclImdb/test', batch_size=batch_size)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


In [15]:
from tensorflow.keras.layers import TextVectorization
max_length = 600
max_tokens = 20000
text_vectorization = TextVectorization(
    max_tokens=max_tokens,
    output_mode='int',
    output_sequence_length = max_length)
text_only_train_ds = train_ds.map(lambda x, y: x)
text_vectorization.adapt(text_only_train_ds)
vect_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
vect_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
vect_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

In [16]:
for inputs, targets in vect_train_ds:
  print("inputs.shape:", inputs.shape)
  print("inputs.dtype:", inputs.dtype)
  print("targets.shape:", targets.shape)
  print("targets.dtype:", targets.dtype)
  print("inputs[0]:", inputs[0])
  print("targets[0]:", targets[0])
  break

inputs.shape: (32, 600)
inputs.dtype: <dtype: 'int64'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor(
[ 4371 16654    77  2453    11   338   352   738    47     5     2   112
   346     5  4371    41   396    65   150    30    39   147     3    59
     1   223   395  9342    13     2  4371   346   190   167    34  5413
  1655     8     2   411     5  2238 13693    36  3089     3   338   298
     5    21     2  5926   523     5     2   879    10    66   110   540
    40  1997   157 13693   904  4239    39 11034     1   176     3  7070
     1    13    65 11608   165     7  6528    15    74    17     2   324
   450     5    30    62  4974     3   171     5   455  2050    11    20
     7     6    28   284     4 12705    59    86   365  1208    41     3
   171     5    50   223     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0 

In [18]:
from tensorflow import keras
from tensorflow.keras import layers

inputs = keras.Input(shape=(None,), dtype='int64')
embedded = layers.Embedding(input_dim=max_tokens,
                            output_dim=64)(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation='sigmoid')(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()

In [19]:
callbacks = [
    keras.callbacks.ModelCheckpoint("vect.keras",
                                    save_best_only=True)
]
history = model.fit(vect_train_ds.cache(),
                    validation_data=vect_val_ds.cache(),
                    epochs=10,
                    callbacks=callbacks)
model = keras.models.load_model("vect.keras")
test_loss, test_acc = model.evaluate(vect_test_ds)
print(f"Test Loss: {test_loss:.3f}, Test Accuracy: {test_acc:.3f}")

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 42ms/step - accuracy: 0.6135 - loss: 0.6335 - val_accuracy: 0.8146 - val_loss: 0.4295
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m25s[0m 39ms/step - accuracy: 0.8384 - loss: 0.4010 - val_accuracy: 0.8494 - val_loss: 0.3530
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m25s[0m 40ms/step - accuracy: 0.8811 - loss: 0.3209 - val_accuracy: 0.8626 - val_loss: 0.3516
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 39ms/step - accuracy: 0.9011 - loss: 0.2748 - val_accuracy: 0.8698 - val_loss: 0.3098
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 39ms/step - accuracy: 0.9185 - loss: 0.2313 - val_accuracy: 0.8884 - val_loss: 0.3140
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 39ms/step - accuracy: 0.9309 - loss: 0.2064 - val_accuracy: 0.8868 - val_loss: 0.3336
Epoch 7/10
[1m6