# Natural Language Processing - IMDB

## Inicialización

In [32]:
import tensorflow as tf
import numpy as np
import tensorflow_datasets as tfds
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense

## Carga de conjunto de datos

Se trabaja con el conjunto de datos [imdb_reviews](https://www.tensorflow.org/datasets/catalog/imdb_reviews?hl=en), el cual contiene 50,000 reseñas de películas clasificadas como positiva (1) o negativa (0).

La  siguiente celda descarga los datos directamente desde el repositorio de TensorFlow Datasets.


In [6]:
imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)

2024-04-03 16:38:39.175567: W external/local_tsl/tsl/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata.google.internal".


[1mDownloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /Users/damoib/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /Users/damoib/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteITQUBY/imdb_reviews-train.…

Generating test examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /Users/damoib/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteITQUBY/imdb_reviews-test.t…

Generating unsupervised examples...:   0%|          | 0/50000 [00:00<?, ? examples/s]

Shuffling /Users/damoib/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteITQUBY/imdb_reviews-unsupe…

[1mDataset imdb_reviews downloaded and prepared to /Users/damoib/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


2024-04-03 16:38:57.802323: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M3 Pro
2024-04-03 16:38:57.802344: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 18.00 GB
2024-04-03 16:38:57.802350: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 6.00 GB
2024-04-03 16:38:57.802378: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-04-03 16:38:57.802387: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


A continuación, se dividen los datos en un conjunto de entrenamiento y prueba, cada uno con 25,000 registros. Luego, se convierten a un Numpy Array para su procesamiento.

In [7]:
train_data, test_data = imdb['train'], imdb['test']
training_sentences = []
training_labels = []

testing_sentences = []
testing_labels = []
for s,l in train_data:
  training_sentences.append(str(s.numpy()))
  training_labels.append(l.numpy())

for s,l in test_data:
  testing_sentences.append(str(s.numpy()))
  testing_labels.append(l.numpy())

training_labels_final = np.array(training_labels)
testing_labels_final = np.array(testing_labels)

2024-04-03 16:39:07.174534: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2024-04-03 16:39:08.557210: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


Ahora vemos un ejemplo del conjunto de datos de entrenamiento. Se imprime un ejemplo de reseña (sentences) y de la clasificación (label).

Al ejecutar la celda, se observa que el primer dato es una crítica negativa, por lo tanto, su clasificación es 0.

Modificando el valor de *i* se pueden visualizar otros ejemplos.

In [16]:
i = 0
print(f"Reseña: {training_sentences[i]}")
print(f"Clasificación: {training_labels[i]}")

Reseña: b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."
Clasificación: 0


## Preprocesamiento de datos

A continuación, se convierten las reseñas textuales a valores numéricos. Se utiliza el Tokenizer para asignar números a cada palabra de las reseñas de modo que trabajaremos con secuencias numéricas en lugar de palabras.

Se define un vocabulario de 10,000 palabras (se consideran las que aparecen con mayor frecuencia) y secuencias de máximo 100 palabras.

In [23]:
vocab_size = 10000
max_length = 100
trunc_type='post'
oov_tok = "<OOV>"

tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(training_sentences)
padded = pad_sequences(sequences, maxlen=max_length, truncating=trunc_type)
testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences, maxlen=max_length)

Analizando el **word_index** se puede identificar los valores que el Tokenizer asignó a cada palabra. A continuación, se muestran los primeros 10.

In [24]:
list(word_index.items())[:10]

[('<OOV>', 1),
 ('the', 2),
 ('and', 3),
 ('a', 4),
 ('of', 5),
 ('to', 6),
 ('is', 7),
 ('br', 8),
 ('in', 9),
 ('it', 10)]

Y observando un ejemplo del *testing_padded* se puede observar la secuencia numérica de una reseña que se utilizará para entrenar el modelo.

In [26]:
testing_padded[0]

array([  12,  251,   37,    6, 1144,    1,  682,    7, 4452,    1,    4,
          1,  334,    7,   37, 8367,  377,    5, 1420,    1,   13,   30,
         64,   28,    6,  874,  181,   17,    4, 1050,    5,   12,  224,
          3,   83,    4,  353,   33,  353, 5229,    5,   10,    6, 1340,
       1160,    2, 5738,    1,    3,    1,    5,   10,  175,  328,    7,
       1319, 3989,    4,  798, 1946,    5,    4,  250, 2710,  158,    3,
          2,  361,   31,  187,   25, 1170,  499,  610,    5,    2,  122,
          2,  356, 1398, 7725,   30,    1,  881,   38,    4,   20,   39,
         12,    1,    4,    1,  334,    7,    4,   20,  634,   60,   48,
        214], dtype=int32)

## NLP
Se crea un modelo secuencial conformado por las siguientes capas:
1. Se define la capa Embedding con entrada del tamaño del vocabulario definido y una dimensión de salida de 16.
2. Se aplica una capa Flatten para aplanar la matriz en un vector
3. Se colocan una capa Dense para definir el modelo NLP con una función de activación relu
4. Se finaliza con una capa Dense con una función de activación sigmoid para obtener una probabilidad entre 0 y 1.


In [21]:
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=16))
model.add(Flatten())
model.add(Dense(units=32, activation='relu'))
model.add(Dense(units=1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
training = model.fit(padded, training_labels_final, epochs=10, validation_data=(testing_padded, testing_labels_final))

Epoch 1/10


2024-04-03 16:41:21.419705: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.


[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 10ms/step - accuracy: 0.6575 - loss: 0.5843 - val_accuracy: 0.8200 - val_loss: 0.3941
Epoch 2/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 9ms/step - accuracy: 0.9090 - loss: 0.2395 - val_accuracy: 0.7934 - val_loss: 0.4672
Epoch 3/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 9ms/step - accuracy: 0.9842 - loss: 0.0621 - val_accuracy: 0.7985 - val_loss: 0.6001
Epoch 4/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 8ms/step - accuracy: 0.9970 - loss: 0.0154 - val_accuracy: 0.8002 - val_loss: 0.7180
Epoch 5/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 8ms/step - accuracy: 0.9993 - loss: 0.0039 - val_accuracy: 0.8022 - val_loss: 0.7925
Epoch 6/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 8ms/step - accuracy: 1.0000 - loss: 7.8129e-04 - val_accuracy: 0.7931 - val_loss: 0.8855
Epoch 7/10
[1m782/782[0m [32m━

## Evaluación del modelo

Se utiliza el modelo de NLP en la clasificación de reseñas nuevas como críticas positivas y negativas.

El modelo proporciona un resultado entre 0 y 1, si el valor es mayor o igual a 0.5 se podría considerar una crítica positiva, mientras que si el valor es menor a 0.5, es negativa.

In [30]:
new_sentences = [
    'I loved this movie. Awesome experience!',
    'This film is so boring.',
    'This movie is so hilarious. I had a really great time!',
    'I hate this movie. I fell asleep.'
    ]
new_sequences = tokenizer.texts_to_sequences(new_sentences)
padded_out = pad_sequences(new_sequences, maxlen=max_length, truncating=trunc_type)
output = model.predict(padded_out)
for i in range(0,len(new_sentences)):
    print(f"Reseña: {new_sentences[i]}")
    print(f"Clasificación: {output[i][0]}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
Reseña: I loved this movie. Awesome experience!
Clasificación: 0.9998024106025696
Reseña: This film is so boring.
Clasificación: 0.006478977855294943
Reseña: This movie is so hilarious. I had a really great time!
Clasificación: 0.995993971824646
Reseña: I hate this movie. I fell asleep.
Clasificación: 0.022402040660381317
