File Name = txtProc_BagOTrigrams.ipynb

Do Txt Processing under Keras TF, specifically, do sentiment analysis using the
bag of words approach

Last tested           4/12/2024
OS                    Ubuntu 22.04.4  LTS
Python                3.10.6
TF Version            2.16.1
Keras version         2.16.1
numpy version         1.26.4
Number of GPUs        1
nVidia Driver         550.54.15
CUDA Version:          12.4  

IMDB data ise used, from 


cd ~Data
mkdir IMDB
cd IMDB
in path ~/Data/IMDB :
curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
tar -xf aclImdb_v1.tar.gz

The commands above create the following directory structure 
aclImdb/
...train/
......pos/
......neg/
...test/
......pos/
......neg/
    
The data was used as part of the paper "Learning Word Vectors for Sentiment Analysis" by Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and  Christopher Potts, all from Stanford, 2011.

The "readme.txt" file in the main directory describes the organization of the data
files, and the meainig of the names in the files.

There are 25K text files for training and 25K for testing. The train/pos/ directory contains 12.5k text files, each of which contains positive-sentiment movie reviews Similarly, there are 12.5K negative-sentiment reviews located at the “neg” directory.

The test/pos and test/neg directories follow the same idea. 

Before creating and running the model, a validation was created (once) by using  5K files from the training set, resulting in a change in size of the training as 25k - 5k = 20k.

The unsup directory contains "unsupervised" data, which is not used.

The code here uses different vectorization which is different from the code in bag_of_tokens. Here the TextVectorization specifies an additional argument ngrams=2, and it gives a different name to the ngrams produced.

Once this is done, I think the code flows in the same way as in bag_of_tokens
The model contains 16 (dense) layers. The activation function is "relu" and uses an input of size max_tokens=20000. 

Once these steps are done the model is created and when it runs, the validation accuracy is 90% so we went from 88% using a bag of tokens to 90% using 2grams.

In [1]:

import re
import string

from logging import logProcesses
import os, pathlib, shutil, random
from platform import python_branch
from syslog import LOG_SYSLOG

import numpy as np
import matplotlib.pyplot as plt 
import tensorflow as tf
import keras
from keras import layers
from keras import models
from keras import optimizers
from keras.layers import Dropout
from keras.layers import TextVectorization


print("TF Version   ", tf.__version__)
print("TF Path      ", tf.__path__[0])
print("Keras version ", keras.__version__)
print("numpy version ", np.__version__)
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))



2025-03-07 19:29:48.876531: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1741393788.889446   53104 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1741393788.892970   53104 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-07 19:29:48.907989: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


TF Version    2.18.0
TF Path       /drv3/hm3/code/python/tf2.18/tf2.18/lib/python3.12/site-packages/keras/api/_v2
Keras version  3.7.0
numpy version  1.26.4
Num GPUs Available:  1


In [None]:
""" 
create_validation_set_fn()

Take 20% of the training set for validation. The training set gives away 5k files,
therefore after this function executes, the training set will contain 25k-5k=20k.
Before running this function it is a good idea to remove the directory (if present).

Last time I used this code was with  /drv3/hm3/Data/IMDB/aclImdb/val
and its content. 

DO NOT RUN THIS CODE if the val data is already created
""" 
def create_validation_set_fn() :
  base_dir = pathlib.Path("/drv3/hm3/Data/IMDB/aclImdb")
  val_dir = base_dir / "val"
  train_dir = base_dir / "train"

  for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    random.Random(1377).shuffle(files)
    num_val_samples = int(0.2*len(files))
    val_files = files[ -num_val_samples: ]
    for fname in val_files :
      shutil.move( train_dir / category /fname, val_dir / category / fname)


In [2]:
"""
create_data_sets_fn()

trn_ds, tst_ds, val_ds = create_data_sets_fn( strBaseDir )
  
Rceive a string that contains a base directory where the raw data is stored.
Return three data sets, for training, testing and validation
  i.e., (trn, tst, val)
"""
def create_data_sets_fn( strBaseDir ) :
    batch_size = 32
    base_dir = pathlib.Path( strBaseDir )
    train_ds =  keras.utils.text_dataset_from_directory( base_dir / "train", batch_size=batch_size )
    val_ds = keras.utils.text_dataset_from_directory( base_dir / "val", batch_size=batch_size )
    test_ds = keras.utils.text_dataset_from_directory( base_dir / "test", batch_size=batch_size )

    # verify by printing that the 3 data sets were created correctly
    for inputs, targets in train_ds:
         print("inputs.shape:", inputs.shape)
         print("inputs.dtype:", inputs.dtype)
         print("targets.shape:", targets.shape)
         print("targets.dtype:", targets.dtype)
         print("inputs[0]:", inputs[0])
         print("targets[0]:", targets[0])
         break
    
    return train_ds, test_ds, val_ds



In [3]:
"""

textvectorization_preprocessing_fn( train_ds, test_ds, val_ds)

Get the three data sets as arguments to limit the vocabulary to the 20K
most frequent words in the data set. 

Note that the argument to textVectorization sets the data as bigrams

 As part of ending, the function
returns the items as shown below

     return binary_2gram_train_ds, binary_2gram_train_ds, binary_2gram_train_ds
"""
def textvectorization_preprocessing_fn( train_ds, test_ds, val_ds ) :
  text_vectorization = TextVectorization( ngrams=3, 
                                         max_tokens=20000, output_mode="multi_hot",)
    
  # prepare a data set that yields only fields raw text input (no labels)
  text_only_train_ds = train_ds.map(lambda x, y: x)

  text_vectorization.adapt(text_only_train_ds)

  # prepare processed versions for training, validation, testing. Use 8 cores
  binary_3gram_train_ds = train_ds.map ( lambda x, y: (text_vectorization(x), y), num_parallel_calls=8)
  binary_3gram_test_ds  = test_ds.map ( lambda x, y: (text_vectorization(x), y), num_parallel_calls=8)
  binary_3gram_val_ds   = val_ds.map ( lambda x, y: (text_vectorization(x), y), num_parallel_calls=8)

  for inputs, targets in  binary_3gram_train_ds :
    print("inputs.shape:",  inputs.shape)
    print("inputs.dtype:",  inputs.dtype)
    print("targets.dtype:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:",     inputs[0])
    print("targets[0]:",    targets[0])
    break

  return binary_3gram_train_ds, binary_3gram_test_ds, binary_3gram_val_ds


In [4]:
"""
get_model_fn()
get_model_fn( max_tokens=20000, hidden_dim=16)

Receive the max number of tokes and the number of layers.
Use those parameters to create a dense model of the given dimension.
The model's activation is relu      
      
"""
def get_model_fn( max_tokens=20000, hidden_dim=16) :
  inputs = keras.Input(shape=(max_tokens,))
  x = layers.Dense(hidden_dim, activation="relu")(inputs)
  x = layers.Dropout(0.5)(x)
  outputs = layers.Dense(1, activation="sigmoid") (x)
  model = keras.Model(inputs, outputs)
  model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])
  return model



In [5]:
"""
      jev_main_fn()

Start the execution of the code in this file

"""
def jev_main_fn() :
  strBaseDir  = "/drv3/hm3/Data/IMDB/aclImdb"

  # call create_validation_set_fn() once to create 5K files in two directories (pos, neg)
  # each with 2.5 k files. The 5K files are taken away from the train set, therefore the
  # size of the training set will be 25k - 5k = 20K 
  #
  #  create_validation_set_fn() 
  #

  train_ds, test_ds, val_ds = create_data_sets_fn( strBaseDir )
  binary_3gram_train_ds, binary_3gram_test_ds, binary_3gram_val_ds = textvectorization_preprocessing_fn( train_ds, test_ds, val_ds )

  model = get_model_fn()
  model.summary()
  callbacks = [ keras.callbacks.ModelCheckpoint("binary_3gram.keras", save_best_only=True) ]

  model.fit(  binary_3gram_train_ds.cache(), 
              validation_data= binary_3gram_val_ds.cache(),
              epochs=10, callbacks=callbacks)
  model=keras.models.load_model("binary_3gram.keras")
  print(f"Test acc: {model.evaluate(binary_3gram_test_ds)[1]:.3f}")

  print("Done")

#### LET"S GO ! ! ! ! 
jev_main_fn()



Found 20000 files belonging to 2 classes.


I0000 00:00:1741393817.142522   53104 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 9411 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3060, pci bus id: 0000:08:00.0, compute capability: 8.6


Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.
inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor(b"I remember this film, exhibit in Barcelona (Spain) in 1970, for the time of a week. Although it could seems incredible, and I can't offer any explanation for it, this movie was exhibit in a theater dedicated to... movies of art and big quality (that, is, Bergman, Resnais, Malle, Bu\xc3\xb1uel, and... The Projected Man). Few people saw it (luckly people, no doubt) and no reference about this very boring SF movie can be found in the Peter Nichols Science Fiction Encyclopidie, or about the author of the original novel. Very indicative. I remember of it, after all this years, a no-story, a lot of special effects that seems ridiculous effects in fact, and no more. It seems that in some countries the running time is 90 mm. and in anothers 77 min. Well, it means only a little more of p

2025-03-07 19:30:29.215741: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


inputs.shape: (32, 20000)
inputs.dtype: <dtype: 'int64'>
targets.dtype: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor([1 1 1 ... 0 0 0], shape=(20000,), dtype=int64)
targets[0]: tf.Tensor(0, shape=(), dtype=int32)


Epoch 1/10


I0000 00:00:1741393840.116030   53270 service.cc:148] XLA service 0x72adb00519d0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1741393840.116067   53270 service.cc:156]   StreamExecutor device (0): NVIDIA GeForce RTX 3060, Compute Capability 8.6
2025-03-07 19:30:40.132655: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1741393840.193392   53270 cuda_dnn.cc:529] Loaded cuDNN version 90501


[1m 30/625[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m2s[0m 4ms/step - accuracy: 0.5887 - loss: 0.6675  

I0000 00:00:1741393840.561832   53270 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 4ms/step - accuracy: 0.7848 - loss: 0.4660 - val_accuracy: 0.9086 - val_loss: 0.2552
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.9129 - loss: 0.2468 - val_accuracy: 0.9114 - val_loss: 0.2463
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.9308 - loss: 0.2111 - val_accuracy: 0.9124 - val_loss: 0.2554
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.9427 - loss: 0.1877 - val_accuracy: 0.9046 - val_loss: 0.2762
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.9495 - loss: 0.1725 - val_accuracy: 0.9078 - val_loss: 0.2870
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.9494 - loss: 0.1786 - val_accuracy: 0.9038 - val_loss: 0.3112
Epoch 7/10
[1m625/625[0m [32m━━━━━━━