File Name = txtProc_BagOfBigrams.ipynb

+ Do Txt Processing under Keras TF, specifically, do sentiment analysis using the bag of words approach.

+ The IMDB data is used, from 

```
cd ~Data
mkdir IMDB
cd IMDB
in path ~/Data/IMDB :
curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
tar -xf aclImdb_v1.tar.gz

The commands above create the directory structure 
aclImdb/
...train/
......pos/
......neg/
...test/
......pos/
......neg/
```
    
+ The data was originally used as part of the paper "Learning Word Vectors for Sentiment Analysis" by Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and  Christopher Potts, all from Stanford, 2011.

+ The "readme.txt" file in the main directory describes the organization of the data files, and the meainig of the names in the files.

+ There are 25K text files for training and 25K for testing. The train/pos/ directory contains 12.5k text files, each of which contains positive-sentiment movie reviews Similarly, there are 12.5K negative-sentiment reviews located at the “neg” directory.

+ The test/pos and test/neg directories follow the same idea. 

+ Before creating and running the model, a validation was created (once) by using  5K files from the training set, resulting in a change in size of the training as 25k - 5k = 20k.

+ The unsup directory contains "unsupervised" data, which is not used.

+ The code here uses different vectorization which is different from the code in bag_of_tokens. Here the TextVectorization specifies an additional argument ngrams=2, and it gives a different name to the ngrams produced.

+ Once this is done, I think the code flows in the same way as in bag_of_tokens
The model contains 16 (dense) layers. The activation function is "relu" and uses an input of size max_tokens=20000. 

+ Once these steps are done the model is created and when it runs, the validation accuracy is 90% so we went from 88% using a bag of tokens to 90% using 2grams.

In [2]:

import re
import string

from logging import logProcesses
import os, pathlib, shutil, random
from platform import python_branch
from syslog import LOG_SYSLOG

import numpy as np
import matplotlib.pyplot as plt 
import tensorflow as tf
import keras
from keras import layers
from keras import models
from keras import optimizers
from keras.layers import Dropout
from keras.layers import TextVectorization


print("TF Version   ", tf.__version__)
print("TF Path      ", tf.__path__[0])
print("Keras version ", keras.__version__)
print("numpy version ", np.__version__)
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))



TF Version    2.18.0
TF Path       /drv3/hm3/code/python/tf2.18/tf2.18/lib/python3.12/site-packages/keras/api/_v2
Keras version  3.7.0
numpy version  1.26.4
Num GPUs Available:  1


In [None]:
""" 
create_validation_set_fn()

Take 20% of the training set for validation. The training set gives away 5k files,
therefore after this function executes, the training set will contain 25k-5k=20k.
Before running this function it is a good idea to remove the directory (if present).

Last time I used this code was with  /drv3/hm3/Data/IMDB/aclImdb/val
and its content. 

DO NOT RUN THIS CODE if the val data is already created
""" 
def create_validation_set_fn() :
  base_dir = pathlib.Path("/drv3/hm3/Data/IMDB/aclImdb")
  val_dir = base_dir / "val"
  train_dir = base_dir / "train"

  for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    random.Random(1377).shuffle(files)
    num_val_samples = int(0.2*len(files))
    val_files = files[ -num_val_samples: ]
    for fname in val_files :
      shutil.move( train_dir / category /fname, val_dir / category / fname)


In [3]:
"""
create_data_sets_fn()

trn_ds, tst_ds, val_ds = create_data_sets_fn( strBaseDir )
  
Rceive a string that contains a base directory where the raw data is stored.
Return three data sets, for training, testing and validation
  i.e., (trn, tst, val)
"""
def create_data_sets_fn( strBaseDir ) :
    batch_size = 32
    base_dir = pathlib.Path( strBaseDir )
    train_ds =  keras.utils.text_dataset_from_directory( base_dir / "train", batch_size=batch_size )
    val_ds = keras.utils.text_dataset_from_directory( base_dir / "val", batch_size=batch_size )
    test_ds = keras.utils.text_dataset_from_directory( base_dir / "test", batch_size=batch_size )

    # verify by printing that the 3 data sets were created correctly
    for inputs, targets in train_ds:
         print("inputs.shape:", inputs.shape)
         print("inputs.dtype:", inputs.dtype)
         print("targets.shape:", targets.shape)
         print("targets.dtype:", targets.dtype)
         print("inputs[0]:", inputs[0])
         print("targets[0]:", targets[0])
         break
    
    return train_ds, test_ds, val_ds



In [4]:
"""

textvectorization_preprocessing_fn( train_ds, test_ds, val_ds)

Get the three data sets as arguments to limit the vocabulary to the 20K
most frequent words in the data set. 

Note that the argument to textVectorization sets the data as bigrams

 As part of ending, the function returns the items as shown below

     return binary_2gram_train_ds, binary_2gram_train_ds, binary_2gram_train_ds
"""
def textvectorization_preprocessing_fn( train_ds, test_ds, val_ds ) :
  text_vectorization = TextVectorization( ngrams=2, 
                                         max_tokens=20000, output_mode="multi_hot",)
    
  # prepare a data set that yields only fields raw text input (no labels)
  text_only_train_ds = train_ds.map(lambda x, y: x)

  text_vectorization.adapt(text_only_train_ds)

  # prepare processed versions for training, validation, testing. Use 8 cores
  binary_2gram_train_ds = train_ds.map ( lambda x, y: (text_vectorization(x), y), num_parallel_calls=8)
  binary_2gram_test_ds  = test_ds.map ( lambda x, y: (text_vectorization(x), y), num_parallel_calls=8)
  binary_2gram_val_ds   = val_ds.map ( lambda x, y: (text_vectorization(x), y), num_parallel_calls=8)

  for inputs, targets in  binary_2gram_train_ds :
    print("inputs.shape:",  inputs.shape)
    print("inputs.dtype:",  inputs.dtype)
    print("targets.dtype:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:",     inputs[0])
    print("targets[0]:",    targets[0])
    break

  return binary_2gram_train_ds, binary_2gram_test_ds, binary_2gram_val_ds


In [5]:
"""
get_model_fn()
get_model_fn( max_tokens=20000, hidden_dim=16)

Receive the max number of tokes and the number of layers.
Use those parameters to create a dense model of the given dimension.
The model's activation is relu      
      
"""
def get_model_fn( max_tokens=20000, hidden_dim=16) :
  inputs = keras.Input(shape=(max_tokens,))
  x = layers.Dense(hidden_dim, activation="relu")(inputs)
  x = layers.Dropout(0.5)(x)
  outputs = layers.Dense(1, activation="sigmoid") (x)
  model = keras.Model(inputs, outputs)
  model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])
  return model



In [6]:
"""
      jev_main_fn()

Start the execution of the code in this file

"""
def jev_main_fn() :
  strBaseDir  = "/drv3/hm3/Data/IMDB/aclImdb"

  # call create_validation_set_fn() once to create 5K files in two directories (pos, neg)
  # each with 2.5 k files. The 5K files are taken away from the train set, therefore the
  # size of the training set will be 25k - 5k = 20K 
  #
  #  create_validation_set_fn() 
  #

  train_ds, test_ds, val_ds = create_data_sets_fn( strBaseDir )
  binary_2gram_train_ds, binary_2gram_test_ds, binary_2gram_val_ds = textvectorization_preprocessing_fn( train_ds, test_ds, val_ds )

  model = get_model_fn()
  model.summary()
  callbacks = [ keras.callbacks.ModelCheckpoint("binary_2gram.keras", save_best_only=True) ]

  model.fit(  binary_2gram_train_ds.cache(), 
              validation_data= binary_2gram_val_ds.cache(),
              epochs=10, callbacks=callbacks)
  model=keras.models.load_model("binary_2gram.keras")
  print(f"Test acc: {model.evaluate(binary_2gram_test_ds)[1]:.3f}")

  print("Done")

#### LET"S GO ! ! ! ! 
jev_main_fn()



Found 20000 files belonging to 2 classes.


I0000 00:00:1741393068.335800   44679 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 9353 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3060, pci bus id: 0000:08:00.0, compute capability: 8.6


Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.
inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor(b"The first one was the best. The second one sucked because the dialog was terrible. Although, the storyline wasn't so bad (in fact, all story lines are good and bad). Throughout the movie, I dosed off a few times. I know that Jackie Chan is a great martial arts expertise, but not a good actor in Rush Hour 2. Chris Tucker, too, wasn't good. And Zhang Ziyi, what can I say, a few lines, terrible acting (But that's based on her script). All the characters there were not that good. But, some of the things I like in Rush Hour 2 is always the action and less sex scenes. I know that Jackie Chan doesn't do those things which is good for him.", shape=(), dtype=string)
targets[0]: tf.Tensor(0, shape=(), dtype=int32)


2025-03-07 19:17:55.232280: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


inputs.shape: (32, 20000)
inputs.dtype: <dtype: 'int64'>
targets.dtype: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor([1 1 1 ... 0 0 0], shape=(20000,), dtype=int64)
targets[0]: tf.Tensor(1, shape=(), dtype=int32)


Epoch 1/10


I0000 00:00:1741393078.519227   46116 service.cc:148] XLA service 0x70c8940513b0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1741393078.519292   46116 service.cc:156]   StreamExecutor device (0): NVIDIA GeForce RTX 3060, Compute Capability 8.6
2025-03-07 19:17:58.540619: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1741393078.600573   46116 cuda_dnn.cc:529] Loaded cuDNN version 90501


[1m 28/625[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m2s[0m 4ms/step - accuracy: 0.5750 - loss: 0.6779  

I0000 00:00:1741393078.969495   46116 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 4ms/step - accuracy: 0.7819 - loss: 0.4747 - val_accuracy: 0.9068 - val_loss: 0.2609
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.9091 - loss: 0.2598 - val_accuracy: 0.9060 - val_loss: 0.2504
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.9259 - loss: 0.2273 - val_accuracy: 0.9098 - val_loss: 0.2541
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.9386 - loss: 0.1887 - val_accuracy: 0.9048 - val_loss: 0.2721
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.9410 - loss: 0.1927 - val_accuracy: 0.9062 - val_loss: 0.2807
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.9489 - loss: 0.1778 - val_accuracy: 0.9002 - val_loss: 0.3029
Epoch 7/10
[1m625/625[0m [32m━━━━━━━