# Title: deep learning part 5, text classification

## Aim:

+ learning text classification
+ practising jupyter lab in vs code

Ref: https://www.tensorflow.org/text/guide/word_embeddings



## Text embedding

 + one-hot encoding. inefficient. sparse matrix

 + represent with unique integer for each word. efficient, dense populated matrix, but captures no relationship between words.

 + word embedding. capture relationship and no need to do it manually. The embedding is present as an extra layer and embedding is learning as a part of training. 

We only need to input the dimension (the length of each word encoding.8, for a small dataset and 1024 for a large dataset)

e.g., 4-dimensional word embedding

    cat: [2.1, 1.5 -0.5,4]
    mat: [1.8,-0.5, 0,-1]
    .....

they float numbers.

next, let's see some examples.

In [None]:
import io
import os
import re
import shutil
import string
import tensorflow as tf

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.layers import TextVectorization

print(f"tensorflow version:{tf.__version__}")
print(f"gpu:{tf.config.list_logical_devices('GPU')}")

Get the dataset

In [None]:
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file("aclImdb_v1.tar.gz", url,
                                  untar=True, cache_dir='.',
                                  cache_subdir='')

dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb_v1_extracted/aclImdb')
os.listdir(dataset_dir)

In [None]:
batch_size = 64
seed = 123
print(dataset_dir)
train_dir=os.path.join(dataset_dir,'train')

print(train_dir)

print(os.listdir(train_dir))
assert(os.path.isdir(train_dir))

#remove unsup from the trainning folder
remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir)



train_ds = tf.keras.utils.text_dataset_from_directory(
    os.path.join(dataset_dir,'train'), batch_size=batch_size, validation_split=0.2,
    subset='training', seed=seed)
val_ds = tf.keras.utils.text_dataset_from_directory(
    os.path.join(dataset_dir,'train'), batch_size=batch_size, validation_split=0.2,
    subset='validation', seed=seed)

print("type of train_ds", type(train_ds))

In [None]:
#show some files
count=0
for text_batch, label_batch in train_ds: #.take(1):
  print(type(text_batch))
  print(text_batch.shape)
  count+=1
  if count >1:
    break;
  
  print(label_batch.numpy(), text_batch.numpy())



#configure
AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

## Embedding layer


    The Embedding layer can be understood as a lookup table that maps from integer indices (which stand for specific words) to dense vectors (their embeddings). The dimensionality (or width) of the embedding is a parameter you can experiment with to see what works well for your problem, much in the same way you would experiment with the number of neurons in a Dense layer.

    Let's see an example with a model of one single embedding layer

In [None]:
import numpy as np
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(1000, 64))
# The model will take as input an integer matrix of size (batch,
# input_length), and the largest integer (i.e. word index) in the input
# should be no larger than 999 (vocabulary size).
# Now model.output_shape is (None, 10, 64), where `None` is the batch
# dimension.
input_array = np.random.randint(1000, size=(32, 10))
model.compile('rmsprop', 'mse')
output_array = model.predict(input_array)
print(output_array.shape)

print(output_array)

In [None]:
#play to show some input output

print(input_array[0:2])

print(output_array[0:2])

In [None]:
#play to have word input??
# NO NO, this can not work!! Embedding only takes
# in integers as input. that is what Textvectorize function
# does!!! 
input_words=[['I','like','you'],
    ['he','does','hate']
]

#commented, since it can not work!!
tf_array=tf.constant(input_words)

print(tf_array)

#output_words=model.predict(tf_array)

print("WE NEED TO VECTERIZE TEXT INTO INTEGERS")

Next, we want to play more with the embedding layer

In [None]:
# Embed a 1,000 word vocabulary into 5 dimensions.
embedding_layer = tf.keras.layers.Embedding(1000, 5)

When you create an Embedding layer, the weights for the embedding are randomly initialized (just like any other layer). During training, they are gradually adjusted via backpropagation.

If you pass an integer to an embedding layer, the result replaces each integer with the vector from the embedding table. This is how to retrieve the embeddings.

In [None]:
result = embedding_layer(tf.constant([1, 2, 3]))

print(result.numpy())

result2 = embedding_layer(tf.constant([1,2,2]))

print(result2) #check for duplicated entries


Notes: **Only input tensors may be passed as positional arguments. The following argument value should be passed as a keyword argument:**

The returned tensor has one more axis than the input, the embedding vectors are aligned along the new last axis. Pass it a (2, 3) input batch and the output is (2, 3, N)

Note: in this case, the first axis is the batch dimension!!!

When given a batch of sequences as input, an embedding layer returns a 3D floating point tensor, of shape (samples, sequence_length, embedding_dimensionality). 

In [None]:
result = embedding_layer(tf.constant([[0, 1, 2], [2, 4, 5]]))
result.shape

print(result)

Text preprocessing

we do two things here, standardize/normalize texts, and then vectorize them.

standardize text: means to lower case all the texts, strip them of HTML tag, and remove punctuations. so far that is all.

vectorize : means to turn texts into integers.

In [None]:
# Create a custom standardization function to strip HTML break tags '<br />'.
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
  return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation), '')


# Vocabulary size and number of words in a sequence.
vocab_size = 10000
sequence_length = 100

# Use the text vectorization layer to normalize, split, and map strings to
# integers. Note that the layer uses the custom standardization defined above.
# Set maximum_sequence length as all samples are not of the same length.
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=sequence_length)

# Make a text-only dataset (no labels) and call adapt to build the vocabulary.
text_ds = train_ds.map(lambda x, y: x)

print(type(text_ds), " and ", type(train_ds.take(-1)))
xx=train_ds.take(-1)

count=1
for element in train_ds:
  print("--",element)
  count+=1
  if count >1:
    break
print("=---------===")
print(text_ds.element_spec, " and ", train_ds.element_spec)
#print(f"train_ds shape:{train_ds.shape} and \n text_ds shape:{text_ds.shape}")
vectorize_layer.adapt(text_ds)

start building model

The TextVectorization layer transforms strings into vocabulary indices. You have already initialized vectorize_layer as a TextVectorization layer and built its vocabulary by calling adapt on text_ds. Now vectorize_layer can be used as the first layer of your end-to-end classification model, feeding transformed strings into the Embedding layer.
The Embedding layer takes the integer-encoded vocabulary and looks up the embedding vector for each word-index. These vectors are learned as the model trains. The vectors add a dimension to the output array. The resulting dimensions are: (batch, sequence, embedding).

The GlobalAveragePooling1D layer returns a fixed-length output vector for each example by averaging over the sequence dimension. This allows the model to handle input of variable length, in the simplest way possible.

The fixed-length output vector is piped through a fully-connected (Dense) layer with 16 hidden units.

The last layer is densely connected with a single output node.

In [None]:
embedding_dim=16

model = Sequential([
  vectorize_layer,
  Embedding(vocab_size, embedding_dim, name="embedding"),
  GlobalAveragePooling1D(),
  Dense(16, activation='relu'),
  Dense(1)
])

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=15
    #callbacks=[tensorboard_callback]
    )

## Things to learn

### tf.data.dataset

        Each element in the
        tf.data.Dataset object returned by tf.keras.utils.text_dataset_from_directory is a batch of data, the structure of which depends on the label_mode parameter. 
        Dataset Element Structure

            If label_mode is None: Each element in the dataset is a single string tensor of shape (batch_size,), containing the raw contents of a batch of text files.
            If label_mode is set (e.g., 'int', 'categorical', or 'binary'): Each element is a tuple (texts, labels), where:
                texts is a string tensor of shape (batch_size,) containing the raw text content from the files.
                labels is a tensor of shape (batch_size,) (or (batch_size, num_classes) for 'categorical' mode) containing the corresponding labels for each text in the batch. The labels are typically inferred from the subdirectory names. 

        Key Parameters Affecting Output

            batch_size: The dataset yields batches of data, not individual samples. The batch_size argument (defaulting to 32) determines the number of samples in each batch.
            label_mode: This parameter determines how labels are encoded (e.g., as integers, binary, or categorical vectors) or if they are included at all.
            labels: If set to 'inferred' (default), the directory structure is used to automatically generate labels. For example, files in main_directory/class_a/ get label 0, and files in main_directory/class_b/ get label 1. 

        The tf.data.Dataset is designed for building efficient input pipelines in TensorFlow, handling large amounts of data by processing it in batches and allowing for various transformations like shuffling, batching, and caching

#### tf.data.dataset.take


        The tf.data.Dataset.take(count) function in TensorFlow is used to create a new dataset that contains a specified number of elements from the beginning of the original dataset.
        Functionality:

            Subset Creation:
            It extracts the first count elements from the tf.data.Dataset it is called upon.
            New Dataset:
            It returns a new tf.data.Dataset object containing only these count elements. The original dataset remains unchanged.
            Handling count = -1 or count > dataset_size:
            If count is set to -1, or if the specified count is greater than the total number of elements in the dataset, the take() function will effectively return a new dataset containing all elements of the original dataset. 


### regex

regex, tf.string.regex

other python regular expression
import re

pyhton regular expression is similar to R regular expression

there are other re too using % style.


        TensorFlow's tf.strings.regex_replace and tf.strings.regex_full_match functions expect a regular expression pattern as a string. While Python's re module allows for various ways to construct and format regular expressions, when passing them to TensorFlow functions, the pattern must be a plain string.
        The % style format, also known as old-style string formatting, is a Python feature for creating strings with embedded values. While you can use it to construct a regular expression string, it's generally recommended to use f-strings (formatted string literals) for better readability and modern Python practices when creating dynamic regex patterns.
        Here's how you can use % style formatting to create a regex pattern string for use with TensorFlow:
        Python

        import tensorflow as tf

        # Define variables for the regex pattern
        num_digits = 3
        start_char = "A"

        # Create the regex pattern using % style formatting
        # This pattern matches a string starting with 'A' followed by exactly 3 digits
        regex_pattern = r"%s\d{%d}" % (start_char, num_digits)

        # Example usage with tf.strings.regex_full_match
        input_string = tf.constant(["A123", "B456", "A78"])
        matches = tf.strings.regex_full_match(input_string, regex_pattern)

        print(matches)

        In this example:

            regex_pattern = r"%s\d{%d}" % (start_char, num_digits) constructs the regex string.
            %s is a placeholder for a string (e.g., start_char).
            %d is a placeholder for an integer (e.g., num_digits).
            The r prefix before the string literal ensures it's treated as a raw string, preventing backslashes from being interpreted as escape sequences by Python before the regex engine sees them. 

        While this works, consider using f-strings for more modern and readable code when constructing dynamic regex patterns:
        Python

        import tensorflow as tf

        num_digits = 3
        start_char = "A"

        # Create the regex pattern using an f-string
        regex_pattern_fstring = rf"{start_char}\d{{{num_digits}}}"

        input_string = tf.constant(["A123", "B456", "A78"])
        matches_fstring = tf.strings.regex_full_match(input_string, regex_pattern_fstring)

        print(matches_fstring)

        In the f-string example, {{ and }} are used to escape the literal curly braces within the f-string, as they are part of the regex syntax for repetition.

### GlobalAverage1d layer



In [1]:
import tensorflow as tf

strings = tf.constant(["apple", "banana", "apricot"])
pattern = r"a.*e"
matches = tf.strings.regex_full_match(strings, pattern)
print(matches)

2025-12-01 20:40:36.768231: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-12-01 20:40:36.836799: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-12-01 20:40:38.181092: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


tf.Tensor([ True False False], shape=(3,), dtype=bool)


2025-12-01 20:40:39.503275: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2025-12-01 20:40:39.503305: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:171] verbose logging is disabled. Rerun with verbose logging (usually --v=1 or --vmodule=cuda_diagnostics=1) to get more diagnostic output from this module
2025-12-01 20:40:39.503323: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:176] retrieving CUDA diagnostic information for host: 2778c3d2ac10
2025-12-01 20:40:39.503326: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:183] hostname: 2778c3d2ac10
2025-12-01 20:40:39.503422: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:190] libcuda reported version is: 575.64.3
2025-12-01 20:40:39.503437: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:194] kernel repo

## globalAveragePool1D layer.

 You can think of the pooling layers as a way to downsample (a way to reduce the size of) the incoming feature vectors.

In the case of max pooling you take the maximum value of all features in the pool for each feature dimension. In the case of average pooling you take the average, but max pooling seems to be more commonly used as it highlights large values.

Global max/average pooling takes the maximum/average of all features whereas in the other case you have to define the pool size. Keras has again its own layer that you can add in the sequential model:


https://stackoverflow.com/questions/75067335/what-does-globalaveragepooling1d-do-in-keras

        After passing the input sequences through an embedding layer, we get a 3D floating-point tensor with shape (samples, sequence_length, embedding_dimensionality). Example, shape is (2, 3, 5), indicating that there are 2 samples (sequences), each with a sequence length of 3 and embedding dimensionality of 5.

        However, dense layers in neural networks require fixed-length input vectors. GlobalAveragePooling1D effectively reduces the sequence dimension, creating a single vector that summarizes the information from the entire sequence.

        Example:

        Embeddings:
        word1: [0.2, 0.4, -0.1, 0.3]
        word2: [0.1, -0.3, 0.2, 0.5]
        word3: [-0.2, 0.1, -0.5, 0.4]
        word4: [0.3, -0.2, 0.3, -0.1]
        word5: [-0.4, 0.3, 0.1, -0.2]

        Global Average Pooled Vector: [0, 0.06, 0, 0.09]





        The tf.keras.layers.GlobalAveragePooling1D layer's input is in the example a tensor of batch x sequence x embedding_size.

        It returns a matrix of batch x embedding_size, by averaging over the sequence dimension. The average is only over one dimension therefore the 1D.

        The averaging can handle handle different sequence sizes. For example sentences of any length.


https://www.sapien.io/glossary/definition/global-pooling#:~:text=Global%20pooling%20layers%20are%20used,of%20the%20input%20image%20size.

        In global average pooling, the average value of each feature map is computed, resulting in a single scalar per map. This technique helps in maintaining spatial information and is less prone to overfitting compared to fully connected layers, which have more parameters.

        In global max pooling, the maximum value in each feature map is selected, capturing the most prominent feature detected by the convolutional filters. This method is useful for identifying the most significant feature in each map, which might be critical for certain classification tasks.

        Global pooling is particularly beneficial in deep learning architectures where it simplifies the model, reduces the number of parameters, and enhances generalization by preventing overfitting. It also makes the model invariant to the input size, which is useful for handling images of different sizes.


### feature concept? in the NN

feature and feature map in the text embedding. 

it is like what we see in sequencing count matrix

it is sample x gene (sample in rows and gene in columns. Here gene is the feature.)

in this case, feature map is the sample x sequence x embedding. after globalAveragePool1D, we have
sequence x embedding (averaged or maximized).

the variable-length sequence dimension is reduced and have fixed length.
