# Title: deep learning part 5, text classification

## Aim:

+ learning text classification
+ practising jupyter lab in vs code

Ref: https://www.tensorflow.org/text/guide/word_embeddings



## Text embedding

 + one-hot encoding. inefficient. sparse matrix

 + represent with unique integer for each word. efficient, dense populated matrix, but captures no relationship between words.

 + word embedding. capture relationship and no need to do it manually. The embedding is present as an extra layer and embedding is learning as a part of training. 

We only need to input the dimension (the length of each word encoding.8, for a small dataset and 1024 for a large dataset)

e.g., 4-dimensional word embedding

    cat: [2.1, 1.5 -0.5,4]
    mat: [1.8,-0.5, 0,-1]
    .....

they float numbers.

next, let's see some examples.

In [None]:
import io
import os
import re
import shutil
import string
import tensorflow as tf

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.layers import TextVectorization

print(f"tensorflow version:{tf.__version__}")
print(f"gpu:{tf.config.list_logical_devices('GPU')}")

Get the dataset

In [None]:
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file("aclImdb_v1.tar.gz", url,
                                  untar=True, cache_dir='.',
                                  cache_subdir='')

dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb_v1_extracted/aclImdb')
os.listdir(dataset_dir)

In [None]:
batch_size = 64
seed = 123
print(dataset_dir)
train_dir=os.path.join(dataset_dir,'train')

print(train_dir)

print(os.listdir(train_dir))
assert(os.path.isdir(train_dir))

#remove unsup from the trainning folder
remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir)



train_ds = tf.keras.utils.text_dataset_from_directory(
    os.path.join(dataset_dir,'train'), batch_size=batch_size, validation_split=0.2,
    subset='training', seed=seed)
val_ds = tf.keras.utils.text_dataset_from_directory(
    os.path.join(dataset_dir,'train'), batch_size=batch_size, validation_split=0.2,
    subset='validation', seed=seed)

print("type of train_ds", type(train_ds))

In [None]:
#show some files
count=0
for text_batch, label_batch in train_ds: #.take(1):
  print(type(text_batch))
  print(text_batch.shape)
  count+=1
  if count >1:
    break;
  
  print(label_batch.numpy(), text_batch.numpy())



#configure
AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

## Embedding layer


    The Embedding layer can be understood as a lookup table that maps from integer indices (which stand for specific words) to dense vectors (their embeddings). The dimensionality (or width) of the embedding is a parameter you can experiment with to see what works well for your problem, much in the same way you would experiment with the number of neurons in a Dense layer.

    Let's see an example with a model of one single embedding layer

In [None]:
import numpy as np
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(1000, 64))
# The model will take as input an integer matrix of size (batch,
# input_length), and the largest integer (i.e. word index) in the input
# should be no larger than 999 (vocabulary size).
# Now model.output_shape is (None, 10, 64), where `None` is the batch
# dimension.
input_array = np.random.randint(1000, size=(32, 10))
model.compile('rmsprop', 'mse')
output_array = model.predict(input_array)
print(output_array.shape)

print(output_array)

In [None]:
#play to show some input output

print(input_array[0:2])

print(output_array[0:2])

In [None]:
#play to have word input??
# NO NO, this can not work!! Embedding only takes
# in integers as input. that is what Textvectorize function
# does!!! 
input_words=[['I','like','you'],
    ['he','does','hate']
]

#commented, since it can not work!!
tf_array=tf.constant(input_words)

print(tf_array)

#output_words=model.predict(tf_array)

print("WE NEED TO VECTERIZE TEXT INTO INTEGERS")

Next, we want to play more with the embedding layer

In [None]:
# Embed a 1,000 word vocabulary into 5 dimensions.
embedding_layer = tf.keras.layers.Embedding(1000, 5)

When you create an Embedding layer, the weights for the embedding are randomly initialized (just like any other layer). During training, they are gradually adjusted via backpropagation.

If you pass an integer to an embedding layer, the result replaces each integer with the vector from the embedding table. This is how to retrieve the embeddings.

In [None]:
result = embedding_layer(tf.constant([1, 2, 3]))

print(result.numpy())

result2 = embedding_layer(tf.constant([1,2,2]))

print(result2) #check for duplicated entries


Notes: **Only input tensors may be passed as positional arguments. The following argument value should be passed as a keyword argument:**

The returned tensor has one more axis than the input, the embedding vectors are aligned along the new last axis. Pass it a (2, 3) input batch and the output is (2, 3, N)

Note: in this case, the first axis is the batch dimension!!!

When given a batch of sequences as input, an embedding layer returns a 3D floating point tensor, of shape (samples, sequence_length, embedding_dimensionality). 

In [None]:
result = embedding_layer(tf.constant([[0, 1, 2], [2, 4, 5]]))
result.shape

print(result)

Text preprocessing

we do two things here, standardize/normalize texts, and then vectorize them.

standardize text: means to lower case all the texts, strip them of HTML tag, and remove punctuations. so far that is all.

vectorize : means to turn texts into integers.

In [None]:
# Create a custom standardization function to strip HTML break tags '<br />'.
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
  return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation), '')


# Vocabulary size and number of words in a sequence.
vocab_size = 10000
sequence_length = 100

# Use the text vectorization layer to normalize, split, and map strings to
# integers. Note that the layer uses the custom standardization defined above.
# Set maximum_sequence length as all samples are not of the same length.
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=sequence_length)

# Make a text-only dataset (no labels) and call adapt to build the vocabulary.
text_ds = train_ds.map(lambda x, y: x)

print(type(text_ds), " and ", type(train_ds.take(-1)))
xx=train_ds.take(-1)

count=1
for element in train_ds:
  print("--",element)
  count+=1
  if count >1:
    break
print("=---------===")
print(text_ds.element_spec, " and ", train_ds.element_spec)
#print(f"train_ds shape:{train_ds.shape} and \n text_ds shape:{text_ds.shape}")
vectorize_layer.adapt(text_ds)

start building model

The TextVectorization layer transforms strings into vocabulary indices. You have already initialized vectorize_layer as a TextVectorization layer and built its vocabulary by calling adapt on text_ds. Now vectorize_layer can be used as the first layer of your end-to-end classification model, feeding transformed strings into the Embedding layer.
The Embedding layer takes the integer-encoded vocabulary and looks up the embedding vector for each word-index. These vectors are learned as the model trains. The vectors add a dimension to the output array. The resulting dimensions are: (batch, sequence, embedding).

The GlobalAveragePooling1D layer returns a fixed-length output vector for each example by averaging over the sequence dimension. This allows the model to handle input of variable length, in the simplest way possible.

The fixed-length output vector is piped through a fully-connected (Dense) layer with 16 hidden units.

The last layer is densely connected with a single output node.

In [None]:
embedding_dim=16

model = Sequential([
  vectorize_layer,
  Embedding(vocab_size, embedding_dim, name="embedding"),
  GlobalAveragePooling1D(),
  Dense(16, activation='relu'),
  Dense(1)
])

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=15
    #callbacks=[tensorboard_callback]
    )