<a href="https://colab.research.google.com/github/awaaat/Machine_learning-Deep_learning/blob/main/End_to_End_Text_Classification_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [23]:
import os
import tensorflow as tf
import re
import numpy as np
import shutil
import string
from tensorflow.keras import losses
from tensorflow.keras import layers

In [9]:
#Let is download and fetchh the database
dataset_link = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
dataset = tf.keras.utils.get_file(fname = "aclImdb_v1", origin = dataset_link, untar = True, cache_dir = ".", cache_subdir = " ")
dataset_dir = os.path.join(os.path.dirname(dataset), "aclImdb")

Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


In [10]:
os.listdir(dataset_dir)

['train', 'README', 'imdbEr.txt', 'imdb.vocab', 'test']

In [12]:
#To get the train_dir...
train_dir = os.path.join(dataset_dir, "train")
os.listdir(train_dir)

['urls_unsup.txt',
 'pos',
 'unsup',
 'neg',
 'unsupBow.feat',
 'urls_pos.txt',
 'urls_neg.txt',
 'labeledBow.feat']

In [13]:
test_dir = os.path.join(dataset_dir, "test")
os.listdir(test_dir)

['pos', 'neg', 'urls_pos.txt', 'urls_neg.txt', 'labeledBow.feat']

In [15]:
#Let us fetch a sample file in the train set
sample_file = os.path.join(train_dir, "pos/1181_9.txt")
with open(sample_file) as f:
  print(f.read())

Rachel Griffiths writes and directs this award winning short film. A heartwarming story about coping with grief and cherishing the memory of those we've loved and lost. Although, only 15 minutes long, Griffiths manages to capture so much emotion and truth onto film in the short space of time. Bud Tingwell gives a touching performance as Will, a widower struggling to cope with his wife's death. Will is confronted by the harsh reality of loneliness and helplessness as he proceeds to take care of Ruth's pet cow, Tulip. The film displays the grief and responsibility one feels for those they have loved and lost. Good cinematography, great direction, and superbly acted. It will bring tears to all those who have lost a loved one, and survived.


In [16]:
os.listdir(train_dir)

['urls_unsup.txt',
 'pos',
 'unsup',
 'neg',
 'unsupBow.feat',
 'urls_pos.txt',
 'urls_neg.txt',
 'labeledBow.feat']

In [17]:
remove_dir = os.path.join(train_dir, "unsup")
shutil.rmtree(remove_dir)

In [18]:
os.listdir(train_dir)

['urls_unsup.txt',
 'pos',
 'neg',
 'unsupBow.feat',
 'urls_pos.txt',
 'urls_neg.txt',
 'labeledBow.feat']

In [20]:
#Ceate a validation set
batch_size = 32
seed = 42
raw_training_dataset = tf.keras.utils.text_dataset_from_directory(train_dir, batch_size = batch_size, seed = seed, shuffle = True, validation_split=0.2, subset = "training" )

Found 25000 files belonging to 2 classes.
Using 20000 files for training.


In [21]:
raw_validation_dataset = tf.keras.utils.text_dataset_from_directory(train_dir,
                                                                    batch_size = batch_size,
                                                                    seed = seed,
                                                                    shuffle = True,
                                                                    validation_split = 0.2,
                                                                    subset  = "validation")

Found 25000 files belonging to 2 classes.
Using 5000 files for validation.


In [27]:
#Let us print out a few samples of our training samples
for sample_train_batch, label_batch in raw_training_dataset.take(1):
  for i in range(5):
    print(f"Review: {sample_train_batch.numpy()[i]}")
    print(f"Label: {label_batch.numpy()[i]}")

Review: b'Great movie - especially the music - Etta James - "At Last". This speaks volumes when you have finally found that special someone.'
Label: 0
Review: b"I am shocked. Shocked and dismayed that the 428 of you IMDB users who voted before me have not given this film a rating of higher than 7. 7?!?? - that's a C!. If I could give FOBH a 20, I'd gladly do it. This film ranks high atop the pantheon of modern comedy, alongside Half Baked and Mallrats, as one of the most hilarious films of all time. If you know _anything_ about rap music - YOU MUST SEE THIS!! If you know nothing about rap music - learn something!, and then see this! Comparisons to 'Spinal Tap' fail to appreciate the inspired genius of this unique film. If you liked Bob Roberts, you'll love this. Watch it and vote it a 10!"
Label: 1
Review: b'What a lovely heart warming television movie. The story tells of a little five year old girl who has lost her daddy and finds it impossible to cope. Her mother is also very distres

In [30]:
raw_training_dataset.class_names[1]

'pos'

In [31]:
raw_testing_dataset = tf.keras.utils.text_dataset_from_directory(test_dir)

Found 25000 files belonging to 2 classes.


Preparing the data for training

In [32]:
""" This will largely involve standardize, tokenize, and vectorize the data using the helpful tf.keras.layers.TextVectorization layer. """
#Ceate a custom standardizer
def custom_standardizer(input_data):
  #Convert to lower case
  lower_case = tf.strings.lower(input_data)
  #Strip off the html tags and space
  stripped_html = tf.strings.regex_replace(lower_case, "<br />", " ")
  return tf.strings.regex_replace(stripped_html, "[%s]" % re.escape(string.punctuation), " ")

In [34]:
#We will have vectorizing layer
maximum_features = 10000
sequence_length = 250
vectorize_layer = tf.keras.layers.TextVectorization(max_tokens = maximum_features,
                                                    standardize = custom_standardizer,
                                                    output_mode = "int",
                                                    output_sequence_length=sequence_length)

#The next stage will require us to make a text-only dataset (without labels), then call adapt

In [35]:
training_text = raw_training_dataset.map(lambda x, y:x)
vectorize_layer.adapt(training_text)

In [38]:
#Let us check the results of using this vectorize thing
def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), label

In [42]:
text_batch, label_batch = next(iter(raw_training_dataset))
first_review, first_label = text_batch[0], label_batch[0]
print(f"Review: {first_review}")
print(f"Label: {raw_training_dataset.class_names[first_label]}")
print(f"Vectorized_text: {vectorize_text(first_review, first_label)}")

Review: b'"A young woman suffers from the delusion that she is a werewolf, based upon a family legend of an ancestor accused of and killed for allegedly being one. Due to her past treatment by men, she travels the countryside seducing and killing the men she meets. Falling in love with a kind man, her life appears to take a turn for the better when she is raped and her lover is killed by a band of thugs. Traumatized again by these latest events, the woman returns to her violent ways and seeks revenge on the thugs," according to the DVD sleeve\'s synopsis.<br /><br />Rino Di Silvestro\'s "La lupa mannara" begins with full frontal, writhing, moaning dance by shapely blonde Annik Borel, who (as Daniella Neseri) mistakenly believes she is a werewolf. The hottest part is when the camera catches background fire between her legs. The opening "flashback" reveals her hairy ancestor was (probably) a lycanthropic creature. Ms. Borel is, unfortunately, not a werewolf; she is merely a very strong l

In [48]:
# We can lookup the token (string) that each integer corresponds to by calling .get_vocabulary() on the layer.
print(f"1420: {vectorize_layer.get_vocabulary()[1420]}")

1420: falling


#We are nearly ready to train our model. As a final preprocessing step, we will apply the TextVectorization layer we created earlier to the train, validation, and test dataset.

In [49]:
training_set = raw_training_dataset.map(vectorize_text)
validation_set = raw_validation_dataset.map(vectorize_text)
testing_set = raw_testing_dataset.map(vectorize_text)

In [50]:
AUTOTUNE = tf.data.AUTOTUNE
train_ds = training_set.cache().prefetch(buffer_size =AUTOTUNE )
validation_ds = validation_set.cache().prefetch(buffer_size = AUTOTUNE)
test_ds = testing_set.cache().prefetch(buffer_size = AUTOTUNE)

#The Juicy Part -- Creating our Neural Network

In [53]:
embedding_dim = 16

In [54]:
model = tf.keras.Sequential([
    layers.Embedding(maximum_features, embedding_dim),
    layers.Dropout(0.2),
    layers.GlobalAveragePooling1D(),
    layers.Dropout(0.2),
    layers.Dense(1, activation = "sigmoid")
])

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 16)          160000    
                                                                 
 dropout (Dropout)           (None, None, 16)          0         
                                                                 
 global_average_pooling1d (  (None, 16)                0         
 GlobalAveragePooling1D)                                         
                                                                 
 dropout_1 (Dropout)         (None, 16)                0         
                                                                 
 dense (Dense)               (None, 1)                 17        
                                                                 
Total params: 160017 (625.07 KB)
Trainable params: 160017 (625.07 KB)
Non-trainable params: 0 (0.00 Byte)
________________

#Loss function and optimizer A model needs a loss function and an optimizer for training. Since this is a binary classification problem and the model outputs a probability (a single-unit layer with a sigmoid activation), you'll use losses.BinaryCrossentropy loss function.

In [55]:
model.compile(loss = losses.BinaryCrossentropy(),
              optimizer = "adam",
              metrics = [tf.metrics.BinaryAccuracy(threshold = 0.5)])

#Let us now Train the model

In [63]:
epochs = 10
history = model.fit(
    train_ds,
    validation_data=validation_ds,
    epochs=epochs)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [64]:
for epoch in range(epochs):
  print(epoch)
  print(model.fit(train_ds, validation_data=validation_ds))

0
<keras.src.callbacks.History object at 0x7d5071755ba0>
1
<keras.src.callbacks.History object at 0x7d506e1713c0>
2
<keras.src.callbacks.History object at 0x7d506ff255d0>
3
<keras.src.callbacks.History object at 0x7d506ff27a00>
4
<keras.src.callbacks.History object at 0x7d507172f310>
5
<keras.src.callbacks.History object at 0x7d507154e170>
6
<keras.src.callbacks.History object at 0x7d507172f3a0>
7
<keras.src.callbacks.History object at 0x7d506ff24430>
8
<keras.src.callbacks.History object at 0x7d507154d8d0>
9
<keras.src.callbacks.History object at 0x7d506ff27310>


In [65]:
#Let us evaluatloss
loss, accuracy = model.evaluate(test_ds)
print(f"Loss: {loss}")
print(f"Accuracy: {accuracy}")

Loss: 0.3388782739639282
Accuracy: 0.8648800253868103
