# Title - Sentiment Analysis IMDB

### Core Machine Learning
*   **`tensorflow as tf`**: This is the fundamental library for the entire project. TensorFlow provides the tools to build, train, and deploy machine learning models, especially neural networks.
*   **`tensorflow.keras.layers`**: Keras is TensorFlow's user-friendly API for building neural networks. The `layers` module provides the building blocks for your model, such as `Embedding` (to convert words into numerical vectors), `Dense` (standard neural network layers), and `Dropout` (to prevent overfitting).
*   **`tensorflow.keras.losses`**: This provides the loss functions, which are used to measure how inaccurate the model's predictions are. The goal of training is to minimize this value. The tutorial uses `BinaryCrossentropy` for a two-class problem (positive/negative).

### Data Handling and Preparation
*   **`numpy as np`**: A core library for numerical operations. TensorFlow is tightly integrated with NumPy, and it's often used to handle the numerical data (vectors and matrices) that the model processes.
*   **`pandas as pd`**: While not used in the first cell of your notebook, Pandas is essential for reading, writing, and manipulating structured data, such as from a CSV file. You would typically use it to load your dataset into a structure called a DataFrame.
*   **`os`**: This module helps interact with the operating system, primarily for handling file paths and directories. The tutorial uses it to manage the dataset files after they are downloaded.
*   **`shutil`**: Provides high-level file operations. The tutorial uses it to remove the downloaded dataset directory to clean up the workspace.

### Text Processing
*   **`re`**: The regular expression module. It's crucial for cleaning text data. In the tutorial, it's used to remove HTML tags like `<br />` from the movie reviews.
*   **`string`**: This module contains common string constants. The tutorial uses `string.punctuation` to easily access a list of all punctuation marks that should be stripped from the text.

### Data Visualization
*   **`matplotlib.pyplot as plt`**: The primary library for creating plots and graphs. It's used at the end of the tutorial to visualize the model's training and validation accuracy and loss over time. This helps you understand if the model is learning effectively or overfitting.
*   **`seaborn as sns`**: Built on top of Matplotlib, Seaborn provides more advanced and aesthetically pleasing statistical plots. While not strictly required by the tutorial, it's often used for tasks like creating confusion matrices or visualizing data distributions.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

import os
import re
import shutil
import string
import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import losses

print(tf.__version__)


2.20.0


In [2]:
# Download and explore the IMDB dataset

# 1. Get the data from the web
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file("aclImdb_v1", url,
                                    untar=True, cache_dir='.',
                                    cache_subdir='')



In [3]:
dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb_v1')

In [4]:
os.listdir(dataset_dir)

['aclImdb']

In [5]:
train_dir = os.path.join(dataset_dir, 'aclImdb', 'train')
test_dir = os.path.join(dataset_dir, 'aclImdb', 'test')
os.listdir(train_dir)

['labeledBow.feat',
 'neg',
 'pos',
 'unsup',
 'unsupBow.feat',
 'urls_neg.txt',
 'urls_pos.txt',
 'urls_unsup.txt']

In [6]:
sample_file = os.path.join(train_dir, 'pos/1181_9.txt')
with open(sample_file) as f:
  print(f.read())

Rachel Griffiths writes and directs this award winning short film. A heartwarming story about coping with grief and cherishing the memory of those we've loved and lost. Although, only 15 minutes long, Griffiths manages to capture so much emotion and truth onto film in the short space of time. Bud Tingwell gives a touching performance as Will, a widower struggling to cope with his wife's death. Will is confronted by the harsh reality of loneliness and helplessness as he proceeds to take care of Ruth's pet cow, Tulip. The film displays the grief and responsibility one feels for those they have loved and lost. Good cinematography, great direction, and superbly acted. It will bring tears to all those who have lost a loved one, and survived.


In [8]:
remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir)

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: '.\\aclImdb_v1\\aclImdb\\train\\unsup\\14837_0.txt'

In [None]:
batch_size = 32
seed = 42

raw_train_ds = tf.keras.utils.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed)

In [None]:
for text_batch, label_batch in raw_train_ds.take(1):
  for i in range(3):
    print("Review", text_batch.numpy()[i])
    print("Label", label_batch.numpy()[i])

In [None]:
print("Label 0 corresponds to", raw_train_ds.class_names[0])
print("Label 1 corresponds to", raw_train_ds.class_names[1])

In [None]:
raw_val_ds = tf.keras.utils.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset='validation',
    seed=seed)

In [None]:
raw_test_ds = tf.keras.utils.text_dataset_from_directory(
    test_dir,
    batch_size=batch_size)

In [None]:
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, '', ' ')
  return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation),
                                  '')
     

In [None]:
max_features = 10000
sequence_length = 250

vectorize_layer = layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    output_mode='int',
    output_sequence_length=sequence_length)

In [None]:
# Make a text-only dataset (without labels), then call adapt

print("Starting to adapt the vectorize_layer...")

vectorize_layer.adapt(raw_train_ds.map(lambda text, label: text))

train_text = raw_train_ds.map(lambda x, y: x)
vectorize_layer.adapt(train_text)

print("Adaptation complete.")

In [None]:
def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), label

In [None]:
# retrieve a batch (of 32 reviews and labels) from the dataset
text_batch, label_batch = next(iter(raw_train_ds))
first_review, first_label = text_batch[0], label_batch[0]
print("Review", first_review)
print("Label", raw_train_ds.class_names[first_label])
print("Vectorized review", vectorize_text(first_review, first_label))

In [None]:
print("1287 ---> ",vectorize_layer.get_vocabulary()[1287])
print(" 313 ---> ",vectorize_layer.get_vocabulary()[313])
print('Vocabulary size: {}'.format(len(vectorize_layer.get_vocabulary())))

In [None]:
vocab = vectorize_layer.get_vocabulary()

max_index = len(vocab) - 1
print(f"Total vocabulary size: {len(vocab)}")
print(f"Maximum index: {max_index}")

In [None]:
#show the first 20 words in the vocabulary
print(f"First 20 words in the vocabulary:\n {vocab[:20]}")

In [None]:
#show the last 20 words in the vocabulary
print(f"Last 20 words in the vocabulary:\n {vocab[-20:]}")

In [None]:
train_ds = raw_train_ds.map(vectorize_text)
val_ds = raw_val_ds.map(vectorize_text)
test_ds = raw_test_ds.map(vectorize_text)

In [None]:
AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)

In [None]:
embedding_dim = 16

In [None]:
model = tf.keras.Sequential([
  layers.Embedding(max_features, embedding_dim),
  layers.Dropout(0.2),
  layers.GlobalAveragePooling1D(),
  layers.Dropout(0.2),
  layers.Dense(1, activation='sigmoid')])

model.summary()

In [None]:
model.compile(loss=losses.BinaryCrossentropy(),
              optimizer='adam',
              metrics=[tf.metrics.BinaryAccuracy(threshold=0.5)])

In [None]:
epochs = 10
history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=epochs)

In [None]:
loss, accuracy = model.evaluate(test_ds)

print("Loss: ", loss)
print("Accuracy: ", accuracy)

In [None]:
history_dict = history.history
history_dict.keys()

In [None]:
acc = history_dict['binary_accuracy']
val_acc = history_dict['val_binary_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']

epochs = range(1, len(acc) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

In [None]:
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')

plt.show()
     

In [None]:
export_model = tf.keras.Sequential([
  vectorize_layer,
  model,
  layers.Activation('sigmoid')
])

export_model.compile(
    loss=losses.BinaryCrossentropy(from_logits=False), optimizer="adam", metrics=['accuracy']
)

# Test it with `raw_test_ds`, which yields raw strings
metrics = export_model.evaluate(raw_test_ds, return_dict=True)
print(metrics)

In [None]:
examples = tf.constant([
  "The movie was great!",
  "The movie was okay.",
  "The movie was terrible..."
])

export_model.predict(examples)