# IMDB Sentiment Analysis
The following model will be trained to classify movie reviews as either positive or negative. The input of this model will be reviews' text.

### Importing necessary tools

In [40]:
import tensorflow as tf
from tensorflow import version

import os
import shutil  # shell utilities

In [41]:
print("TensorFlow version: ", version.VERSION)

TensorFlow version:  2.11.0


### Downloading the dataset

If dataset is not downloaded, download and extract it:

In [42]:
if not os.path.isdir("data/aclImdb"):
    url = r"https://ai.stanford.edu/%7Eamaas/data/sentiment/aclImdb_v1.tar.gz"
    dataset = tf.keras.utils.get_file(
        fname="aclImdb_v1", origin=url,
        cache_dir="data", cache_subdir='',
        extract=True,
    )
    dataset_dir = os.path.join(os.path.dirname(dataset), "aclImdb")
else:
    dataset_dir = "data/aclImdb"

os.listdir(dataset_dir)

['train', 'test', 'imdbEr.txt', 'README', '.ipynb_checkpoints', 'imdb.vocab']

Listing the directories inside of the 'train' directory and visualizing an input sample's format:

**Note**: I notice "br" tags which must be taken care of.

In [43]:
train_dir = os.path.join(dataset_dir, "train")
print(train_dir, ":\n\t" , os.listdir(train_dir), sep='', end="\n\n")

pos_dir = os.path.join(train_dir, "pos")
print(pos_dir, ":\n\t" , os.listdir(pos_dir)[:5], "...", sep='', end="\n\n")

sample_file = os.path.join(pos_dir, "7142_8.txt")
with open(sample_file, 'r') as f:
    print("Sample positive review", ":\n\t", f.read())

data/aclImdb/train:
	['pos', 'neg', 'urls_neg.txt', 'labeledBow.feat', 'unsupBow.feat', 'urls_pos.txt', 'urls_unsup.txt']

data/aclImdb/train/pos:
	['6797_8.txt', '736_10.txt', '12370_8.txt', '7142_8.txt', '1946_9.txt']...

Sample positive review :
	 Holes, originally a novel by Louis Sachar, was successfully transformed into an entertaining and well-made film. Starring Sigourney Weaver as the warden, Shia Labeouf as Stanley, and Khleo Thomas as Zero, the roles were very well casted, and the actors portrayed their roles well.<br /><br />The film had inter-weaving storylines that all led up to the end. The main storyline is about Stanley Yelnats and his punishment of spending a year and a half at Camp Greenlake. The second storyline is about Sam and Kate Barlow. This plot deals with racism and it is the more deep storyline to the movie. The third is about Elya Yelnats and Madame Zeroni, which explains the 100-year curse on the Yelnats family. In my opinion, these storylines were weaved 

### Loading the dataset to memory:

**Note**: To prepare a dataset for binary classification, you will need two folders on disk, corresponding to class_a and class_b.  Any other folders must be removed before using the `text_dataset_from_directory` utility.

Removing the extra directory "unsup" from the "train" directory if it exists:

In [44]:
if os.path.isdir("data/aclImdb/train/unsup"):
    unsup_dir = os.path.join(train_dir, "unsup")
    shutil.rmtree(unsup_dir)
    print("Deleted \"unsup\" directory.")

Dividing the raw dataset into training and validation sets using the `validation_split` argument (80:20):

In [45]:
batch_size = 32
seed = 42

raw_train_ds, raw_val_ds = tf.keras.utils.text_dataset_from_directory(
    directory=train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset="both",
    seed=seed)

# alternatively:
'''
raw_train_ds = tf.keras.utils.text_dataset_from_directory(
    directory=train_dir, 
    batch_size=batch_size,
    validation_split=0.2,
    subset="training",
    seed=seed)

raw_val_ds = tf.keras.utils.text_dataset_from_directory(
    directory=train_dir, 
    batch_size=batch_size, 
    validation_split=0.2, 
    subset='validation', 
    seed=seed)
'''

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Using 5000 files for validation.


'\nraw_train_ds = tf.keras.utils.text_dataset_from_directory(\n    directory=train_dir, \n    batch_size=batch_size,\n    validation_split=0.2,\n    subset="training",\n    seed=seed)\n\nraw_val_ds = tf.keras.utils.text_dataset_from_directory(\n    directory=train_dir, \n    batch_size=batch_size, \n    validation_split=0.2, \n    subset=\'validation\', \n    seed=seed)\n'

Notice Dataset type difference:

In [46]:
raw_train_ds, raw_train_ds.take(count=1)

(<BatchDataset element_spec=(TensorSpec(shape=(None,), dtype=tf.string, name=None), TensorSpec(shape=(None,), dtype=tf.int32, name=None))>,
 <TakeDataset element_spec=(TensorSpec(shape=(None,), dtype=tf.string, name=None), TensorSpec(shape=(None,), dtype=tf.int32, name=None))>)

Iterating over the batches of BatchDataset and visualizing 3 examples from each batch:

In [47]:
for text_batch, label_batch in raw_train_ds.take(count=1):  # count is the number of batches in this BatchDataset
    for i in range(3):
        print("Review", text_batch.numpy()[i])
        print("Label", label_batch.numpy()[i])

Review b'"Pandemonium" is a horror movie spoof that comes off more stupid than funny. Believe me when I tell you, I love comedies. Especially comedy spoofs. "Airplane", "The Naked Gun" trilogy, "Blazing Saddles", "High Anxiety", and "Spaceballs" are some of my favorite comedies that spoof a particular genre. "Pandemonium" is not up there with those films. Most of the scenes in this movie had me sitting there in stunned silence because the movie wasn\'t all that funny. There are a few laughs in the film, but when you watch a comedy, you expect to laugh a lot more than a few times and that\'s all this film has going for it. Geez, "Scream" had more laughs than this film and that was more of a horror film. How bizarre is that?<br /><br />*1/2 (out of four)'
Label 0
Review b"David Mamet is a very interesting and a very un-equal director. His first movie 'House of Games' was the one I liked best, and it set a series of films with characters whose perspective of life changes as they get into 

Printing class names for each class label:

**Note**: These are inferred from the folder names by the `text_dataset_from_directory` utility. There are other options in the `labels` argument for naming the classes. Also, this attribute is added by the utility and does not exist in the default `Dataset` class.

In [48]:
print("Label 0:", raw_train_ds.class_names[0])
print("Label 1:", raw_train_ds.class_names[1])

Label 0: neg
Label 1: pos
