# Text Classification

In this one I will be downloading movie review data and classifying the data as "positive" or "negative". That is, doing sentiment analysis on the text data

In [1]:
import matplotlib.pyplot as plt
import os
import re
import shutil
import string
import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import preprocessing
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

In [2]:
# Download the data
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file("aclImdb_v1.tar.gz", url,
                                    untar=True, cache_dir='.',
                                    cache_subdir='')

dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')

Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


In [3]:
os.listdir(dataset_dir)

['imdb.vocab', 'imdbEr.txt', 'README', 'test', 'train']

In [13]:
train_dir = os.path.join(dataset_dir, 'train')
os.listdir(train_dir)

['labeledBow.feat',
 'neg',
 'pos',
 'unsupBow.feat',
 'urls_neg.txt',
 'urls_pos.txt',
 'urls_unsup.txt']

In [11]:
sample_file = os.path.join(train_dir, 'pos/1181_9.txt')
with open(sample_file) as f:
    print(f.read())

Rachel Griffiths writes and directs this award winning short film. A heartwarming story about coping with grief and cherishing the memory of those we've loved and lost. Although, only 15 minutes long, Griffiths manages to capture so much emotion and truth onto film in the short space of time. Bud Tingwell gives a touching performance as Will, a widower struggling to cope with his wife's death. Will is confronted by the harsh reality of loneliness and helplessness as he proceeds to take care of Ruth's pet cow, Tulip. The film displays the grief and responsibility one feels for those they have loved and lost. Good cinematography, great direction, and superbly acted. It will bring tears to all those who have lost a loved one, and survived.


In [12]:
"""
Loading the dataset will require us to use the test_data_for_direcotry utility. This requires data to be
in the exact format that it is in above - how convienent.
"""
# This dir does not matter to us - get rid of it
remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir)

In [14]:
# Load the data in
batch_size = 32
seed = 42

raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', 
    batch_size=batch_size, 
    validation_split=0.2, # 80/20 training/validation data
    subset='training', 
    seed=seed)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.


In [19]:
# Showing some data samples
for text_batch, label_batch in raw_train_ds.take(1):
    for i in range(3):
        print("Review:", text_batch.numpy()[i][:100])
        print("Label:", label_batch.numpy()[i])

Review: b'Great movie - especially the music - Etta James - "At Last". This speaks volumes when you have final'
Label: 0
Review: b'I am shocked. Shocked and dismayed that the 428 of you IMDB users who voted before me have not given'
Label: 1
Review: b'What a lovely heart warming television movie. The story tells of a little five year old girl who has'
Label: 1


In [21]:
# But what does 0 and 1 mean?
print("Label 0 corresponds to", raw_train_ds.class_names[0])
print("Label 1 corresponds to", raw_train_ds.class_names[1])

Label 0 corresponds to neg
Label 1 corresponds to pos


In [23]:
# Now get the validation data set - make sure to use the same seed or specify shuffle=False
raw_val_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', 
    batch_size=batch_size, 
    validation_split=0.2, 
    subset='validation', 
    seed=seed)

Found 25000 files belonging to 2 classes.
Using 5000 files for validation.


In [25]:
# Now get the test data
raw_test_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/test', 
    batch_size=batch_size)

Found 25000 files belonging to 2 classes.


Quick note for those that don't know - Training data is what we train on, Validation data is what
we validate the model on. That is, it is the data is used to see how the data is doing during training
Finally, test data is what we plan to use to actually test the model when it is done