## The Problem: Large Movie Dataset Review
### Classify movie reviews from IMDB into positive or negative sentiment.
### Download the dataset [here](https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz)

In [1]:
# imports

from gensim.models import KeyedVectors
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing import text_dataset_from_directory
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.layers import Embedding, Dense, Input, GlobalAveragePooling1D
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam

import utils

## Exploring the data

In [2]:
# Importing & preprocessing the dataset

train_ds = text_dataset_from_directory(f'C:\\technology\pythonlearning\\semantic_processing\\Distributional Semantics\\data\\aclImdb\\train')
test_ds = text_dataset_from_directory(f'C:\\technology\pythonlearning\\semantic_processing\\Distributional Semantics\\data\\aclImdb\\test')

dfTrain = pd.DataFrame(train_ds.unbatch().as_numpy_iterator(), columns=['text', 'label'])
dfTest = pd.DataFrame(test_ds.unbatch().as_numpy_iterator(), columns=['text', 'label'])
_, xts = train_test_split(dfTest, stratify=dfTest['label'], test_size=0.25)

dfTrain['text'] = dfTrain['text'].map(lambda x: x.decode())
xts['text'] = xts['text'].map(lambda x: x.decode())

  train_ds = text_dataset_from_directory(f'C:\\technology\pythonlearning\\semantic_processing\\Distributional Semantics\\data\\aclImdb\\train')
  test_ds = text_dataset_from_directory(f'C:\\technology\pythonlearning\\semantic_processing\\Distributional Semantics\\data\\aclImdb\\test')


Found 75000 files belonging to 3 classes.
Found 25000 files belonging to 2 classes.


In [3]:
pd.options.display.max_colwidth = 200
dfTrain.sample(n=5)

Unnamed: 0,text,label
2413,"I was only 9 years old in 1968, but I was an avid television watcher, and I loved this TV show.<br /><br />My parents got me a Julia ""Barbie"" doll, even though I did not have any regular Barbie do...",2
32520,"Ah yes, the VS series, MVC2 being the pinnacle. It's been said before, this is what you get when half of the crew fell asleep on the job, unfortunately the gameplay half did. Don't get me wrong, t...",0
57457,"I'm not a John Cleese completist (although I thought ""Fawlty Towers was brilliant), but I am a fan, and when I saw this sitting, neglected, on a shelf at my local Blockbuster, I decided to give it...",1
73325,"Saying this movie is extremely hard to follow and just as frustrating to sit through is putting it very mildly. Also saying that the current available print is dark, dreary, scratchy, abysmally ed...",0
34852,"One of the major flaws in this film is that while the mocking of pretentious yuppies is satisfying, it fails to realize that the movie makers themselves are guilty of being one of those that deser...",0


In [4]:
print(dfTrain.loc[0, 'text'])

I disagree with other commenters. Though not the best movie, the movie had a point similar to that of "To Kill A Mockingbird." An attorney took on a case and a client he really didn't want. If he loses, his client could die. The stakes are high. In "Knock On Any Door," the attorney had abandoned the father, believing that it was no big deal. His associate didn't take the consequences seriously either. In the long run, not only did the family's father and breadwinner die in prison for a crime he didn't do, but the son felt abandoned and became a hood, eventually dying in the electric chair. If the lawyer had treated this innocent and poor family just as he treated his rich clients, and had given them his best as he did his rich clients, the father and the son may have been saved. The moral is, all people have value, whether poor or old, and when one is entrusted with the care or safety of another, one should treat that life as if it were his own in all cases. Not the best movie, but, wh

## Tokenize the text

In [5]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(dfTrain['text'].tolist())
train_sequences = tokenizer.texts_to_sequences(dfTrain['text'].tolist())
test_sequences = tokenizer.texts_to_sequences(xts['text'].tolist())


word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 153845 unique tokens.


In [6]:
print(train_sequences[0])

[10, 3136, 15, 84, 18935, 150, 21, 1, 117, 17, 1, 17, 67, 3, 220, 725, 5, 12, 4, 5, 503, 3, 28331, 32, 4560, 561, 20, 3, 411, 2, 3, 8193, 28, 64, 155, 180, 45, 28, 2028, 24, 8193, 97, 714, 1, 10201, 23, 301, 8, 3145, 20, 100, 1223, 1, 4560, 67, 2509, 1, 359, 3604, 12, 9, 13, 54, 200, 848, 24, 6523, 155, 189, 1, 3467, 628, 349, 8, 1, 194, 493, 21, 62, 118, 1, 5174, 359, 2, 43723, 714, 8, 1055, 16, 3, 758, 28, 155, 79, 18, 1, 464, 440, 2509, 2, 891, 3, 2664, 861, 1640, 8, 1, 5067, 3296, 45, 1, 2728, 67, 1947, 11, 1342, 2, 338, 237, 40, 14, 28, 1947, 24, 965, 10429, 2, 67, 358, 90, 24, 117, 14, 28, 118, 24, 965, 10429, 1, 359, 2, 1, 464, 203, 25, 75, 1999, 1, 1555, 6, 29, 81, 25, 1065, 770, 338, 39, 153, 2, 51, 27, 6, 22703, 15, 1, 455, 39, 4357, 4, 154, 27, 141, 1737, 12, 115, 14, 45, 9, 68, 24, 197, 8, 29, 3013, 21, 1, 117, 17, 18, 51, 583, 30, 35, 50, 71, 32, 716, 1980, 9, 124, 92, 298, 69, 7, 7, 263, 9, 39, 21, 892, 14, 552, 8, 11, 17, 593, 8, 143, 115, 29, 1, 55]


In [7]:
print([tokenizer.index_word[k] for k in train_sequences[0]])

['i', 'disagree', 'with', 'other', 'commenters', 'though', 'not', 'the', 'best', 'movie', 'the', 'movie', 'had', 'a', 'point', 'similar', 'to', 'that', 'of', 'to', 'kill', 'a', 'mockingbird', 'an', 'attorney', 'took', 'on', 'a', 'case', 'and', 'a', 'client', 'he', 'really', "didn't", 'want', 'if', 'he', 'loses', 'his', 'client', 'could', 'die', 'the', 'stakes', 'are', 'high', 'in', 'knock', 'on', 'any', 'door', 'the', 'attorney', 'had', 'abandoned', 'the', 'father', 'believing', 'that', 'it', 'was', 'no', 'big', 'deal', 'his', 'associate', "didn't", 'take', 'the', 'consequences', 'seriously', 'either', 'in', 'the', 'long', 'run', 'not', 'only', 'did', 'the', "family's", 'father', 'and', 'breadwinner', 'die', 'in', 'prison', 'for', 'a', 'crime', 'he', "didn't", 'do', 'but', 'the', 'son', 'felt', 'abandoned', 'and', 'became', 'a', 'hood', 'eventually', 'dying', 'in', 'the', 'electric', 'chair', 'if', 'the', 'lawyer', 'had', 'treated', 'this', 'innocent', 'and', 'poor', 'family', 'just', 

In [8]:
MAX_SEQUENCE_LENGTH = max([max(map(len, train_sequences)), max(map(len, test_sequences))])

In [9]:
MAX_SEQUENCE_LENGTH

2493

In [10]:
train_data = pad_sequences(train_sequences, maxlen=MAX_SEQUENCE_LENGTH)
test_data = pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH)

In [11]:
print([tokenizer.index_word.get(k, '<PAD>') for k in train_data[0]])

['<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', 

# Train a classifier with Word Embeddings

In [18]:
countries_wiki = KeyedVectors.load('wiki-countries.w2v')

print(countries_wiki)

Word2Vec<vocab=20944, vector_size=128, alpha=0.025>


In [15]:
#embedding_layer = utils.make_embedding_layer(countries_wiki, tokenizer, MAX_SEQUENCE_LENGTH)
embedding_layer = utils.make_embedding_layer(countries_wiki,tokenizer,MAX_SEQUENCE_LENGTH=MAX_SEQUENCE_LENGTH)
countries_wiki_model = Sequential([
    Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32'),
    embedding_layer,
    GlobalAveragePooling1D(),
    Dense(128, activation='relu'),
    Dense(64, activation='relu'),
    Dense(1, activation='sigmoid')
])
countries_wiki_model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])

ValueError: Unrecognized keyword arguments passed to Embedding: {'weights': [array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [-0.81471908,  0.58569628, -0.01761852, ...,  0.10394527,
         0.88644546, -1.08423793],
       [ 0.02136188,  1.02093649,  0.22952487, ..., -0.54903054,
        -0.34340549,  0.89367366],
       ...,
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ]])], 'input_length': 2493}

In [None]:
countries_wiki_history = countries_wiki_model.fit(
    train_data, dfTrain['label'].values,
    validation_data=(test_data, xts['label'].values),
    batch_size=64, epochs=30
)

# Train with a different set of word embeddings

## GloVe: Global Vectors for Word Representation
### Download [here](http://nlp.stanford.edu/data/glove.6B.zip)

In [15]:
glove_wiki = KeyedVectors.load_word2vec_format('data/glove.6B/glove.6B.300d.txt', binary=False, no_header=True)

In [None]:
embedding_layer = utils.make_embedding_layer(glove_wiki, tokenizer, MAX_SEQUENCE_LENGTH)

glove_model = Sequential([
    Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32'),
    embedding_layer,
    GlobalAveragePooling1D(),
    Dense(128, activation='relu'),
    Dense(64, activation='relu'),
    Dense(1, activation='sigmoid')
])
glove_model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])

In [None]:
glove_history = glove_model.fit(
    train_data, dfTrain['label'].values,
    validation_data=(test_data, xts['label'].values),
    batch_size=32, epochs=30
)

In [None]:
plt.plot(countries_wiki_history.history['val_accuracy'], label='Countries Wiki')
plt.plot(glove_history.history['val_accuracy'], label='All Wiki')
plt.legend()