<a href="https://colab.research.google.com/github/coffepowered/fun-with-nlp/blob/text-vectorization/embedding_with_textvectorization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro

This notebook uses basic NLP techniques to classify the sentiment of IMBD reviews.

The objective of this notebook is demonstrating how we can get quite good performances in real tasks with minimal preprocessing, simple models and just a few lines of code =).

Let's read an excerpt of the original README to know more about it:

```
Overview

This dataset contains movie reviews along with their associated binary
sentiment polarity labels. It is intended to serve as a benchmark for
sentiment classification. This document outlines how the dataset was
gathered, and how to use the files provided. 

Dataset 

The core dataset contains 50,000 reviews split evenly into 25k train
and 25k test sets. The overall distribution of labels is balanced (25k
pos and 25k neg). We also include an additional 50,000 unlabeled
documents for unsupervised learning. 

In the entire collection, no more than 30 reviews are allowed for any
given movie because reviews for the same movie tend to have correlated
ratings. Further, the train and test sets contain a disjoint set of
movies, so no significant performance is obtained by memorizing
movie-unique terms and their associated with observed labels.  In the
labeled train/test sets, a negative review has a score <= 4 out of 10,
and a positive review has a score >= 7 out of 10. Thus reviews with
more neutral ratings are not included in the train/test sets. In the
unsupervised set, reviews of any rating
```


Notice that IMDB dataset is embedded in keras (already tokenized!), but here we start from raw text for didactic purposes. First, download the dataset:


In [2]:
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar xvzf aclImdb_v1.tar.gz

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
aclImdb/train/unsup/44982_0.txt
aclImdb/train/unsup/44981_0.txt
aclImdb/train/unsup/44980_0.txt
aclImdb/train/unsup/44979_0.txt
aclImdb/train/unsup/44978_0.txt
aclImdb/train/unsup/44977_0.txt
aclImdb/train/unsup/44976_0.txt
aclImdb/train/unsup/44975_0.txt
aclImdb/train/unsup/44974_0.txt
aclImdb/train/unsup/44973_0.txt
aclImdb/train/unsup/44972_0.txt
aclImdb/train/unsup/44971_0.txt
aclImdb/train/unsup/44970_0.txt
aclImdb/train/unsup/44969_0.txt
aclImdb/train/unsup/44968_0.txt
aclImdb/train/unsup/44967_0.txt
aclImdb/train/unsup/44966_0.txt
aclImdb/train/unsup/44965_0.txt
aclImdb/train/unsup/44964_0.txt
aclImdb/train/unsup/44963_0.txt
aclImdb/train/unsup/44962_0.txt
aclImdb/train/unsup/44961_0.txt
aclImdb/train/unsup/44960_0.txt
aclImdb/train/unsup/44959_0.txt
aclImdb/train/unsup/44958_0.txt
aclImdb/train/unsup/44957_0.txt
aclImdb/train/unsup/44956_0.txt
aclImdb/train/unsup/44955_0.txt
aclImdb/train/unsup/44954_0.txt
aclImdb

# Load and view data

Configure TOT_SAMPLES to load the desired number of samples. Max is 50k.

In [3]:
import glob, os
import numpy as np
print(glob.glob("aclImdb/train/pos/*.txt"))

TOT_SAMPLES = 15000
positives = glob.glob("aclImdb/train/pos/*.txt")[:TOT_SAMPLES//2]
negatives = glob.glob("aclImdb/train/neg/*.txt")[:TOT_SAMPLES//2]

content = []
for f in positives:
  if os.path.isfile(f):
    with open(f, 'r') as reader:
      content.append(reader.read())
n_pos = len(content)

for f in negatives:
  if os.path.isfile(f):
    with open(f, 'r') as reader:
      content.append(reader.read())

labels = np.zeros(len(content))
labels[:n_pos] = 1


assert len(labels) == len(content)

['aclImdb/train/pos/3463_7.txt', 'aclImdb/train/pos/6958_7.txt', 'aclImdb/train/pos/11188_10.txt', 'aclImdb/train/pos/5038_10.txt', 'aclImdb/train/pos/8880_8.txt', 'aclImdb/train/pos/3154_10.txt', 'aclImdb/train/pos/504_8.txt', 'aclImdb/train/pos/12494_8.txt', 'aclImdb/train/pos/10650_8.txt', 'aclImdb/train/pos/11743_7.txt', 'aclImdb/train/pos/3472_10.txt', 'aclImdb/train/pos/11283_8.txt', 'aclImdb/train/pos/1561_8.txt', 'aclImdb/train/pos/7843_9.txt', 'aclImdb/train/pos/6720_9.txt', 'aclImdb/train/pos/1784_10.txt', 'aclImdb/train/pos/11050_8.txt', 'aclImdb/train/pos/5427_10.txt', 'aclImdb/train/pos/1947_8.txt', 'aclImdb/train/pos/7155_8.txt', 'aclImdb/train/pos/8944_8.txt', 'aclImdb/train/pos/8430_9.txt', 'aclImdb/train/pos/12024_7.txt', 'aclImdb/train/pos/5338_10.txt', 'aclImdb/train/pos/6560_7.txt', 'aclImdb/train/pos/10490_7.txt', 'aclImdb/train/pos/7068_8.txt', 'aclImdb/train/pos/11891_9.txt', 'aclImdb/train/pos/8317_8.txt', 'aclImdb/train/pos/3649_9.txt', 'aclImdb/train/pos/381_1

In [6]:
# view a random review
np.random.choice(content)

"Just got back from a free screening and I'm very glad I didn't pay to see this very sub-par film. The theater was full and the crowd was a mix of kids and adults. It seemed like it was just the kids who were laughing at all the slap-stick and fart jokes though (good god they loved to hit these poor mice in the crotch a lot!). The movie is pretty juvenile, unintelligent, predictable, and mostly annoying. The characters just seem to be thrown together to fill in empty space and the relationships between them all seemed very forced with no charm at all.<br /><br />Visually, the film is about average with nothing that really stands out. They did a decent job of mimicking the clay look from Wallace and Gromit, but other than that it's very forgettable imagery.<br /><br />Although I was really bored throughout the whole film, I chuckled a couple times. It's not an absolute failure, but I most definitely would not want to watch it again. If you're a parent with kids (and you don't care that 

## Exploratory Analysis (TODO)

Do something here.

## Preprocessing
In this notebook we are purposely performing no preprocessing. Just shuffle the data.

In [7]:
from sklearn.utils import shuffle
content, labels = shuffle(np.array(content), np.array(labels))

## Build the simplest model you can think of

Tune model config as you want but do not expect great improvements :).

The point here is using Keras' TextVectorization layer.

In [18]:
# model configs
max_tokens = 5000
seq_len = 350

# imports 
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
import tensorflow as tf
from numpy import array
from keras import Sequential
from keras.layers import Input, Dense, Embedding, Flatten, Dropout

# create vectorizer and "adapt" it
vectorizer = TextVectorization(max_tokens=max_tokens, 
                               output_sequence_length=seq_len)

vectorizer.adapt(tf.data.Dataset.from_tensor_slices(array(content)))

In [19]:
model = Sequential()
model.add(Input(shape=(1,), dtype=tf.string))
model.add(vectorizer)
model.add(Embedding(max_tokens, 32, 
                    input_length=seq_len))
model.add(Flatten())
model.add(Dense(150, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])
model.summary()


Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text_vectorization_5 (TextVe (None, 350)               0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 350, 32)           160000    
_________________________________________________________________
flatten (Flatten)            (None, 11200)             0         
_________________________________________________________________
dense (Dense)                (None, 150)               1680150   
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 151       
Total params: 1,840,301
Trainable params: 1,840,301
Non-trainable params: 0
_________________________________________________________________


In [20]:
model.fit(content, labels,
          batch_size=128,
          epochs=6,
          validation_split=0.1
          )

Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


<tensorflow.python.keras.callbacks.History at 0x7f394477add8>

# Evaluation

Notice that the model overfits. Pretty easily. That is no bad per se, since we still have pretty good performance on the validation set. 

Let's sanity check our result on very, very, very simple reviews I've written:

In [21]:
model.predict(["That was simple a great movie, highly suggested!",
               "WTF worse movie EVER!",
               "I have somewhat mixed feelings about it. On one side, this is great, on the other hand seems like the actors put not enough effort into it",
               "Somewhat a bad movie but I liked the plot",
               "Just a waste of time, avoid"])

array([[0.9896177 ],
       [0.21192223],
       [0.63738286],
       [0.47483245],
       [0.00326097]], dtype=float32)