## IMDb : Internet Movie Database reviews

The IMDb task is a sentiment classification task. It consists of movie reviews collected from IMDB. The training and test set sizes are both 25,000. In addition there is a set of 50,000 unlabeled reviews.

See [website](http://ai.stanford.edu/~amaas/data/sentiment/) and [paper](http://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf) for more info.

In [1]:
import numpy as np
import pandas as pd
import os
import sys
import csv
import re
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.utils import shuffle

sys.path.append("../") 
from bert_sklearn import BertClassifier
from bert_sklearn import load_model

DATADIR = "./aclImdb"

In [None]:
%%bash
wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
tar -xf aclImdb_v1.tar.gz
rm aclImdb_v1.tar.gz

In [2]:
"""
IMDB train data size: 25000 
IMDB unsup data size: 50000 
IMDB test data size: 25000 
"""

def clean(text):
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r"\"", "", text)       
    return text

def slurp(filename):
    with open(filename) as f: 
        data = clean(f.read())
    return data

def get_imdb_df(datadir,val=None):
    data = [(slurp(datadir + filename),val) for filename in os.listdir(datadir)]
    return pd.DataFrame(data,columns=['text','label'])

def get_imdb_data(train_dir = DATADIR + "/train",test_dir = DATADIR + "/test",random_state=42 ):

    label_list = [0,1]
    pos = get_imdb_df(train_dir + "/pos/",1)
    neg = get_imdb_df(train_dir + "/neg/",0)
    train = shuffle(pd.concat([pos, neg]),random_state=random_state)
    print("IMDB train data size: %d "%(len(train)))
    
    unsup = get_imdb_df(train_dir + "/unsup/")
    print("IMDB unsup data size: %d "%(len(unsup)))

    pos = get_imdb_df(test_dir + "/pos/",1)
    neg = get_imdb_df(test_dir + "/neg/",0)
    test = shuffle(pd.concat([pos, neg]),random_state=random_state)
    print("IMDB test data size: %d "%(len(test)))
    
    return train, test, label_list, unsup

train, test, label_list, unsup = get_imdb_data()

IMDB train data size: 25000 
IMDB unsup data size: 50000 
IMDB test data size: 25000 


In [4]:
train.head()    

Unnamed: 0,text,label
6868,when you add up all the aspects from the movie...,1
11516,Lord Alan Cunningham(Antonio De Teffè)is a nut...,0
9668,I thought it was an extremely clever film. I w...,1
1140,"Granted, I'm not the connoisseur d'horror my p...",0
1518,I thought it would at least be aesthetically b...,0


In [5]:
train[:1].values

array([["when you add up all the aspects from the movie---the dancing, singing, acting---the only one who stands out as the best in the cast is Vanessa Williams...her dedication, energy and timeless beauty make Rosie the perfect role for her. Never have i ever seen someone portray Rose with such vibrancy! Vanessa's singing talent shows beautifully with all the songs she performs as Rose and her acting skills never cease to amaze me! Her dancing is so incredible, even if as some people say the choreography was bad---her dancing skills were displayed better than ever before! I'd recommend this version over the '63 just because i find that although lengthy the acting by Vanessa is superb-----not to mention the fact that Jason Alexander and the rest of the cast are very impressive as well (with the exception of Chynna Philips...what in hell were they thinking when they cast her?)All in all I'd say this version is wonderful and I recommend that everyone see this version!",
        1]], dtyp

As you can see, each review is much longer than a sentence or two. The Google AI BERT models were trained on sequences of max length 512. Lets look at the performance for max_seq_length equal to  128, 256, and 512.

### max_seq_length = 128

In [4]:
%%time

train, test, label_list, unsup = get_imdb_data()

X_train = train['text']
y_train = train['label']

X_test = test['text']
y_test = test['label']

model = BertClassifier()
model.max_seq_length = 128
model.learning_rate = 2e-05
model.epochs = 4

print(model)

model.fit(X_train, y_train)

accy = model.score(X_test, y_test)

IMDB train data size: 25000 
IMDB unsup data size: 50000 
IMDB test data size: 25000 
Building sklearn classifier...
BertClassifier(bert_model='bert-base-uncased', epochs=4, eval_batch_size=8,
        fp16=False, gradient_accumulation_steps=1, label_list=None,
        learning_rate=2e-05, local_rank=-1, logfile='bert_sklearn.log',
        loss_scale=0, max_seq_length=128, num_mlp_hiddens=500,
        num_mlp_layers=0, random_state=42, restore_file=None,
        train_batch_size=32, use_cuda=True, validation_fraction=0.1,
        warmup_proportion=0.1)
Loading bert-base-uncased model...
Defaulting to linear classifier/regressor
train data size: 22500, validation data size: 2500


Training: 100%|██████████| 704/704 [09:10<00:00,  1.28it/s, loss=0.396]
                                                             

Epoch 1, Train loss : 0.3956, Val loss: 0.2588, Val accy = 89.48%


Training: 100%|██████████| 704/704 [09:12<00:00,  1.46it/s, loss=0.193]
                                                             

Epoch 2, Train loss : 0.1930, Val loss: 0.2531, Val accy = 90.28%


Training: 100%|██████████| 704/704 [09:11<00:00,  1.45it/s, loss=0.102] 
                                                             

Epoch 3, Train loss : 0.1019, Val loss: 0.2917, Val accy = 89.92%


Training: 100%|██████████| 704/704 [09:11<00:00,  1.46it/s, loss=0.0698]
                                                             

Epoch 4, Train loss : 0.0698, Val loss: 0.3091, Val accy = 89.88%


                                                            


Test loss: 0.3400, Test accuracy = 89.16%
CPU times: user 41min 13s, sys: 17min 47s, total: 59min 1s
Wall time: 46min 44s




### max_seq_length = 256

In [7]:
%%time

train, test, label_list, unsup = get_imdb_data()

X_train = train['text']
y_train = train['label']

X_test = test['text']
y_test = test['label']

model = BertClassifier()
model.max_seq_length = 256
model.train_batch_size = 32
model.learning_rate = 2e-05
model.epochs = 4

print(model)

model.fit(X_train, y_train)

accy = model.score(X_test, y_test)

IMDB train data size: 25000 
IMDB unsup data size: 50000 
IMDB test data size: 25000 
Building sklearn classifier...
BertClassifier(bert_model='bert-base-uncased', epochs=4, eval_batch_size=8,
        fp16=False, gradient_accumulation_steps=1, label_list=None,
        learning_rate=2e-05, local_rank=-1, logfile='bert_sklearn.log',
        loss_scale=0, max_seq_length=256, num_mlp_hiddens=500,
        num_mlp_layers=0, random_state=42, restore_file=None,
        train_batch_size=32, use_cuda=True, validation_fraction=0.1,
        warmup_proportion=0.1)
Loading bert-base-uncased model...
Defaulting to linear classifier/regressor
train data size: 22500, validation data size: 2500


Training: 100%|██████████| 704/704 [14:13<00:00,  1.01s/it, loss=0.336]
                                                             

Epoch 1, Train loss : 0.3360, Val loss: 0.2038, Val accy = 92.08%


Training: 100%|██████████| 704/704 [14:14<00:00,  1.00s/it, loss=0.14] 
                                                             

Epoch 2, Train loss : 0.1403, Val loss: 0.1911, Val accy = 93.00%


Training: 100%|██████████| 704/704 [14:14<00:00,  1.00s/it, loss=0.0704]
                                                             

Epoch 3, Train loss : 0.0704, Val loss: 0.2216, Val accy = 92.84%


Training: 100%|██████████| 704/704 [14:14<00:00,  1.00s/it, loss=0.0474]
                                                             

Epoch 4, Train loss : 0.0474, Val loss: 0.2335, Val accy = 93.16%


                                                            


Test loss: 0.2541, Test accuracy = 92.36%
CPU times: user 1h 1min 7s, sys: 31min 55s, total: 1h 33min 2s
Wall time: 1h 9min 26s




### max_seq_length = 512

In [3]:
%%time

train, test, label_list, unsup = get_imdb_data()

X_train = train['text']
y_train = train['label']

X_test = test['text']
y_test = test['label']

model = BertClassifier()
model.max_seq_length = 512

# max_seq_length=512 will use a lot more GPU mem, so I am turning down batch size 
# and adding gradient accumulation steps
model.train_batch_size = 16
model_gradient_accumulation_steps = 4

model.learning_rate = 2e-05
model.epochs = 4

print(model)

model.fit(X_train, y_train)

accy = model.score(X_test, y_test)

IMDB train data size: 25000 
IMDB unsup data size: 50000 
IMDB test data size: 25000 
Building sklearn classifier...
BertClassifier(bert_model='bert-base-uncased', epochs=4, eval_batch_size=8,
        fp16=False, gradient_accumulation_steps=1, label_list=None,
        learning_rate=2e-05, local_rank=-1, logfile='bert_sklearn.log',
        loss_scale=0, max_seq_length=512, num_mlp_hiddens=500,
        num_mlp_layers=0, random_state=42, restore_file=None,
        train_batch_size=16, use_cuda=True, validation_fraction=0.1,
        warmup_proportion=0.1)
Loading bert-base-uncased model...
Defaulting to linear classifier/regressor
train data size: 22500, validation data size: 2500


Training: 100%|██████████| 1407/1407 [30:39<00:00,  1.10s/it, loss=0.309]
                                                             

Epoch 1, Train loss : 0.3088, Val loss: 0.1738, Val accy = 93.24%


Training: 100%|██████████| 1407/1407 [30:39<00:00,  1.10s/it, loss=0.115]
                                                             

Epoch 2, Train loss : 0.1145, Val loss: 0.1770, Val accy = 94.00%


Training: 100%|██████████| 1407/1407 [30:38<00:00,  1.12s/it, loss=0.0501]
                                                             

Epoch 3, Train loss : 0.0501, Val loss: 0.2188, Val accy = 93.72%


Training: 100%|██████████| 1407/1407 [31:02<00:00,  1.11s/it, loss=0.0304]
                                                             

Epoch 4, Train loss : 0.0304, Val loss: 0.2305, Val accy = 94.00%


                                                            


Test loss: 0.2223, Test accuracy = 94.04%
CPU times: user 2h 6min 43s, sys: 1h 6min 6s, total: 3h 12min 50s
Wall time: 2h 22min 9s


