## Build a Classifier for IMDB 

### Introduction
In this notebook, we will build a classifier for IMDB.

In [2]:
import os 
from datetime import datetime

import numpy as np
import pandas as pd 
import keras
import matplotlib.pyplot as plt

%matplotlib inline

Using TensorFlow backend.


### Load data

In [6]:
imdb_dir = '/Users/yizhang/Documents/Playground/learn_keras/aclImdb'
train_dir = os.path.join(imdb_dir, 'train')

labels = []
texts = []

for label_type in ['neg', 'pos']:
    dir_name = os.path.join(train_dir, label_type)
    for fname in os.listdir(dir_name):
        if fname[-4:] == '.txt':
            with open(os.path.join(dir_name, fname)) as f:
                texts.append(f.read())
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)

In [7]:
texts[0]

"Working with one of the best Shakespeare sources, this film manages to be creditable to it's source, whilst still appealing to a wider audience.<br /><br />Branagh steals the film from under Fishburne's nose, and there's a talented cast on good form."

### Tokening the data
Break sentences into the units from which we will derive the schematic representation of the doc/sentence.

In [23]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences


max_len = 100 # cuts off reviews after 100 words
training_samples = 200
validation_samples = 10000
max_words = 10000 # consider only the top 10k words in the dataset

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

In [47]:
word_index = tokenizer.word_index
reverse_word_index = dict((idx, w) for (w, idx) in word_index.items())
print('Found {:,} unique tokens.'.format(len(word_index)))


def index_to_word(index):
    return reverse_word_index[index]

Found 88,582 unique tokens.


In [13]:
sample_mtx = tokenizer.texts_to_matrix('Working with one of the')
sample_mtx.shape

(23, 10000)

In [59]:
data = pad_sequences(sequences, maxlen=max_len)
labels = np.asarray(labels)

**summary**: fram raw texts to data ready for learning algorithm: 
1) tokenization (`tokenizer`) break entire documents into sequence of words and parsing by removing non-english words
2) `pad_sequences` make sequence of variable length into a same one (100)

#### Inspect the tokenization and sequence results

In [31]:
texts[0]

"Working with one of the best Shakespeare sources, this film manages to be creditable to it's source, whilst still appealing to a wider audience.<br /><br />Branagh steals the film from under Fishburne's nose, and there's a talented cast on good form."

In [30]:
', '.join([str(i) for i in sequences[0]])

'777, 16, 28, 4, 1, 115, 2278, 6887, 11, 19, 1025, 5, 27, 5, 42, 2425, 1861, 128, 2270, 5, 3, 6985, 308, 7, 7, 3383, 2373, 1, 19, 36, 463, 3169, 2, 222, 3, 1016, 174, 20, 49, 808'

In [50]:
', '.join(index_to_word(i) for i in sequences[0])

"working, with, one, of, the, best, shakespeare, sources, this, film, manages, to, be, to, it's, source, whilst, still, appealing, to, a, wider, audience, br, br, branagh, steals, the, film, from, under, nose, and, there's, a, talented, cast, on, good, form"

the following sample is generated with `sequences[0]` after `pad_sequences`, which make the document of variable-lenght to the same length.

In [54]:
', '.join([str(i) for i in data[0]])

'0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 777, 16, 28, 4, 1, 115, 2278, 6887, 11, 19, 1025, 5, 27, 5, 42, 2425, 1861, 128, 2270, 5, 3, 6985, 308, 7, 7, 3383, 2373, 1, 19, 36, 463, 3169, 2, 222, 3, 1016, 174, 20, 49, 808'

In [56]:
set(len(doc) for doc in data)

{100}

In [58]:
print('shape of data tensor: ', data.shape)
print('shape of label tensor: ', labels.shape)

shape of data tensor:  (25000, 100)
shape of label tensor:  (25000,)
