### NOTE:
Basicaly this notebook prepared to use within **Google Colab**: https://colab.research.google.com/. 


The Google Colabatory has **free Tesla K80 GPU** and already prepared to develop deep learning applications.

First time opens this notebook, do not forget to enable **Python 3** runtime and **GPU** accelerator in Google Colab **Notebook Settings**. 


### Setup Project
Create workspace


In [1]:
PROJECT_HOME = '/content/keras-movie-reviews-classification'

import os.path
if not os.path.exists(PROJECT_HOME):
  os.makedirs(PROJECT_HOME)
os.chdir(PROJECT_HOME)

!pwd

/content/keras-movie-reviews-classification


### Import Project
Import project from GitHub to created workspace.

In [2]:
!git init .
!git remote remove origin
!git remote add -t \* -f origin https://github.com/alex-agency/keras-movie-reviews-classification.git
!git checkout origin/master

!ls -la

Initialized empty Git repository in /content/keras-movie-reviews-classification/.git/
fatal: No such remote: origin
Updating origin
remote: Counting objects: 9, done.[K
remote: Compressing objects: 100% (8/8), done.[K
remote: Total 9 (delta 0), reused 5 (delta 0), pack-reused 0[K
Unpacking objects: 100% (9/9), done.
From https://github.com/alex-agency/keras-movie-reviews-classification
 * [new branch]      master     -> origin/master
error: The following untracked working tree files would be overwritten by checkout:
	input/reviews.tsv.bz2
Please move or remove them before you switch branches.
Aborting
total 16
drwxr-xr-x 4 root root 4096 Mar 31 04:11 .
drwxr-xr-x 1 root root 4096 Mar 31 04:08 ..
drwxr-xr-x 8 root root 4096 Mar 31 04:11 .git
drwxr-xr-x 3 root root 4096 Mar 31 04:08 input


### Load Movie Review Text Data
Create Data Frame from compressed csv file.

In [3]:
import sklearn.utils
import pandas as pd

# Import Data frame from compressed tsv file
reviews_df = pd.read_csv("input/reviews.tsv.bz2", sep='\t', 
                         encoding='utf-8', compression='bz2')
# Shuffle all reviews
reviews_df = sklearn.utils.shuffle(reviews_df)

print("Review Frame:")
print(reviews_df.head())

Review Frame:
                                                  review  feedback
24970  I'll be honest- the reason I rented this movie...  negative
159    Well, What can I say, other than these people ...  positive
11996  'The second beginning' as it's title explains,...  positive
2965   This has to be the all time best computer anim...  positive
24877  The movie is plain bad. Simply awful. The stri...  negative


### Load Keras with TensorFlow backend
Keras is a high-level neural networks API capable of running on top of TensorFlow.

In [4]:
# Load Keras libraries
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

Using TensorFlow backend.


### Data transformation
Original data is in text format. In order to be able to feed it into a neural network it needs to be converted into tensors first.

The first step is tokenizing the reviews. The tokenizer converts each review into a sequence of integers with each integer representing the index of the word in a dictionary. Next the sequences are padded so all of them have the same length.

In [5]:
reviews = reviews_df['review']

# The words are indexed such that lower indexes correspond to more frequently used words.
MAX_NB_WORDS = 5000 # only more frequently used words will be kept

# Create tokenizer and set the number of features we want.
# The Tokenizer stores everything in the word_index during fit_on_texts. 
# Then, when calling the texts_to_sequences method, only the top num_words are considered.
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)

print('Fitting tokenizer...')
# Train tokenizer
tokenizer.fit_on_texts(reviews)
# Review sequence represents each word as a number.
# Unknown words or words that are not frequently used are ingored.
reviews_seq = tokenizer.texts_to_sequences(reviews)
# get vocabluary, which will be used for converting sequence back to review.
dictionary = tokenizer.word_index   
# cut vocabluary to frequently used words.
dictionary = dict(list(dictionary.items())[:MAX_NB_WORDS])

print('Vocabulary: {} words.'.format(len(dictionary)))

Fitting tokenizer...
Vocabulary: 5000 words.


In [6]:
# Transform sequence from numbers to words.
def sequence_to_text(sequence, dictionary): 
  text = ''
  for index in sequence:
    for word, num in dictionary.items():
      if num == index:
        text = text + ' ' + word
        break
  return text

print('Sequence:')
print(reviews_seq[0])

print('\nReview:')
print(sequence_to_text(reviews_seq[0], dictionary))

Sequence:
[656, 25, 1168, 1, 283, 9, 1566, 10, 16, 12, 83, 9, 234, 3, 641, 325, 4, 87, 3846, 35, 403, 4422, 233, 27, 630, 296, 1, 48, 218, 9, 460, 5, 62, 86, 13, 7, 3, 278, 213, 40, 4, 3424, 1, 110, 391, 80, 2733, 67, 7, 3, 2929, 25, 1450, 3, 1976, 3, 438, 205, 503, 3, 400, 29, 1458, 158, 27, 6, 15, 1, 113, 37, 247, 1, 326, 4, 3, 2384, 34, 79, 449, 295, 7, 1, 235, 1, 96, 12, 160, 36, 9, 861, 8, 12, 1158, 9, 1788, 1, 126, 42, 21, 209, 8, 1420, 117, 133, 2, 45, 69, 33, 105, 1822, 515, 687, 8, 12, 2, 89, 119, 281, 421, 8, 12, 3, 2743, 16, 2, 421, 1, 1961, 76, 138, 4828, 17, 221, 50, 5, 3, 16, 70, 11, 10, 18, 883, 13, 42, 8, 153, 24, 1, 54, 5, 138, 81, 1553, 38, 794, 15, 8, 2, 384, 8, 1, 112, 61, 89, 52, 14, 8, 608, 12, 179, 1142, 13, 1, 2304, 962, 89, 2342, 307, 13, 827, 13, 58, 97, 75, 58, 759, 67, 1, 87, 2, 12, 1394, 1090, 13, 31, 5, 1, 235, 42, 60, 1, 226, 114, 1434, 5, 1, 149]

Review:
 i'll be honest the reason i rented this movie was because i am a huge fan of most notably from earl

The next step is to make all of the reviews sequence the same length.

In [7]:
import numpy as np
# Set max review length as 90 percentile of all sequences length.
MAX_SEQUENCE_LENGTH = int(round(np.percentile([len(i) for i in reviews_seq], 90)))
print('Max sequence length: {} words.'.format(MAX_SEQUENCE_LENGTH))

# Padding all text to same size, longer sequences are reduced to max legth.
reviews_vectors = pad_sequences(reviews_seq, maxlen=MAX_SEQUENCE_LENGTH)

print('\nPadded Sequence:')
print(reviews_vectors[0])

Max sequence length: 400 words.

Padded Sequence:
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0  

Transform feedbacks to one-hot vectors. A one hot encoding is a representation of categorical variables as binary vectors.

This first requires that the categorical values be mapped to integer values.

In [8]:
feedback = reviews_df['feedback']

# Convert features to integers.
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
labels = label_encoder.fit_transform(feedback)

# Converts a class vector (integers) to binary class matrix.
feedback_vectors = to_categorical(labels)

print('Shape of reviews tensor:', reviews_vectors.shape)
print('Shape of feedbacks tensor:', feedback_vectors.shape)

Shape of reviews tensor: (50000, 400)
Shape of feedbacks tensor: (50000, 2)


In [9]:
from numpy import argmax
# Transform one-hot vetor to label.
def onehot_to_text(onehot, encoder): 
  return encoder.inverse_transform(argmax(onehot))

print('One Hot vector: {}'.format(feedback_vectors[0]) )
print('Feedback: {}'.format(onehot_to_text(feedback_vectors[0], label_encoder) ))

One Hot vector: [1. 0.]
Feedback: negative


  if diff:


### Export result to dataset file
Serialize object and put it to compressed numpy array.

In [10]:
# Export data to compressed numpy array
np.savez_compressed('input/dataset.npz', dictionary=dictionary, reviews_vectors=reviews_vectors, 
                label_encoder=label_encoder, feedback_vectors=feedback_vectors)
!ls -la input

total 34888
drwxr-xr-x 3 root root     4096 Mar 31 04:31 .
drwxr-xr-x 4 root root     4096 Mar 31 04:11 ..
drwxr-xr-x 4 7297 1000     4096 Jun 26  2011 aclImdb
-rw-r--r-- 1 root root 16338527 Mar 31 04:31 dataset.npz
-rw-r--r-- 1 root root 19372026 Mar 31 04:08 reviews.tsv.bz2


### Downloading files to your local file system

It will invoke a browser download of the file to your local computer.

In [0]:
from google.colab import files
# Download file
files.download('input/dataset.npz')