
### Binary Classification 
 In Machine Learning, Classification is a type of Supervised Learning method, where the task is to divide the data samples into predefined groups by a Decision Function. When there are only two groups, it is called Binary Classification.
 The decision function is learned from a set of labeled samples, which is called Training Data and the process of learning the decision function is called Training.

### NOTE:
Basicaly this notebook prepared to use within **Google Colab**: https://colab.research.google.com/. 

The Google Colabatory has **free Tesla K80 GPU** and already prepared to develop deep learning applications.

First time opens this notebook, do not forget to enable **Python 3** runtime and **GPU** accelerator in Google Colab **Notebook Settings**. 


### Setup Project
Create workspace and change directory.


In [1]:
PROJECT_HOME = '/content/keras-movie-reviews-classification'

import os.path
if not os.path.exists(PROJECT_HOME):
  os.makedirs(PROJECT_HOME)
os.chdir(PROJECT_HOME)

!pwd

/content/keras-movie-reviews-classification


### Import Project
Import GitHub project to workspace.

In [2]:
# Import project and override existing data.
!git init .
!git remote add -t \* -f origin https://github.com/alex-agency/keras-movie-reviews-classification.git
!git reset --hard origin/master
!git checkout

!ls -la input

Initialized empty Git repository in /content/keras-movie-reviews-classification/.git/
Updating origin
remote: Counting objects: 16, done.[K
remote: Compressing objects: 100% (13/13), done.[K
remote: Total 16 (delta 2), reused 11 (delta 1), pack-reused 0[K
Unpacking objects: 100% (16/16), done.
From https://github.com/alex-agency/keras-movie-reviews-classification
 * [new branch]      master     -> origin/master
HEAD is now at 9361c0f step2-data-preparation
total 34888
drwxr-xr-x 3 root root     4096 Apr 13 15:52 .
drwxr-xr-x 4 root root     4096 Apr 13 15:52 ..
drwxr-xr-x 4 7297 1000     4096 Jun 26  2011 aclImdb
-rw-r--r-- 1 root root 16338527 Apr 13 15:52 dataset.npz
-rw-r--r-- 1 root root 19372046 Apr 13 15:52 reviews.tsv.bz2


### Movie Review text data
Create Data Frame from compressed csv file.

In [3]:
import sklearn.utils
import pandas as pd

# Import Data frame from compressed tsv file
reviews_df = pd.read_csv("input/reviews.tsv.bz2", sep='\t', 
                         encoding='utf-8', compression='bz2')
# Shuffle all reviews
reviews_df = sklearn.utils.shuffle(reviews_df)

print("Review Frame:")
print(reviews_df.head())

Review Frame:
                                                  review  feedback
49935  This movie was so bad, I thought I was going t...  negative
2366   Let's start this review out on a positive note...  positive
13543  I'm glad that I saw this film after Mr.Sandler...  negative
29566  This version of ALICE IN WONDERLAND is truly o...  positive
37164  Although the concept of a 32 year old woman po...  positive


### Keras
Keras is a high-level API, written in Python and capable of running on top of TensorFlow, Theano, or CNTK deep learning frameworks.

Keras provides a simple and modular API to create and train Neural Networks, hiding most of the complicated details under the hood.
By default, Keras is configured to use Tensorflow as the backend since it is the most popular choice.

Keras is becoming super popular recently because of its simplicity.

In [4]:
# Load Keras libraries
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

Using TensorFlow backend.


### Data transformation
Original data is in text format. In order to be able to feed it into a neural network it needs to be converted into tensors first.

The first step is tokenizing the reviews. The tokenizer converts each review into a sequence of integers with each integer representing the index of the word in a dictionary. Next the sequences are padded so all of them have the same length.

In [5]:
reviews = reviews_df['review']

# The words are indexed such that lower indexes correspond to more frequently used words.
MAX_NB_WORDS = 5000 # only more frequently used words will be kept

# Create tokenizer and set the number of features we want.
# The Tokenizer stores everything in the word_index during fit_on_texts. 
# Then, when calling the texts_to_sequences method, only the top num_words are considered.
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)

print('Fitting tokenizer...')
# Train tokenizer
tokenizer.fit_on_texts(reviews)
# Review sequence represents each word as a number.
# Unknown words or words that are not frequently used are ingored.
reviews_seq = tokenizer.texts_to_sequences(reviews)
# get vocabluary, which will be used for converting sequence back to review.
dictionary = tokenizer.word_index   
# cut vocabluary to frequently used words.
dictionary = dict(list(dictionary.items())[:MAX_NB_WORDS])

print('Vocabulary: {} words.'.format(len(dictionary)))

Fitting tokenizer...
Vocabulary: 5000 words.


In [6]:
# Transform sequence of numbers back to text.
def sequence_to_text(sequence, dictionary): 
  text = ''
  for index in sequence:
    for word, num in dictionary.items():
      if num == index:
        text = text + ' ' + word
        break
  return text

print('Sequence:')
print(reviews_seq[0])

print('\nReview:')
print(sequence_to_text(reviews_seq[0], dictionary))

Sequence:
[10, 16, 12, 33, 73, 9, 189, 9, 12, 165, 5, 2052, 7, 1, 650, 4, 8, 8, 12, 28, 9, 97, 76, 5, 845, 139, 8, 1, 452, 4, 1, 16, 116, 32, 22, 29, 294, 12, 2410, 60, 8, 4, 1908, 1972, 1732, 5, 67, 37, 29, 223, 31, 601, 29, 8, 60, 72, 88, 457, 14, 131, 82, 45, 12, 53, 1700, 52, 5, 1, 101, 1, 552, 11, 1487, 9, 469, 12, 1342, 5, 93, 175, 1434, 17, 14, 46, 283, 8, 38, 303, 90, 72, 24, 5, 1640, 5, 346, 99, 346, 4, 4475, 726, 11, 761, 2816, 5, 59, 4, 264, 12, 80, 7, 1, 16, 138, 843, 45, 12, 160, 211, 41, 1, 16, 29, 28, 8, 12, 36, 1240, 139, 1, 87, 4988, 515, 4, 169, 294, 16, 122, 89, 15, 3, 119, 225, 4, 459, 1333, 7, 5, 389, 21, 1399, 11, 8, 12, 165, 5, 75, 125, 1073, 8, 148, 337, 630, 304, 9, 565, 1, 768, 557, 4378, 2067, 141, 25, 318, 14, 263, 5, 1, 607, 2594, 59, 14, 43, 54, 12, 3, 48, 16, 17, 145, 9, 580, 141, 9, 816, 8, 2, 93, 246, 9, 291, 38, 1027, 52, 7, 1, 3380, 272, 9, 12, 17, 9, 405, 291, 1027, 52, 7, 1, 3380, 4, 10, 18, 9, 61, 413, 5, 1, 768, 1682, 5, 36, 10, 16, 9, 234, 3, 724

### Padding data
The next step is to make all of the reviews sequence the same length.

In [7]:
import numpy as np
# Set max review length as 90 percentile of all sequences length.
MAX_SEQUENCE_LENGTH = int(round(np.percentile([len(i) for i in reviews_seq], 90)))
print('Max sequence length: {} words.'.format(MAX_SEQUENCE_LENGTH))

# Padding all text to same size, longer sequences are reduced to max legth.
reviews_vectors = pad_sequences(reviews_seq, maxlen=MAX_SEQUENCE_LENGTH)

print('\nPadded Sequence:')
print(reviews_vectors[0])

Max sequence length: 400 words.

Padded Sequence:
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0   10   16   12   33   73    9  189
    9   12  165    5 2052    7    1  650    4    8    8   12   28    9
   97   76    5  845  139    8    1  452    4    1   16  116   32   22
   29  294   12 2410   60  

### One-hot vectors
Transform feedbacks to one-hot vectors. A one hot encoding is a representation of categorical variables as binary vectors.

This first requires that the categorical values be mapped to integer values.

In [8]:
feedback = reviews_df['feedback']

# Convert features to integers.
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
labels = label_encoder.fit_transform(feedback)

# Converts a class vector (integers) to binary class matrix.
feedback_vectors = to_categorical(labels)

print('Shape of reviews tensor:', reviews_vectors.shape)
print('Shape of feedbacks tensor:', feedback_vectors.shape)

Shape of reviews tensor: (50000, 400)
Shape of feedbacks tensor: (50000, 2)


In [9]:
from numpy import argmax
# Transform one-hot vetor back to label.
def onehot_to_text(onehot, encoder): 
  return encoder.inverse_transform(argmax(onehot))

print('One Hot vector: {}'.format(feedback_vectors[0]) )
print('Feedback: {}'.format(onehot_to_text(feedback_vectors[0], label_encoder) ))

One Hot vector: [1. 0.]
Feedback: negative


  if diff:


### Split data into train and test subsets

In [10]:
from sklearn.model_selection import train_test_split

# Split arrays or matrices into random train and test subsets
x_train, x_test, y_train, y_test = train_test_split(reviews_vectors, feedback_vectors, test_size = 0.20)

print('x_train shape:', x_train.shape)
print('y_train shape:', y_train.shape)

print('x_test shape:', x_test.shape)
print('y_test shape:', y_test.shape)

x_train shape: (40000, 400)
y_train shape: (40000, 2)
x_test shape: (10000, 400)
y_test shape: (10000, 2)


### Export result to dataset file
Serialize object and put it to compressed numpy array.

In [11]:
# Export data to compressed numpy array
np.savez_compressed('input/dataset.npz', dataset=((x_train, y_train), (x_test, y_test)), 
                    dictionary=dictionary, label_encoder=label_encoder)
!ls -la input

total 35696
drwxr-xr-x 3 root root     4096 Apr 13 15:52 .
drwxr-xr-x 4 root root     4096 Apr 13 15:52 ..
drwxr-xr-x 4 7297 1000     4096 Jun 26  2011 aclImdb
-rw-r--r-- 1 root root 17163254 Apr 13 15:55 dataset.npz
-rw-r--r-- 1 root root 19372046 Apr 13 15:52 reviews.tsv.bz2


### Downloading file to your local file system

It will invoke a browser download of the file to your local computer.

In [0]:
from google.colab import files
# Download file
files.download('input/dataset.npz')