### NOTE:
Basicaly this notebook prepared to use within **Google Colab**: https://colab.research.google.com/. 

The Google Colabatory has **free Tesla K80 GPU** and already prepared to develop deep learning applications.

First time opens this notebook, do not forget to enable **Python 3** runtime and **GPU** accelerator in Google Colab **Notebook Settings**. 


### Setup Project
Create workspace and change directory.

In [1]:
PROJECT_HOME = '/content/keras-movie-reviews-classification'

import os.path
if not os.path.exists(PROJECT_HOME):
  os.makedirs(PROJECT_HOME)
os.chdir(PROJECT_HOME)

!pwd

/content/keras-movie-reviews-classification


### Large Movie Review Dataset
This is a dataset for binary sentiment classification containing a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided.

Link: http://ai.stanford.edu/~amaas/data/sentiment/

In [2]:
# Downloading and extract archive
import os.path
if not os.path.exists("input/aclImdb"):
    import urllib.request, tarfile
    print("Downloading...")
    response = urllib.request.urlopen("http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz")
    tarfile.open(fileobj = response, mode = "r|*").extractall("input")
    
!ls -la input/aclImdb

Downloading...
total 1732
drwxr-xr-x 4 7297 1000   4096 Jun 26  2011 .
drwxr-xr-x 3 root root   4096 Apr 13 15:50 ..
-rw-r--r-- 1 7297 1000 903029 Jun 11  2011 imdbEr.txt
-rw-r--r-- 1 7297 1000 845980 Apr 12  2011 imdb.vocab
-rw-r--r-- 1 7297 1000   4037 Jun 26  2011 README
drwxr-xr-x 4 7297 1000   4096 Apr 12  2011 test
drwxr-xr-x 5 7297 1000   4096 Jun 26  2011 train


In [3]:
# Extract text from files
from glob import glob
def extract(files):
    list = []
    for filename in glob(files):
        with open(filename, "rb") as f:
            list.append ({
                "review": f.read().decode("utf-8").replace("<br />", "\n")
            })
    return list
  
# Get the Train and Test data of Positive and Negative reviews
print("Loading...")
train_pos_data = extract("input/aclImdb/train/pos/*.txt")
print("positive reviews for train:", len(train_pos_data))
train_neg_data = extract("input/aclImdb/train/neg/*.txt")
print("negative reviews for train:", len(train_neg_data))
test_pos_data = extract("input/aclImdb/test/pos/*.txt")
print("positive reviews for test: ", len(test_pos_data))
test_neg_data = extract("input/aclImdb/test/neg/*.txt")
print("negative reviews for test: ", len(test_neg_data))

Loading...
positive reviews for train: 12500
negative reviews for train: 12500
positive reviews for test:  12500
negative reviews for test:  12500


In [4]:
import pandas as pd

# Create Data frames for positive and negative reviews
train_pos_df = pd.DataFrame(train_pos_data)
train_pos_df["feedback"] = 'positive'
train_neg_df = pd.DataFrame(train_neg_data)
train_neg_df["feedback"] = 'negative'

test_pos_df = pd.DataFrame(test_pos_data)
test_pos_df["feedback"] = 'positive'
test_neg_df = pd.DataFrame(test_neg_data)
test_neg_df["feedback"] = 'negative'

# Combine all reviews together
reviews_df = pd.concat([train_pos_df, train_neg_df, test_pos_df, test_neg_df])

print("Review Frame:")
print(reviews_df.head())

Review Frame:
                                              review  feedback
0  well "Wayne's World" is long gone and the year...  positive
1  This film is one of the best of all time, cert...  positive
2  "The Case of the Scorpion's Tail" has all the ...  positive
3  A famous orchestra conductor, Daniel Dareus, s...  positive
4  This is a beautiful, rich, and very well-execu...  positive


### Export result
Store data into compressed csv file with tab separator.

In [5]:
# Export Data frames to tsv file
reviews_df.to_csv("input/reviews.tsv.bz2", index=False, sep='\t', 
                  encoding='utf-8', compression='bz2')
!ls -la input

total 18932
drwxr-xr-x 3 root root     4096 Apr 13 15:51 .
drwxr-xr-x 3 root root     4096 Apr 13 15:50 ..
drwxr-xr-x 4 7297 1000     4096 Jun 26  2011 aclImdb
-rw-r--r-- 1 root root 19371416 Apr 13 15:51 reviews.tsv.bz2


### Downloading file to your local file system

It will invoke a browser download of the file to your local computer.

In [0]:
from google.colab import files
# Download file
files.download('input/reviews.tsv.bz2')