# Data preprocessing
    - Download data in the server
    - Convert test to sequences.
    - Configure sequences for a RNN model.

## Download data in the server

### Command line in the server
    Path to data:
        cd /home/ubuntu/data/training/keras
    Download dataset: 
        wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
    Uncompress it:
        tar -zxvf aclImdb_v1.tar.gz

## Convert test to sequences
    - List of all text files
    - Read files into python
    - Tokenize
    - Create dictionaries to recode
    - Recode tokens into ids and create sentences

In [2]:
#Imports and paths
import numpy as np

data_path='/home/ubuntu/data/training/keras/aclImdb/'

In [3]:
# Generator of list of files in a folder and subfolders
import os
import shutil
import fnmatch

def gen_find(filepattern, toppath):
    '''
    Generator with a recursive list of files in the toppath that match filepattern 
    Inputs:
        filepattern(str): Command stype pattern 
        toppath(str): Root path
    '''
    for path, dirlist, filelist in os.walk(toppath):
        for name in fnmatch.filter(filelist, filepattern):
            yield os.path.join(path, name)

#Test
print(gen_find("*.txt", data_path+'train/pos/').next())

/home/jorge/data/training/keras/aclImdb/train/pos/8938_9.txt


In [3]:
def read_sentences(path):
    sentences = []
    sentences_list = gen_find("*.txt", path)
    for ff in sentences_list:
        with open(ff, 'r') as f:
            sentences.append(f.readline().strip())
    return sentences        

#Test
print(read_sentences(data_path+'train/pos/')[0:2])

["A wonderful film version of the best-selling book and smash Broadway play about the lives of Sadie and Bessie Delany, two African-American sisters who both lived over the age of 100 and told their story of witnessing a century of American history. Ruby Dee and Diahann Carroll give very good performances as Bessie and Sadie, respectively. Amy Madigan also is good as Amy Hill Hearth, the white New York Times reporter whose article about the sisters launched the book, etc. Many of the flashback scenes and even many of the present-day ones are very powerful, if not quite as inspirational as in the book. That is the only real drawback, combined with the fact that certain aspects of the story are not presented clearly, such as the inter-racial background of the sisters' mother and why their father was so stern. But other than that, a very well-done, excellently performed, powerful movie.", "Now this is what I'd call a good horror. With occult/supernatural undertones, this nice low-budget F

In [4]:
print(read_sentences(data_path+'train/neg/')[0:2])

["Bobby is a goofy kid who smiles far too much and wants sex. So he buys a van to aid in this quest. The acting is lame, the comedy is pathetic and the script is no more than a loosely strung chain of clich\xc3\xa9s and cheap thrills. The makers of the film obviously wanted to capture some of the out there craziness of other films of the time, but fell a long way short. They even resort to Bobby slipping on a banana skin, because this will supposedly add comedic value.<br /><br />I'm struggling to find a redeeming feature of the film. If you like DeVito, this is another classic DeVito kind of role - but he's only a supporting actor and there for clich\xc3\xa9 value.", "It's the worst movie I've ever seen. The action is so unclear, work of cameras is so poor, actors are so affected ... and this lamentable 5 minutes of Arnie on the screen. My advice from the bottom of my heart - don't watch it unless you like such a low class torture."]


In [5]:
def tokenize(sentences):
    from nltk import word_tokenize
    print 'Tokenizing...',
    tokens = []
    for sentence in sentences:
        tokens += [word_tokenize(sentence.decode('utf-8'))]
    print('Done!')

    return tokens

print(tokenize(read_sentences(data_path+'train/pos/')[0:2]))

Tokenizing... Done!
[[u'A', u'wonderful', u'film', u'version', u'of', u'the', u'best-selling', u'book', u'and', u'smash', u'Broadway', u'play', u'about', u'the', u'lives', u'of', u'Sadie', u'and', u'Bessie', u'Delany', u',', u'two', u'African-American', u'sisters', u'who', u'both', u'lived', u'over', u'the', u'age', u'of', u'100', u'and', u'told', u'their', u'story', u'of', u'witnessing', u'a', u'century', u'of', u'American', u'history', u'.', u'Ruby', u'Dee', u'and', u'Diahann', u'Carroll', u'give', u'very', u'good', u'performances', u'as', u'Bessie', u'and', u'Sadie', u',', u'respectively', u'.', u'Amy', u'Madigan', u'also', u'is', u'good', u'as', u'Amy', u'Hill', u'Hearth', u',', u'the', u'white', u'New', u'York', u'Times', u'reporter', u'whose', u'article', u'about', u'the', u'sisters', u'launched', u'the', u'book', u',', u'etc', u'.', u'Many', u'of', u'the', u'flashback', u'scenes', u'and', u'even', u'many', u'of', u'the', u'present-day', u'ones', u'are', u'very', u'powerful', u',

In [6]:
sentences_trn_pos = tokenize(read_sentences(data_path+'train/pos/'))
sentences_trn_neg = tokenize(read_sentences(data_path+'train/neg/'))
sentences_trn = sentences_trn_pos + sentences_trn_neg


Tokenizing... Done!
Tokenizing... Done!


In [7]:
#create the dictionary to conver words to numbers. Order it with most frequent words first
def build_dict(sentences):
#    from collections import OrderedDict

    '''
    Build dictionary of train words
    Outputs: 
     - Dictionary of word --> word index
     - Dictionary of word --> word count freq
    '''
    print 'Building dictionary..',
    wordcount = dict()
    #For each worn in each sentence, cummulate frequency
    for ss in sentences:
        for w in ss:
            if w not in wordcount:
                wordcount[w] = 1
            else:
                wordcount[w] += 1

    counts = wordcount.values() # List of frequencies
    keys = wordcount.keys() #List of words
    
    sorted_idx = reversed(np.argsort(counts))
    
    worddict = dict()
    for idx, ss in enumerate(sorted_idx):
        worddict[keys[ss]] = idx+2  # leave 0 and 1 (UNK)
    print np.sum(counts), ' total words ', len(keys), ' unique words'

    return worddict, wordcount


worddict, wordcount = build_dict(sentences_trn)

print(worddict['the'], wordcount['the'])

Building dictionary.. 7056193  total words  135098  unique words
(2, 289298)


In [8]:
# 
def generate_sequence(sentences, dictionary):
    '''
    Convert tokenized text in sequences of integers
    '''
    seqs = [None] * len(sentences)
    for idx, ss in enumerate(sentences):
        seqs[idx] = [dictionary[w] if w in dictionary else 1 for w in ss]

    return seqs

In [9]:
# Create train and test data

#Read train sentences and generate target y
train_x_pos = generate_sequence(sentences_trn_pos, worddict)
train_x_neg = generate_sequence(sentences_trn_neg, worddict)
X_train_full = train_x_pos + train_x_neg
y_train_full = [1] * len(train_x_pos) + [0] * len(train_x_neg)

print(X_train_full[0], y_train_full[0])

([137, 424, 26, 328, 7, 2, 17914, 302, 5, 8144, 2246, 329, 55, 2, 487, 7, 20032, 5, 23759, 40433, 3, 132, 7118, 2922, 47, 238, 1539, 151, 2, 695, 7, 1320, 5, 600, 82, 80, 7, 8073, 6, 1448, 7, 338, 541, 4, 5092, 5486, 5, 42665, 13131, 223, 65, 63, 368, 22, 23759, 5, 20032, 3, 5633, 4, 4330, 29365, 111, 9, 63, 22, 4330, 3306, 63946, 3, 2, 587, 535, 773, 4487, 2486, 625, 8569, 55, 2, 2922, 9465, 2, 302, 3, 637, 4, 1412, 7, 2, 2794, 162, 5, 83, 131, 7, 2, 21381, 674, 35, 65, 991, 3, 78, 36, 200, 22, 6972, 22, 14, 2, 302, 4, 267, 9, 2, 77, 174, 11709, 3, 2630, 23, 2, 208, 17, 817, 1378, 7, 2, 80, 35, 36, 1353, 737, 3, 163, 22, 2, 45078, 980, 7, 2, 2922, 92, 488, 5, 188, 82, 355, 20, 53, 10488, 4, 118, 102, 93, 17, 3, 6, 65, 6239, 3, 6767, 2671, 3, 991, 25, 4], 1)


In [10]:
#Read test sentences and generate target y
sentences_tst_pos = read_sentences(data_path+'test/pos/')
sentences_tst_neg = read_sentences(data_path+'test/neg/')

test_x_pos = generate_sequence(tokenize(sentences_tst_pos), worddict)
test_x_neg = generate_sequence(tokenize(sentences_tst_neg), worddict)
X_test_full = test_x_pos + test_x_neg
y_test_full = [1] * len(test_x_pos) + [0] * len(test_x_neg)

print(X_test_full[0])
print(y_test_full[0])

Tokenizing... Done!
Tokenizing... Done!
[137, 30688, 3344, 38048, 17, 299, 17, 245, 9, 228, 2, 3379, 7, 1726, 4, 2463, 27811, 18, 501, 9, 7308, 5, 2732, 3, 2, 410, 288, 57, 22, 1522, 222, 22, 16, 89, 247, 6, 2105, 618, 4, 601, 5, 5722, 238, 223, 504, 368, 5, 238, 315, 8, 12882, 2, 593, 7, 18525, 91, 16, 18, 6, 45107, 3344, 38048, 17, 83, 31358, 96, 619, 792, 5, 376, 4, 8431, 2260, 406, 238, 2, 818, 6, 603, 24, 82, 304, 347, 38, 157, 23, 112, 3, 40, 18, 53, 1092, 17, 34, 96, 285, 2, 1069, 5896, 8, 2, 820, 7, 1356, 5, 6033, 33, 2, 110, 1425, 48, 392, 8, 4, 10568, 74276, 9, 1624, 5, 6663, 22, 2269, 57603, 3, 40, 56, 274, 7698, 14, 3096, 17, 40, 697, 430, 8, 194, 55, 2, 3344, 18461, 2797, 8, 19, 231, 3, 6, 208, 17, 178, 38, 1, 257, 45, 2, 64, 1135, 4, 137, 2100, 1061, 410, 4]
1


## Configure sequences for a RNN model
    - Remove words with low frequency
    - Truncate / complete sequences to the same length

In [11]:
#Median length of sentences
print np.median([len(x) for x in X_test_full])

208.0


In [12]:
max_features = 50000 # Number of most frequent words selected. the less frequent recode to 0
maxlen = 200  # cut texts after this number of words (among top max_features most common words)

In [13]:
#Select the most frequent max_features, recode others using 0
def remove_features(x):
    return [[0 if w >= max_features else w for w in sen] for sen in x]

X_train = remove_features(X_train_full)
X_test  = remove_features(X_test_full)
y_train = y_train_full
y_test = y_test_full

print(X_test[1])

[61, 374, 20, 2, 110, 374, 664, 44, 6287, 13020, 6218, 5, 20, 42, 7, 484, 250, 44, 2, 1216, 801, 18258, 4032, 2583, 452, 2, 3016, 3, 107, 427, 2, 875, 4, 458, 2804, 2485, 3, 838, 5, 3, 8, 6, 3307, 2890, 3, 102, 4061, 3, 89, 3326, 427, 5269, 4769, 22, 28240, 3, 2, 2115, 7, 13020, 6218, 2092, 2562, 33, 2, 875, 4, 4056, 5479, 14246, 5, 18258, 18, 13417, 5164, 3, 19, 9, 200, 4846, 4, 61, 9, 46, 906, 374, 4, 1, 92, 1521, 3, 1700, 46, 92, 2992, 5, 1, 333, 3244, 37, 3358, 148, 4, 118, 16, 18, 161, 322, 175, 4, 4770, 4]


In [14]:
from keras.preprocessing import sequence

# Cut or complete the sentences to length = maxlen
print("Pad sequences (samples x time)")

X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)

print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

print(X_test[0])

Using Theano backend.
Using gpu device 0: GeForce GTX TITAN Black (CNMeM is disabled, cuDNN 5103)


Pad sequences (samples x time)
('X_train shape:', (25000, 200))
('X_test shape:', (25000, 200))
[    0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0   137 30688  3344 38048    17   299    17   245     9   228
     2  3379     7  1726     4  2463 27811    18   501     9  7308     5
  2732     3     2   410   288    57    22  1522   222    22    16    89
   247     6  2105   618     4   601     5  5722   238   223   504   368
     5   238   315     8 12882     2   593     7 18525    91    16    18
     6 45107  3344 38048    17    83 31358    96   619   792     5   376
     4  8431  2260   406   238     2   818     6   603    24    82   304
   347    38   157    23   112     3    40    18    53  1092    17    34
    96   285     2  1069  5896     8     2   820     7  1356     5  6033
    33     2   110  1425    

In [15]:
# Shuffle data
from sklearn.utils import shuffle
X_train, y_train = shuffle(X_train, y_train, random_state=0)