# Data preprocessing
    - Download data in the server
    - Convert test to sequences.
    - Configure sequences for a RNN model.

## Download data in the server

### Command line in the server
    Path to data:
        cd /home/ubuntu/data/training/keras
    Download dataset: 
        wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
    Uncompress it:
        tar -zxvf aclImdb_v1.tar.gz

## Convert test to sequences
    - List of all text files
    - Read files into python
    - Tokenize
    - Create dictionaries to recode
    - Recode tokens into ids and create sentences

In [1]:
#Imports and paths
from __future__ import print_function

import numpy as np

data_path='/home/ubuntu/data/training/keras/aclImdb/'
data_path='/Users/jorge/data/training/keras/aclImdb/'

In [2]:
# Generator of list of files in a folder and subfolders
import os
import shutil
import fnmatch

def gen_find(filepattern, toppath):
    '''
    Generator with a recursive list of files in the toppath that match filepattern 
    Inputs:
        filepattern(str): Command stype pattern 
        toppath(str): Root path
    '''
    for path, dirlist, filelist in os.walk(toppath):
        for name in fnmatch.filter(filelist, filepattern):
            yield os.path.join(path, name)

#Test
print(gen_find("*.txt", data_path+'train/pos/').next())

/Users/jorge/data/training/keras/aclImdb/train/pos/0_9.txt


In [3]:
def read_sentences(path):
    sentences = []
    sentences_list = gen_find("*.txt", path)
    for ff in sentences_list:
        with open(ff, 'r') as f:
            sentences.append(f.readline().strip())
    return sentences        

#Test
print(read_sentences(data_path+'train/pos/')[0:2])

['Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!', 'Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from

In [4]:
print(read_sentences(data_path+'train/neg/')[0:2])

["Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly.", "Airport '77 starts as a brand new luxury 747 plane is loaded up with valuable paintings & such belonging to rich businessman Philip Stevens (James Stewart) who is flying them & a bunch of VIP's to his estate in preparation of it being opened to the public as a museum, also on board is Stevens daughter Julie (Kathleen Quinlan) & her son. 

In [5]:
def tokenize(sentences):
    from nltk import word_tokenize
    print( 'Tokenizing...',)
    tokens = []
    for sentence in sentences:
        tokens += [word_tokenize(sentence.decode('utf-8'))]
    print('Done!')

    return tokens

print(tokenize(read_sentences(data_path+'train/pos/')[0:2]))

Tokenizing...
Done!
[[u'Bromwell', u'High', u'is', u'a', u'cartoon', u'comedy', u'.', u'It', u'ran', u'at', u'the', u'same', u'time', u'as', u'some', u'other', u'programs', u'about', u'school', u'life', u',', u'such', u'as', u'``', u'Teachers', u"''", u'.', u'My', u'35', u'years', u'in', u'the', u'teaching', u'profession', u'lead', u'me', u'to', u'believe', u'that', u'Bromwell', u'High', u"'s", u'satire', u'is', u'much', u'closer', u'to', u'reality', u'than', u'is', u'``', u'Teachers', u"''", u'.', u'The', u'scramble', u'to', u'survive', u'financially', u',', u'the', u'insightful', u'students', u'who', u'can', u'see', u'right', u'through', u'their', u'pathetic', u'teachers', u"'", u'pomp', u',', u'the', u'pettiness', u'of', u'the', u'whole', u'situation', u',', u'all', u'remind', u'me', u'of', u'the', u'schools', u'I', u'knew', u'and', u'their', u'students', u'.', u'When', u'I', u'saw', u'the', u'episode', u'in', u'which', u'a', u'student', u'repeatedly', u'tried', u'to', u'burn', u'do

In [6]:
sentences_trn_pos = tokenize(read_sentences(data_path+'train/pos/'))
sentences_trn_neg = tokenize(read_sentences(data_path+'train/neg/'))
sentences_trn = sentences_trn_pos + sentences_trn_neg


Tokenizing...
Done!
Tokenizing...
Done!


In [7]:
#create the dictionary to conver words to numbers. Order it with most frequent words first
def build_dict(sentences):
#    from collections import OrderedDict

    '''
    Build dictionary of train words
    Outputs: 
     - Dictionary of word --> word index
     - Dictionary of word --> word count freq
    '''
    print( 'Building dictionary..',)
    wordcount = dict()
    #For each worn in each sentence, cummulate frequency
    for ss in sentences:
        for w in ss:
            if w not in wordcount:
                wordcount[w] = 1
            else:
                wordcount[w] += 1

    counts = wordcount.values() # List of frequencies
    keys = wordcount.keys() #List of words
    
    sorted_idx = reversed(np.argsort(counts))
    
    worddict = dict()
    for idx, ss in enumerate(sorted_idx):
        worddict[keys[ss]] = idx+2  # leave 0 and 1 (UNK)
    print( np.sum(counts), ' total words ', len(keys), ' unique words')

    return worddict, wordcount


worddict, wordcount = build_dict(sentences_trn)

print(worddict['the'], wordcount['the'])

Building dictionary..
7056193  total words  135098  unique words
2 289298


In [8]:
# 
def generate_sequence(sentences, dictionary):
    '''
    Convert tokenized text in sequences of integers
    '''
    seqs = [None] * len(sentences)
    for idx, ss in enumerate(sentences):
        seqs[idx] = [dictionary[w] if w in dictionary else 1 for w in ss]

    return seqs

In [9]:
# Create train and test data

#Read train sentences and generate target y
train_x_pos = generate_sequence(sentences_trn_pos, worddict)
train_x_neg = generate_sequence(sentences_trn_neg, worddict)
X_train_full = train_x_pos + train_x_neg
y_train_full = [1] * len(train_x_pos) + [0] * len(train_x_neg)

print(X_train_full[0], y_train_full[0])

[25771, 2010, 9, 6, 1154, 252, 4, 51, 2178, 43, 2, 185, 74, 22, 62, 102, 6180, 55, 457, 144, 3, 163, 22, 32, 30392, 31, 4, 331, 5910, 176, 14, 2, 5314, 6415, 505, 87, 8, 285, 17, 25771, 2010, 18, 2009, 9, 94, 2504, 8, 685, 93, 9, 32, 30392, 31, 4, 21, 31367, 8, 2136, 12271, 3, 2, 6460, 1527, 47, 71, 84, 231, 165, 82, 1286, 5864, 92, 23687, 3, 2, 55076, 7, 2, 236, 919, 3, 45, 3054, 87, 7, 2, 6585, 15, 697, 5, 82, 1527, 4, 283, 15, 234, 2, 410, 14, 72, 6, 1530, 3872, 802, 8, 3892, 211, 2, 457, 3, 15, 1257, 15934, 69, 69, 69, 43, 69, 69, 69, 4, 2010, 4, 137, 378, 402, 90, 62700, 90, 15, 167, 164, 8, 11052, 42, 7, 150, 5864, 4, 44635, 90, 9161, 8, 25771, 2010, 4, 15, 550, 17, 131, 1507, 7, 86, 695, 121, 17, 25771, 2010, 9, 262, 9587, 4, 218, 6, 2576, 17, 16, 9, 30, 41] 1


In [10]:
#Read test sentences and generate target y
sentences_tst_pos = read_sentences(data_path+'test/pos/')
sentences_tst_neg = read_sentences(data_path+'test/neg/')

test_x_pos = generate_sequence(tokenize(sentences_tst_pos), worddict)
test_x_neg = generate_sequence(tokenize(sentences_tst_neg), worddict)
X_test_full = test_x_pos + test_x_neg
y_test_full = [1] * len(test_x_pos) + [0] * len(test_x_neg)

print(X_test_full[0])
print(y_test_full[0])

Tokenizing...
Done!
Tokenizing...
Done!
[15, 448, 5, 234, 19, 25, 268, 419, 138, 129, 33387, 8, 44, 6, 184, 390, 7, 2014, 4, 15, 254, 942, 17, 15, 20, 6106, 8, 84, 16, 103, 49, 67, 15, 697, 7, 14659, 17791, 40, 20, 77, 517, 8, 60, 252, 4, 15, 20, 394, 4, 17791, 275, 2, 123, 7, 3086, 16280, 65, 108, 3, 5, 1832, 10676, 275, 997, 7851, 23, 163, 14025, 4, 21, 1943, 7, 6, 63, 25, 9, 17, 16, 71, 4335, 23, 293, 1442, 4, 61, 42, 89, 636, 17, 4, 21, 446, 843, 28, 72, 20, 3048, 58, 27, 20, 3099, 44, 2197, 347, 2, 110, 385, 7, 2, 25, 3, 5, 81, 1668, 8, 1758, 347, 2, 381, 385, 4, 458, 16245, 2, 843, 15, 36, 77, 234, 131, 380, 14, 1758, 3, 29, 131, 447, 2346, 400, 22, 108, 3, 284, 2834, 36, 8, 369, 289, 84, 112, 2692, 4, 61, 25, 20, 106, 3, 5, 15, 1432, 17, 34, 169, 84, 16, 187, 34, 2323, 4]
1


## Configure sequences for a RNN model
    - Remove words with low frequency
    - Truncate / complete sequences to the same length

In [11]:
#Median length of sentences
print('Median length: ', np.median([len(x) for x in X_test_full]))

Median length:  208.0


In [12]:
max_features = 50000 # Number of most frequent words selected. the less frequent recode to 0
maxlen = 200  # cut texts after this number of words (among top max_features most common words)

In [13]:
#Select the most frequent max_features, recode others using 0
def remove_features(x):
    return [[0 if w >= max_features else w for w in sen] for sen in x]

X_train = remove_features(X_train_full)
X_test  = remove_features(X_test_full)
y_train = y_train_full
y_test = y_test_full

print(X_test[1])

[4759, 689, 204, 1054, 5694, 1144, 70, 38, 2505, 2032, 3, 2, 1, 32, 20986, 31, 3, 23, 19, 248, 2990, 2422, 507, 55, 2, 25169, 2522, 12691, 141, 6, 220, 338, 0, 5801, 49, 38, 4673, 980, 8, 329, 452, 38, 1, 11650, 14, 67, 20, 2416, 22, 32, 21, 9263, 5756, 4338, 12475, 4, 31, 15, 167, 85, 359, 7, 10289, 3, 5, 168, 23397, 12674, 2422, 1569, 35, 6, 11870, 6, 2758, 28, 107, 1043, 250, 8, 2800, 976, 23, 32, 16197, 31, 5, 32, 2216, 702, 31, 27, 3, 29, 62, 114, 19, 26, 20, 9869, 45, 2, 14646, 12, 13, 10, 11, 12, 13, 10, 11, 21, 26, 532, 23, 62, 1544, 657, 924, 28, 945, 6, 0, 328, 7, 2, 1226, 657, 924, 7, 4278, 18, 32, 0, 31, 5, 32, 5944, 31, 27, 3, 29, 47269, 403, 1378, 24, 116, 110, 11964, 605, 4, 1358, 2, 245, 1100, 8, 2, 2522, 12691, 209, 1357, 70, 65, 108, 4, 5694, 88, 6, 356, 313, 5, 299, 6, 10530, 24, 1150, 3723, 17632, 28, 15, 489, 2, 0, 4387, 7, 2, 245, 33, 316, 132, 7, 2, 1007, 27, 17, 20959, 2, 133, 1091, 54, 887, 62, 2083, 1985, 1183, 8, 2, 4148, 4, 135, 18, 62, 1461, 123, 970, 73, 2

In [14]:
from keras.preprocessing import sequence

# Cut or complete the sentences to length = maxlen
print("Pad sequences (samples x time)")

X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)

print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

print(X_test[0])

ImportError: No module named keras.preprocessing

In [None]:
# Shuffle data
from sklearn.utils import shuffle
X_train, y_train = shuffle(X_train, y_train, random_state=0)