# Data preprocessing
    - Download data to the server
    - Convert text to sequences.
    - Configure sequences for a RNN model.

## Download data to the server

### Command line in the server
    Path to data:
        cd /home/ubuntu/data/training/text/sentiment
    Download dataset: 
        wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
    Uncompress it:
        tar -zxvf aclImdb_v1.tar.gz

## Convert text to sequences
    - List of all text files
    - Read files into python
    - Tokenize
    - Create dictionaries to recode
    - Recode tokens into ids and create sentences

In [1]:
#Imports and paths
from __future__ import print_function

import numpy as np

# GPU path
#data_path='/home/ubuntu/data/training/text/sentiment/aclImdb/'

data_path='../../data/aclImdb/'


In [2]:
# Generator of list of files in a folder and subfolders
import os
import shutil
import fnmatch

def gen_find(filepattern, toppath):
    '''
    Generator with a recursive list of files in the toppath that match filepattern 
    Inputs:
        filepattern(str): Command stype pattern 
        toppath(str): Root path
    '''
    for path, dirlist, filelist in os.walk(toppath):
        for name in fnmatch.filter(filelist, filepattern):
            yield os.path.join(path, name)

#Test
#print(gen_find("*.txt", data_path+'train/pos/').next())

In [4]:
def read_sentences(path):
    sentences = []
    sentences_list = gen_find("*.txt", path)
    for ff in sentences_list:
        with open(ff, 'r', encoding='utf-8') as f:
            sentences.append(f.readline().strip())
    return sentences        

#Test
print(read_sentences(data_path+'train/pos/')[0:2])

['Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!', 'Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from

In [5]:
print(read_sentences(data_path+'train/neg/')[0:2])

["Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly.", "Airport '77 starts as a brand new luxury 747 plane is loaded up with valuable paintings & such belonging to rich businessman Philip Stevens (James Stewart) who is flying them & a bunch of VIP's to his estate in preparation of it being opened to the public as a museum, also on board is Stevens daughter Julie (Kathleen Quinlan) & her son. 

In [6]:
def tokenize(sentences):
    from nltk import word_tokenize
    print( 'Tokenizing...',)
    tokens = []
    for sentence in sentences:
        tokens += [word_tokenize(sentence)]
    print('Done!')

    return tokens

print(tokenize(read_sentences(data_path+'train/pos/')[0:2]))

Tokenizing...
Done!
[['Bromwell', 'High', 'is', 'a', 'cartoon', 'comedy', '.', 'It', 'ran', 'at', 'the', 'same', 'time', 'as', 'some', 'other', 'programs', 'about', 'school', 'life', ',', 'such', 'as', '``', 'Teachers', "''", '.', 'My', '35', 'years', 'in', 'the', 'teaching', 'profession', 'lead', 'me', 'to', 'believe', 'that', 'Bromwell', 'High', "'s", 'satire', 'is', 'much', 'closer', 'to', 'reality', 'than', 'is', '``', 'Teachers', "''", '.', 'The', 'scramble', 'to', 'survive', 'financially', ',', 'the', 'insightful', 'students', 'who', 'can', 'see', 'right', 'through', 'their', 'pathetic', 'teachers', "'", 'pomp', ',', 'the', 'pettiness', 'of', 'the', 'whole', 'situation', ',', 'all', 'remind', 'me', 'of', 'the', 'schools', 'I', 'knew', 'and', 'their', 'students', '.', 'When', 'I', 'saw', 'the', 'episode', 'in', 'which', 'a', 'student', 'repeatedly', 'tried', 'to', 'burn', 'down', 'the', 'school', ',', 'I', 'immediately', 'recalled', '...', '...', '...', 'at', '...', '...', '...', 

In [7]:
sentences_trn_pos = tokenize(read_sentences(data_path+'train/pos/'))
sentences_trn_neg = tokenize(read_sentences(data_path+'train/neg/'))
sentences_trn = sentences_trn_pos + sentences_trn_neg


Tokenizing...
Done!
Tokenizing...
Done!


In [8]:
#create the dictionary to conver words to numbers. Order it with most frequent words first
def build_dict(sentences):
#    from collections import OrderedDict

    '''
    Build dictionary of train words
    Outputs: 
     - Dictionary of word --> word index
     - Dictionary of word --> word count freq
    '''
    print( 'Building dictionary..',)
    wordcount = dict()
    #For each worn in each sentence, cummulate frequency
    for ss in sentences:
        for w in ss:
            if w not in wordcount:
                wordcount[w] = 1
            else:
                wordcount[w] += 1

    counts = list(wordcount.values()) # List of frequencies
    keys = list(wordcount) #List of words
    
    sorted_idx = reversed(np.argsort(counts))
    
    worddict = dict()
    for idx, ss in enumerate(sorted_idx):
        worddict[keys[ss]] = idx+2  # leave 0 and 1 (UNK)
    print( np.sum(counts), ' total words ', len(keys), ' unique words')

    return worddict, wordcount


worddict, wordcount = build_dict(sentences_trn)

print(worddict['the'], wordcount['the'])

Building dictionary..
7056532  total words  134957  unique words
2 289300


In [9]:
# 
def generate_sequence(sentences, dictionary):
    '''
    Convert tokenized text in sequences of integers
    '''
    seqs = [None] * len(sentences)
    for idx, ss in enumerate(sentences):
        seqs[idx] = [dictionary[w] if w in dictionary else 1 for w in ss]

    return seqs

In [10]:
# Create train and test data

#Read train sentences and generate target y
train_x_pos = generate_sequence(sentences_trn_pos, worddict)
train_x_neg = generate_sequence(sentences_trn_neg, worddict)
X_train_full = train_x_pos + train_x_neg
y_train_full = [1] * len(train_x_pos) + [0] * len(train_x_neg)

print(X_train_full[0], y_train_full[0])

[25754, 2020, 9, 6, 1157, 252, 4, 51, 2182, 43, 2, 185, 74, 22, 62, 102, 6174, 55, 457, 144, 3, 163, 22, 32, 30767, 31, 4, 331, 5911, 176, 14, 2, 5267, 6405, 505, 87, 8, 285, 17, 25754, 2020, 18, 2016, 9, 94, 2499, 8, 685, 93, 9, 32, 30767, 31, 4, 21, 30770, 8, 2133, 12211, 3, 2, 6480, 1527, 47, 71, 84, 231, 165, 82, 1287, 5832, 92, 23854, 3, 2, 50096, 7, 2, 236, 920, 3, 45, 3061, 87, 7, 2, 6638, 15, 697, 5, 82, 1527, 4, 283, 15, 234, 2, 410, 14, 72, 6, 1528, 3880, 802, 8, 3891, 211, 2, 457, 3, 15, 1259, 16443, 69, 69, 69, 43, 69, 69, 69, 4, 2020, 4, 137, 378, 402, 90, 50176, 90, 15, 167, 164, 8, 11020, 42, 7, 150, 5832, 4, 44287, 90, 9096, 8, 25754, 2020, 4, 15, 550, 17, 131, 1508, 7, 86, 695, 121, 17, 25754, 2020, 9, 262, 9688, 4, 217, 6, 2576, 17, 16, 9, 30, 41] 1


In [11]:
#Read test sentences and generate target y
sentences_tst_pos = read_sentences(data_path+'test/pos/')
sentences_tst_neg = read_sentences(data_path+'test/neg/')

test_x_pos = generate_sequence(tokenize(sentences_tst_pos), worddict)
test_x_neg = generate_sequence(tokenize(sentences_tst_neg), worddict)
X_test_full = test_x_pos + test_x_neg
y_test_full = [1] * len(test_x_pos) + [0] * len(test_x_neg)

print(X_test_full[0])
print(y_test_full[0])

Tokenizing...
Done!
Tokenizing...
Done!
[15, 448, 5, 234, 19, 25, 268, 419, 138, 129, 34510, 8, 44, 6, 184, 391, 7, 2009, 4, 15, 254, 942, 17, 15, 20, 6115, 8, 84, 16, 103, 49, 67, 15, 697, 7, 14439, 17665, 40, 20, 77, 517, 8, 60, 252, 4, 15, 20, 394, 4, 17665, 275, 2, 123, 7, 3082, 16416, 65, 108, 3, 5, 1829, 10670, 275, 996, 7840, 23, 163, 14216, 4, 21, 1944, 7, 6, 63, 25, 9, 17, 16, 71, 4345, 23, 293, 1442, 4, 61, 42, 89, 635, 17, 4, 21, 446, 843, 28, 72, 20, 3048, 58, 27, 20, 3097, 44, 2209, 347, 2, 110, 385, 7, 2, 25, 3, 5, 81, 1674, 8, 1759, 347, 2, 381, 385, 4, 458, 16479, 2, 843, 15, 36, 77, 234, 131, 380, 14, 1759, 3, 29, 131, 447, 2346, 400, 22, 108, 3, 284, 2831, 36, 8, 369, 289, 84, 112, 2697, 4, 61, 25, 20, 106, 3, 5, 15, 1431, 17, 34, 169, 84, 16, 187, 34, 2323, 4]
1


## Configure sequences for a RNN model
    - Remove words with low frequency
    - Truncate / complete sequences to the same length

In [12]:
#Median length of sentences
print('Median length: ', np.median([len(x) for x in X_test_full]))

Median length:  208.0


In [13]:
max_features = 50000 # Number of most frequent words selected. the less frequent recode to 0
maxlen = 200  # cut texts after this number of words (among top max_features most common words)

In [14]:
#Select the most frequent max_features, recode others using 0
def remove_features(x):
    return [[0 if w >= max_features else w for w in sen] for sen in x]

X_train = remove_features(X_train_full)
X_test  = remove_features(X_test_full)
y_train = y_train_full
y_test = y_test_full

print(X_test[1])

[4761, 689, 204, 1054, 5663, 1145, 70, 38, 2502, 2032, 3, 2, 1, 32, 21897, 31, 3, 23, 19, 248, 2984, 2426, 507, 55, 2, 26539, 2521, 12645, 141, 6, 220, 338, 0, 5781, 49, 38, 4674, 980, 8, 328, 452, 38, 1, 11544, 14, 67, 20, 2408, 22, 32, 21, 9229, 5793, 4358, 12317, 4, 31, 15, 167, 85, 359, 7, 10212, 3, 5, 168, 24137, 12870, 2426, 1570, 35, 6, 11922, 6, 2751, 28, 107, 1042, 250, 8, 2805, 977, 23, 32, 16165, 31, 5, 32, 2215, 702, 31, 27, 3, 29, 62, 114, 19, 26, 20, 9974, 45, 2, 14732, 12, 13, 10, 11, 12, 13, 10, 11, 21, 26, 532, 23, 62, 1544, 656, 924, 28, 944, 6, 0, 329, 7, 2, 1225, 656, 924, 7, 4282, 18, 32, 0, 31, 5, 32, 5952, 31, 27, 3, 29, 41760, 403, 1379, 24, 116, 110, 11803, 605, 4, 1358, 2, 245, 1099, 8, 2, 2521, 12645, 209, 1357, 70, 65, 108, 4, 5663, 88, 6, 356, 313, 5, 299, 6, 10450, 24, 1150, 3735, 17308, 28, 15, 490, 2, 0, 4375, 7, 2, 245, 33, 317, 132, 7, 2, 1007, 27, 17, 21420, 2, 133, 1089, 54, 887, 62, 2086, 1985, 1182, 8, 2, 4161, 4, 135, 18, 62, 1457, 123, 969, 73, 2

In [15]:
from tensorflow.contrib.keras import preprocessing

# Cut or complete the sentences to length = maxlen
print("Pad sequences (samples x time)")

X_train = preprocessing.sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = preprocessing.sequence.pad_sequences(X_test, maxlen=maxlen)

print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

print(X_test[0])

  from ._conv import register_converters as _register_converters


Instructions for updating:
Use the retry module or similar alternatives.
Pad sequences (samples x time)
X_train shape: (25000, 200)
X_test shape: (25000, 200)
[    0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0    15   448     5   234    19
    25   268   419   138   129 34510     8    44     6   184   391     7
  2009     4    15   254   942    17    15    20  6115     8    84    16
   103    49    67    15   697     7 14439 17665    40    20    77   517
     8    60   252     4    15    20   394     4 17665   275     2   123
     7  3082 16416    65   108     3     5  1829 10670   275   996  7840
    23   163 14216     4    21  1944     7     6    63    25     9    17
    16    71  4345    23   293  1442     4    61    42    89   635    17
     4    21   446   843    28    72    20  3048    58    27    20  3097
    44  2209   347     2   110   385  

In [16]:
# Shuffle data
from sklearn.utils import shuffle

X_train, y_train = shuffle(X_train, y_train, random_state=0)

In [17]:
# Export train and test data
np.save(data_path + 'X_train', X_train)
np.save(data_path + 'y_train', y_train)
np.save(data_path + 'X_test',  X_test)
np.save(data_path + 'y_test',  y_test)


In [18]:
# Export worddict
import pickle

with open(data_path + 'worddict.pickle', 'wb') as pfile:
    pickle.dump(worddict, pfile)
