# Text processing
This notebook is adapted from the following web sites
* https://machinelearningmastery.com/prepare-text-data-deep-learning-keras/

In [1]:
import keras

Using TensorFlow backend.
  return f(*args, **kwds)


In [2]:
from keras.preprocessing.text import text_to_word_sequence
from keras.preprocessing.text import one_hot
from keras.preprocessing.text import hashing_trick

Split a sentence into a list of words

In [3]:
sentence = 'The quick brown fox jumps over the lazy dog'
words = text_to_word_sequence(sentence)
words

['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

A more convenient representation of a document is a sequence of integer values where each word is represented by a unique integer

In [4]:
# get unique words
words = set(words)
vocab_size = len(words)

In [5]:
# increase the vocabulary size by about 1.3 to reduce hash collisions
one_hot(sentence, round(vocab_size * 1.3))

[2, 7, 3, 1, 1, 6, 2, 8, 4]

Choose the md5 hashing function as it is consistent across runs versus the default hash function

In [6]:
hashing_trick(sentence, round(vocab_size * 1.3), hash_function='md5')

[6, 4, 1, 2, 7, 5, 6, 2, 6]

## Tokenizer API

The Tokenizer API is better for working with multiple text documents at once.

In [7]:
from keras.preprocessing.text import Tokenizer
# define 5 documents
docs = ['Well done!',
        'Good work',
        'Great effort',
        'nice work',
        'Excellent!']
# create the tokenizer
t = Tokenizer()
# fit the tokenizer on the documents
t.fit_on_texts(docs)

In [8]:
# number of times each word occured in all the documents
t.word_counts

OrderedDict([('well', 1),
             ('done', 1),
             ('good', 1),
             ('work', 2),
             ('great', 1),
             ('effort', 1),
             ('nice', 1),
             ('excellent', 1)])

In [9]:
# number of times each word occurs in all the documents
t.word_docs

{'done': 1,
 'effort': 1,
 'excellent': 1,
 'good': 1,
 'great': 1,
 'nice': 1,
 'well': 1,
 'work': 2}

In [10]:
# display the number of documents
t.document_count

5

In [11]:
# mapping of word to the index
t.word_index

{'done': 3,
 'effort': 6,
 'excellent': 8,
 'good': 4,
 'great': 5,
 'nice': 7,
 'well': 2,
 'work': 1}

In [13]:
# mode = binary - whether each word occurs in the document
# mode = count  - the number of times the word occurs in the document
# mode = tfidf  - the Term Frequency - Inverse Document Frequency scoring
# mode = freq   - the frequencey of each word as a ratio of words in the document
t.texts_to_matrix(docs, mode='binary')

array([[ 0.,  0.,  1.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.]])