# Overview

The Keras deep learning library provides some basic tools to help you prepare your text data. The Tokenizer API that can be fit on training data and used to encode training, validation, and test documents.

## Split Words with text_to_word sequence

The text to word sequence() automatically does 3 things:
    - Splits words by space
    - Filters out punctuation
    - Converts text to lowercase 

In [1]:
from keras.preprocessing.text import text_to_word_sequence

text = "The quick brown fox jumped over the lazy dog."
result = text_to_word_sequence(text)
print(result)

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']


## Encoding with one_hot vectors

Keras provides the one hot() function that you can use to tokenize and integer encode a text document in one step. The name suggests that it will create a one hot encoding of the document, which is not the case. Instead, the function is a wrapper for the hashing trick(). The function returns an integer encoded version of the document. The use of a hash function means that there may be collisions and not all words will be assigned unique integer values. As with the text to word sequence() function in the previous section, the one hot() function will make the text lower case, filter out punctuation, and split words based on white space.

** In addition to the text, the vocabulary size (total words) must be specified. The size can be larger than the existing vocabulary if additioal documents will be processed. **  The size of the vocabulary defines the hashing space from which words are hashed.

In [3]:
from keras.preprocessing.text import one_hot,text_to_word_sequence 

text = 'The quick brown fox jumped over the lazy dog.'

# estimate the size of the vocabulary
words = set(text_to_word_sequence(text))
vocab_size = len(words)

print(vocab_size)
# integer encode the document
result = one_hot(text, round(vocab_size*1.3)) 
print(result)

8
[1, 7, 9, 4, 3, 2, 1, 7, 3]


## Hash encoding w/ hashing Trick

Keras provides the hashing trick() function that tokenizes and then integer encodes the document, just like the one hot() function. It provides more flexibility, allowing you to specify the hash function as either hash (the default) or other hash functions such as the built in md5 function or your own function. 

In [5]:
from keras.preprocessing.text import hashing_trick, text_to_word_sequence 

text = 'The quick brown fox jumped over the lazy dog.'

# estimate the size of the vocabulary
words = set(text_to_word_sequence(text))
vocab_size = len(words)

print(vocab_size)
# integer encode the document
result = hashing_trick(text, round(vocab_size*1.3), hash_function='md5')
print(result)

8
[6, 4, 1, 2, 7, 5, 6, 2, 6]


## Tokenizer API

Keras provides a more sophisticated API for preparing text that can be fit and reused to prepare multiple text documents. This may be the preferred approach for large projects. Keras provides the Tokenizer class for preparing text documents for deep learning. 

Once fit, the Tokenizer provides 4 attributes that you can use to query what has been
learned about your documents:
    - word counts: A dictionary mapping of words and their occurrence counts when the
      Tokenizer was fit.
    - word docs: A dictionary mapping of words and the number of documents that reach    
      appears in.
    - word index: A dictionary of words and their uniquely assigned integers.
    - document count: A dictionary mapping and the number of documents they appear in 
      calculated during the fit.

In [6]:
from keras.preprocessing.text import Tokenizer # define 5 documents
docs = ['Well done!',
'Good work', 'Great effort', 'nice work', 'Excellent!']
# create the tokenizer
t = Tokenizer()
# fit the tokenizer on the documents
t.fit_on_texts(docs)

In [11]:
# summarize what was learned
display(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs)

OrderedDict([('well', 1),
             ('done', 1),
             ('good', 1),
             ('work', 2),
             ('great', 1),
             ('effort', 1),
             ('nice', 1),
             ('excellent', 1)])

5
{'work': 1, 'well': 2, 'done': 3, 'good': 4, 'great': 5, 'effort': 6, 'nice': 7, 'excellent': 8}
{'well': 1, 'done': 1, 'good': 1, 'work': 2, 'great': 1, 'effort': 1, 'nice': 1, 'excellent': 1}


Once the Tokenizer has been fit on training data, it can be used to encode documents in the train or test datasets. The texts to matrix() function on the Tokenizer can be used to create one vector per document provided per input. The length of the vectors is the total size of the vocabulary. This function provides a suite of standard bag-of-words model text encoding schemes that can be provided via a mode argument to the function. The modes available include:
    - binary: Whether or not each word is present in the document. This is the default.
    - count: The count of each word in the document.
    - tfidf: The Text Frequency-Inverse DocumentFrequency (TF-IDF) scoring for each word in the document.
    - freq: The frequency of each word as a ratio of words within each document.

In [12]:
# integer encode documents
encoded_docs = t.texts_to_matrix(docs, mode='count') 
display(encoded_docs)

array([[0., 0., 1., 1., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 1., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1.]])