In [1]:
import tensorflow as tf

# Corpus vocabulary
----

A text corpus's unique words make up its **vocabulary**. This vocabulary forms the foundation for all NLP text processing. 

- **Character-based**
- **Word-based**

# Tokenization
-----
The most important part of the vocabulary is that it allows us to represent each piece of text by the specific words that appear in it.

Rather than being represented as one long string, a piece of text can be represented as a vector/list of its vocabulary words. This process is known as tokenization, where each individual vocabulary word in a piece of text is a token.

## Tokenizer Object

`fit_on_texts` updates the internal vocabulary of the `Tokenizer` based on the word frequency in the input texts.

- It creates a word index: a dictionary mapping words to integers.
- It does **not** transform the texts into sequences — that’s done by `texts_to_sequences`.

In [None]:
# Tokenizer
tokenizer = tf.keras.preprocessing.text.Tokenizer()

# texts
texts = [
    "I love machine learning",
    "Deep learning is amazing",
    "I love deep learning"
]

# Build the word index based on the frequency of words in the input texts
tokenizer.fit_on_texts(texts)

# Print the dictionary that maps words to their integer index
print(tokenizer.word_index)

{'learning': 1, 'i': 2, 'love': 3, 'deep': 4, 'machine': 5, 'is': 6, 'amazing': 7}


In [None]:
# Convert each text in the list into a sequence of integers based on the word index
print(tokenizer.texts_to_sequences(texts))

[[2, 3, 5, 1], [4, 1, 6, 7], [2, 3, 4, 1]]


## Tokenizer parameters
The Tokenizer object can be initialized with a number of optional parameters. By default, the Tokenizer filters out any punctuation and white space.

### Out-Of-Vocabulary (OOV)
When a new text contains words not in the corpus vocabulary, those words are known as out-of-vocabulary (OOV).

The `texts_to_sequences` automatically filters out all OOV words. However, if we want to specify each OOV word with a special vocabulary token (e.g. `OOV`), we can initialize the Tokenizer with the `oov_token` parameter.

In [None]:
# Create a Tokenizer and specify a special token for out-of-vocabulary (OOV) words
tokenizer = tf.keras.preprocessing.text.Tokenizer(oov_token='OOV')

# texts
text_corpus = ['bob ate apples, and pears', 'fred ate apples!']
tokenizer.fit_on_texts(text_corpus)

# Convert a new sentence into a sequence of integers
# Words not seen in the corpus ('bacon', 'orange') are mapped to the OOV token (which is `1`)
print(tokenizer.texts_to_sequences(['bob ate bacon orange']))

[[4, 2, 1, 1]]


In [3]:
# Show the word index dictionary
# Includes all words from the corpus and the 'OOV' token
print(tokenizer.word_index)

{'OOV': 1, 'ate': 2, 'apples': 3, 'bob': 4, 'and': 5, 'pears': 6, 'fred': 7}


### num-words
The `num_words` parameter lets us specify the maximum number of vocabulary words to use. For example, if we set `num_words=100` when initializing the `Tokenizer`, it will only use the 100 most frequent words in the vocabulary and filter out the remaining vocabulary words.

*This can be useful when the text corpus is large and you need to limit the vocabulary size to increase training speed or prevent overfitting on infrequent words.*

In [6]:
# Create a tokenizer that only keeps the top (num_words - 1) = 1 most frequent word
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=2)

# Fit tokenizer on the text corpus to build the word index
text_corpus = ['bob ate apples, and pears', 'fred ate apples!']
tokenizer.fit_on_texts(text_corpus)

# Convert new sentence to sequences
# Only the word 'ate' is kept because it's the most frequent (index 1)
# All other words are filtered out since their indices are >= 2
print(tokenizer.texts_to_sequences(['bob ate pears apples']))  # Output: [[1]]

[[1]]
