## Tokenizer

In [14]:
from tensorflow import keras
from keras.preprocessing.text import Tokenizer

import tensorflow as tf

In [15]:
train_sentences = [
    "I love my rabbit",
    "I love my dog",
    "They live happily",
    "Do you think my dog is cute?"
]

**OOV Token**: is a word_index that will be used for a word that doesnt exist in a provided vocab list.

**num_words**: Maximum number of words to keep in **vocab list**. Only the most common ***num_words-1*** words will be kept.

nb. The order of the word index is based on its frequency.

In [18]:
# Only word with <= [num_words-1] index that will be saved in vocab_list/dict.
# Word that is not saved into vocab_list will be considered as OOV with index = 1

tokenizer = Tokenizer(num_words=3, oov_token="<OOV>") # >>> dict will hold the items which is word with index <= (num - 1) index
tokenizer.fit_on_texts(train_sentences)
word_index = tokenizer.word_index
word_count = tokenizer.word_counts
print(word_index)
print(word_count)

{'<OOV>': 1, 'my': 2, 'i': 3, 'love': 4, 'dog': 5, 'rabbit': 6, 'they': 7, 'live': 8, 'happily': 9, 'do': 10, 'you': 11, 'think': 12, 'is': 13, 'cute': 14}
OrderedDict([('i', 2), ('love', 2), ('my', 3), ('rabbit', 1), ('dog', 2), ('they', 1), ('live', 1), ('happily', 1), ('do', 1), ('you', 1), ('think', 1), ('is', 1), ('cute', 1)])


In [19]:
test_sentences = [
    "I love my lovely dog",
    "I love my adorable cat"
]

encoded_test_seq = tokenizer.texts_to_sequences(test_sentences)
print(encoded_test_seq)

[[1, 1, 2, 1, 1], [1, 1, 2, 1, 1]]


> Because we only defined the num_words is 3, which means the dictionary only saved the words with index 1 (OOV) and 2 (my). So, when we test, model will recognize a word out of dict as 'OOV'.  

> This num_words is good for large data to reduce (giving) the load on model.
Better to do preprocessing text such as lemmatization/stemming, stopword removal for unnecessary words.

## Padding and Truncating


Padding is to make the encoded_seq be of the same length. </br>
Truncating is to truncate the words based on predetermined length.

In [5]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [7]:
encoded_train_seq = tokenizer.texts_to_sequences(train_sentences)

padded = pad_sequences(
    encoded_train_seq, 
    padding='post', # default: 'pre'
    truncating='post',
    maxlen=5)

padded

array([[1, 1, 2, 1, 0],
       [1, 1, 2, 1, 0],
       [1, 1, 1, 0, 0],
       [1, 1, 1, 2, 1]], dtype=int32)

Padding **'post'**: Pad with '0' after the sequence. </br>
Truncating **'post'**: Truncate any sequences longer than maxlen.

> explanation: maxlen=5, **pad='post'**, ***truncate='post'*** </br>
train_sentences = [ </br>
    "I love my rabbit", -> 4 words; [1, 1, 2, 1, **0**] </br>
    "I love my dog", -> 4 words [1, 1, 2, 1, **0**]</br>
    "They live happily", -> 3 words [1, 1, 1, **0, 0**]</br>
    "Do you think my dog is cute?" -> 7 words [1, 1, 1, 2, 1 | ***1***, ***1***]</br>
]

# Full Code

In [8]:
from tensorflow import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

import tensorflow as tf

train_sentences = [
    "I love my rabbit",
    "My rabbit is beautiful as always",
    "And the dog is very lovely",
    "Having them in your side is such a blessing ."
]

test_sentences = [
    "Where did you find your dog ?",
    "Why do you love them ?",
    "Your dog is indeed very lovely!"
]

tokenizer = Tokenizer(
    num_words=100, 
    filters=".", # only '.' will be filtered from the sentence
    char_level=False, # every char won't be treated as a token
    oov_token='<OOV>'
)
tokenizer.fit_on_texts(train_sentences)

# Check on training data
encoded_train_seq = tokenizer.texts_to_sequences(train_sentences)
print("encoded_train_seq : ", encoded_train_seq)

# tokenizer.fit_on_sequences(encoded_train_seq)

decoded_train_seq = tokenizer.sequences_to_texts(encoded_train_seq)
print("decoded_train_seq : ", decoded_train_seq)

# APPLY PADDING AND TRUNCATING ON ENCODED SEQ
maxlen = max(len(i) for i in encoded_train_seq) # >>> 9
train_padded_seq = pad_sequences(
    encoded_train_seq,
    maxlen=maxlen,
    padding='post',
    truncating='post',
    value=0
)
print(train_padded_seq)

decoded_train_padded_seq = tokenizer.sequences_to_texts(train_padded_seq)
print("decoded_train_padded_seq : ", decoded_train_padded_seq)

# TESTING
# tokenizer.fit_on_texts(test_sentences)*
encoded_test_seq = tokenizer.texts_to_sequences(test_sentences)
print("encoded_test_seq : ", encoded_test_seq)

maxlen = max(len(i) for i in encoded_test_seq)
test_padded_seq = pad_sequences(
    encoded_test_seq,
    maxlen=maxlen,
    padding='post',
    truncating='post',
    value=0
)
print(test_padded_seq)

# Check on testing data
decoded_test_padded_seq = tokenizer.sequences_to_texts(test_padded_seq)
print("decoded_test_padded_seq : ", decoded_test_padded_seq)

encoded_train_seq :  [[5, 6, 3, 4], [3, 4, 2, 7, 8, 9], [10, 11, 12, 2, 13, 14], [15, 16, 17, 18, 19, 2, 20, 21, 22]]
decoded_train_seq :  ['i love my rabbit', 'my rabbit is beautiful as always', 'and the dog is very lovely', 'having them in your side is such a blessing']
[[ 5  6  3  4  0  0  0  0  0]
 [ 3  4  2  7  8  9  0  0  0]
 [10 11 12  2 13 14  0  0  0]
 [15 16 17 18 19  2 20 21 22]]
decoded_train_padded_seq :  ['i love my rabbit <OOV> <OOV> <OOV> <OOV> <OOV>', 'my rabbit is beautiful as always <OOV> <OOV> <OOV>', 'and the dog is very lovely <OOV> <OOV> <OOV>', 'having them in your side is such a blessing']
encoded_test_seq :  [[1, 1, 1, 1, 18, 12, 1], [1, 1, 1, 6, 16, 1], [18, 12, 2, 1, 13, 1]]
[[ 1  1  1  1 18 12  1]
 [ 1  1  1  6 16  1  0]
 [18 12  2  1 13  1  0]]
decoded_test_padded_seq :  ['<OOV> <OOV> <OOV> <OOV> your dog <OOV>', '<OOV> <OOV> <OOV> love them <OOV> <OOV>', 'your dog is <OOV> very <OOV> <OOV>']


> model only detects the word that available in dictionary.