### **1. Text to Sequences**
+ In the previous lab, you saw how to generate a word_index dictionary to generate tokens for each word in your corpus.
+ You can use then use the result to convert each of the input sentences into a sequence of tokens. That is done using the texts_to_sequences() method as shown below.
### **2. Padding**
+ You will usually need to pad the sequences into a uniform length because that is what your model expects. You can use the pad_sequences for that.
+ By default, it will pad according to the length of the longest sequence. You can override this with the **maxlen** argument to define a specific length. Feel free to play with the other arguments shown in class and compare the result.


In [1]:
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences = [ 
    'I love my dog',
    'I love my cat',
    'You love my dog',
    'Do you love me',
    'Do you think my dog is amayzing?'
]

In [9]:
tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

padded = pad_sequences(sequences, padding='post',
                    truncating='post', maxlen=10) # maxlen-chieu dai cau lon nhat

print(word_index)
print(sequences)
print(padded)

{'<OOV>': 1, 'love': 2, 'my': 3, 'dog': 4, 'you': 5, 'i': 6, 'do': 7, 'cat': 8, 'me': 9, 'think': 10, 'is': 11, 'amayzing': 12}
[[6, 2, 3, 4], [6, 2, 3, 8], [5, 2, 3, 4], [7, 5, 2, 9], [7, 5, 10, 3, 4, 11, 12]]
[[ 6  2  3  4  0  0  0  0  0  0]
 [ 6  2  3  8  0  0  0  0  0  0]
 [ 5  2  3  4  0  0  0  0  0  0]
 [ 7  5  2  9  0  0  0  0  0  0]
 [ 7  5 10  3  4 11 12  0  0  0]]


### **3. Out-of-vocabulary tokens**
+ Notice that you defined an oov_token when the Tokenizer was initialized earlier.
+ This will be used when you have input words that are not found in the word_index dictionary. For example, you may decide to collect more text after your initial training and decide to not re-generate the word_index
+ You will see this in action in the cell below. Notice that the token 1 is inserted for words that are not found in the dictionary.

In [6]:
# Try with words that the tokenizer wasn't fit to
test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]

# Generate the sequences
test_seq = tokenizer.texts_to_sequences(test_data)

# Print the word index dictionary
print("\nWord Index = " , word_index)

# Print the sequences with OOV
print("\nTest Sequence = ", test_seq)

# Print the padded result
padded = pad_sequences(test_seq, maxlen=10)
print("\nPadded Test Sequence: ")
print(padded)


Word Index =  {'<OOV': 1, 'love': 2, 'my': 3, 'dog': 4, 'you': 5, 'i': 6, 'do': 7, 'cat': 8, 'me': 9, 'think': 10, 'is': 11, 'amayzing': 12}

Test Sequence =  [[6, 1, 2, 3, 4], [3, 4, 1, 3, 1]]

Padded Test Sequence: 
[[0 0 0 0 0 6 1 2 3 4]
 [0 0 0 0 0 3 4 1 3 1]]
