<a href="https://colab.research.google.com/github/bogdanlalu/tensorflow/blob/master/01_Tokenizer_oov_sequences_padding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



1. Instantiate a Tokenizer object with a parameter to set a limit on top N words by frequency of appearence (`num_words`) and a custom token for out of vocabulary (`oov_token`) words. The out of vocabulary words are either those that fall below the `num_words` threshold or, if we are working with new text, words that have not been present when the tokenizer was fitted. Special characters defined using the `filters` parameter will be ignored.
2. We fit the tokenizer to a list of sentences (corpus) using the `fit_on_texts` method. The resulting tokenizer will contain a `word_index` which is a dictionary where the keys are the unique words in the corpus and the values are the numeric ID those words have have been assigned.
3. To encode the words as numbers we create *sequences* object, using the *tokenizer's* `texts_to_sequences` method and pass in the *entire corpus* like in step #2. Words that fall below `num_words` threshold are encoded with the numeric code assigned to the `oov_token`.
4. To convert the resulting encoding back as text use the *tokenizer's* `sequences_to_texts` method and pass in the sequences object created at the previous step. Words that fell below the treshold will be replaced with the `oov_token` text.
5. To account for difference in the length of sentences and create vectors of equal length, we use *padding*. This is a function which takes a sequences object as parameter, resulting in a new sequences object. The function fills gaps with 0, to make vectors of equal length. An optional  `maxlen` parameter can be used to restict all vectors (sentences) to a maximum length, resulting in truncating. Padding can be added *before* the words of each sentence using the `pre` value of the `padding` parameter (think of right aligning rows of text in Word and filling blanks with 0), or *after*, (think of left aligning rows of text in Word) using the `post` value for the `padding` parameter. Choosing pre or post padding impacts truncating. If pre padding is used, the `pre` value should also be chosen for `truncating` parameter. Conversely post truncating should be used with post padding. The logic behind this is to mostly truncate 0s.


In [0]:
from typing import List
from tensorflow.keras.preprocessing.text import Tokenizer

In [0]:
sentences1 = [
             'I love my (dog)',
             'I love my cat']
sentences2 = [
             'I love my dog',
             'I love my cat',
              'You love my cat and dog']
sentences3 = [
             'I love my dog',
             'I love my cat',
              'You love my cat',
              "I don't love you, and you don't love me, and my cat loves mice but doesn't love you!"]
all_sentences = [sentences1, sentences2, sentences3]

# Tokenize, create sequences and convert them back to text, test on new sentences

In [0]:
def show_tokenizer_proc_pipeline(text:List[str], n:int, new_text:List[str]):
  tok = Tokenizer(num_words=n, oov_token='<OOV>')
  tok.fit_on_texts(text)

  sequences = tok.texts_to_sequences(text)
  sequences_text = tok.sequences_to_texts(sequences)
  print('-'*50)
  print("word index: ", tok.word_index)
  print("num words: ", tok.num_words)
  
  print("sequences:", sequences)
  print("seq2texts:", sequences_text)
  print('-'*50)

  new_sequences = tok.texts_to_sequences(new_text)
  new_sequences_text = tok.sequences_to_texts(new_sequences)
  print("new_sequences:", new_sequences)
  print("new_seq2texts:", new_sequences_text)

  print("="*50)

In [0]:
test_sentences = ['My dog hates your cat.', 
                  'My cat likes fish and milk.']

In [8]:
for sentence in all_sentences:
  show_tokenizer_proc_pipeline(sentence, 8,test_sentences)

--------------------------------------------------
word index:  {'<OOV>': 1, 'i': 2, 'love': 3, 'my': 4, 'dog': 5, 'cat': 6}
num words:  8
sequences: [[2, 3, 4, 5], [2, 3, 4, 6]]
padded_sequences [[2 3 4 5]
 [2 3 4 6]]
padded_sequences2text: ['i love my dog', 'i love my cat']
--------------------------------------------------
new_sequences: [[4, 5, 1, 1, 6], [4, 6, 1, 1, 1, 1]]
new_padded_sequences: [[4 5 1 1 6 0]
 [4 6 1 1 1 1]]
new_p_seq2texts: ['my dog <OOV> <OOV> cat <OOV>', 'my cat <OOV> <OOV> <OOV> <OOV>']



--------------------------------------------------
word index:  {'<OOV>': 1, 'love': 2, 'my': 3, 'i': 4, 'dog': 5, 'cat': 6, 'you': 7, 'and': 8}
num words:  8
sequences: [[4, 2, 3, 5], [4, 2, 3, 6], [7, 2, 3, 6, 1, 5]]
padded_sequences [[4 2 3 5 0 0]
 [4 2 3 6 0 0]
 [7 2 3 6 1 5]]
padded_sequences2text: ['i love my dog <OOV> <OOV>', 'i love my cat <OOV> <OOV>', 'you love my cat <OOV> dog']
--------------------------------------------------
new_sequences: [[3, 5, 1, 1, 6], [3

# Add sequence padding to account for different lengths in sentences

In [0]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [0]:
def show_tokenizer_proc_pipeline(text: List[str], 
                                 n: int, 
                                 new_text: List[str],
                                 length: int = None):
  tok = Tokenizer(num_words=n, oov_token='<OOV>')
  tok.fit_on_texts(text)

  sequences = tok.texts_to_sequences(text)

  padded_seq = pad_sequences(sequences, 
                             padding='post', 
                             truncating='post', 
                             maxlen=length)
  
  pad_sequences_text = tok.sequences_to_texts(padded_seq)

  print('-'*50)
  print("word index: ", tok.word_index)
  print("num words: ", tok.num_words)
  
  print("sequences:", sequences)
  print("padded_sequences", padded_seq)
  print("padded_sequences2text:", pad_sequences_text)

  print('-'*50)

  new_sequences = tok.texts_to_sequences(new_text)

  new_paddded_seq = pad_sequences(new_sequences, 
                                  padding='post', 
                                  truncating='post', 
                                  maxlen=length)
  
  pad_new_sequences_text = tok.sequences_to_texts(new_paddded_seq)
  
  print("new_sequences:", new_sequences)
  print("new_padded_sequences:", new_paddded_seq)
  print("new_p_seq2texts:", pad_new_sequences_text)
  print("="*50)
  print('\n'*2)

In [0]:
test_sentences = ['My dog hates your cat.', 
                  'My cat likes fish and milk.']

In [12]:
for sentence in all_sentences:
  show_tokenizer_proc_pipeline(sentence, 8, test_sentences, length=4)

--------------------------------------------------
word index:  {'<OOV>': 1, 'i': 2, 'love': 3, 'my': 4, 'dog': 5, 'cat': 6}
num words:  8
sequences: [[2, 3, 4, 5], [2, 3, 4, 6]]
padded_sequences [[2 3 4 5]
 [2 3 4 6]]
padded_sequences2text: ['i love my dog', 'i love my cat']
--------------------------------------------------
new_sequences: [[4, 5, 1, 1, 6], [4, 6, 1, 1, 1, 1]]
new_padded_sequences: [[4 5 1 1]
 [4 6 1 1]]
new_p_seq2texts: ['my dog <OOV> <OOV>', 'my cat <OOV> <OOV>']



--------------------------------------------------
word index:  {'<OOV>': 1, 'love': 2, 'my': 3, 'i': 4, 'dog': 5, 'cat': 6, 'you': 7, 'and': 8}
num words:  8
sequences: [[4, 2, 3, 5], [4, 2, 3, 6], [7, 2, 3, 6, 1, 5]]
padded_sequences [[4 2 3 5]
 [4 2 3 6]
 [7 2 3 6]]
padded_sequences2text: ['i love my dog', 'i love my cat', 'you love my cat']
--------------------------------------------------
new_sequences: [[3, 5, 1, 1, 6], [3, 6, 1, 1, 1, 1]]
new_padded_sequences: [[3 5 1 1]
 [3 6 1 1]]
new_p_seq2tex