In [1]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [2]:
sentences = [
    "What companies Elon founded",
    "Elon's company: Tesla, SpaceX, The Boring, OpenAI",
    "We are the fehle of Elon Musk",
    "Do you think Elon is the amazing?"
]

# Tokenizer

In this case, what it's going to do is, in your body of texts that it's tokenizing, 
it will take the 100 most common words or whatever value that you actually put in here.
I have a lot less than a 100 unique words here, so it's not really going to have any effect. 

What fit on texts will then do is it will go through the entire body of text and it
will create a dictionary with the key being the word and the value being the token for that word.

# OOV Token

I'm also going to use this parameter called an OOV token. The idea here is that I'm going to create a new token,
a special token that I'm going to use for words that aren't recognized, that aren't in the word index itself. 

In [3]:
tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>") 
tokenizer.fit_on_texts(sentences) 

# Word Indexing and Sequences

Number one is that punctuation like spaces and the comma, have actually been removed.
So it cleans up my text for me in that way too just to actually pull out the words.

Tokenizer in Tensorflow doesn't care about lowercase or uppercase letters. 

One really handy thing about this that you'll use later is the fact that the text to sequences
called can take any set of sentences, so it can encode them based on the word set that it
learned from the one that was passed into fit on texts.

In [4]:
word_index = tokenizer.word_index
print(word_index)

{'<OOV>': 1, 'elon': 2, 'the': 3, 'what': 4, 'companies': 5, 'founded': 6, "elon's": 7, 'company': 8, 'tesla': 9, 'spacex': 10, 'boring': 11, 'openai': 12, 'we': 13, 'are': 14, 'fehle': 15, 'of': 16, 'musk': 17, 'do': 18, 'you': 19, 'think': 20, 'is': 21, 'amazing': 22}


In [5]:
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)

[[4, 5, 2, 6], [7, 8, 9, 10, 3, 11, 12], [13, 14, 3, 15, 16, 2, 17], [18, 19, 20, 2, 21, 3, 22]]


# Test Data

So now, if I want to take a look at words that the tokenizer wasn't fit to. So for example, my test data is "what can I do with this notebook" and "my lover really likes work of the Elon", if I now tokenized them and create sequences out of that, we'll see  [4, 1, 1, 18, 1, 1, 1]  for the first sentence.


In [6]:
test_data = [
    'what can I do with this notebook',
    'my lover really likes work of the Elon'
]

test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq) # 1 is the "404 text" and assigned to the unknown words from sentence

[[4, 1, 1, 18, 1, 1, 1], [1, 1, 1, 1, 1, 16, 3, 2]]


# Padding

You can now see that the list of sentences has been padded out into a matrix and that
each row in the matrix has the same length. It achieved this by putting the appropriate number of zeros before the sentence. So in the case of the sentence 4, 5, 2, 6, it didn't actually do any. In the case of the longer sentence
here it didn't need to do any. Often you'll see examples where the padding is after the sentence and not before as you just saw. 


But you can override that with the maxlen parameter. So for example if you only want your sentences to have a maximum of five words. You can say maxlen equals five like this. This of course will lead to the question.
If I have sentences longer than the maxlength, then I'll lose information but from where. Like with the padding the default is pre, which means that you will lose from the beginning of the sentence. If you want to override this so that you lose from the end instead, you can do so with the truncating parameter like this. 

In [7]:
padded = pad_sequences(sequences, padding="post", maxlen=7)
print(padded)

[[ 4  5  2  6  0  0  0]
 [ 7  8  9 10  3 11 12]
 [13 14  3 15 16  2 17]
 [18 19 20  2 21  3 22]]
