<a href="https://colab.research.google.com/github/dswh/lil_nlp_with_tensorflow/blob/main/01_03_begin.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenization of words using TensorFlow

The notebook covers the creation of sequences of tokens from words in a sentence.

In [20]:
from tensorflow.keras.preprocessing.text import Tokenizer

In [21]:
##define list of sentences to tokenize
train_sentences = [
             'It is a sunny day',
             'It is a cloudy day',
             'It is a super hot day!',
             'Will it rain today?'

]

In [22]:
##set up the tokenizer
tokenizer = Tokenizer(num_words=100)

##train the tokenizer on training sentences
tokenizer.fit_on_texts(train_sentences)

##store word index for the words in the sentence
word_index = tokenizer.word_index

In [23]:
##create sequences using tokenizer
sequences = tokenizer.texts_to_sequences(sentences)


In [24]:
print(f"Word index -->{word_index}")
print(f"Sequences of words -->{sequences}")

Word index -->{'it': 1, 'is': 2, 'a': 3, 'day': 4, 'sunny': 5, 'cloudy': 6, 'super': 7, 'hot': 8, 'will': 9, 'rain': 10, 'today': 11}
Sequences of words -->[[1, 2, 3, 5, 4], [1, 2, 3, 6, 4], [1, 2, 3, 7, 8, 4], [9, 1, 10, 11]]


In [25]:
print(sentences[0])
print(sequences[0])

It is a sunny day
[1, 2, 3, 5, 4]


## Tokenizing new data using the same tokenizer

In [26]:
new_sentences = [
                 'Will it be raining today?',
                 'It is a pleasant day.'
]

In [27]:
new_sequences = tokenizer.texts_to_sequences(new_sentences)

In [28]:
print(new_sentences)
print(new_sequences)

['Will it be raining today?', 'It is a pleasant day.']
[[9, 1, 11], [1, 2, 3, 4]]


## Replacing newly encountered words with special values

In [29]:
##set up the tokenizer again with oov_token
tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')

##train the tokenizer on training sentences
tokenizer.fit_on_texts(train_sentences)

##store word index for the words in the sentence
word_index = tokenizer.word_index

In [32]:
new_sequences = tokenizer.texts_to_sequences(new_sentences)
print(word_index)
print(new_sequences)

{'<OOV>': 1, 'it': 2, 'is': 3, 'a': 4, 'day': 5, 'sunny': 6, 'cloudy': 7, 'super': 8, 'hot': 9, 'will': 10, 'rain': 11, 'today': 12}
[[10, 2, 1, 1, 12], [2, 3, 4, 1, 5]]
