<a href="https://colab.research.google.com/github/devbabbar7/DeepLearning.AI-TensorFlow/blob/main/Natural%20Language%20Processing%20Tensorflow/Tokenizer_Padding_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Tokenizer Basics
In most NLP tasks, the initial step in preparing your data is to extract a vocabulary of words from your corpus (i.e. input texts). You will need to define how to represent the texts into numerical representations which can be used to train a neural network. These representations are called tokens and Tensorflow and Keras makes it easy to generate these using its APIs. You will see how to do that in the next cells.

Generating the vocabulary
In this notebook, you will look first at how you can provide a look up dictionary for each word. The code below takes a list of sentences, then takes each word in those sentences and assigns it to an integer. This is done using the fit_on_texts() method and you can get the result by looking at the word_index property. More frequent words have a lower index.

In [1]:
from tensorflow.keras.preprocessing.text import Tokenizer

# Define input sentences
sentences = [
    'i love my dog',
    'I, love my cat'
    ]

# Initialize the Tokenizer class
tokenizer = Tokenizer(num_words = 100)

# Generate indices for each word in the corpus
tokenizer.fit_on_texts(sentences)

# Get the indices and print it
word_index = tokenizer.word_index
print(word_index)

# Sentences with tokens
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)

{'i': 1, 'love': 2, 'my': 3, 'dog': 4, 'cat': 5}
[[1, 2, 3, 4], [1, 2, 3, 5]]


In [2]:
# Define input sentences
sentences = [
    'i love my dog',
    'I, love my cat',
    'You love my dog!'
]

# Initialize the Tokenizer class
tokenizer = Tokenizer(num_words = 100)

# Generate indices for each word in the corpus
tokenizer.fit_on_texts(sentences)

# Get the indices and print it
word_index = tokenizer.word_index
print(word_index)

# Sentences with tokens
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)

{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}
[[3, 1, 2, 4], [3, 1, 2, 5], [6, 1, 2, 4]]


In [7]:
#Padding
from tensorflow.keras.preprocessing.sequence import pad_sequences

padded = pad_sequences(sequences, maxlen=5)

# Print the result
print("\nPadded Sequences:")
print(padded)

padded = pad_sequences(sequences, maxlen=10)

# Print the result
print("\nPadded Sequences 2:")
print(padded)


Padded Sequences:
[[0 3 1 2 4]
 [0 3 1 2 5]
 [0 6 1 2 4]]

Padded Sequences 2:
[[0 0 0 0 0 0 3 1 2 4]
 [0 0 0 0 0 0 3 1 2 5]
 [0 0 0 0 0 0 6 1 2 4]]


In [5]:
#Out-of-vocabulary tokens
# Try with words that the tokenizer wasn't fit to
test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]

# Generate the sequences
test_seq = tokenizer.texts_to_sequences(test_data)

# Print the word index dictionary
print("\nWord Index = " , word_index)

# Print the sequences with OOV
print("\nTest Sequence = ", test_seq)


Word Index =  {'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}

Test Sequence =  [[3, 1, 2, 4], [2, 4, 2]]


We can see that the test is ignoring the words with no tokenization.

#OOV_TOKEN
In natural language processing, OOV stands for "Out-of-Vocabulary." An OOV token is a special token that is used by tokenizers to represent any word or character that is not present in the tokenizer's vocabulary.

In [8]:
# Define input sentences
sentences = [
    'i love my dog',
    'I, love my cat',
    'You love my dog!'
]

# Initialize the Tokenizer class
tokenizer = Tokenizer(num_words = 100, oov_token = '<OOV>')

# Generate indices for each word in the corpus
tokenizer.fit_on_texts(sentences)

# Get the indices and print it
word_index = tokenizer.word_index
print(word_index)

# Sentences with tokens
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)


#Out-of-vocabulary tokens
# Try with words that the tokenizer wasn't fit to
test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]

# Generate the sequences
test_seq = tokenizer.texts_to_sequences(test_data)

# Print the word index dictionary
print("\nWord Index = " , word_index)

# Print the sequences with OOV
print("\nTest Sequence = ", test_seq)

{'<OOV>': 1, 'love': 2, 'my': 3, 'i': 4, 'dog': 5, 'cat': 6, 'you': 7}
[[4, 2, 3, 5], [4, 2, 3, 6], [7, 2, 3, 5]]

Word Index =  {'<OOV>': 1, 'love': 2, 'my': 3, 'i': 4, 'dog': 5, 'cat': 6, 'you': 7}

Test Sequence =  [[4, 1, 2, 3, 5], [3, 5, 1, 3, 1]]
