# Preparing the text for the Tensorflow model

## Tokenizing text and creating sequences for sentences

Neural Network and Machine Learning Model require number as feed.We can't feed the words. So in this notebook we will see how to convert the words into number.

## Import the Tokenizer

In [1]:
# Import the Tokenizer
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

## Write Some Sentence

In [2]:
Sentence=["My Name is Amrendra",
         "I love Mango",
         "Mango is my  favourite fruit",
         "do you like choclate?"
         "what is your favourite fruit",
         "your cat",
         "your favourite place",
         "where you spend your time alone"]

## Tokenize the word

Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.

In [3]:
#we can set nmber of word to be tokenize
#The out of vocabulary (OOV) token represents words that are not in the index.
tokenizer=Tokenizer(num_words=100,oov_token="<OOV>")
#we now call fit_on_texts() on the tokenizer to generate unique number ot each word
tokenizer.fit_on_texts(Sentence)

To further read more about function refer:
[see]( https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer )

## View the word index
* After tokenize the text, the tokenizer has a word index that contains key-value pairs for all the words.
* The word is the key, and the number is the value.

In [4]:
word_index = tokenizer.word_index
print(word_index)

{'<OOV>': 1, 'your': 2, 'is': 3, 'favourite': 4, 'my': 5, 'mango': 6, 'fruit': 7, 'you': 8, 'name': 9, 'amrendra': 10, 'i': 11, 'love': 12, 'do': 13, 'like': 14, 'choclate': 15, 'what': 16, 'cat': 17, 'place': 18, 'where': 19, 'spend': 20, 'time': 21, 'alone': 22}


In [5]:
# Get the number for a given word
print(word_index["choclate"])

15


# Create sequences for the sentences

After you tokenize the words, the word index contains a unique number for each word. However, the numbers in the word index are not ordered. Words in a sentence have an order. So after tokenizing the words, the next step is to generate sequences for the sentences.

In [6]:
#call text_to_sequences() to convert the words in sequence
Sequence=tokenizer.texts_to_sequences(Sentence)
print(Sentence)
print(Sequence)

['My Name is Amrendra', 'I love Mango', 'Mango is my  favourite fruit', 'do you like choclate?what is your favourite fruit', 'your cat', 'your favourite place', 'where you spend your time alone']
[[5, 9, 3, 10], [11, 12, 6], [6, 3, 5, 4, 7], [13, 8, 14, 15, 16, 3, 2, 4, 7], [2, 17], [2, 4, 18], [19, 8, 20, 2, 21, 22]]


## Check for the words which are not in word index

Let's take a look at what happens if the sentence being sequenced contains words that are not in the word index.

* When the word is not in the word_index then it will take the value of `<oov>` out of vocablary value.

In [7]:
Sentences1=["I Like Apple"]

In [8]:
Sequences1 = tokenizer.texts_to_sequences(Sentences1)
print(Sequences1)

[[11, 14, 1]]


## Make the sequences all the same length

We can make the sequence of same length by using padding or by truncating the sentence.

* Padding:- Adding zero to the sentence to make all the sentence of equal size. There are 2 types of padding.for padding we use `tf.keras.preprocessing.sequence.pad_sequences` to add zeros to the sequence to make them all the same length.
    
    
    * Post:- In this type of padding the zero is added in the end of the sentence.
    
        [ 1 2 5 16 0 0 0 0]
        
    * By default padding of zero happens in the start of sentence. 
        
        [0 0 0 0 1 2 5 16]
        
* Truncating means deleting or removing the word from the sentence to make all the sentence of equal size.

We can optionally specify the max_length in pad_sequence. if the sequence is longer than max length it will truncated.The longer sequence to max length. By default it will truncate from beginning of the sequence. but we specify truancate from end also.

if we won't provide max length, then the sequence are padded to max length of the sentence.

To know more about padding and truncating refer:[see](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences)





In [9]:
Padded=pad_sequences(Sequence)
print(Padded)

[[ 0  0  0  0  0  5  9  3 10]
 [ 0  0  0  0  0  0 11 12  6]
 [ 0  0  0  0  6  3  5  4  7]
 [13  8 14 15 16  3  2  4  7]
 [ 0  0  0  0  0  0  0  2 17]
 [ 0  0  0  0  0  0  2  4 18]
 [ 0  0  0 19  8 20  2 21 22]]


Here, we can see that padded with max length of sentence.

### Padding sequence with max length

In [10]:
padded1=pad_sequences(Sequence,maxlen=13)
print(padded1)

[[ 0  0  0  0  0  0  0  0  0  5  9  3 10]
 [ 0  0  0  0  0  0  0  0  0  0 11 12  6]
 [ 0  0  0  0  0  0  0  0  6  3  5  4  7]
 [ 0  0  0  0 13  8 14 15 16  3  2  4  7]
 [ 0  0  0  0  0  0  0  0  0  0  0  2 17]
 [ 0  0  0  0  0  0  0  0  0  0  2  4 18]
 [ 0  0  0  0  0  0  0 19  8 20  2 21 22]]


Here, we can see that zero has added on all sentence sequence and make it of lenght 13.

### Padding sequence with max length and padding

In [11]:
Pad2=pad_sequences(Sequence,maxlen=13,padding="post")
print(Pad2)

[[ 5  9  3 10  0  0  0  0  0  0  0  0  0]
 [11 12  6  0  0  0  0  0  0  0  0  0  0]
 [ 6  3  5  4  7  0  0  0  0  0  0  0  0]
 [13  8 14 15 16  3  2  4  7  0  0  0  0]
 [ 2 17  0  0  0  0  0  0  0  0  0  0  0]
 [ 2  4 18  0  0  0  0  0  0  0  0  0  0]
 [19  8 20  2 21 22  0  0  0  0  0  0  0]]


Here,we can see that as we specify padding is post now zero are added in last.

### Truncating

In [12]:
pad3=pad_sequences(Sequence,maxlen=5)
print(pad3)

[[ 0  5  9  3 10]
 [ 0  0 11 12  6]
 [ 6  3  5  4  7]
 [16  3  2  4  7]
 [ 0  0  0  2 17]
 [ 0  0  2  4 18]
 [ 8 20  2 21 22]]


We have seen that our maxlength of sequence is 9 but when we constriant the max length to 5 some sequence get truncated.

$$ End $$