Bulid models that can process text

Word based encodings
The Ascii values just gives the information or numbers for the letters in the word. The words SILENT and LISTEN have the same codes.

So that 's the reason we try giving the sentence an encoding instead of the word. 
For Example: I love my dog gets codes as 01 02 03 04
but when we try I love my cat it gives same numbers for I love my and cat gets 05

In [5]:
#Using APIs
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = ['I love my dog','I love my cat']

tokenizer = Tokenizer(num_words = 100) #num_words gives the top 100 unique words by volume in cases we have huge amount of data
tokenizer.fit_on_texts(sentences) # takes the data and encodes it
word_index = tokenizer.word_index # gives key value pairs 
print(word_index)
#It is very similar to fitting a ML model 

{'i': 1, 'love': 2, 'my': 3, 'dog': 4, 'cat': 5}


We can see there are 5 unique words in the above two sentences when we give another new sentence it gives a new value
Tokenizer strios punctuation I is shown as lower case "i"

In [8]:
from tensorflow.keras.preprocessing.text import Tokenizer
#Text to sequence
# New set of sentences
sentences = ['I love my dog','I love my cat','you love my dog','Do you think my dog is amazing?']
tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)
print(word_index)
print(sequences)

{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}
[[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]


In [10]:
test_data = ['I love my name','My name is Akshata']
test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)# We can see that the words which we were not identified in the output were skipped bcoz our train set did not contain that information

[[4, 2, 1], [1, 9]]


In [11]:
# In order to handle unknown data it is always a good practice to use oov parameter to replace unknown words in test set
from tensorflow.keras.preprocessing.text import Tokenizer
#Text to sequence
# New set of sentences
sentences = ['I love my dog','I love my cat','you love my dog','Do you think my dog is amazing?']
tokenizer = Tokenizer(num_words=100,oov_token ="<OOV>") # oov_token means Out of vocabulary
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)
print(word_index)
print(sequences)

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]


We can see that OOV takes a key of 1 now let us test on the test set

In [12]:
test_data = ['I love my name','My name is Akshata']
test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)# We can see that the words which we were not identified are replaced with OOV

[[5, 3, 2, 1], [2, 1, 10, 1]]


In [15]:
# Padding : In order to maintain uniformity in size we use padding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
#Text to sequence
# New set of sentences
sentences = ['I love my dog','I love my cat','you love my dog','Do you think my dog is amazing?']
tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

padded = pad_sequences(sequences)
print(word_index)
print(sequences)
print(padded)

{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}
[[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]
[[ 0  0  0  4  2  1  3]
 [ 0  0  0  4  2  1  6]
 [ 0  0  0  5  2  1  3]
 [ 7  5  8  1  3  9 10]]


We can see that a matrix is created in order to convert the text data so that it can be used by our neural network
The output of the lines which has fewer words than the longest sentences have 0's in the start 
usually padding ="post" if we want 0's in the last
maxlen = n If we want the sentences to have n words
using truncate we can change the from where we want to consider the sentence pre indicates start post indicates end

In [18]:
test_data = ['I love my name','My name is Akshata']
test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)# We can see that the words which we were not identified are replaced with OOV
padded = pad_sequences(test_seq , maxlen = 7, padding = "post")
print("\npost padding")
print(padded)

[[4, 2, 1], [1, 9]]

post padding
[[4 2 1 0 0 0 0]
 [1 9 0 0 0 0 0]]
