<a href="https://colab.research.google.com/github/fatjan/learn-tensorflow/blob/master/NLP_Intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Tokenizer

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
             'I love my dog',
             'I love my cat',
             'You love my dog!',
             'Do you think my dog is amazing?'
]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}


Creating Sequences

In [None]:
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)


[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]


Test data, what about the words that the model has never seen before?

In [None]:
test_data = [
             'I really love my dog.',
             'my dog loves my manatee'
]

test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)

[[4, 2, 1, 3], [1, 3, 1]]


really is not in the word index, that's why the first sentence only consists of I love my dog. Same thing happens on the second sentence, which is my dog my because the words loves and manatee are not in the word_index.

Let's use OOV to solve this. OOV: out of vocabulary

In [None]:
tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)
sequences = tokenizer.texts_to_sequences(sentences)

test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
[[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]


Now that we have the OOV in our word index, everytime the test data has a word that the word_index does not have, the tokenizer will make it into the group of OOV with value of 1.

How to handle sentences with different length? We can use padding.

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

padded = pad_sequences(sequences)
print(padded)

[[ 0  0  0  5  3  2  4]
 [ 0  0  0  5  3  2  7]
 [ 0  0  0  6  3  2  4]
 [ 8  6  9  2  4 10 11]]


By using padding, our sequences length will follow the longest sentence. The sentence with less length will be padded by 0s.

We can also put the padding after the sentence. We can use padding='post'

In [None]:
padded = pad_sequences(sequences, padding='post')
print(padded)


[[ 5  3  2  4  0  0  0]
 [ 5  3  2  7  0  0  0]
 [ 6  3  2  4  0  0  0]
 [ 8  6  9  2  4 10 11]]


If we don't want the length to be equal to the longest sentence, we can determine its max length by using maxlen parameter.

In [None]:
padded = pad_sequences(sequences, padding='post', truncating='post', maxlen=5)
print(padded)


[[5 3 2 4 0]
 [5 3 2 7 0]
 [6 3 2 4 0]
 [8 6 9 2 4]]


We can also use truncating parameter to determine which part of the longest sentence to be truncated. Post means we truncate the end, pre means we truncate the beginning.

# Training a model to recognize sentiment in text.

In [None]:
import json
with open('sarcasm.json', 'r') as f:
  datastore = json.load(f)

sentences = []
labels = []
urls = []

for item in datastore:
  sentences.append(item['headline'])
  labels.append(item['is_sarcastic'])
  urls.append(item['article_link'])


FileNotFoundError: ignored