In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

# Week 1

## Intro to NLP

- First step is going to be finding a way to represent text (characters, words, and sentences) as numbers that can be fed into a nueral network.
- Using character level encodings like ASCII lose the semantic value of words. For e.g. LISTEN and SILENT.
- `Word-based encodings`- mapping words to numbers help retain semantic information of words.
- For e.g.
<p style="padding: 10px; border: 2px solid black;">
I love my dog = 1 2 3 4 <br>
I love my cat = 1 2 3 5 <br><br>
=> I=1, love=2, my=3, dog=4, cat=5 <br><br>
[1, 2, 3, 4] and [1, 2, 3, 5] are similar, so just by looking at them we can conclude that the original sentences had to be similar.
</p>

## Word Encodings

**Tensorflow APIs for word encodings**

Tokenizer- generates dictionary of word encodings and creates vectors out of sentences

In [7]:
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!'
]

tokenizer = Tokenizer(num_words = 100)
'''
num_words:  maximum number of words minus one (based on frequency)
            to keep when generating sequences
'''

tokenizer.fit_on_texts(sentences) # encodes the data (list of sentences)
word_index = tokenizer.word_index # word_index property returns a dict of word-token pairs

print(word_index)

{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}


More frequent words have a lower index.

As we can see, `Tokenizer` ignores case and special symbols. The `lower` and `filter` arguments of Tokenizer override this default behaviour.

In [8]:
tokenizer = Tokenizer(num_words = 1, filters='', lower=False)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

{'love': 1, 'my': 2, 'I': 3, 'dog': 4, 'cat': 5, 'You': 6, 'dog!': 7}


## Text to sequence

In [9]:
sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

print(word_index)
print(sequences)

{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}
[[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]


The `text_to_sequences()` method can take in any set of sentences to encode. It will use the word_index it generated during the call to `fit_on_texts()`.

In [11]:
test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]

test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq) # words not in the word_index will be lost

[[4, 2, 1, 3], [1, 3, 1]]


The `oov_token` argument to Tokenizer ensures that instead of words being ignores, they are replaced by a flag.

In [15]:
sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]

test_seq = tokenizer.texts_to_sequences(test_data)
print(word_index)
print(test_seq)

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
[[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]


**Padding**

The lists are of different lengths. Therefore, we need padding so that our inputs to the neurals networks will be of the same shape.

In [21]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

padded = pad_sequences(sequences)
padded_post = pad_sequences(sequences, padding='post', maxlen=5, truncating='pre')

print(word_index)
print(sequences)
print(padded)
print(padded_post)

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]
[[ 0  0  0  5  3  2  4]
 [ 0  0  0  5  3  2  7]
 [ 0  0  0  6  3  2  4]
 [ 8  6  9  2  4 10 11]]
[[ 5  3  2  4  0]
 [ 5  3  2  7  0]
 [ 6  3  2  4  0]
 [ 9  2  4 10 11]]
