# Understand Embeddings Using Tensorflow

We will create some simple embeddings to help illustrate how to change text to numbers so we can process it.

You can watch https://www.youtube.com/watch?v=fNxaJsNG3-s to learn about this more.

# Tokenization - Text to Numbers

In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

In [8]:
sentences = [
    "Please use a spoon",
    "Please use a napkin!",
    "Let's go to the store and buy some groceries",
    "The park is closed at night."
]

In [23]:
tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

The word index list all the words in the sentences and assigns them a number, but that's not enough for processing. You should also notice that these particular functions removed non-character values such as `!` from the text automatically. Not all functions do this so sometimes you need to prep your data more carefully.

You may also notice that `Let's` was not changed and the apostrophe was included in the "word".

In [24]:
print(word_index)

{'please': 1, 'use': 2, 'a': 3, 'the': 4, 'spoon': 5, 'napkin': 6, "let's": 7, 'go': 8, 'to': 9, 'store': 10, 'and': 11, 'buy': 12, 'some': 13, 'groceries': 14, 'park': 15, 'is': 16, 'closed': 17, 'at': 18, 'night': 19}


# Sequences (Sequences of Words) and What's `<OOV>`?

In [25]:
sequences = tokenizer.texts_to_sequences(sentences)

Each sentence is now a sequence of numbers based on the word index.

In [15]:
print(sequences)

[[1, 2, 3, 5], [1, 2, 3, 6], [7, 8, 9, 4, 10, 11, 12, 13, 14], [4, 15, 16, 17, 18, 19]]


If we know try to sequence a sentence with words that have not been seen before, the word is essentially dropped in the output vector. So you need to have a large index to capture all the possible world. But a large dictionary can never be large enough. Let's create a new tokenizer that has a placeholder for missing words.

In [30]:
tokenizer2 = Tokenizer(num_words=100, oov_token="<OOV>")
tokenizer2.fit_on_texts(sentences)

In [32]:
test_data = ["Soon, the moon shall rise"]
test_seq = tokenizer2.texts_to_sequences(test_data)
print(tokenizer2.word_index)
print(test_seq)

{'<OOV>': 1, 'please': 2, 'use': 3, 'a': 4, 'the': 5, 'spoon': 6, 'napkin': 7, "let's": 8, 'go': 9, 'to': 10, 'store': 11, 'and': 12, 'buy': 13, 'some': 14, 'groceries': 15, 'park': 16, 'is': 17, 'closed': 18, 'at': 19, 'night': 20}
[[1, 5, 1, 1, 1]]


# Real World Issues

Sentences have different lengths and lost of complexity. We really need a way to take any sentence and map it into a fixed vector length that is always the same size. We could pad our input sentences to be the same length then the output is always the same. Padding support is provided directly in keras/tf. Also, the data we have can be large and complex paragraphs vs individual words or sentences. We need to manipulate the text data.

In [34]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

Working through all these issues would be alot of work during our workshop. Fortunately, there are modular building blockes for creating sentence embeddings already available.

We also want to create sentence embeddings that have semantic meaning. Simply converting sentences into numbers directly feels devoid of the richness in language. Also, we need to find a way to capture more "context" in a finite sized vector.

# Advanced Word Embeddings

Word embeddings capture semantic meaning by creating a numerical description of a word that takes into account all of the places the word is observed in various large documents--search as wikipedia.

In [None]:
Load some BERT embedding and show the embedding concept for a word using BERT.