# Building a vocabulary and word embeddings
The requirement is to build 2 models: 1 that uses it's own embeddings and another that uses word embeddings built with Gensim. The first one will use `Tokenizer` to generate encoded sequences from the input text and also built a vocabulary. The second model will use Gensim's learned vocabulary and some helper functions for encoding and decoding the encoded sequences.

In [1]:
DATASETS_DIR = "../datasets"
MODELS_DIR = "../models"

## Load the sequences

In [2]:
# load doc into memory
def load_doc(filename):
  # open the file as read only
  file = open(filename, 'r')

  # read all the text
  text = file.read()

  # close the file
  file.close()
  return text

in_filename = f"{DATASETS_DIR}/content_sequences.txt"
doc = load_doc(in_filename)
lines = doc.split('\n')
lines[:4]

['premier soccer league psl iri kudya magaka mambishi zvichitevera mhirizhonga yakaitika',
 'soccer league psl iri kudya magaka mambishi zvichitevera mhirizhonga yakaitika kubabourfields',
 'league psl iri kudya magaka mambishi zvichitevera mhirizhonga yakaitika kubabourfields kubulawayo',
 'psl iri kudya magaka mambishi zvichitevera mhirizhonga yakaitika kubabourfields kubulawayo nezuro']

## Encoding sequences using `Tokenizer`
The word embedding layer expects sequences to be comprised of integers. Each word in the doc can be mapped into a unique integer and encode the text sequences. Later, when making predictions, the prediction numbers can be converted and look up their associated words in the same mapping. The `Tokenizer` from Keras will be used for this. It first needs to be trained on the entire training dataset so that it finds all the unique words in the data and assigns each a unique integer.

In [3]:
# integer encode the sequences of words
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)

In [4]:
# vocabulary size
vocab_size = len(tokenizer.word_index) + 1
vocab_size

2766

In [5]:
# print a couple of word:index pairs from the learned word index
count = 0
print("-------------")
print("Word -> Index")
print("-------------")
for word, index in tokenizer.word_index.items():
    print(f"{word} -> {index}")
    count += 1
    
    if count > 4:
        break

-------------
Word -> Index
-------------
kuti -> 1
iyi -> 2
uye -> 3
vanoti -> 4
vanodaro -> 5


In [6]:
# Save the tokenizer for later usage (checkpoint)
from pickle import dump

# save the tokenizer
dump(tokenizer, open(f'{MODELS_DIR}/tokenizer.pkl', 'wb'))

## Encoding sequences using `Word2Vec`
I want to achieve the same goal as `Tokenizer` but instead using the vocabulary and word index learned by Gensim's `Word2Vec` since those will be different from the ones `Tokenizer` comes up with.

In [7]:
# create the word embeddings
from gensim.models import Word2Vec

sentences = [line.split() for line in lines]
w2v_model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=0)

In [8]:
# vocabulary size
vocab_size = len(w2v_model.wv.key_to_index)
vocab_size

2765

In [9]:
# print a couple of word:index pairs from the learned word index
count = 0
print("-------------")
print("Word -> Index")
print("-------------")
for word, index in w2v_model.wv.key_to_index.items():
    print(f"{word} -> {index}")
    count += 1
    
    if count > 4:
        break

-------------
Word -> Index
-------------
kuti -> 0
iyi -> 1
uye -> 2
vanoti -> 3
vanodaro -> 4


In [10]:
# save the word embedding model
w2v_model.save(f"{MODELS_DIR}/word2vec.model")