# Continuous Bag of Words (CBOW)

© Data Trainers LLC. GPL v 3.0.

**Author:** Axel Sirota


In this notebook we will train from scratch a CBOW word embedding model based on a famous dataset: The Yelp reviews dataset. This dataset is uploaded into a dropbox and the cell command to download the files is already done for you.

Take it easy and pay attention to the model, how easy it is to define it,and the iteration nuances on the dataset generation.

You can run this lab both locally or in Colab.

- To run in Colab just go to `https://colab.research.google.com`, sign-in and you upload this notebook. Colab has GPU access for free.
- To run locally just run `jupyter notebook` and access the notebook in this lab. You would need to first install the requirements in `requirements.txt`

Follow the instructions. Good luck!



In [None]:
!pip install textblob 'keras-nlp' 'keras-preprocessing' 'gensim==4.2.0' np_utils

In [None]:
import multiprocessing
import tensorflow as tf
import sys
import keras.backend as K
from keras.models import Sequential
from keras.layers import Dense, Embedding, Lambda
import np_utils
from tensorflow.keras.utils import to_categorical
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from textblob import TextBlob, Word
from keras_preprocessing.sequence import pad_sequences
import numpy as np
import random
import os
import pandas as pd
import gensim
import warnings
import nltk

TRACE = False  # Setting to true is useful when debugging to know which device is being used
embedding_dim = 50
epochs=100
batch_size = 500
BATCH = True

def set_seeds_and_trace():
  os.environ['PYTHONHASHSEED'] = '0'
  np.random.seed(42)
  tf.random.set_seed(42)
  random.seed(42)
  if TRACE:
    tf.debugging.set_log_device_placement(True)

def set_session_with_gpus_and_cores():
  cores = multiprocessing.cpu_count()
  gpus = len(tf.config.list_physical_devices('GPU'))
  config = tf.compat.v1.ConfigProto( device_count = {'GPU': gpus  , 'CPU': cores} , intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
  sess = tf.compat.v1.Session(config=config)
  tf.compat.v1.keras.backend.set_session(sess)

set_seeds_and_trace()
set_session_with_gpus_and_cores()
warnings.filterwarnings('ignore')
nltk.download('punkt')
textblob_tokenizer = lambda x: TextBlob(x).words

In [None]:
%%writefile get_data.sh
if [ ! -f yelp.csv ]; then
  wget -O yelp.csv https://www.dropbox.com/s/xds4lua69b7okw8/yelp.csv?dl=0
fi

In [None]:
!bash get_data.sh

In [None]:
path = './yelp.csv'
yelp = pd.read_csv(path)
# Create a new DataFrame that only contains the 5-star and 1-star reviews.
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]
X = yelp_best_worst.text
y = yelp_best_worst.stars.map({1:0, 5:1})

In [None]:
# Create corpus of sentences such that the sentence has more than 3 words
corpus = [None]

At this point we have a list (any iterable will do) of queries that are longer than 3 words. This is normal to filter random queries. Now we must use the `Tokenizer` object to `fit` on the corpus, in order to convert each wor to an ID, and later convert such corpus of list of words into their identifiers.


In [None]:
tokenizer = Tokenizer()
# Use the fit_on_texts method to fit the tokenizer
None # Fill

print(f'Before the tokenizer: {corpus[:1]}')

#Now use the same "trained" tokenizer to convert the corpus from words to IDs with the texts_to_sequences method
tokenized_corpus = None

print(f'After the tokenizer: {tokenized_corpus[:1]}')

In [None]:
nb_samples = sum(len(s) for s in tokenized_corpus)
vocab_size = len(tokenizer.word_index) + 1


In [None]:
print(f'First 5 corpus items are {tokenized_corpus[:5]}')
print(f'Length of corpus is {len(tokenized_corpus)}')



In [None]:
type(tokenized_corpus)

In [None]:
# This is the algorithmic part of batching the dataset and yielding the window of words and expected middle word for each bacth as a generator.
def generate_data(corpus, vocab_size, window_size=2, sentence_batch_size=15,  batch_size=250):
    np.random.shuffle(np.array(corpus))
    number_of_sentence_batches = (len(corpus) // sentence_batch_size) + 1
    for batch in range(number_of_sentence_batches):
        lower_end = batch*batch_size
        upper_end = (batch+1)*batch_size if batch+1 < number_of_sentence_batches else len(corpus)
        mini_batch_size = upper_end - lower_end
        maxlen = window_size*2
        X = []
        Y = []
        for review_id, words in enumerate(corpus[lower_end:upper_end]):
            L = len(words)
            for index, word in enumerate(words):
                contexts = []
                labels   = []
                s = index - window_size
                e = index + window_size + 1

                contexts.append([words[i] for i in range(s, e) if 0 <= i < L and i != index])
                labels.append(word)

                x = pad_sequences(contexts, maxlen=maxlen)
                y = to_categorical(labels, vocab_size)
                X.append(x)
                Y.append(y)
        X = tf.constant(X)
        Y = tf.constant(Y)
        number_of_batches = len(X) // batch_size
        for real_batch in range(number_of_batches):
          lower_end = batch*batch_size
          upper_end = (batch+1)*batch_size
          batch_X = tf.squeeze(X[lower_end:upper_end])
          batch_Y = tf.squeeze(Y[lower_end:upper_end])
          yield (batch_X, batch_Y)

Notice now in a sample how we construct X and y to predict words

In [None]:
iterable = generate_data(corpus=tokenized_corpus, vocab_size=vocab_size, batch_size=10)
sample_x, sample_y = next(iterable)

In [None]:
sample_y_numpy = sample_y.numpy()

sample_x

In [None]:

np.where(sample_y_numpy == 1)

Now comes the core part, defining the model. Keras provides a convenient Sequential model class to just `add` layers of any type and they will just work. Let's add an `Embedding` layer (that will map the word ids into a vector of size 100), a `Lambda` to average the words out in a sentence, and a `Dense layer` to select the best word on the other end. This is classic CBOW.


In [None]:
cbow = Sequential()
cbow.add()  # Add an Embedding layer with input_dim vocab_size, output_dim to be embedding_dim, and the input_length to be twice our window
cbow.add()  # Add a Lambda that takes a lambda function using the K.mean method to average the words. The output_shape should be (dim, ).
cbow.add()  # Add a classic Dense layer to just select with a softmax the best word
# Compile the model with a loss and optimizer of your liking.
cbow.compile()

In [None]:
cbow.summary()

In [None]:
def fit_model():
    if not BATCH:
        # If we are not batching, Fill how to get X AND Y
        X, Y = None # Fill
        print(f'Size of X is {X.shape} and Y is {Y.shape}')
        cbow.fit(X, Y, epochs = epochs)
    else:
        # Implement the batching logic to train the model (Hint: use the train_on_batch method of Keras models)
        pass

In [None]:
fit_model()

In [None]:
with open('./cbow_scratch_synonims.txt' ,'w') as f:
    f.write('{} {}\n'.format(vocab_size-1, embedding_dim))
    vectors = cbow.get_weights()[0]
    for word, i in tokenizer.word_index.items():
        str_vec = ' '.join(map(str, list(vectors[i, :])))
        f.write('{} {}\n'.format(word, str_vec))

In [None]:
w2v = gensim.models.KeyedVectors.load_word2vec_format('./cbow_scratch_synonims.txt', binary=False)



In [None]:
w2v.most_similar(positive=['gasoline'])

In [None]:
w2v.most_similar(negative=['apple'])