# Assignment

The goal in this assignment is to implement a skip-gram model using negative sampling by following the skeleton code provided below. Please refer to 05-Neural-Language-Models for the theory behind the skip-gram model and negative sampling. In the code, look for the TODO comments to see where you need to add your own code.

Email me with any questions you may have. We can also schedule an online meeting to discuss the assignment.


There are a few things you need to do:
- Complete the implementation of the `tokenize_doc` function to tokenize the input text and return a list of sentences where each sentence is a list of words (represented by unique integers). You also need to return two dictionaries: `word2idx` and `idx2word` that map each word to a unique integer and vice versa.

- Complete the implementation of the `create_word_frequencies` function to create a numpy array of shape (vocabulary size,) containing the frequencies of each word in the vocabulary.
- Complete the implementation of the `get_context` function to extract a list of context words for a given position in a sentence.
- Complete the implementation of the `nsd` function to calculate the negative sampling distribution given a list of sentences (words encoded as integers) and the vocabulary size.
- Complete the implementation of the `train` function to train the skip-gram model using negative sampling. The function should return the trained model parameters and a list of losses at each epoch.

You can mail your (group) solution to [amarov@feb.uni-sofia.bg](mailto:amarov@feb.uni-sofia.bg).


In [None]:
import nltk
import numpy as np
import spacy

from nltk.corpus import gutenberg

nltk.download('gutenberg')

alice = gutenberg.raw(fileids="carroll-alice.txt")

nlp = spacy.load("en_core_web_sm")

In [0]:

def tokenize_doc(text: str) -> (list, dict, dict, np.array):
    """
    :param text: The text input to be tokenized
    :return: Returns a tuple of (sentences, word2idx, idx2word)
        The sentences list contains lists of words (represented by unique integers) in each sentence. The
        pre-processing steps are:
            - Removal of punctuation
            - Removal of whitespace
            - Lowercase
        The word2idx dictionary maps each word to a unique integer.
        The idx2word dictionary maps each integer to a unique word.
    """
    sentences = []

    text_doc = nlp(text)

    # TODO
    # TODO

    for sentence in text_doc.sents:
        tokens = []

        for token in sentence:
            # TODO 
                continue
            token_normalized = # TODO

            if token_normalized not in word2idx:
                # TODO
                word2idx[token_normalized] = idx
                idx2word[idx] = token_normalized

        sentences.append([word2idx[token] for token in tokens])

    return # TODO

In [None]:

def create_word_frequencies(sentences: list, V: int) -> np.array:
    """
    :param sentences: A list of sentences where each sentence is a list of words (represented by unique integers)
    :param V: The size of the vocabulary
    :return: A numpy array of shape (V,) containing the frequencies of each word in the vocabulary
    """

    # TODO
    
    return freq


In [None]:

def train(
        sentences_num: list,
        word_freqs: np.array,
        window_size: int = 5,
        learning_rate: float = 0.025,
        num_negatives: int = 5,
        drop_threshold: float = 1e-5,
        epochs: int = 10,
        D: int = 50
):
    """
    Trains a skip-gram model using a list sentences where each word is encoded by a unique integer.
    :param sentences_num: A list of sentences
    :param word_freqs: A numpy array holding the word frequencies
    :param window_size: The size of the window to use for context words
    :param learning_rate: The learning rate to use for gradient descent
    :param num_negatives: The number of negative samples to draw per input word
    :param drop_threshold: The threshold for subsampling frequent words
    :param epochs: The number of epochs to train for
    :param D: The dimensionality of the word vectors (word embeddings)
    :return:
        A tuple of trained model parameters and a list of losses at each epoch: W1, W2, losses
    """
    # Determine the vocabulary size
    V = ???

    # The vocabulary size (number of unique words)

    neg_sampling_dist = ???

    # Print basic information about the training data
    print(f"Training with vocabulary size {V:5}")

    # Print the parameters
    print(
        f"""
            window_size: {window_size},
            embedding size D={D},
            number of negative samples: {num_negatives},
            epochs: {epochs}, 
            learning rate: {learning_rate}"""
    )

    # First we initialize the model parameters
    # which are the two weight matrices

    W1 = # TODO: initialize the input to hidden weights matrix
    W2 = # TODO: initialize the hidden to output weights matrix

    # The probability distribution for drawing negative samples
    prob_negative = ???

    # We will store the costs in a list, so we can eventually plot them later
    losses = []

    # for subsampling each sentence
    p_drop = 1 - np.sqrt(drop_threshold / prob_negative)

    # Start training the model

    for epoch in range(epochs):
        # randomly order sentences, so we don't always see
        # sentences in the same order
        np.random.shuffle(sentences_num)

        # Initialize the loss for this epoch
        loss = ???

        # A variable to keep track of the number of processed sentences
        #  in this epoch
        counter = ???

        for sent in sentences_num:
            # Drop words from the sentence with probability p_drop
            
            # TODO
            
            if len(sent) < 2:
                continue

            # Start iterating over the words in the sentence
            for pos, word in enumerate(sent):
                # get the positive context words/negative samples
                context_words = ???
                context_words_array = ???

                neg_words = # TODO

                # Now we have the input (the center word) and the targets (the positive context words and negative samples)
                # We can call the SGD function for the positive words (that are actually in the context)
                c = ???
                loss += c

                # And then for the negative samples
                c = ???
                
                loss += c

            counter += 1

            # Print some information about the training progress
            if counter % 100 == 0:
                print("processed %s / %s\r" % (counter, len(sentences_num)))

        # Print the number of the epoch and the loss
        print("epoch complete:", epoch, "loss:", loss)

        # save the loss
        losses.append(loss)

    # return the model
    return W1, W2, losses


In [None]:
def nsd(freq: np.array) -> np.array:
    """
    Calculate the negative sampling distribution given a list of sentences (words encoded as integers)
    and the vocabulary size.
    :param freq: A numpy array holding the word frequencies
    :return:
    A numpy array of the same shape as freq containing the probabilities of drawing each word as a negative sample.
    """

    # Deemphasize frequent words by raising their frequencies to the 3/4 power
    # TODO

    # Now we sum all the adjusted frequencies and divide each adjusted frequency by the sum
    # so that the adjusted frequencies now form a probability distribution.
    # TODO

    return p

def get_context(pos: int, sentence: list, window_size: int) -> list:
    """
    Return the context words for a given position in a sentence.
    :param pos: The index of the context word in the sentence
    :param sentence: A list of words (encoded as integers)
    :param window_size: The size of the window to use for context words
    :return:
    A list of context words (encoded as integers).
    For a sentence of the form and window size 3: 0 12 4 2 [2 8 1 pos 0 55 2] 2 98 2 3
    the function will return the list [2, 8, 1, 0, 55, 2]
    """

    # TODO
    
    return context


def sigmoid(x: np.array):
    """
    The sigmoid activation function
    :param x: np.array
    :return: np.float64
    """
    return 1 / (1 + np.exp(-x))


def sgd(
        center_word_idx: int,
        target_words_indices: list,
        is_context_word: int,
        lr: float,
        W1: np.array,
        W2: np.array
) -> np.float64:
    """
    Performs a single step of stochastic gradient descent. It updates the weights in the
    W1 and W2 matrices.
    :param center_word_idx: The index of the center word
    :param target_words_indices: The indices of the target words
    :param is_context_word: A 0/1 flag indicating whether the target word is a context word (1) or a negative sample (0)
    :param lr: The learning rate to use for gradient descent
    :param W1: The matrix of hidden weights (input to hidden)
    :param W2: The matrix of output weights (hidden to output)
    :return: The loss for the batch
    """

    # W[input_] shape: D
    # V[:,targets] shape: D x N
    # activation shape: N
    # print("input_:", input_, "targets:", targets)

    # Because the input is a one-hot-encoded vector
    # with one at the index position of the input word and zero otherwise,
    # multiplying the input with the hidden weights is equivalent
    # to selecting the hidden weights corresponding to the input word.
    # This is why we don't need to perform the full matrix multiplication here,
    # and we can just pull the corresponding row from W1 by using np indexing.

    # Furthermore, we only need the output weights corresponding to the target words
    # as these are the only ones relevant for the calculation of the probability
    # of the target words given the input word.

    # First we calculate the net value of the output neuron
    a = ???

    # Now we pass the net value through the sigmoid activation function
    # to get the probability of the target words given the input word
    p = ???

    # The is_context_word is the target used in the calculation of the loss
    gW2 = ???
    gW1 = ???

    W2[:, target_words_indices] -= lr * gW2  # D x N
    W1[center_word_idx] -= lr * gW1  # D

    # Calculate the loss (binary cross entropy) for each sample
    # We add a small constant to the probabilities to avoid taking the log of zero
    # for very small probabilities that may occur due to finite precision arithmetic.

    loss = ???
    
    # Finally, return the total loss for the batch
    return loss.sum()


In [None]:
def tokenize_doc(text: str) -> list:
    # Create an empty list to store the sentences
    sentences = []

    # Pass the text through spacy's pipeline
    text_doc = nlp(text)
    
    # Create a dictionary to store the word to index mapping
    word2idx = {}

    # Create a dictionary to store the index to word mapping
    idx2word = {}
    
    # Iterate over the sentences in the text
    for i, sentence in enumerate(text_doc.sents):
        # For each sentence, create a list to store the tokens
        # The first token is the "BEGINNING" token (beginning of the sentence)

        tokens = []
        
        # Iterate over the tokens in the sentence
        for token in sentence:
            # Omit spaces and punctuation
            if ???:
                continue

            # Lowercase the token
            token_normalized = ??? 

            # Append the lowercased token to the list of tokens
            tokens.append(token_normalized)
            
            # If the token is not in the word2idx dictionary, add it
            if token_normalized not in word2idx:
                # The indices of the tokens must be unique, 
                # so taking the number of entries in the word2idx dictionary will give us the next index            
                idx = len(word2idx)

                # Add the token to the word2idx and idx2word dictionaries
                word2idx[token_normalized] = idx
                idx2word[idx] = token_normalized
        
        # Append the list of tokens to the list of sentences
        sentences.append(tokens)

    return sentences, word2idx, idx2word

In [0]:
sentences, word2idx, idx2word = tokenize_doc(alice)
word_frequencies = create_word_frequencies(sentences, len(word2idx))

W1, W2, costs = train(sentences, word_frequencies)
