### 1. Learning Checkpoint: Word Embeddings
- Reference: https://sites.cs.ucsb.edu/~alexmei/documents/presentations/word_embeddings.pdf

#### I. Preliminaries
* Word Embeddings: translations of text into alternative representations
* Why? alternative representations are easier to manipulate and quantify
- Assume Corpus (C) is the set of unique words in vocabulary under consideration
- **Bag of Words**: vector of length C with each element i denoting count of word i
- **One Hot Vector**: instead of count of word i, use 1 if word in text else 0

Checkpoint:
- Let `d = {0: "Alex", 1: "William", 2: "machine(s)", 3: "learning", 4: "student", 5: "professor"}` where the `v[i]` is the representation of the word `d[i]`. 
- (i) What is the bag of words representation for "Alex is a machine learning student that likes machines"? Denote this as $v_1$.
- (ii) What is the one hot vector representation for "William is a machine learning professor that likes to machines"? Denote this as $v_2$.
- (iii) What is a limitation of the bag of words / one hot vector representation?

In [None]:
import numpy as np

v1 = np.array([1, 0, 1, 1, 1, 0])
v2 = np.array([0, 1, 1, 1, 0, 1])

#### II. Similarity
* Objective: given two vectors, $v_i, v_j$, compare their similarity.
* Example Task: plagiarism checker for text similarity
* **Euclidean Distance**: $$||v_i - v_j||_2$$
* **Dot Product**: $$v_i^T v_j$$
* **Cosine Similarity**: $$\frac{v_i^T v_j}{||v_i||_2 ||v_j||_2}$$

Checkpoint:
* (i) What is the simiarity of $v_1, v_2$ using Euclidean Distance?
* (ii) What is the similarity of $v_1, v_2$ using Dot Product?
* (iii) What is the similarity of $v_1, v_2$ using Cosine Similarity?
* (iv) What is a disadvantage of the euclidean distance metric? (Hint: think about dimensionality.)
* (v) What is a disadvantage of the dot product metric? (Hint: think about range.)
* (vi) What is a limitation of the cosine similarity metric? (Hint: think about domain.)

In [None]:
import numpy.linalg as npla

print("Euclidean Distance:", npla.norm(v1 - v2, 2))
print("Dot Product:", v1.T @ v2)
print("Cosine Similarity:", v1.T @ v2 / (npla.norm(v1, 2) * npla.norm(v2, 2)))

#### III. Context
* Objective: make better representations by adding context to the word representation.
* Example: consider words with multiple meanings, such as 'rob a rich bank' vs 'sleep by the river bank'.
* **N-Gram**: instead of using a single word as a token in the corpus, use a window of n words as the token.
* **Window-Based**: define a matrix A that is C x C. Let A[i][j] = 1 if d[j] is in the context window of size n of d[i]. 

Checkpoint:
* Consider the sentence "Ryan likes to play Genshin Impact". 
* (i) What are the 2-grams for this sentence?
* (ii) How many possible 3-grams are in this sentence?
* (iii) What is a limitation of this N-Gram model? (Hint: think what happens as n increases.)
* Now, consider the sentence "Vaishnavi loves all things sugar" with d = {0: "Vaishnavi", 1: "loves", 2: "all", 3: "things", 4: "sugar"}.
* (iv) What is the window-based representation of this sentence with context window size 2?
* (v) What is a limitation of this Window-Based model?
* (vi) What is a potential solution to the limitation of the Window-Based model? 

In [None]:
def ngram_extractor(word_list: list, n: int) -> list:
  return [" ".join(word_list[i: i+n]) for i in range(len(word_list) - n + 1)]

print("2-Grams:", ngram_extractor("Ryan likes to play Genshin Impact".split(" "), 2))
print("Number of 3-Grams:", len(ngram_extractor("Ryan likes to play Genshin Impact".split(" "), 3)))

In [None]:
def window_extractor(word_list: list, corpus: dict, n: int) -> list:
  length = len(word_list)
  A = [[0] * length for _ in word_list]

  for i in range(length):
    for j in range(length):
      if i - n < j < i + n and i != j:
        A[corpus[word_list[i]]][corpus[word_list[j]]] = 1

  return A

print("Size 2 Window:", window_extractor("Vaishnavi loves all things sugar".split(" "), {"Vaishnavi": 0, "loves": 1, "all": 2, "things": 3, "sugar": 4}, 2))

#### IV. Learned Representations
* Machine learning does well on a variety of tasks. What if we just train a NN to learn a dense representation?
* **Continuous Bag of Words**: predict target word given context word.
  * Input: $1 \times C$ representation of context word (i.e., one-hot).
  * Hidden Layer of size $M$
  * Output $1 \times V$ dense representation of target word.  
* **Skip gram**: predict context words given target word.
  * Input: $1 \times C$ representation of target word (i.e., one-hot).
  * Hidden Layer of size $M$
  * Output $1 \times V$ dense representation of context word.  
* **Softmax Loss Function**: interested in class $i$, given $K$ classes,
$$\sigma(v_i) = \frac{\exp(v_i)}{\sum_k \exp(v_k)}$$

For the implementation, refer to https://towardsdatascience.com/a-word2vec-implementation-using-numpy-and-python-d256cf0e5f28 

Checkpoint:
* (i) What type of learning problem is CBOW/Skipgram? (i.e., Supervised, Unsupervised, Self-Supervised, Weakly Supervised)
* (ii) Draw the neural-network that represents the CBOW/Skipgram architecture.
* (iii) How does the softmax function relate to the sigmoid function?
* (iv) How many learnable parameters does the CBOW/Skipgram architecture have? (Don't forget to include bias.)
* (v) What is a potential limitation of the CBOW/Skipgram models? 
* (vi) What is a potential solution to this limitation?