##### **Importing Libraries.**

In [1]:
import numpy as np 
import pandas as pd 

from collections import Counter 
from sklearn.neighbors import NearestNeighbors

##### **Loading Parameters.**

In [2]:
corpus = [
    "knowing the name of something is different from knowing something".split(),
    "knowing something about everything is alright".split(),
]

##### Q1 : **Suppose that we construct a vocabulary that contains unique words in the corpus. What is the size of the vocabulary, V ??**

In [3]:
V = Counter(corpus[0])
V.update(corpus[1])

In [4]:
V.most_common()

[('knowing', 3),
 ('something', 3),
 ('is', 2),
 ('the', 1),
 ('name', 1),
 ('of', 1),
 ('different', 1),
 ('from', 1),
 ('about', 1),
 ('everything', 1),
 ('alright', 1)]

In [5]:
len(V)

11

##### Q2 : **Suppose that we use "one hot encoding" to convert the words in the sentences to vectors. Then each word in the sentences is converted to a vector of size?**

Vector size of each word will remain same as the size of the vocabulary i.e. **"11"**.

##### Q3 : **Construct a co-occurrence matrix using the vocabulary developed in question 1.** 

**However, drop the following words from the vocabulary (and hence from the sentences): of, the, alright, about, from.**

**Arrange the words in the vocabulary in alphabetical order. Use a window of size 1, k=1.** 

**How many non-zero entries are there in the matrix?**

In [6]:
dropped_words = "of the alright about from".split()

for i in dropped_words:
    del V[i]

In [7]:
V

Counter({'knowing': 3,
         'name': 1,
         'something': 3,
         'is': 2,
         'different': 1,
         'everything': 1})

Display "V" in alphabetical order.

In [8]:
V = {k: V[k] for k in sorted(V)}
V

{'different': 1,
 'everything': 1,
 'is': 2,
 'knowing': 3,
 'name': 1,
 'something': 3}

In [9]:
for sentence in corpus:
    for i in dropped_words:
        while True:
            try:
                sentence.remove(i)
            except ValueError:
                break

In [10]:
corpus

[['knowing', 'name', 'something', 'is', 'different', 'knowing', 'something'],
 ['knowing', 'something', 'everything', 'is']]

Quantifying the association between two words in a corpus.

In [11]:
def co_occurence(word, context, window_size, corpus):

    n_occur = 0

    for sentence in corpus:
        indices = [i for i, w in enumerate(sentence) if w == word]

        for index in indices:
            window = sentence[
                max(0, index - window_size) : min(
                    index + window_size + 1, len(sentence) + 1
                )
            ]
            
            n_occur += window.count(context)
    return n_occur

In [12]:
X = pd.DataFrame(
    np.zeros((len(V), len(V)), dtype=int), index=V.keys(), columns=V.keys()
)

In [13]:
X

Unnamed: 0,different,everything,is,knowing,name,something
different,0,0,0,0,0,0
everything,0,0,0,0,0,0
is,0,0,0,0,0,0
knowing,0,0,0,0,0,0
name,0,0,0,0,0,0
something,0,0,0,0,0,0


In [14]:
for word in V:
    for context in V:

        if word != context:
            X.loc[word, context] = co_occurence(word, context, 1, corpus)

In [15]:
X

Unnamed: 0,different,everything,is,knowing,name,something
different,0,0,1,1,0,0
everything,0,0,1,0,0,1
is,1,1,0,0,0,1
knowing,1,0,0,0,1,2
name,0,0,0,1,0,1
something,0,1,1,2,1,0


In [16]:
(X.values != 0).sum()

16

##### Q4 : **By using the cooccurence matrix created in Q3, compute the cosine similarity between the words in the vocabulary. While computing cosine similarity, ensure that each word vector is normalized to have a unit magnitude. The word : "knowing" is closest to which of the following words?**

In [17]:
X_norm = X / np.linalg.norm(X, axis=1).reshape(-1, 1)

In [18]:
pd.DataFrame(X_norm @ X_norm.T, index=V.keys(), columns=V.keys())

Unnamed: 0,different,everything,is,knowing,name,something
different,1.0,0.5,0.0,0.0,0.5,0.801784
everything,0.5,1.0,0.408248,0.57735,0.5,0.267261
is,0.0,0.408248,1.0,0.707107,0.408248,0.218218
knowing,0.0,0.57735,0.707107,1.0,0.57735,0.154303
name,0.5,0.5,0.408248,0.57735,1.0,0.534522
something,0.801784,0.267261,0.218218,0.154303,0.534522,1.0


Normalizing to have a unit magnitude.

In [19]:
np.linalg.norm(X_norm, axis=1)

array([1., 1., 1., 1., 1., 1.])

Knowing is most close to **"is"** (value = 0.707107 is the highest aming all the other ones)

##### Q5 : **Compute the Pointwise Mutual Information (PMI) for the pair (knowing,something). Take N=9 and log to the base 2.** (Enter the answer upto 3 decimal places)

Defining PMI function.

In [20]:
def pmi(word, context, corpus, window_size=1):
    count = co_occurence(word, context, window_size, corpus)

    count_context = sum([k.count(context) for k in corpus])
    count_word = sum([k.count(word) for k in corpus])
    
    return np.log2(count * 9 / count_context / count_word)

In [21]:
pmi("knowing", "something", corpus)

1.0

##### Q6 : **Calculate the PPMI for Q5 and enter the value upto 3 decimal points**


Answer is : **"1.000"**

##### Q7 : **Compute the SVD of the (normalized) co-occurrence matrix and take the rank-1 approximation (round the values in the matrix to 3 decimal points). Which of the following words are closer to the word knowing? We say the word is closer to the word "knowing" if its similarity score is greater than 0.5.**

Calculating **"SVD"** matrix.

In [22]:
u, sig, v = np.linalg.svd(X_norm)

Performing **Rank 1 approximation**.

In [23]:
rank1 = sig[0] * np.outer(u[:, 0], v[0, :])

Rank 1 in dataframe.

In [24]:
rank1_df = pd.DataFrame(rank1.round(3), index=V.keys(), columns=V.keys())

In [25]:
rank1_df

Unnamed: 0,different,everything,is,knowing,name,something
different,0.138,0.131,0.266,0.331,0.116,0.438
everything,0.163,0.155,0.314,0.391,0.137,0.517
is,0.133,0.127,0.257,0.32,0.112,0.424
knowing,0.151,0.143,0.291,0.362,0.127,0.479
name,0.178,0.169,0.343,0.427,0.149,0.565
something,0.145,0.138,0.28,0.348,0.122,0.461


The words : **"name"** (0.427) & **"everything"** (0.391) are the two most closest words to **"knowing"**.

##### Q8 : **Suppose that we use the continuous bag of words (CBOW) model to find vector representations of words. Suppose further that we use a context window of size 3 (that is, given the 3 context words, predict the target word P(w<sub>t</sub> | w<sub>i</sub> |w<sub>j</sub> |w<sub>k</sub> |). The size of word vectors (vector representation of words) is chosen to be 100 and and the vocabulary contains 10000 words. The input to the network is the one-hot encoding (also called 1-of-V encoding) of word(s). How many parameters (weights), excluding bias, are there in W<sub>word</sub>?** 

Enter the answer in thousands. For example, if your answer is 50000, then just enter 50.