Submission Deadline: __October 20, 2023; 11:59 PM__

A penalty will be applied for late submission. Please refer to the course policy for more detail.  

## Instructions

Please read the instructions carefully before you start working on the homework.

- Please follow instructions and printed out the results as required. Keep the printed results and your implementation for grading purpose.
    - The TAs will not run your code for grading purpose unless it is necessary. That means, you may lose some points if the printed results are not in the submitted file.
- Submission should be via Canvas.
    - If you use Google Colab for running the code, please download the file and submit it via Canvas once it's done.
    - Submission via a Google Colab link will be considered as an invalid submission.
- Please double check the submitted file once you upload it to Canvas.
    - Students should be responsible for checking whether they submit the right files.
    - Re-submission is not allowed once the deadline is passed.

Also, if you missed the class lectures, please study the course materials first before working on the homework. It may save you some time.

# Homework 02 Word Embeddings

### Goal

The **goal** of this homework is to provide an opportunity to build an end-to-end system.

Specifically, we are going to build a word embedding system, that can

1. Read and preprocess raw data
2. Use two different ways (latent semantic analysis and skip-gram) to learn word embeddings
3. Evaluate the quality of word embeddings using some intrinsic evaluation methods

### Submission

Your submission should only include this notebook file. Please keep **all the outputs** in your submission for grading. We will run the code only if we are not sure it is correct.

### Dependency

You will need the following package to finish this homework assignment

- [spaCy](https://pypi.org/project/spacy/)
- [fasttext](https://pypi.org/project/fasttext/)

### Hint

Search for the keyword `TODO` to find out which parts need your input

In [1]:
# Download the data from course webpage
import urllib.request
from os.path import isfile
if not isfile("embeddings/imdb-small.txt"):
    url = "https://yangfengji.net/uva-nlp-grad/data/embeddings.zip"
    print("Downloading ...")
    filename, headers = urllib.request.urlretrieve(url, filename="embeddings.zip")

    print("Decompressing the file ...")
    !unzip embeddings.zip

sents = open("embeddings/imdb-small.txt").read().split("\n")
print("Read {} sentences".format(len(sents)))

Downloading ...
Decompressing the file ...
Archive:  embeddings.zip
   creating: embeddings/
  inflating: embeddings/imdb-small.txt  
  inflating: embeddings/word-pairs.txt  
Read 10000 sentences


## 1. Data Processing (5 points)

Data processing is an **essential** skill for NLP researchers. Unlike machine learning where researchers sometimes may want to use synthetic data to demonstrate the potential of their algorithms, NLP researchers need to deal with real-world data all the time. Unfortunately, this means that these data are noisy and often contain irregular patterns. Therefore, a reasonable data processing can alleviate the challenge of building NLP systems to some extent and may also help boost the performance of machine learning models.

Data processing for learning word embeddings includes two basic modules

- Tokenizing texts and replacing some special tokens
- Filtering low-frequency and building a vocab

### 1.1 Tokenization (2 points)

The following function *tokenize()* should include the following components

1. Load the raw text from the file named **imdb-small.txt**
2. Convert all characters into lower cases
3. Tokenize the raw text using `nltk.tokenize`
4. Remove all punctuation (as single tokens) and replace all numbers with a special token `<num>`
5. Write the preprocessed text to the file named **imdb-small.txt.tokenized** and maintain the same format (one paragraph per line)

(The file names are pre-defined, please do not change them.)

In [5]:
# TODO: add necessary packages here
#import nltk
from nltk.tokenize import RegexpTokenizer
import string
import re
nltk.download('punkt')
def tokenize(infname="embeddings/imdb-small.txt"):
    outfname = open(infname + ".tokenized", "w")

    with open(infname, 'r', encoding='utf-8') as file:
        raw_text = file.read()
    raw_text = raw_text.lower()

    sentences = nltk.sent_tokenize(raw_text)

    def replace_numbers(text):
        return re.sub(r'\d+', '<num>', text)

    tokenized_sentences = []
    tokenizer = RegexpTokenizer(r'\w+')
    for sentence in sentences:
        tokens=tokenizer.tokenize(sentence)
        cleaned_tokens = [replace_numbers(token) for token in tokens]
        tokenized_sentences.append(' '.join(cleaned_tokens))

    # Write the preprocessed text to the output file

    outfname.write('\n'.join(tokenized_sentences))


    # ----------------------------------------
    # TODO: add your code here

    # ----------------------------------------
    outfname.close()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### 1.2 Filtering (2 points)

The following function *token_filter()* should include the following components

1. Remove the words that appear in the data less than 5 times (word_frequency < 5)
2. Write the filtered data to the file named **imdb-small.txt.filtered** and maintain the same format (one sentence per line)
3. Return a Python list that contains all the words

In [3]:
# TODO: add necessary packages here
import nltk
from nltk.tokenize import MWETokenizer

def token_filter(infname="embeddings/imdb-small.txt.tokenized", thresh=5):
    outfname = open(infname.replace(".tokenized", ".filtered"), 'w')
    vocab = []
    with open(infname, 'r', encoding='utf-8') as file:
        sentences = file.readlines()

    #sentences = nltk.sent_tokenize(raw_text)
    freq_dist = nltk.FreqDist()

    def custom_word_tokenize(text):
    # Use regular expressions to tokenize words
      words = re.findall(r'\b\w+\b|<num>', text)
      return words

    for s in sentences:
      tokens=custom_word_tokenize(s)
      freq_dist.update(tokens)

    filtered_words = [word for word in freq_dist if freq_dist[word] >= thresh]

    filtered_words_set = set(filtered_words)
    tokenized_sentences = []
    for sentence in sentences:
        tokens = custom_word_tokenize(sentence)
        # Filter tokens based on the frequency
        filtered_tokens = [token for token in tokens if token in filtered_words_set]
        if filtered_tokens:
            #outfname.write(' '.join(filtered_tokens) + '\n')
            tokenized_sentences.append(' '.join(filtered_tokens))


    vocab=list(filtered_words_set)
    outfname.write('\n'.join(tokenized_sentences))

    # ----------------------------------------
    # TODO: remove "pass" and add your code here

    # ----------------------------------------
    outfname.close()
    return vocab

### 1.3 Put all together (1 point)

The following code block will call the previous two functions to do data preprocessing.

This code block should include the following steps

- tokenization
- build the vocabulary with the variable name `vocab`
- print out the size of the vocabulary

In [6]:
tokenize()
vocab = token_filter()
print("The vocab size = {}".format(len(vocab)))

The vocab size = 18289


## 2. Word Embeddings (5 points)

In this section, you need to implement two different ways of constructing word embeddings: latent semantic analysis  and skipgram.

### 2.1 Latent semantic analysis (3 points)

The function of LSA should include the following components

- Construct the word-doc matrix using `CountVectorizer` with `tokenizer=lambda x : x.split()`, make sure in this matrix that each row represents one word and each column represents one document (sentence, to be accurate in this case)
- Use the `TruncatedSVD` from `sklearn.decomposition` to factorize the word-doc matrix
- Construct the word embedding matrix with dimensionality as $v \times k$, where $v$ is the vocab size and $k$ is the word embedding dimension

The LSA() function should return

- **embeddings**: A matrix with size $v\times k$ that contains all the word embeddings
- **vocab**: A Python dict with size $v$ that maps a word to the corresponding word index. Please pay attention to the mapping relation in vocab, which will be needed in the evaluation section.

In [7]:
# TODO: add necessary packages here
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
def LSA(fname = "embeddings/imdb-small.txt.filtered", dim=50):
    sents = open(fname).read().split("\n")
    #print("stupid")


    # Create a CountVectorizer and transform the sentences into a word-doc matrix
    vectorizer = CountVectorizer(tokenizer=lambda x: x.split())
    word_doc_matrix = vectorizer.fit_transform(sents)
    word_doc_matrix=word_doc_matrix.T

    svd=TruncatedSVD(n_components=dim)
    embeddings=svd.fit_transform(word_doc_matrix)
    #print(embeddings.shape)
    vocab = {word: idx for idx, word in enumerate(vectorizer.get_feature_names_out())}






    # -------------------------------------
    # TODO: add your code here

    # -------------------------------------
    return embeddings, vocab

In [8]:
em,vocab=LSA()
print(em.shape)
print(len(vocab))



(18289, 50)
18289


### 2.2 Skip-gram model (2 points)

In this section, you do not have to implement the skip-gram model by yourself. An authentic implementation of skip-gram can be found in the Python package [fasttext](https://pypi.org/project/fasttext/), which you can install on the your local machine with the folllwing commandline or directly load the package if you are using Google Colab.
```python
pip install fasttext
```

In the following code, please use the `fasttext.train_unsupervised` function for the skipgram() implementation. For the `fasttext.train_unsupervised`, please use the following configurations

- `model='skipgram'`
- Context window size: `ws = 3`
- Word embedding dimension: `dim = 50`
- Number of negative examples: `neg = 5`

For all other parameters, use their default values.

Similar to the previous LSA(), Skipgram() should return

- **embeddings**: A matrix with size $v\times k$ that contains all the word embeddings
- **vocab**: A Python dict with size $v$ that maps an index to the corresponding word

To get the word embeddings and vocab from fasttext, you need to understand [some functions](https://pypi.org/project/fasttext/#api) provided by the `model` object in the fasttext.

In [9]:
!pip install fasttext

Collecting fasttext
  Downloading fasttext-0.9.2.tar.gz (68 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/68.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.8/68.8 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pybind11>=2.2 (from fasttext)
  Using cached pybind11-2.11.1-py3-none-any.whl (227 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp310-cp310-linux_x86_64.whl size=4199769 sha256=d690c8ba8c429f5420fe76a1442cb53def9552ccb7f868601fc32e5d25272333
  Stored in directory: /root/.cache/pip/wheels/a5/13/75/f811c84a8ab36eedbaef977a6a58a98990e8e0f1967f98f394
Successfully built fasttext
Installing collected packages: pybind11, fasttext
Successfully installed fasttext-0.9.2 pybind11-2.11.1


In [10]:
# TODO: add necessary packages here
import fasttext
def Skipgram(fname = "embeddings/imdb-small.txt.filtered", ws=3, dim=50):
    # ------------------------------------------
    # TODO: add your code here
    model = fasttext.train_unsupervised(fname, model='skipgram', ws=ws, dim=dim, neg=5)

    embeddings = np.array([model.get_word_vector(word) for word in model.words])
    vocab = {word: idx for idx, word in enumerate(model.words)}






    # ------------------------------------------
    return embeddings, vocab

In [11]:
embeddings_sg, vocab_sg = Skipgram()
print(embeddings_sg.shape)
print(len(vocab_sg))

(18290, 50)
18290


### 2.3 Put all together

Run the following code blocks to get word embeddings from two different methods. It may take a couple of minutes to compute both embeddings.

In [12]:
embeddings_lsa, vocab_lsa = LSA()
embeddings_sg, vocab_sg = Skipgram()

The following code will serve as the sanity check that `vocab_lsa` and `vocab_sg` contain the same words

In [13]:
lsa_word_set = set([item[0] for item in vocab_lsa.items()])
sg_word_set = set([item[0] for item in vocab_sg.items()])
sym_diff = lsa_word_set.symmetric_difference(sg_word_set)

if len(sym_diff) == 0:
    print("vocab_lsa and vocab_sg contain the same words!")
else:
    print("The word that only appear in one vocab: {}".format(sym_diff))

The word that only appear in one vocab: {'</s>'}


If the only word from the `symmetric_difference()` function is `</s>`, then your implementation should be fine. (`</s>` was added by `fasttext` automatically to the end of each text.)

## 3. Evaluation (5 points)

In this homework, we will only use intrinsic evaluation. Specifically, for a list of predefined word pairs with their similarity scores, the evaluation is to calculate the correlation between the predefined similarity scores and the cosine similarity scores based on word embeddings. The higher the correlation, the better the quality of word embeddings.

In [14]:
def load_wordpairs(fname = "embeddings/word-pairs.txt", vocab=vocab):
    records = {}
    with open(fname) as fin:
        for line in fin:
            items = line.strip().split(",")
            if (items[1] in vocab) and (items[2] in vocab): # make sure both words in the vocab
                records[(items[1],items[2])] = float(items[3])
    print("Load {} pairs of words for evaluation".format(len(records)))
    return records

### 3.1 Word similarity correlation (2 points)

The purpose of this section is to implement the correlation function that compares the predefined scores and the scores computed by cosine similarity. The code of the correlation function is almost done, and the only thing left is the code for computing cosine similarity.

In [15]:
from scipy.stats import pearsonr
# TODO: Add necessary packages here
from scipy.spatial.distance import cosine
def correlation(records, embeddings, vocab):
    predefined_scores = []
    cossim_scores = []
    for (words, sim_score) in records.items():
        predefined_scores.append(sim_score)
        if words[0] in vocab and words[1] in vocab:
            vec1 = embeddings[vocab[words[0]]]
            vec2 = embeddings[vocab[words[1]]]
            sim = 1 - cosine(vec1, vec2)
            #predefined_scores.append(sim_score)
            cossim_scores.append(sim)
        # ---------------------------------
        # TODO: add your code here for computing the cossine similarity
        #       between words[0] and words[1], and assign the value to variable "score"

        # ---------------------------------

    corr = pearsonr(predefined_scores, cossim_scores)
    return corr

Run the following code block to calculate the correlations between pre-defined similarity scores and the cosine similarity scores based on word embeddings

In [17]:
records=load_wordpairs(vocab=vocab_lsa)

corr_lsa = correlation(records, embeddings_lsa, vocab_lsa)
print("The correlation with the LSA embeddings = {} with p-value {}".format(corr_lsa[0], corr_lsa[1]))
records=load_wordpairs(vocab=vocab_sg)
corr_sg = correlation(records, embeddings_sg, vocab_sg)
print("The correlation with the skipgram embeddings = {} with p-value {}".format(corr_sg[0], corr_sg[1]))

if corr_lsa[0] > corr_sg[0]:
    print("LSA is better than Skip-gram")
elif corr_lsa[0] < corr_sg[0]:
    print("Skipgram is better than LSA")

Load 151 pairs of words for evaluation
The correlation with the LSA embeddings = 0.31924259155765383 with p-value 6.45906927376854e-05
Load 151 pairs of words for evaluation
The correlation with the skipgram embeddings = 0.34222529117452544 with p-value 1.7003902933571214e-05
Skipgram is better than LSA


### 3.2 Analysis of context window size in Skipgram (3 points)

With the correlation function, we can analyze the effect of different context window sizes in the Skipgram model. Specifically, please call the previous implementation

- `Skipgram(fname, ws, dim=50)` with the context window size `ws` as 3, 6, 9, 12, 15
- For each context window size, calculate the correlation using the function `correlation(records, embeddings, vocab)`
- **Print out** the fives correlation scores in your final submission: one score per line with the following format
<center> ws\t correlation</center>

In [18]:
# TODO: add your code here

embeddings_sg, vocab_sg = Skipgram(ws=3)
records=load_wordpairs(vocab=vocab_sg)
corr_sg = correlation(records, embeddings_sg, vocab_sg)
print("ws= 3 \t The correlation with the skipgram embeddings = {} with p-value {}".format(corr_sg[0], corr_sg[1]))


embeddings_sg, vocab_sg = Skipgram(ws=6)
records=load_wordpairs(vocab=vocab_sg)
corr_sg = correlation(records, embeddings_sg, vocab_sg)
print("ws= 6 \t The correlation with the skipgram embeddings = {} with p-value {}".format(corr_sg[0], corr_sg[1]))



embeddings_sg, vocab_sg = Skipgram(ws=9)
records=load_wordpairs(vocab=vocab_sg)

corr_sg = correlation(records, embeddings_sg, vocab_sg)
print("ws= 9 \t The correlation with the skipgram embeddings = {} with p-value {}".format(corr_sg[0], corr_sg[1]))


embeddings_sg, vocab_sg = Skipgram(ws=12)
records=load_wordpairs(vocab=vocab_sg)
corr_sg = correlation(records, embeddings_sg, vocab_sg)
print("ws= 12 \t The correlation with the skipgram embeddings = {} with p-value {}".format(corr_sg[0], corr_sg[1]))


embeddings_sg, vocab_sg = Skipgram(ws=15)
records=load_wordpairs(vocab=vocab_sg)
corr_sg = correlation(records, embeddings_sg, vocab_sg)
print("ws= 15 \t The correlation with the skipgram embeddings = {} with p-value {}".format(corr_sg[0], corr_sg[1]))



Load 151 pairs of words for evaluation
ws= 3 	 The correlation with the skipgram embeddings = 0.34222529117452544 with p-value 1.7003902933571214e-05
Load 151 pairs of words for evaluation
ws= 6 	 The correlation with the skipgram embeddings = 0.4120620007773998 with p-value 1.4651248219952812e-07
Load 151 pairs of words for evaluation
ws= 9 	 The correlation with the skipgram embeddings = 0.4398232975955795 with p-value 1.6007465907200303e-08
Load 151 pairs of words for evaluation
ws= 12 	 The correlation with the skipgram embeddings = 0.4343807165651287 with p-value 2.5101634437980525e-08
Load 151 pairs of words for evaluation
ws= 15 	 The correlation with the skipgram embeddings = 0.44382442544907136 with p-value 1.1441542431685874e-08


Similar experiment can also be conducted on the parameter of negative examples `neg`, but it will not be included in this homework.