## Module 4: Working with Textual Data with NLTK Part II (10 Points).


In this tutorial, we will:
1. Explore the development of word2vec embedding of word tokens.
2. A shallow taste of the PyTorch framework for deep learning.


**Note: There is an associated submission for this exercise.**

Let's still start with the `examp_doc` we have used in last tutorial. Alternatively, you can replace it with other paragraphs.

In [1]:
import nltk

In [2]:
# Trick: use triple quotes (""" ...... """) to define multi-line texts in Python.

examp_doc = """Growing use of the Internet and social media in the past decade has led to an explosion in the amount of
social and behavioral data available to researchers. This in turn has created huge opportunities for social scientists to
study human behavior and social interaction in unprecedented detail. Leveraging these opportunities requires collaborative,
interdisciplinary efforts involving computer and information scientists, physicists, and mathematicians who know how to
build the telescope and economists, political scientists, and sociologists who know where to aim it. Computational social
science exists at the intersection of these varied disciplines."""


print(examp_doc)

Growing use of the Internet and social media in the past decade has led to an explosion in the amount of 
social and behavioral data available to researchers. This in turn has created huge opportunities for social scientists to 
study human behavior and social interaction in unprecedented detail. Leveraging these opportunities requires collaborative, 
interdisciplinary efforts involving computer and information scientists, physicists, and mathematicians who know how to 
build the telescope and economists, political scientists, and sociologists who know where to aim it. Computational social 
science exists at the intersection of these varied disciplines.


In [7]:
import string


nltk.download('punkt')
nltk.download('stopwords')
stemmer = nltk.stem.PorterStemmer()
stopword_lst = nltk.corpus.stopwords.words('english')
punct_lst = list('''!()-[]{};:'"\\,<.>/?@#$%^&*_~''') # Converted the string to a list of characters

def clean_and_tokenize(examp_doc):
    # write down the corresponding procedures.
    # the six lines correspond to six steps.

    out = nltk.sent_tokenize(examp_doc)
    out = [nltk.word_tokenize(sent) for sent in out]
    out = [[word.lower() for word in sent] for sent in out]
    out = [[word for word in sent if word not in stopword_lst] for sent in out]
    out = [[word for word in sent if word not in punct_lst] for sent in out]
    out = [[stemmer.stem(word) for word in sent] for sent in out]

    return out


token_lsts = clean_and_tokenize(examp_doc)
print(token_lsts)

[['grow', 'use', 'internet', 'social', 'media', 'past', 'decad', 'led', 'explos', 'amount', 'social', 'behavior', 'data', 'avail', 'research'], ['turn', 'creat', 'huge', 'opportun', 'social', 'scientist', 'studi', 'human', 'behavior', 'social', 'interact', 'unpreced', 'detail'], ['leverag', 'opportun', 'requir', 'collabor', 'interdisciplinari', 'effort', 'involv', 'comput', 'inform', 'scientist', 'physicist', 'mathematician', 'know', 'build', 'telescop', 'economist', 'polit', 'scientist', 'sociologist', 'know', 'aim'], ['comput', 'social', 'scienc', 'exist', 'intersect', 'vari', 'disciplin']]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### Tokenize the document.

**Exercise 1** Use below code cell to convert the document into a list of tokens. Each token should:
- split the paragraph into sentences
- for each sentence, further split them into individual words.
- turn all words into lower cases (i.e., "a" insteand of "A")
- remove stop words.
- remove punctuations.
- stem to the root.

You can add a cell box below and test the code line by line before writing the function.

### 1. Bag-of-Word (BoW) Encoding

This section will experiment the creation of BoW with the provided "examp_doc".

   1. The *first* step is to clean and prepare the data, which we have already done with the function `clean_and_tokenize()`, and the output is `token_lsts`.

   2. Then the *second* step is to create a vocabulary, consisting all unique words in the document. Please check and run the below code to achieve this step.

In [None]:
vocab = sorted(list(set([token for token_lst in token_lsts for token in token_lst])))

<span style='background-color: #FFFF00;'> **Question:** what is the dimension for the one-hot word encoding derived from this vocabulary? (**1 Point**) </span>


**Answer:** (double click and type your answers here)

3. The *third* step is to convert each documents (i.e., each sentence in `exemp_doc` or each list in `token_lsts`) into the BoW encoding. Write the function in the below code cell to achieve this.

In [None]:
## finish the code in this step.
import numpy as np

def tokenLst_to_BOW(token_lsts, vocab):
    # try finish this function by your self.

    out =

    return out


bow_encoding = tokenLst_to_BOW(token_lsts, vocab)

print(bow_encoding[0])

Bag of Words (BOW) encoding can be utilized to analyze the similarity among texts by comparing the frequency distribution of various words. Therefore, BOW encoding can serve as input for NLP tasks that inherently rely on similarity scores, such as document classification (e.g., determining the category of news articles), topic modeling, and latent semantic analysis.

Below just showed a simple example for comparing similarity of the four sentences in the `exemp_doc` with the [`cosine_similarity`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html) function in sklearn. Based on the results, it can be inferred that the first and second sentences are the most alike, with a cosine similarity score of 0.31.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

print(cosine_similarity((bow_encoding[0], bow_encoding[1])))

array([[1.        , 0.31311215],
       [0.31311215, 1.        ]])

<span style='background-color: #FFFF00;'> **Exercise:** In below code cell, write the function to convert BoW encoding (a list of lists) into TF-IDF encoding. (**3 Points**) </span>
- The input is BOW encoding.
- The output is tf-idf encoding.

<div>
<img src="attachment:image-2.png" align="center" width="400">
<img src="attachment:image-3.png" align="center" width="300">
<img src="attachment:image.png" align="center" width="300">
</div>

In [None]:
def bow_to_tfidf(bow_encoding):
    # write your code below




    return tfidf_encoding


bow_to_tfidf(bow_encoding)

### 2. Word2Vec Embedding.

[`PyTorch`](https://pytorch.org/), developed by Facebook's Artificial Intelligence Research group (FAIR), is a powerful library for building deep learning networks. It provides implementations of many commonly-used network layers. The process of building a deep learning network is akin to building with blocks (搭积木). When building with blocks, you arrange each block to create an overall appealing structure. Similarly, when building a deep learning model, you arrange the different layers to optimize its efficiency.

![image-3.png](attachment:image-3.png)

We will use PyTorch to implement the Continuous Bag-of-Words (CBOW) model we have learnt in the lecture session. To start, we need to prepare the original file `exemp_doc` into bag-of-words. You can run the below code cell to prepare the data.

Please also be noted that the CBOW embedding and BOW encoding (or one-hot encoding) are two different things.

In [None]:
context_size = 2

data = []

for sent in token_lsts:
    for i in range(context_size, len(sent)-2):
        context = (sent[i-2], sent[i-1], sent[i+1], sent[i+2])
        target = sent[i]
        data.append((context, target))


print(data[0])

(('grow', 'use', 'social', 'media'), 'internet')


CBOW predicts the word with its contextual words. For example, predicting "*internet*" based on the list of words: ["*grow*", "*use*", "*social*", "*media*"].


**A shallow taste of deep learning**

In our lecture session, we have explained the principle of Word2Vec embeddings. Below code cell presents a simple deep learning model, written with PyTorch, for Continuous Bag-of-Words (CBOW). The corresponding network structure is also provided in Figure 1.


<div>
<img src="attachment:image-2.png" align="center" width="700">
</div>

**Figure 1**. The network structure of CBOW.

Some online documents in PyTorch you may find helpful:
- [`nn.Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html)
- [`nn.Linear`](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html)
- [`F.log_softmax`](https://pytorch.org/docs/stable/generated/torch.nn.functional.log_softmax.html)

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F   # load necessary packages & functions from PyTorch


class CBOW(nn.Module): # define the class "CBOW", which is a child class derived from the parent class: "nn.Module"

    def __init__(self, vocab_size, embedding_dim=5, hidd_dim = 16, context_size = 2):
        super(CBOW, self).__init__() # the "CBOW" class inherits all properties and methods from its parent class.

        # initialize the embedding layer.
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)

        # Define the first fully connected layer (linear transformation) which takes the concatenated embeddings as input
        self.linear1 = nn.Linear(context_size * embedding_dim, hidd_dim)

        # Define the second fully connected layer which outputs scores for each word in the vocabulary
        self.linear2 = nn.Linear(hidd_dim, vocab_size)

    def forward(self, inputs): # Define the forward pass of the network
        embeds = self.embeddings(inputs) # Get the embeddings for the input words

        # Flatten the embeddings (from context_size x embedding_dim to a single vector)
        out = embeds.view((1, -1))

        # Pass the flattened embeddings through the first linear layer
        out = self.linear1(out)

        # Pass the output of the first linear layer through the second linear layer
        out = self.linear2(out)

        # Apply log softmax to get log probabilities for each word in the vocabulary
        log_probs = F.log_softmax(out, dim=1)

        return(log_probs) # Return the log probabilities

<span style='background-color: #FFFF00;'> Please carefully review Figure 1 and the 13-line PyTorch code. Analyze the code and answer the following questions (**4 Points**).</span>

---
**Q1:** which line of the code correspond to the "Embedding Layer" on the figure, and what is the dimension of inputs and outputs for that layer?

**A1:**

---

**Q2:** which line of the code correspond to the "1st Linear Layer" on the figure, and what is the dimension of inputs and outputs for that layer?

**A2:**

---
**Q3:** which line of the code correspond to the "2nd Linear Layer" on the figure, and what is the dimension of inputs and outputs for that layer?

**A3:**

---
**Q4:** what is the purpose of the softmax?

**A4:**

---

Run below code cell to train the model and check the loss curve.

In [None]:
cosine_similarity(model.embeddings(torch.tensor([0])).detach().numpy(),
                  model.embeddings(torch.tensor([1])).detach().numpy())

array([[-0.10768455]], dtype=float32)

In [None]:
# step 1: embedding
out = model.embeddings(context_ids)

# 2. faltten
out = out.view((1, -1))

# 3. first linear layer:

out = model.linear1(out)

# 4. second linear layer:
out = model.linear2(out)

# 5. softmax

prob = F.softmax(out).detach().numpy()[0] # output the probability of the target word.

plt.figure(figsize = (10, 5))
plt.bar(vocab, prob)
plt.xticks(rotation = 90)
plt.show()

In [None]:
import matplotlib.pyplot as plt

def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)

# Create a dictionary mapping each word to its index in the vocabulary, as nn.Embedding layer only takes indices.
word_to_ix = {word: i for i, word in enumerate(vocab)}


losses = []
loss_function = nn.NLLLoss() # Define the loss function to be used (Negative Log Likelihood Loss)
model = CBOW(len(vocab), embedding_dim=5, context_size=4) # Initialize the CBOW model with necessary inputs.
optimizer = optim.SGD(model.parameters(), lr=0.05)  # Define the optimizer and learning rate to update the model's parameters.


for epoch in range(20):
    total_loss = 0 # Initialize the total loss for the current epoch
    for context, target in data: # Iterate over each context-target pair in the dataset
        context_ids = make_context_vector(context, word_to_ix) # Convert the context words to their corresponding indices and create a tensor
        model.zero_grad() # Zero out the gradients from previous steps
        log_probs = model(context_ids) # Forward pass: get the log probabilities for the target word given the context
        label = torch.tensor([word_to_ix[target]], dtype=torch.long) # Create a tensor for the target word's index
        loss = loss_function(log_probs, label) # Calculate the loss between the predicted log probabilities and the actual target
        loss.backward() # Backward pass: compute gradients of the loss with respect to the model's parameters
        optimizer.step() # Update the model's parameters using the optimizer
        total_loss += loss.item() # Accumulate the loss for the current epoch
    losses.append(total_loss) # Append the total loss for the current epoch to the losses list


# Plot the loss values over the epochs using matplotlib

plt.plot(range(20), losses)

### 3. Sentence Embedding with SentenceBERT

#### Want to Know More about word embedding?

In this simple exercise, we used a very simple corpus. It only includes one paragraph. Also, we set the embedding dimension to be 3 for easy visualization. The result is not very accurate.

Some institutions or scholars have trained word embeddings from billions of documents. Those pre-trained word embeddings can be downloaded online. They also provide different options for the embedding dimension, e.g., 100 or 200. You can check the following link for more information:

- [Google Word2Vec](https://code.google.com/archive/p/word2vec/)
- [Standford GloVe](https://nlp.stanford.edu/projects/glove/)

### 4. Other application of NLP approaches.

We have introduced different NLP methods in this session, some of which can be applied to tasks beyound text processing. Below shows you a pesudo scenario.

The classification of occupations (职业) normally follows a top-down manner. For instance, China has defined 8 general occupation types, each further subdivided into more specific categories [[ref]](https://oss.baigongbao.com/2021/07/18/BkANmPhfNr.pdf). However, due to the rapid evolution of society, new occupation types often emerge that do not fit into existing categories. As a research assistant, you have been tasked with classifying these new occupations into the existing categories. To accomplish this, you recall the concept of "embedding" from your CSS 5220 class. You decide to create embeddings for each occupation type and use similarity measures to match new occupations to existing categories.

There are various methods to achieve this, an intuitive one is to create embeddings based on job advertisement texts. Unfortunately, your advisor does not have access to this type of data. Instead, he provides you with user-based data, consisting of 1,000 job applicants and the jobs each applicant has applied for. Below shows two exemplary records of the data:

    1. {'name': '张三', 'jobs_applied': ['数据分析师'，'机器学习工程师'，'产品经理'，'数据运营']}
    2. {'name': '李四', 'jobs_applied': ['服务员'，'前台'，'销售助理'，'办公室文员'，'零售店员']}

Therefore, the question becomes: how can you create embeddings for occupations based on this available data?

<span style='background-color: #FFFF00;'> Please read the above question and document the steps that you will take to solve the question (**2 Points**).</span>

(**Hint**: I expect you use Word2Vec embedding)

---



**Answer:** (double click to start typing your answers)


----