# 1. Generating semantic embeddings


**A Note on how to use Notebook Environments**

* To run a cell, press shift + enter. The notebook will execute the code in the cell and move to the next cell. If the cell contains a markdown cell (text only), it will render the markdown and move to the next cell.
* Since cells can be executed in any order and variables can be over-written, you may at some point feel that you have lost track of the state of your notebook. If this is the case, you can always restart the kernel by clicking Runtime in the menu bar (if you're using Colab) and selecting Restart runtime. This will clear all variables and outputs.
* The final variable in a cell will be printed on the screen. If you want to print multiple variables, use the print() function.
* Notebook environments support code cells and markdown (text) cells. For the purposes of this workshop, markdown cells are used to provide high-level explanations of the code. More specific details are provided in the code cells themselves in the form of comments (lines beginning with #).

***Preparation for Exercise 2*** (Hugging Face and Meta Llama License)
* Make sure you have a hugging Face account (https://huggingface.co/join).
* Go to the meta-llama/Llama-3.2-1B-Instruct model page and fill in the 'COMMUNITY LICENSE AGREEMENT' form at the top of the page to get access to the model (this may take a few minutes). We will not need this model in the first exercise but can save some time for later.

## Environment Setup

In [None]:
import sys
if 'google.colab' in sys.modules:  # If in Google Colab environment
    # Mount google drive to enable access to data files
    from google.colab import drive
    drive.mount('/content/drive')

    # Install requisite packages
    !pip install sentence_transformers pacmap &> /dev/null

    # Change working directory
    %cd /content/drive/MyDrive/LLM_SIBR

We begin by loading the requisite packages. For those coming from R, packages in Python are sometimes given shorter names for use in the code via the `import <name> as <nickname>` syntax (e.g. `import pandas as pd`). These are usually standardized nicknames. `sentence_transformers`: Is a package for extracting embeddings/features from text data using transformer-based models. The other packages are used for computations, clustering, and plotting.

In [None]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
import pacmap
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import linkage, fcluster
from sklearn.preprocessing import StandardScaler
from datetime import datetime

### Embedding Extraction
The code begins by loading the data as `pandas.DataFrame` objects. Because we want to evaluate semantic information from articles' titles and abstracts, we first concatenate them into a single column (`'text'`). Run the code below.

In [None]:
# Load the data with only the desired columns
data = pd.read_csv('data_cleaned_filtered_clustered.csv', usecols=['Title', 'Abstract_cleaned', 'Year'])

# Concatenate titles and abstracts
data['text'] = data['Title'] + '.\n\n' + data['Abstract_cleaned']
data

In [None]:
# Print an example of a text
print(data['text'][30])

Now, we will extract embeddings from the text data, which are numerical representations of the meaning of text, using the sentence-transformers package. Because there are more than 5k articles in the file, we will sample around 1,000 to speed up the process (typically you would want the embeddings for all articles).

In [None]:
# subset articles based on year
data_sample = data[data['Year'] > 2021].reset_index(drop=True)

#How many articles do we end up with?
print(len(data_sample))

The code makes use of the 'all-MiniLM-L6-v2' model, which is a small and efficient embedding model, to extract embeddings from the sentences. The model will encode the sentences into 384-dimensional vector representations. The cell will then print the features as a pandas dataframe for easy viewing.


In [None]:
# Initialize embedding extraction pipeline
model_ckpt = 'all-MiniLM-L6-v2'
#model_ckpt = 'all-mpnet-base-v2' #this is a somewhat larger model that may take more time
model = SentenceTransformer(model_ckpt)

# Extract features
embeddings = model.encode(data_sample['text'], show_progress_bar=True)
embeddings

To visualize the article embeddings (and reduce their dimensionality for clustering downstream), the code initializes a `PaCMAP` object and uses it to project the embeddings into 2 components (because we want to create a 2-dimensional map).

In [None]:
# PaCMAP dimensionality reduction
pacmap_model = pacmap.PaCMAP(n_components=2, n_neighbors=50, MN_ratio=3, FP_ratio=10.0, distance="angular")
lyt = pacmap_model.fit_transform(embeddings)

### Clustering
We will next perform hierarchical clustering using Euclidean distance. We first computes pairwise distances with 'pdist', then build a linkage matrix using the complete linkage method (which considers the furthest pairwise distance between clusters). Finally, we form clusters by cutting the hierarchy to produce a specified number of clusters (in 't = ...').

In [None]:
#Hierarchical clustering
distance_matrix = pdist(lyt, metric='euclidean')
Z = linkage(distance_matrix, method='complete')
clustering = fcluster(Z, t=3, criterion='maxclust')  # t = ... indicates the number of clusters

### Plotting

These clusters can then be visualized:


In [None]:
#set random seed for reproducibility
np.random.seed(42)
jitter = np.random.normal(scale=.3, size=lyt.shape)
plt.figure(figsize=(8, 6))
plt.scatter(lyt[:, 0] + jitter[:, 0], lyt[:, 1] + jitter[:, 1], c=clustering, cmap="Set1", s=20)

# Label cluster centers
for cluster_id in np.unique(clustering):
    idx = np.where(clustering == cluster_id)[0]
    cx, cy = lyt[idx, 0].mean(), lyt[idx, 1].mean()
    plt.text(cx, cy, str(cluster_id), color='black', fontsize=10, ha='center')

plt.title("PaCMAP + Hierarchical Clustering on Sentence Embeddings")
plt.axis('off')

#to save the plot if needed (remove # from code lines)
#generate current timestamp
timestamp = datetime.now().strftime("%H-%M-%S")

#create dynamic filename including time
#feel free to choose a more meaningful name
filename = f"clusters_pacmap_{timestamp}.png"

#save plot
plt.savefig(filename, dpi=300, bbox_inches='tight')

#show plot
plt.show()

### Tasks

**TASK 1**: Try out different numbers of clusters and have a look how the map changes (in the clustering = fcluster...() line you will need to change the t = ... argument). What number do you think works well for the visualization?

**TASK 2:** Sample about 1,000 articles from the beginning of the database and re-run the analysis. How does the map change?

You will have to modify this line:

```
data_sample = data[data['Year'] > 2021].reset_index(drop=True)
```



***Task 3:*** Some article clusters sometimes seem appear more isolated than other. What may be the reason for this?