# 2. Extracting topics from research articles
In this exercise we will tro to extract taxonomic tags from articles to evaluate the topics and methods used in different article clusters. This time, we will use a text generation approach.

### **Setting up environment**
If you have not done this yet during exercise 1:
* Make sure you have a hugging Face account (https://huggingface.co/join).
* Go to the meta-llama/Llama-3.2-1B-Instruct model page and fill in the 'COMMUNITY LICENSE AGREEMENT' form at the top of the page to get access to the model (this may take a few minutes).

Make sure to set your runtime to use a GPU by going to `Runtime` -> `Change runtime type` -> `Hardware accelerator` -> `T4 GPU`

In [3]:
import sys
if 'google.colab' in sys.modules:  # If in Google Colab environment
    # Mount google drive to enable access to data files
    from google.colab import drive
    drive.mount('/content/drive')

    # Install requisite packages
    !pip install transformers accelerate &> /dev/null

    # Change working directory
    %cd /content/drive/MyDrive/LLM_SIBR

Mounted at /content/drive
/content/drive/MyDrive/LLM_SIBR


In [2]:
import pandas as pd
import numpy as np
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
import torch
import json
import re
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import squareform
from math import acos, pi
from collections import Counter

### Generating taxonomic tags
We will first load the data and concatenate the semantic content of titles and abstracts.

In [None]:
# Load data
data = pd.read_csv('data_cleaned_filtered_clustered.csv')

For time reasons, the code next samples 10 articles from each continent and saves their concatenated titles to `sampled_data` (you can also sample more if you have time to spare in the exercise):

In [None]:
# Sample 10 articles from each continent
sampled_data = data.groupby('continent').apply(lambda x: x.sample(n=min(len(x), 10), random_state=42)).reset_index(drop=True)

# Concatenate title and abstract
sampled_data['text'] = sampled_data['Title'] + ' ' + sampled_data['Abstract_cleaned']

# Print the first 5 rows of the new DataFrame to verify the formatting
display(sampled_data.head())

The code next loads the LLM and its corresponding tokenizer. We will use "meta-llama/Llama-3.2-1B-Instruct", a recent model trained which shows impressive performance given its relatively small size. The smaller size has the main advantage that it can be run on the freely available GPUs on Google Colab.

The code begins by setting the random seed. This helps ensure the reproducibility of the often stochastic processes involved in training and running LLMs. It next asks you to provide your Hugging Face access token. Please generate a token by clicking '+ Create new token' > 'Read' > 'Create token' on https://huggingface.co/settings/tokens. You will then need to copy-paste the token into 'your_access_token_here' in the code below in order to download the model.** The code then loads the model and tokenizer. The model is loaded onto the GPU via device_map="cuda" and the model is set to use half-precision via torch_dtype=torch.float16 to save memory (RAM). The trust_remote_code=True argument is used to trust the remote code, and attn_implementation='eager' is used for faster inference on the T4 GPUs available on Google Colab.

Troubleshooting: If you receive an error about using a GPU when running the code below, you might still have to click `Runtime` -> `Change runtime type` -> `Hardware accelerator` -> `T4 GPU`

In [None]:
torch.random.manual_seed(42) # For reproducibility

access_token = 'your_access_token_here'

model_ckpt = 'meta-llama/Llama-3.2-1B-Instruct'
# Load the model
model = AutoModelForCausalLM.from_pretrained(
    model_ckpt,
    device_map="cuda", # Use GPU
    torch_dtype=torch.float16, # Use half-precision
    trust_remote_code=True,
    attn_implementation='eager', # For faster inference on T4 GPUs
    token=access_token
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_ckpt,
    token=access_token,
)

The code next initializes a `transformers` pipeline for text generation. This is a high-level API that allows for easy text generation using the pre-trained models. We will use this pipeline to characterize the clusters. The pipeline takes two arguments in addition to the task (`"text-generation"`):

1. `model`: The model to use for text generation.
2. `tokenizer`: The tokenizer to use for text generation.

In [None]:
# Initialize pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

We next set our text generation hyperparameters in `generation_args`, which is later feed into the text generation `pipe`. We will be asking the model to produce 3-5 main tags characterizing each cluster.
Since we only want the model to output a few tags, the code provides a hard constraint on the generation by setting `"max_new_tokens": 150` and `"do_sample": False`. It also sets `"temperature"` and `"top_p"` to `None`, since these parameters do not apply when `do_sample=False`. The `pad_token_id` is set to the end-of-sequence token ID, which is recommended when using the Hugging Face pipeline for text generation:

In [None]:
# Text generation arguments
generation_args = {
    "max_new_tokens": 150,
    "return_full_text": False,
    "do_sample": False,
    "temperature": None,
    "top_p": None,
    "pad_token_id": pipe.tokenizer.eos_token_id
}

The code next initializes our model prompts. Conventionally, the system_prompt instructs the model with a general "vibe"/persona that it should take on, whilst the user_prompt_template provides the task-specific instructions. However, there are no strict rules for what you should put in the system versus user prompt: what works and doesn't work depends on the specific assistant-oriented fine-tuning regime the model went through, which vary from model to model. Curly brackets {} act as a placeholder in the prompt that can be filled with an article's title and abstract.

## TASK 1: Develop and implement an effective system and user prompt.
### System prompt
* The system prompt is typically rather short. For example, if you describe the task and character of a very helpful research assistant to another person in one or two sentences, this could already be close to a good system prompt.

### User prompt
* First, you should describe the task as precise as possible (i.e., analyze the provided article and characterize its core subject and methodology; extract 3-5 non-redundant taxonomic tags)
* Next, you can provide more rules for the tag generation, for instance, how specific and concise the tags should be, how strict the model should stick to the text provided, whether it should focus more strongly on the topic or methodology, etc.)
* Remember to add curly brackets {} where you want to feed in the text from the article
* Lastly, instruct the model about your desired output structure which will make it much easier the process the tags afterwards. In some cases it can be helpful to implement a chain-of-thought process. This means that you ask the model to provide a brief reasoning before generating the tags. Then, tell the model what kind of format you would like to receive your tags in. For example:
'Exclusively return the list of the 3-5 generated taxonomic tags. Format the output as a JSON object with the key 'tags'. You are not allowed to add any other text.'.
* If you don't want to start from scratch, have a look at the

To test how different prompts behave, consider using a subset of data to reduce the time, E.g.,

```
test_data = sampled_data[0:5]
```
and replace the data in the line

```
for index, row in sampled_data.iterrows():
```




In [None]:
# Initialise system prompt and user prompt template
system_prompt = "Add your system prompt here."
user_prompt_template = """
Add your user prompt here. Read the instructions above for some hints. Or copy paste the prompt from the cheatsheet to edit.
"""



# Iterate through the sampled articles from each continent (their titles and abstracts), generate the model's labels, save to cluster_labels
cluster_labels = {}
for index, row in sampled_data.iterrows():
    continent = row['continent']
    text = row['text']

    # Insert article titles into the prompt_template
    user_prompt = user_prompt_template.format(text)

    # JSON format the system and user prompt for feeding into the model
    message = [
        {'role': 'system', 'content': system_prompt},
        {'role': 'user', 'content': user_prompt}
    ]

    # Generate text and access output at index 0 at key 'generated_text'
    output = pipe(message, **generation_args)[0]['generated_text']

    # Save output as dataframe  and add the index number and continent information
    cluster_labels[index] = {'index': index, 'continent': continent, 'output': output}

cluster_labels = pd.DataFrame(cluster_labels).T

display(cluster_labels.head())

Because we limited the number of tokens, we may miss the closing brackets for the json object that the model is procuding when the model produces nonsense (which may happen once in a while).

In [4]:
# If the LLM tag generation didn't work out for you for some reason, you can open tags from a file
# remove the hashtag and run the code below
# cluster_labels =  pd.read_csv('cluster_labels.csv')

In [None]:
#let's make sure the json output is properly closed
cluster_labels['output_clean'] = cluster_labels['output'].apply(
    lambda x: x if re.search(r'"\s*\]\s*\}$', x.strip()) else x.strip() + '" ] }'
)

display(cluster_labels.head())

In [None]:
# try to clean up the output, you will need to end up with a column that contains the tags as a list
cluster_labels['tags'] = cluster_labels['output_clean'].apply(lambda x: json.loads(x)['tags'])

# check what it looks like
display(cluster_labels.head())

# flatten the tag list and calculate counts
tag_counts = Counter(tag for tags in cluster_labels['tags'] for tag in tags)
tags_unique = list(tag_counts.keys())


## Clustering tags
Because we may find considerable redundancy and semantic variation among the tags, we will cluster semantically similar tags together. For this purpose, we will first use an embedding model to extract the semantic embeddings of the tags. We will then use hierarchical clustering methods to group similar tags together based on the cosine similarity of the tags and assign the most frequent tag as the group label.

We'll start by embedding the tags (for better performance in an environment that is not as RAM constrained, you can also use other models like 'Salesforce/SFR-Embedding-Mistral').

In [None]:
# Load the SentenceTransformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Create instruct-style prompts for embedding
prompts = [f"Instruct: Embed the distinct meaning of this behavioral reinforcement learning concept.\nQuery: {tag}"
           for tag in tags_unique]

# Encode tags using the model
tag_emb = model.encode(prompts, convert_to_numpy=True)
tag_emb_df = pd.DataFrame(tag_emb, index=tags_unique)


Next, we will compute the cosine similarity of the embeddings and normalize the range. Finally, we compute a high similarity cutoff at the 80th percentile of these values, adjusted slightly by a tiny epsilon for precision.

In [None]:
# Cosine similarity
tag_cos = cosine_similarity(tag_emb_df)
np.fill_diagonal(tag_cos, 1.0)

# Normalize cosine similarity to range [0,1] using acos
tag_cos[tag_cos > 1] = 1
tag_cos[tag_cos < -1] = -1
tag_cos = 1 - np.arccos(tag_cos) / pi

# Set a high similarity cutoff
eps = 1 / (20 ** 10)
triu_vals = tag_cos[np.triu_indices_from(tag_cos, k=1)]
cutoff = np.quantile(triu_vals, 0.8) - eps

In the next step, and similar to exercise 1, we will apply hierarchical clustering to identify which tags should be grouped together because the may have a similar meaning.

In [None]:
# Convert similarity to distance
distance_matrix = 1 - tag_cos
condensed_dist = squareform(distance_matrix, checks=False)
linkage_matrix = linkage(condensed_dist, method='complete')

# Cut tree at dynamic level
num_steps = sum(linkage_matrix[:, 2] < (1 - cutoff))
num_clusters = len(tags_unique) - num_steps
cluster_labels_array = fcluster(linkage_matrix, t=num_clusters, criterion='maxclust')

# Map cluster numbers to tags
tag_to_cluster = dict(zip(tags_unique, cluster_labels_array))

# Group tags into cliques (clusters)
tag_cliques = defaultdict(list)
for tag, cluster_id in tag_to_cluster.items():
    tag_cliques[cluster_id].append(tag)

What we still need to do is to create a new dictionary, assign the most frequent tag in the group as the group label (which, ultimately, is the tag or keyword that we will use to characterize the article content) and apply it back to the dataframe to find out which of the new tag groups appear in what continent.

In [None]:
tag_dict = {}
for clique_tags in tag_cliques.values():
    counts = [tag_counts[t] for t in clique_tags]
    label = clique_tags[np.argmax(counts)]
    for t in clique_tags:
        tag_dict[t] = label

# Clean each tag list using the dictionary
cluster_labels['clustered_tags'] = cluster_labels['tags'].apply(
    lambda tags: list(dict.fromkeys(tag_dict.get(tag, tag) for tag in tags))
)

In the next step, we want to evaluate at the level of continents which keywords appear most frequently.

In [None]:
# Explode cleaned tags so each tag is a separate row (linked to continent)
exploded = cluster_labels.explode('tags_clean')

# Group by continent and tag, then count occurrences
freq_per_continent = exploded.groupby(['continent', 'tags_clean']).size().reset_index(name='frequency')

# sort by continent and frequency descending
freq_per_continent = freq_per_continent.sort_values(['continent', 'frequency'], ascending=[True, False])

#filter to only keep the tags with a frequency higher than 1 and remove the tag "reinforcement learning" (which is somewhat meaningless in this dataset)
freq_per_continent = freq_per_continent[
    (freq_per_continent['frequency'] > 1) &
    (~freq_per_continent['tags_clean'].str.lower().eq('reinforcement learning'))
]

#look at the outcome
freq_per_continent

## Task 2
How do the resulting clustered keywords change if you adapt the cutoff criterion to a higher or lower level?
This is the line that you will have to modify:

```
cutoff = np.quantile(triu_vals, 0.8) - eps
```



## Task 3

It would be cool to look at the variation in keywords that your analysis produced. Navigate to this survey and enter your most frequently occurring keyword as the label for each continent: https://forms.gle/wADcXjtvgChuWkTX8
