# 3. Labeling and Retrieval
In this exercise we will try to give interpretable to labels to the clusters generated in [2_clustering.ipynb](https://github.com/Zak-Hussain/LLM4SciSci/blob/main/2_clustering.ipynb), this time using a text generation approach instead of keyword occurrence statistics. We will also try to extract methodological information from the article PDFs.

By the end of this notebook, you will be able to:
- Load a pre-trained causal LLM for text generation and run it on a GPU. 
- Implement (zero-shot) labelling and information instruction with a text generation model.

## Environment Setup
Make sure to set your runtime to use a GPU by going to `Runtime` -> `Change runtime type` -> `Hardware accelerator` -> `T4 GPU`

In [None]:
import sys
if 'google.colab' in sys.modules:  # If in Google Colab environment
    # Mount google drive to enable access to data files
    from google.colab import drive
    drive.mount('/content/drive')
    
    # Installing requisite packages
    !pip install transformers accelerate pymupdf &> /dev/null

    # Change working directory
    %cd /content/drive/MyDrive/LLM4SciSci

In [1]:
import pandas as pd
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
import torch
import os
import fitz
from tqdm.notebook import tqdm_notebook as tqdm
from huggingface_hub import InferenceClient

## Assigning interpretable labels to clusters

We begin by loading the dataset as a `pandas.DataFrame`:

In [2]:
# Load data
data = pd.read_csv('science_of_science_clusters.csv')

# Convert cluster labels to string type so they don't get misinterpreted as ordinal
data['cluster'] = data['cluster'].astype(str)
data

Unnamed: 0,title,abstract,keywords,year,citations,text,cluster
0,Machine learning misclassification networks re...,Given a large enough volume of data and precis...,Interdisciplinary research; Machine learning; ...,2024,3,Machine learning misclassification networks re...,4
1,Dynamic patterns of the disruptive and consoli...,Scientific breakthroughs possess the transform...,citation network; disruption; Nobel Prize; sci...,2024,2,Dynamic patterns of the disruptive and consoli...,4
2,Automating the practice of science: Opportunit...,Automation transformed various aspects of our ...,AI for science; automation ; computational sci...,2025,1,Automating the practice of science: Opportunit...,0
3,Asian American Representation Within Psycholog...,"As a racial group, Asians are incredibly diver...",Asian/Asian American; diversity; intersectiona...,2024,1,Asian American Representation Within Psycholog...,2
4,Bibliometric analysis of publications on trabe...,"Purpose: Trabecular bone score (TBS), as a tex...",Bone mineral density; Fracture risk; Knowledge...,2024,0,Bibliometric analysis of publications on trabe...,1
...,...,...,...,...,...,...,...
1119,The science of science foundation,[No abstract available],,1965,1,The science of science foundation./n/n[No abst...,3
1120,Bibliographic coupling: A review,The theory and practical applications of bibli...,,1974,224,Bibliographic coupling: A review./n/nThe theor...,1
1121,Behavioristisk kritik av psykoanalysen,"As fas as metascience is concerned, Schioldbor...",,1971,3,Behavioristisk kritik av psykoanalysen./n/nAs ...,2
1122,The R & D information gap or the social scienc...,[No abstract available],,1967,0,The R & D information gap or the social scienc...,0


The code next samples 20 articles from each cluster and saves their concatenated titles to `cluster_titles`:

In [None]:
# Sample 20 titles from each cluster
cluster_titles = {}
for cluster in data['cluster'].unique():
    sampled_titles = (
        data.query('cluster == @cluster')['title']
        .sample(n=20, random_state=42)
    )
    sampled_titles = sampled_titles.str.cat(sep='.\n\n') + '.'
    cluster_titles[cluster] = sampled_titles

# Print first cluster of titles to make sure formatting looks right
print(cluster_titles['0'])

The code next loads the LLM and its corresponding tokenizer. We will use [`"meta-llama/Llama-3.2-1B-Instruct"`](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct), a recent model trained which shows impressive performance given its relatively small size. The smaller size has the main advantage that it can be run on the freely available GPUs on Google Colab.

The code begins by setting the random seed. This helps ensure the reproducibility of the often stochastic processes involved in training and running LLMs. **It next asks you to provide your [Hugging Face access token](https://huggingface.co/settings/tokens). Please generate a token by clicking '+ Create new token' > 'Read' > 'Create token'**. You will then need to copy-paste the token into `'your_access_token_here'` in the code below in order to download the model.** The code then loads the model and tokenizer. The model is loaded onto the GPU via `device_map="cuda"` and the model is set to use half-precision via `torch_dtype=torch.float16` to save memory (RAM). The `trust_remote_code=True` argument is used to trust the remote code, and `attn_implementation='eager'` is used for faster inference on the T4 GPUs available on Google Colab.

In [None]:
torch.random.manual_seed(42) # For reproducibility

access_token = 'your_access_token_here'

model_ckpt = 'meta-llama/Llama-3.2-1B-Instruct'
# Load the model
model = AutoModelForCausalLM.from_pretrained(
    model_ckpt,
    device_map="cuda", # Use GPU
    torch_dtype=torch.float16, # Use half-precision
    trust_remote_code=True,
    attn_implementation='eager', # For faster inference on T4 GPUs
    token=access_token
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_ckpt,
    token=access_token,
)


The code next initializes a `transformers` pipeline for text generation. This is a high-level API that allows for easy text generation using the pre-trained models. We will use this pipeline to characterize the clusters. The pipeline takes two arguments in addition to the task (`"text-generation"`):

1. `model`: The model to use for text generation.
2. `tokenizer`: The tokenizer to use for text generation.

In [None]:
# Initialize pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

We next set our text generation hyperparameters in `generation_args`, which is later feed into the text generation `pipe`. We will be asking the model to produce 3-5 main themes characterizing each cluster. Since we only want the model to output a few themes, the code provides a hard constraint on the generation by setting `"max_new_tokens": 50` and `"do_sample": False`. It also sets `"temperature"` and `"top_p"` to `None`, since these parameters do not apply when `do_sample=False`. The `pad_token_id` is set to the end-of-sequence token ID, which is recommended when using the Hugging Face pipeline for text generation:

In [None]:
# Text generation arguments
generation_args = {
    "max_new_tokens": 50,
    "return_full_text": False,
    "do_sample": False,
    "temperature": None,
    "top_p": None,
    "pad_token_id": pipe.tokenizer.eos_token_id
}

The code next initializes our model prompts. The `system_prompt` instructs the model with a general "vibe"/persona that it should take on. The `user_prompt_template` provides the task-specific instructions. The curly brackets `{}` act as a placeholder that can be filled with the article titles in each cluster using python's string `.format()` method.

In [None]:
# Initialise system prompt and user prompt template
system_prompt = "You are a metascience expert who specializes in assigning faithful and informative labels to clusters of article titles."
user_prompt_template = """
Which themes describe the following cluster of articles?
--------------------------
{}
--------------------------
Provide 3-5 main themes (no more, no less), each seperated by a comma. Do NOT provide any other output.
"""
# Iterate through each cluster of titles, generate the model's labels, save to cluster_labels
cluster_labels = {}
for cluster, titles in cluster_titles.items():

    # Insert article titles into the prompt_template
    user_prompt = user_prompt_template.format(titles)

    # JSON format the system and user prompt for feeding into the model
    message = [
        {'role': 'system', 'content': system_prompt},
        {"role": "user", "content": user_prompt}
    ]

    # Generate text and access output at index 0 at key 'generated_text'
    output = pipe(message, **generation_args)[0]['generated_text']

    # Save cluster labels
    cluster_labels[cluster] = output

cluster_labels

The model outputs are relatively informative and not vastly different from the PMI-based cluster labels generated in the last exercise (which provide a nice sanity check). However, the model does not appear to be obeying instructions properly, as it is often providing more themes than the 3-5 it was instructed to provide.

**TASK 1**: Try increasing the model size from 1B to 3B parameters (switch `model_ckpt` from `'meta-llama/Llama-3.2-1B-Instruct'` to `'meta-llama/Llama-3.2-3B-Instruct'` (you will need to re-run the subsequent cells). Is the larger model better at following instructions?


## Information extraction from article PDFs

In this section, we will use text generation to extract methodological information from metascience article PDFs. Specifically, we will ask the model to identify whether each article in our `LLM4SciSci/SciSci_articles` directory used knowledge graphs.

Unfortunately, feeding long texts into LLMs can require a lot of RAM (by default, the RAM footprint grows quadratically with the text length). In order to ensure we have enough RAM, as well as a sufficiently large/powerful mode for the task, we will switch to using the Hugging Face API [Hugging API](https://huggingface.co/docs/api-inference/en/getting-started). Specifically, we will make use of [`'mistralai/Mistral-7B-Instruct-v0.3'`](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3).

We begin by initialising the HF inference client with our access token:

In [None]:
client = InferenceClient(token=access_token)

The code next assigns the `system_prompt` and `user_prompt_template` for the task:

In [None]:
# Initialise system and user prompt template
system_prompt = "You are a metascience expert who specializes in identifying whether an article uses knowledge graphs or not."
user_prompt_template = """
Does the following article make explicit use of knowlege graphs? First provide your judgement in all caps (TRUE/FALSE/UNCLEAR),
then justify your judgement with some text snippets extracted from the article.
Do not provide any further reasoning. Here is the article:

{}
"""

We then iterate through each PDF article, extract the text, and generate the models judgement of whether the article uses knowledge graphs.

In [None]:
# Get all file names in SciSci_articles directory
file_names = os.listdir('SciSci_articles')
file_names.remove('.DS_Store') if '.DS_Store' in file_names else None

# Iterate through each file, extract the text, generate the models response, save to outputs
outputs = {}
for file_name in tqdm(file_names):

    # Read PDF and extract text from selected pages
    pdf = fitz.open(f'SciSci_articles/{file_name}')
    texts = [page.get_text() for page in pdf[:-2]]  # Drop last two pages (mostly references)

    # Concatenate the page texts together with page breaks indicated
    text = '\n\n---page break---\n\n'.join(texts)

    # Format user_prompt_template with PDF text
    user_prompt = user_prompt_template.format(text)
    message = [
        {'role': 'system', 'content': system_prompt},
        {"role": "user", "content": user_prompt}
    ]

    # Generate text and access output at index 0 at key 'generated_text'
    output = client.chat_completion(
        messages=message,
        model="mistralai/Mistral-7B-Instruct-v0.3",
        max_tokens=50,
        temperature=0.0
    ).choices[0].message.content

    # Save output
    outputs[file_name] = output

Predictions can then be extracted with a some simple if-elif-else statements:

In [None]:
predictions = {}
for f_name, output in outputs.items():

    if 'TRUE' in output and 'FALSE' not in output:
        prediction = 'TRUE'
    elif 'FALSE' in output and 'TRUE' not in output:
        prediction = 'FALSE'
    elif 'UNCLEAR' in output:
        prediction = 'UNCLEAR'
    else:
        prediction = 'NaN'

    predictions[f_name] = prediction

predictions = pd.Series(list(predictions.values()), dtype=str)
predictions.hist()

Although the distribution of outputs looks reasonable, it could probably be improved. For instance, when compared to the number of times 'knowledge graph' appears in the keywords of all the articles (which, of course, is also not a perfect measure), it seems to potentially identify a few too many positive cases.

In [None]:
# Plotting the global distribution of 'knowledge graph' keyword occurrences
data['keywords'].dropna().str.lower().str.contains('knowledge graph').astype(str).hist()

Inspecting the model's justifications in its outputs also reveals that it may be too flexible with its interpretation of what constitutes a knowledge graph:

In [None]:
outputs

**Final challenge**: Inspect the outputs to get an idea for situations in which the model may be making mistakes. Then try playing around with the prompt to see if you can improve the quality of the outputs. Some ideas might be:

- Providing a strict definition of knowledge graphs in the prompt.
- Forcing the model to use chain-of-thought (or "chain-of-evidence") by asking it to provide the evidence for its judgement *before* it provides its judgement (tip: you will have to increase `"max_new_tokens"` to ensure that the judgement still makes it in to the model output).
- Moving the instructions in `user_prompt_template` to the system prompt (leaving only the curly brackets, `'{}'`, in the `user_prompt_template`).
- Moving the article text to the system prompt and all task-specific instructions to the user prompt.

Tip: Processing all 30 articles for each idea can take a while. To speed up your efforts, consider trying out different ideas in parallel with the people sitting next to you.