# Flowmetrics – LLM Classification

This notebook defines and implements the core classification task of the Flowmetrics framework. It formalises research impact modelling as a two-part problem: (i) predicting impact stages for research topic pairs, and (ii) generating narrative summaries that explain these impact trajectories based on structured evidence.

### Objective

To use large language models (LLMs) in a zero-shot setting to predict the stage-wise impact of research topic pairs and generate structured explanations of how research unfolds across societal dimensions.

### Structure

#### 1. Impact Stage Prediction  
Given a topic pair \((t_A, t_B)\), the model assigns one or more impact stages from the Flowmetrics trajectory:
- **Reach** – Twitter, Facebook, Wikipedia  
- **Engagement** – Blogs, Reddit, YouTube, Mendeley  
- **Feedback** – *Defined in the framework, but not associated with publicly available signals*  
- **Influence** – CrossRef citations, News mentions  
- **Outcome** – Policy mentions, Patent citations

Although the **Feedback** stage is part of the framework — representing expert critique and scholarly evaluation — it is not currently operationalised due to the lack of observable public signals. Incorporating this dimension remains a future direction, especially with the emergence of open peer review datasets.

#### 2. Narrative Impact Generation  
Beyond stage classification, the model generates a concise narrative for each topic pair based on the predicted stages and structured metric evidence. This defines the task as a form of **structured data-to-text generation**, where the input graph encodes impact-stage associations and the output provides a coherent account of societal impact.

### Prompting Strategy

We employ a standardised zero-shot prompt template to ensure consistent model behaviour. The prompt encourages inclusive reasoning across weak or cumulative signals. When no platform data is detected for a topic pair, the model assigns **Reach** by default — signalling minimal public accessibility.

# Table of Contents
- [1. Classification pipeline: automating Flowmetrics labeling](#section-1)
    - [1.1 Training](#subsection-11)
        - [1.1.1 Load model](#subsection-111)
        - [1.1.2 Prompt templates for zero-shot learning](#subsection-112)
    - [1.2 Prediction](#subsection-12)

In [141]:
import os
import re
import pandas as pd
import rdflib
from itertools import combinations
from langchain_anthropic import ChatAnthropic
from langchain.chat_models import ChatOpenAI
from langchain_mistralai import ChatMistralAI
from langchain.llms import HuggingFacePipeline
from langchain import LLMChain, PromptTemplate
from rdflib import Graph, Namespace, URIRef, RDFS, XSD, RDF
from transformers import pipeline
from collections import defaultdict
from pathlib import Path
from config import models, impact_stages, stage_order

In [142]:
# project directory
project_dir = Path(".").resolve().parent

## 1. Classification Pipeline: Automating Flowmetrics Labeling

This section defines the zero-shot classification pipeline used to assign Flowmetrics impact stages to research topic pairs. The pipeline leverages large language models (LLMs) to predict one or more stages based on structured metric inputs derived from Altmetric and CrossRef.

Each input instance represents a topic pair, accompanied by evidence of platform-level engagement. The LLM is prompted to reason over these signals and return the most appropriate stages from the predefined typology: **Reach**, **Engagement**, **Feedback**, **Influence**, and **Outcome**.

The classification is performed in a zero-shot setting using a standardised prompt template, which:
- Encourages cumulative and inclusive reasoning across weak or sparse evidence
- Defaults to **Reach** when no platform signals are present
- Supports multi-label output to reflect the multidimensional nature of societal impact

In [143]:
GRAPH_FILE = project_dir / "data" / "impact_augmented_kg.ttl"

In [144]:
g = Graph()
g.parse(GRAPH_FILE, format="turtle")

<Graph identifier=Nb58774477a044f9aa5ab958f400800c8 (<class 'rdflib.graph.Graph'>)>

In [145]:
# Define namespaces
FLOW = rdflib.Namespace("http://example.org/flowmetrics#")
RDFS = rdflib.Namespace("http://www.w3.org/2000/01/rdf-schema#")
RDF = rdflib.RDF

In [146]:
def get_label(node):
    label = g.value(node, RDFS.label)
    return str(label) if label else str(node)

def extract_scores(node):
    return [
        (str(platform), float(g.value(node, FLOW.score)))
        for platform in g.objects(node, FLOW.platform)
        if g.value(node, FLOW.score) is not None
    ]

records = []

for pair in g.subjects(RDF.type, FLOW.TopicPair):
    topics = list(g.objects(pair, FLOW.hasTopic))
    if len(topics) != 2:
        continue  # Skip if not exactly two topics
    t1, t2 = topics
    t1_label, t2_label = get_label(t1), get_label(t2)

    shared_concepts = [str(c) for c in g.objects(pair, FLOW.hasSharedConcept)]

    def get_impact(prop):
        return [x for node in g.objects(pair, prop) for x in extract_scores(node)]

    records.append({
        "topic_1_id": str(t1),
        "topic_1_name": t1_label,
        "topic_2_id": str(t2),
        "topic_2_name": t2_label,
        "shared_concepts": shared_concepts,
        "reach": get_impact(FLOW.hasReachImpact),
        "engagement": get_impact(FLOW.hasEngagementImpact),
        "feedback": get_impact(FLOW.hasFeedbackImpact),
        "influence": get_impact(FLOW.hasInfluenceImpact),
        "outcome": get_impact(FLOW.hasOutcomeImpact)
    })

df = pd.DataFrame(records)
df

Unnamed: 0,topic_1_id,topic_1_name,topic_2_id,topic_2_name,shared_concepts,reach,engagement,feedback,influence,outcome
0,http://example.org/flowmetrics#topic_0,optimization,http://example.org/flowmetrics#topic_1,combinatorial problems,"[combinatorial optimization, combinatorial pro...","[(Wikipedia, 0.2784810126582279), (Facebook, 0...","[(Blogs, 0.0), (Reddit, 0.14545454545454545), ...","[(Expert, 0.0), (Peers, 0.0)]","[(Citation_crossref, 0.14096521945183116), (Ne...","[(Patents, 0.07159904534606205), (Policy, 0.0)]"
1,http://example.org/flowmetrics#topic_0,optimization,http://example.org/flowmetrics#topic_10,adaptive algorithms,"[combinatorial optimization, combinatorial pro...","[(Wikipedia, 0.3417721518987342), (Twitter, 0....","[(Videos, 0.16666666666666666), (Mendeley, 0.0...","[(Peers, 0.0), (Expert, 0.0)]","[(Citation_crossref, 0.1498249124562281), (New...","[(Patents, 0.07995226730310262), (Policy, 0.0)]"
2,http://example.org/flowmetrics#topic_0,optimization,http://example.org/flowmetrics#topic_113,optimization problems,"[combinatorial optimization, combinatorial pro...","[(Wikipedia, 0.2468354430379747), (Facebook, 0...","[(Mendeley, 0.0), (Reddit, 0.09090909090909091...","[(Peers, 0.0), (Expert, 0.0)]","[(Citation_crossref, 0.12760327532187146), (Ne...","[(Policy, 0.0), (Patents, 0.06205250596658711)]"
3,http://example.org/flowmetrics#topic_0,optimization,http://example.org/flowmetrics#topic_138,convolutional neural networks,[computational efficiency],"[(Facebook, 0.1388888888888889), (Wikipedia, 0...","[(Reddit, 0.1090909090909091), (Videos, 0.1666...","[(Expert, 0.0), (Peers, 0.0)]","[(News, 0.0), (Citation_crossref, 0.1301703483...","[(Patents, 0.06563245823389022), (Policy, 0.0)]"
4,http://example.org/flowmetrics#topic_0,optimization,http://example.org/flowmetrics#topic_143,evolutionary algorithms,"[evolutionary algorithms, multi-objective opti...","[(Twitter, 0.10295224554654178), (Facebook, 0....","[(Blogs, 0.0), (Reddit, 0.09090909090909091), ...","[(Expert, 0.0), (Peers, 0.0)]","[(News, 0.0), (Citation_crossref, 0.1256549327...","[(Patents, 0.06205250596658711), (Policy, 0.0)]"
...,...,...,...,...,...,...,...,...,...,...
766,http://example.org/flowmetrics#topic_31,computational efficiency,http://example.org/flowmetrics#topic_9,detection algorithm,"[correlation analysis, encoder-decoder]","[(Facebook, 0.14583333333333331), (Wikipedia, ...","[(Reddit, 0.07272727272727272), (Videos, 0.0),...","[(Expert, 0.0), (Peers, 0.0)]","[(Citation_crossref, 0.11778257549827545), (Ne...","[(Policy, 0.0), (Patents, 0.08591885441527446)]"
767,http://example.org/flowmetrics#topic_66,reference image,http://example.org/flowmetrics#topic_9,detection algorithm,[reference image],"[(Facebook, 0.08333333333333333), (Wikipedia, ...","[(Videos, 0.0), (Blogs, 0.0), (Reddit, 0.07272...","[(Peers, 0.0), (Expert, 0.0)]","[(Citation_crossref, 0.10234064400621364), (Ne...","[(Patents, 0.06921241050119331), (Policy, 0.0)]"
768,http://example.org/flowmetrics#topic_67,classification models,http://example.org/flowmetrics#topic_9,detection algorithm,"[color images, reference image]","[(Facebook, 0.09722222222222221), (Twitter, 0....","[(Reddit, 0.10909090909090909), (Mendeley, 0.0...","[(Expert, 0.0), (Peers, 0.0)]","[(Citation_crossref, 0.12229799110081356), (Ne...","[(Policy, 0.0), (Patents, 0.07875894988066826)]"
769,http://example.org/flowmetrics#topic_74,computer vision,http://example.org/flowmetrics#topic_9,detection algorithm,"[color images, computer vision, reference image]","[(Twitter, 0.0884837333779376), (Wikipedia, 0....","[(Reddit, 0.09090909090909091), (Blogs, 0.0), ...","[(Expert, 0.0), (Peers, 0.0)]","[(Citation_crossref, 0.12074458281772465), (Ne...","[(Policy, 0.0), (Patents, 0.09069212410501193)]"


### 1.1 Model Selection and Zero-Shot Setup

This project adopts a train-free evaluation protocol, leveraging state-of-the-art large language models (LLMs) to perform two tasks in a zero-shot setting:  
(1) predicting impact stages from structured inputs, and  
(2) generating natural language summaries of societal research impact.

Traditional machine learning approaches are not suitable for this setting, as they require supervised training data and lack generative capabilities. Instead, we rely on pre-trained LLMs capable of performing classification and data-to-text generation directly from structured prompts.

We evaluate seven LLMs spanning both open-source and proprietary model families. These models were selected based on their widespread use, architectural diversity, and ability to handle structured input within a unified inference pipeline. Each model is queried using the same prompt template to ensure consistency in reasoning and output formatting.

| Model                | Type         | Version                        | Context Window |
|---------------------|--------------|--------------------------------|----------------|
| DeepSeek-V3         | Open         | deepseek-chat (v3-base)        | 128k           |
| Mixtral 8x22B       | Open         | open-mixtral-8x22b             | 64k            |
| GPT-3.5-Turbo       | Proprietary  | gpt-3.5-turbo-0125             | 16k            |
| GPT-4o              | Proprietary  | gpt-4o-2024-05-01              | 128k           |
| GPT-4o-mini         | Proprietary  | gpt-4o-mini-2024-05-01         | 128k           |
| Claude 3.7 Sonnet   | Proprietary  | claude-3-7-sonnet-20250219     | 200k           |
| Claude 3.5 Haiku    | Proprietary  | claude-3-5-haiku-20241022      | 200k           |

All models are accessed through a unified interface and evaluated under the same pipeline conditions.

#### 1.1.1 Load Model

In [148]:
MODEL_PATH = Path("")

In [149]:
def load_model(model_name: str, temperature: float = 0.1, max_tokens: int = 4096):
    """
    Loads a local or API-based model using LangChain-compatible interface.
    """
    # Local models
    local_model_map = {
        "deepseek": MODEL_PATH / "DeepSeek-R1-Distill-Qwen-1.5B",
        "llama": MODEL_PATH / "Llama-3.2-1B",
        "mistral": MODEL_PATH / "Mistral-7B-Instruct-v0.1"
    }

    if model_name in local_model_map:
        model_path = local_model_map[model_name]
        if not model_path.exists():
            raise FileNotFoundError(f"Model path does not exist: {model_path}")
        generator = pipeline(
            "text-generation",
            model=str(model_path),
            device="cuda",
            temperature=temperature,
            max_new_tokens=max_tokens
        )
        return HuggingFacePipeline(pipeline=generator)

    # OpenAI models
    elif model_name in ["gpt-3.5-turbo", "gpt-3.5-turbo-instruct", "gpt-4", "gpt-4o", "gpt-4o-mini"]:
        api_key = os.getenv("OPENAI_API_KEY")
        if not api_key:
            raise EnvironmentError("OPENAI_API_KEY not set in environment variables.")
        return ChatOpenAI(
            model_name=model_name,
            temperature=temperature,
            max_tokens=max_tokens,
            openai_api_key=api_key
        )

    # Anthropic models (Claude)
    elif model_name in ["claude-3-7-sonnet-20250219", "claude-3-5-haiku-20241022"]:
        api_key = os.getenv("ANTHROPIC_API_KEY")
        if not api_key:
            raise EnvironmentError("ANTHROPIC_API_KEY not set in environment variables.")
        return ChatAnthropic(
            model_name=model_name,
            temperature=temperature,
            max_tokens=max_tokens,
            anthropic_api_key=api_key
        )
    # DeepSeek API via OpenAI-compatible interface
    elif model_name in ["deepseek-reasoner", "deepseek-chat"]:
        api_key = os.getenv("DEEPSEEK_API_KEY")
        if not api_key:
            raise EnvironmentError("DEEPSEEK_API_KEY not set in environment variables.")
        return ChatOpenAI(
            model_name=model_name,
            temperature=temperature,
            max_tokens=max_tokens,
            openai_api_key=api_key,
            openai_api_base="https://api.deepseek.com/v1"
        )
     # Mistral API (Mistral Large or similar)
    elif model_name in ["open-mixtral-8x22b", "open-mixtral-8x7b"]:
        api_key = os.getenv("MISTRAL_API_KEY")
        if not api_key:
            raise EnvironmentError("MISTRAL_API_KEY not set in environment variables.")
        return ChatMistralAI(
            model_name=model_name,
            temperature=temperature,
            max_tokens=max_tokens,
            mistral_api_key=api_key
        )

    else:
        raise ValueError(f"Unsupported model: {model_name}")

#### 1.1.2 Prompt Templates for Zero-Shot Learning

In [150]:
def generate_response(model_name: str, prompt: str):
    """
    Generates a response from a local or API-based model via LLMChain.
    """
    model = load_model(model_name)

    if isinstance(model, HuggingFacePipeline):
        template = PromptTemplate(template="{prompt}", input_variables=["prompt"])
    else:
        template = PromptTemplate(template="{prompt}", input_variables=["prompt"])

    llm_chain = LLMChain(llm=model, prompt=template)

    return llm_chain.run({"prompt": prompt})

In [151]:
def apply_prompt_template(row):
    """
    Generate a prompt to classify the joint impact stage of a research topic pair,
    using structured platform co-mention evidence and shared concepts.
    The platform values are normalised per platform within each dimension.
    """
    topic_1 = row["topic_1_name"]
    topic_2 = row["topic_2_name"]
    shared_concepts = ', '.join(row["shared_concepts"]) or "None"

    platform_lines = []
    for dim in impact_stages:
        entries = row.get(dim, [])
    
        if isinstance(entries, str):
            try:
                entries = ast.literal_eval(entries)
            except Exception:
                entries = []
    
        if isinstance(entries, list):
            dim_totals = defaultdict(float)
            for item in entries:
                if isinstance(item, (list, tuple)) and len(item) == 2:
                    platform, value = item
                    try:
                        dim_totals[platform] += float(value)
                    except Exception:
                        continue  # Skip if value cannot be converted to float
            platform_info = ', '.join(
                f"{platform}: {round(value, 3)}" for platform, value in dim_totals.items()
            )
            platform_lines.append(f"{platform_info if platform_info else 'None'}")
    
    full_platform = '\n'.join(platform_lines)
    pairs = [p.strip() for line in full_platform.splitlines() for p in line.split(",")]
    full_platform_lines = ", ".join(pairs)

    prompt = f"""
            You are an expert in research impact analysis. Your task is to assess the impact of a pair of research topics based on structured evidence of platform co-mentions and shared concepts.
            
            This is a multi-label classification problem. Your goal is to classify all impact stages as either supported or not, based on the strength and relevance of the evidence for each stage.
            
            ---
            
            Impact Stages:
            
            - Reach: Broad dissemination of research to general audiences via mass communication platforms (e.g., Twitter, Facebook, Wikipedia).
            - Engagement: Active interaction, discussion, or interpretation of research in community-driven forums (e.g., Blogs, Reddit, YouTube, Mendeley).
            - Feedback: Scholarly reactions or critical appraisals, often indicating academic interest (e.g., Peer Review).
            - Influence: Contribution to discourse in authoritative contexts (e.g., citations via CrossRef, media coverage via News).
            - Outcome: Tangible societal or technological effects arising from research (e.g., Policy documents, Patents).
            
            ---
            
            Important Notes:
            
            - Platform values are normalised per platform and impact dimension at the topic level (0.0 = no evidence, 1.0 = maximum evidence). For topic pairs, scores are summed (range: 0.0–2.0) to reflect combined impact.
            - Classify all impact stages with meaningful or emerging support, considering the total cumulative evidence across platforms, even if individual platform scores are low.
            - Err slightly on the side of inclusiveness: if multiple signals exist across platforms, prefer assigning the stage rather than omitting it.
            - If cumulative evidence across all platforms for a stage is strictly zero, assign only the "Reach" stage.
            
            ---
            
            Now classify this pair:
            
            - Topic 1: {topic_1}  
            - Topic 2: {topic_2}  
            - Shared Concepts: {shared_concepts}
            
            Platform Co-mention Evidence (normalised values):  
            {full_platform_lines}
            
            ---
            
            You must begin your output exactly with:
            
            Impact Stages with Sufficient Support: [list of stages]
            
            Followed by:
            
            Impact Summary:
            Write a concise (2–4 sentence) summary explaining the evidence for each assigned impact stage. Mention the key platforms and shared concepts that support each stage. Highlight how cumulative evidence across platforms contributed to classification, even if individual signals are small.
            """

    return prompt.strip()

In [152]:
def iterate_and_generate_responses(topic_pairs_df, model_name):
    print(f"Running for model: {model_name}")
    
    impact_stages_set = {"reach", "engagement", "feedback", "influence", "outcome"}

    predicted_stages = []
    impact_summaries = []
    raw_responses = []

    for i in range(len(topic_pairs_df)):
        row = topic_pairs_df.iloc[i]

        # Generate prompt
        prompt = apply_prompt_template(row)

        # Generate response
        response = generate_response(model_name, prompt).strip()
        raw_responses.append(response)

        # Extract supported stages
        stages_match = re.search(r"Impact Stages with Sufficient Support:\s*\[([^\]]+)\]", response)
        if stages_match:
            supported_stages = stages_match.group(1).split(",")
            supported_stages = [stage.strip().lower() for stage in supported_stages if stage.strip()]
        else:
            supported_stages = []

        supported_stages = [stage for stage in supported_stages if stage in impact_stages_set]
        predicted_stages.append(supported_stages)

        # Extract impact summary
        summary_match = re.search(r"Impact Summary:\s*(.+)", response, re.DOTALL)
        summary = summary_match.group(1).strip() if summary_match else ""
        impact_summaries.append(summary)

    # Assign new columns
    topic_pairs_df[f"predicted_stages_{model_name}"] = predicted_stages
    topic_pairs_df[f"impact_summary_{model_name}"] = impact_summaries
    topic_pairs_df[f"raw_response_{model_name}"] = raw_responses

    return topic_pairs_df

### 1.2 Prediction

In [153]:
for m in models:
    topic_pairs_df = iterate_and_generate_responses(df, model_name=m)

Running for model: open-mixtral-8x22b
Running for model: deepseek-chat
Running for model: gpt-3.5-turbo
Running for model: gpt-4o
Running for model: gpt-4o-mini
Running for model: claude-3-7-sonnet-20250219
Running for model: claude-3-5-haiku-20241022


In [154]:
# save pre-processed files
DATASET_PATH = project_dir / "data" / "flowmetrics-predictions.csv"
topic_pairs_df.to_csv(DATASET_PATH, index=False)