# Flowmetrics – Impact-Augmented Knowledge Graph Construction

This notebook constructs the Impact-Augmented Knowledge Graph, the central data structure underpinning the Flowmetrics framework. It documents the end-to-end pipeline for generating a structured dataset of societal research impact trajectories by integrating heterogeneous data sources into an RDF graph suitable for AI-driven impact modelling.

### Objective

To automate the extraction, semantic alignment, and structuring of research topic pairs and their associated impact signals — enabling scalable modelling of how research impact unfolds across time, platforms, and audiences.

### Structure

#### 1. Data Collection Pipeline  
Harvests impact evidence through API integration with three key platforms:  
- **arXiv** – source of metadata for topic extraction and co-occurrence modelling  
- **Altmetric** – provider of online attention signals (e.g., news, social media, blogs)  
- **CrossRef** – supplier of citation-based and policy-linked influence metrics

The pipeline operates on a curated corpus of 12,350 computer science preprints (2000–2024), spanning major subfields such as machine learning, natural language processing, computer vision, and artificial intelligence. Papers were selected for topical diversity, metadata completeness, and coverage across at least one impact platform.

# Table of Contents
- [1. Data collection pipeline: automating data extraction](#section-1)
  - [1.1 API Integration](#subsection-11)
     - [1.1.1 Altmetric Data](#subsection-111)
     - [1.1.2 CrossRef Data](#subsection-112)
     - [1.1.3 arXiv Data](#subsection-113)
  - [1.2 Impact Trajectory Matching](#subsection-12)
     - [1.2.1 Nodes of topics](#subsection-121)
     - [1.2.2 Edges of impact](#subsection-122)
     - [1.2.3 Impact-augmented knowledge graph](#subsection-123)

In [1]:
import json
import nltk
import requests
import numpy as np
import pandas as pd
import networkx as nx
import seaborn as sns
import matplotlib.pyplot as plt
from pyvis.network import Network
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from cso_classifier import CSOClassifier
from rdflib import Graph as RDFGraph, Namespace, URIRef, BNode, Literal
from rdflib.namespace import RDF, RDFS, XSD
from config import models, impact_stages, stage_order
from pathlib import Path

In [2]:
# project directory
project_dir = Path(".").resolve().parent

## 1.1 API Integration  
### 1.1.1 Altmetric Data

Altmetric data captures the online attention and engagement surrounding scholarly publications, providing insight into the broader impact of research beyond traditional citation metrics. It reflects how a paper is being discussed, shared, and interacted with across a range of platforms including news outlets, blogs, Twitter, Facebook, Reddit, Wikipedia, and policy documents.

This data helps construct a multidimensional view of research visibility and societal relevance — spanning public discourse, academic engagement, and policy uptake. As such, Altmetric indicators are increasingly valuable for researchers, institutions, and funders aiming to understand and quantify how research resonates across different audiences and sectors.

In [3]:
def get_altmetric_data(doi):
    url = f"https://api.altmetric.com/v1/doi/{doi}"
    response = requests.get(url)
    
    if response.status_code == 200:
        data = response.json()
        altmetric_score = data.get("score", 0)
        all_mentions_counts = data.get("cited_by_posts_count", 0)
        twitter_counts = data.get("cited_by_tweeters_count", 0)
        news_counts = data.get("cited_by_news_outlets_count", 0)
        blogs_counts = data.get("cited_by_blogs_count", 0)
        reddit_counts = data.get("cited_by_rdts_count", 0)
        facebook_counts = data.get("cited_by_fbwalls_count", 0)
        patents_counts = data.get("cited_by_patents_count", 0)
        wiki_counts = data.get("cited_by_wikipedia_count", 0)
        policy_counts = data.get("cited_by_policy_count", 0)
        mendeley_counts = data.get("cited_by_mendeley_count", 0)
        video_counts = data.get("cited_by_videos_count", 0)
        return altmetric_score, all_mentions_counts, twitter_counts, news_counts, blogs_counts, reddit_counts, facebook_counts, patents_counts, wiki_counts, policy_counts, mendeley_counts, video_counts
    elif response.status_code == 401:
        print("Unauthorized: Check your API key or permissions.")
    elif response.status_code == 429:
        print("Rate limit exceeded. Please try again later.")
    else:
        pass
    return None, None, None, None, None, None, None, None, None, None, None, None

### 1.1.2 CrossRef Data

CrossRef data provides extensive metadata and citation information for scholarly works, including journal articles, books, conference proceedings, datasets, and other research outputs. It plays a critical role in enhancing the discoverability and traceability of academic content through the use of persistent identifiers such as Digital Object Identifiers (DOIs).

Within the Flowmetrics framework, CrossRef serves as a key source of citation-based impact signal. These signals help trace the academic and institutional reach of research over time, offering a complementary view to socially-driven metrics and enabling a more comprehensive understanding of scholarly influence.

In [4]:
def get_citation_count_from_crossref(doi):
    url = f"https://api.crossref.org/works/{doi}"
    response = requests.get(url)
    
    if response.status_code == 200:
        data = response.json()
        citation_count = data.get('message', {}).get('is-referenced-by-count', 0)
        return citation_count
    elif response.status_code == 404:
        pass
    elif response.status_code == 429:
        print("Rate limit exceeded. Please try again later.")
    else:
        pass
    return None

### 1.1.3 arXiv Data

arXiv is a widely used preprint repository that offers open access to scholarly articles across a broad range of disciplines, including computer science, physics, mathematics, statistics, electrical engineering, quantitative biology, and economics.

In the Flowmetrics framework, arXiv serves as the primary source for research metadata, enabling the extraction of topics and co-occurrence patterns. This metadata provides the structural backbone for identifying topic pairs and aligning them with downstream impact signals collected from Altmetric and CrossRef.

Download the Dataset: The arXiv snapshot utilised in this project is available for download from Kaggle (https://www.kaggle.com/datasets/Cornell-University/arxiv), provided by Cornell University.

In [5]:
DATASET_FILE = project_dir / "data" / "arxiv-metadata-oai-snapshot.json" # Download here: https://www.kaggle.com/datasets/Cornell-University/arxiv

In [6]:
category_map = {
    'cs.AI': 'Artificial Intelligence',
    'cs.CL': 'Computation and Language',
    'cs.CV': 'Computer Vision and Pattern Recognition',
    'cs.DS': 'Data Structures and Algorithms',
    'cs.ET': 'Emerging Technologies',
    'cs.HC': 'Human-Computer Interaction',
    'cs.IR': 'Information Retrieval',
    'cs.NE': 'Neural and Evolutionary Computing',
    'cs.LG': 'Machine Learning'
}

In [7]:
# Initialize wordnet lemmatizer
wnl = WordNetLemmatizer()

def remove_stop_words(sentence):
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(sentence)
    filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
    filtered_sentence = []
    for w in word_tokens:
        if w not in stop_words:
            filtered_sentence.append(w)
    return ' '.join(filtered_sentence).replace(',','')

def lemmatizer_word(word):
    return wnl.lemmatize(word)

def lemmatizer(sentence):
    tokens = nltk.word_tokenize(sentence)
    lemmatized_tokens = [lemmatizer_word(token) for token in tokens]
    return " ".join(lemmatized_tokens)

def get_metadata():
    with open(DATASET_FILE, 'r') as f:
        for line in f:
            yield line

In [8]:
dois = []
titles = []
abstracts = []
years = []
categories = []
citations_crossref = []
citations_wos = []
altmetric_score = []
all_mentions_counts = []
twitter_counts = []
news_counts = []
blogs_counts = []
reddit_counts = []
facebook_counts = []
patents_counts = []
wiki_counts = []
policy_counts = []
mendeley_counts = []
video_counts = []
metadata = get_metadata()
count = 0
for paper in metadata:
    paper_dict = json.loads(paper)
    ref = paper_dict.get('journal-ref')
    doi = paper_dict.get('doi')
    try:
        year = int(ref[-4:]) 
        if 2000 < year <= 2024:
            count+=1
            categories.append(category_map[paper_dict.get('categories').split(" ")[0]])
            years.append(year)
            titles.append(lemmatizer(remove_stop_words(paper_dict.get('title'))))
            abstracts.append(lemmatizer(remove_stop_words(paper_dict.get('abstract'))))
            dois.append(doi)
            citations_crossref.append(get_citation_count_from_crossref(doi))
            a, b, c, d, e, f, g, h, i, j, k, l = get_altmetric_data(doi)
            altmetric_score.append(a)
            all_mentions_counts.append(b)
            twitter_counts.append(c)
            news_counts.append(d)
            blogs_counts.append(e)
            reddit_counts.append(f)
            facebook_counts.append(g)
            patents_counts.append(h)
            wiki_counts.append(i)
            policy_counts.append(j)
            mendeley_counts.append(k)
            video_counts.append(l)
    except:
        pass

len(dois), len(titles), len(abstracts), len(years), len(categories), len(citations_crossref), len(altmetric_score)

(12536, 12536, 12536, 12536, 12536, 12536, 12536)

In [9]:
# Create a DataFrame
df = pd.DataFrame({
    'DOI': dois,
    'Title': titles,
    'Abstract': abstracts,
    'Year': years,
    'Category': categories,
    'Citation_crossref': citations_crossref,
    'Altmetric_score': altmetric_score,
    'All_mentions': all_mentions_counts,
    'Twitter': twitter_counts,
    'News': news_counts,
    'Blogs': blogs_counts,
    'Reddit': reddit_counts,
    'Facebook': facebook_counts,
    'Patents': patents_counts,
    'Policy': policy_counts,
    'Mendeley': mendeley_counts,
    'Wikipedia': wiki_counts,
    'Videos': video_counts,
})

## 1.2 Impact Trajectory Matching

In [10]:
df = df[df['DOI'].notna()]
df

Unnamed: 0,DOI,Title,Abstract,Year,Category,Citation_crossref,Altmetric_score,All_mentions,Twitter,News,Blogs,Reddit,Facebook,Patents,Policy,Mendeley,Wikipedia,Videos
3,10.1016/j.comgeo.2008.05.005,Edges Switches Tunnels Bridges,Edge casing well-known method improve readabil...,2009,Data Structures and Algorithms,7.0,5.488,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,10.1016/j.tcs.2008.03.029,Soft constraint abstraction based semiring hom...,The semiring-based constraint satisfaction pro...,2008,Artificial Intelligence,6.0,,,,,,,,,,,,
10,10.1093/comjnl/bxm084,On Ultrametric Algorithmic Information,How best quantify information object whether n...,2010,Artificial Intelligence,9.0,,,,,,,,,,,,
14,10.1007/978-3-540-78568-2,Efficient Algorithms Node Disjoint Subgraph Ho...,Recently great effort dedicated research manag...,2008,Data Structures and Algorithms,0.0,,,,,,,,,,,,
15,10.2478/s13230-010-0014-0,Toward Psycho-robots,We try perform geometrization psychology repre...,2010,Artificial Intelligence,1.0,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12505,10.1109/TED.2007.893191,Retraction Generalized Extension Computing Words,Fuzzy automaton whose input alphabet set numbe...,2007,Artificial Intelligence,4.0,,,,,,,,,,,,
12509,10.1007/s00357-007-0007-9,The Haar Wavelet Transform Dendrogram,We describe new wavelet transform use hierarch...,2007,Information Retrieval,48.0,,,,,,,,,,,,
12527,10.1093/comjnl/bxl065,Hedging prediction machine learning,Recent advance machine learning make possible ...,2007,Machine Learning,87.0,4.000,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
12528,10.1142/S1793005708001100,A Neutrosophic Description Logic,Description Logics ( DLs ) appropriate widely ...,2006,Artificial Intelligence,0.0,,,,,,,,,,,,


In [11]:
df["Category"].value_counts()

Category
Computer Vision and Pattern Recognition    1976
Machine Learning                           1453
Artificial Intelligence                     809
Computation and Language                    618
Human-Computer Interaction                  404
Data Structures and Algorithms              296
Neural and Evolutionary Computing           273
Information Retrieval                       269
Emerging Technologies                        88
Name: count, dtype: int64

### 1.2.1 Nodes of Topics

The nodes of the knowledge graph represent high-quality research topics extracted from arXiv metadata. To identify these topics, we used the **CSO Classifier** — an unsupervised tool that assigns concepts from the **Computer Science Ontology (CSO)** based on paper titles, abstracts, and keywords.

The CSO Classifier integrates two components:  
- A **syntactic module**, which detects explicitly mentioned concepts  
- A **semantic module**, which leverages part-of-speech tagging and word embeddings to infer related concepts

Outputs from both modules are merged to generate a candidate topic list for each paper. To consolidate fine-grained variations, we applied a **frequency-based clustering strategy**, assigning each paper to its most frequently associated concept across the corpus. This allowed for the grouping of semantically similar papers under unified topic identifiers without manual filtering.

The resulting pipeline generated **156 coherent, domain-relevant topic nodes**, covering diverse areas of computer science.

In [12]:
cc = CSOClassifier(modules = "both", enhancement = "all", explanation = False)
results = list()

In [13]:
def create_paper_dict(row):
    paper = {
        "title": row["Title"],
        "abstract": row["Abstract"]
    }
    return cc.run(paper)

In [14]:
# Apply function to create the 'cso_topics' column
df["CSO_classifier"] = df.apply(create_paper_dict, axis=1)

Computer Science Ontology loaded.
Model loaded.


In [15]:
for _, row in df.head(3).iterrows():
    print(f"Category: {row['Category']}")
    print(f"Title: {row['Title']}")
    print(f"Abstract: {row['Abstract']}")
    print(f"CSO Topics: \n{json.dumps(row['CSO_classifier'], indent=4)}\n")
    print("="*80)

Category: Data Structures and Algorithms
Title: Edges Switches Tunnels Bridges
Abstract: Edge casing well-known method improve readability drawing non-planar graph . A cased drawing order edge edge crossing interrupt lower edge appropriate neighborhood crossing . Certain order lead readable drawing others . We formulate several optimization criterion try capture concept `` good `` cased drawing . Further address algorithmic question turn given drawing optimal cased drawing . For many resulting optimization problem either find polynomial time algorithm NP-hardness result .
CSO Topics: 
{
    "syntactic": [
        "optimization problems",
        "optimization",
        "polynomial-time algorithms",
        "edge crossing"
    ],
    "semantic": [
        "optimization problems",
        "optimization",
        "polynomial-time algorithms",
        "edge point",
        "planar graph"
    ],
    "union": [
        "optimization problems",
        "optimization",
        "edge crossing",

In [16]:
# Extract the "union" list from each dictionary
df["Research_concepts"] = df["CSO_classifier"].apply(lambda x: x.get("union", []))

In [17]:
df["Super_topics"] = df["CSO_classifier"].apply(lambda x: x.get("enhanced", []))

In [18]:
#df["Research_concepts"] = df["Research_concepts"].apply(ast.literal_eval)
all_concepts = df["Research_concepts"].apply(lambda x: x).explode().tolist()

In [19]:
# Step 1: Count the frequency of each concept across all papers
concept_counts = pd.Series(all_concepts).value_counts()

In [20]:
# Step 2: Assign the most frequent concept as the cluster for each paper
df["Name"] = df["Research_concepts"].apply(
    lambda concepts: max(concepts, key=lambda concept: concept_counts.get(concept, 0))
)
# Step 3: Generate unique Topic IDs starting from 0
df["Topic"] = df["Name"].map({name: idx for idx, name in enumerate(df["Name"].unique())})
df

Unnamed: 0,DOI,Title,Abstract,Year,Category,Citation_crossref,Altmetric_score,All_mentions,Twitter,News,...,Patents,Policy,Mendeley,Wikipedia,Videos,CSO_classifier,Research_concepts,Super_topics,Name,Topic
3,10.1016/j.comgeo.2008.05.005,Edges Switches Tunnels Bridges,Edge casing well-known method improve readabil...,2009,Data Structures and Algorithms,7.0,5.488,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,"{'syntactic': ['optimization problems', 'optim...","[optimization problems, optimization, edge cro...","[correlation analysis, mathematics, graph draw...",optimization,0
4,10.1016/j.tcs.2008.03.029,Soft constraint abstraction based semiring hom...,The semiring-based constraint satisfaction pro...,2008,Artificial Intelligence,6.0,,,,,...,,,,,,"{'syntactic': ['optimal solutions'], 'semantic...","[constraint satisfaction problems (csp), combi...","[artificial intelligence, combinatorial mathem...",combinatorial problems,1
10,10.1093/comjnl/bxm084,On Ultrametric Algorithmic Information,How best quantify information object whether n...,2010,Artificial Intelligence,9.0,,,,,...,,,,,,"{'syntactic': ['computability'], 'semantic': [...","[combinatorial optimization, combinatorial pro...","[optimization, combinatorial mathematics, comp...",combinatorial problems,1
14,10.1007/978-3-540-78568-2,Efficient Algorithms Node Disjoint Subgraph Ho...,Recently great effort dedicated research manag...,2008,Data Structures and Algorithms,0.0,,,,,...,,,,,,"{'syntactic': ['graph-based', 'synthetic data'...","[graph-based, synthetic data, state space, mat...","[graphic methods, machine learning, state spac...",matching algorithm,2
15,10.2478/s13230-010-0014-0,Toward Psycho-robots,We try perform geometrization psychology repre...,2010,Artificial Intelligence,1.0,,,,,...,,,,,,"{'syntactic': ['cognitive systems', 'robots', ...","[mobile robots, cognitive systems, cognitive s...","[robotics, computer science, topology]",artificial intelligence,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12505,10.1109/TED.2007.893191,Retraction Generalized Extension Computing Words,Fuzzy automaton whose input alphabet set numbe...,2007,Artificial Intelligence,4.0,,,,,...,,,,,,"{'syntactic': ['automation', 'feature models',...","[automation, feature models, nondeterministic ...","[engineering, software product line, translati...",part of speech,67
12509,10.1007/s00357-007-0007-9,The Haar Wavelet Transform Dendrogram,We describe new wavelet transform use hierarch...,2007,Information Retrieval,48.0,,,,,...,,,,,,"{'syntactic': ['wavelet decomposition', 'wavel...","[dendrogram, computability, wavelet decomposit...","[cluster analysis, genetic diversity, computab...",wavelet,289
12527,10.1093/comjnl/bxl065,Hedging prediction machine learning,Recent advance machine learning make possible ...,2007,Machine Learning,87.0,4.000,2.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,"{'syntactic': ['learning objects', 'k-nearest ...","[learning objects, correlation analysis, suppo...","[learning environments, mathematics, clusterin...",machine learning,38
12528,10.1142/S1793005708001100,A Neutrosophic Description Logic,Description Logics ( DLs ) appropriate widely ...,2006,Artificial Intelligence,0.0,,,,,...,,,,,,"{'syntactic': ['reasoning', 'semantics', 'desc...","[dynamic logic, description logic, multivalued...","[modal logic, formal logic, knowledge represen...",semantics,16


In [21]:
# Ensure feedback columns exist in df, even if filled with 0.0
for col in ["Peers", "Expert"]:
    if col not in df.columns:
        df[col] = 0.0

### 1.2.2 Edges of Impact

Edges in the Impact-Augmented Knowledge Graph represent impact-bearing relationships between research topics. An edge is established when two topics co-occur in at least one paper and share a common CSO concept. These relationships are encoded as RDF triples of the form *(TopicPair, hasStageImpact, BlankNode)*.

Each `hasStageImpact` predicate is instantiated with a stage-specific property (e.g., `flow:hasReachImpact`, `flow:hasInfluenceImpact`). The associated `BlankNode` stores metadata about the impact source, including:  
- `flow:platform`: the platform generating the signal (e.g., Twitter, Facebook)  
- `flow:score`: the aggregated and normalised impact score for that platform and stage

#### Aggregation Strategy

To produce these edge-level scores, we applied a two-step aggregation process:  
1. **Normalisation:** Raw scores were independently normalised per platform and dimension to account for scale differences (e.g., Altmetric vs. CrossRef).  
2. **Summation:** For each topic pair and stage, we summed the normalised scores across all platforms associated with that dimension.

#### Example

Consider the topic pair `Semantics` and `Language Model`, which co-occur and share the CSO concept `language model`. Their Reach impact is encoded as the following RDF triples:  
*(flow:pair_15_131, flow:hasReachImpact, :reach1)*,  
*(:reach1, flow:platform, "Twitter")*,  
*(:reach1, flow:score, 0.1321)*,  
*(:reach2, flow:platform, "Facebook")*,  
*(_:reach2, flow:score, 0.0972)*

In [22]:
def normalize(series, range_min, range_max):
    min_val = series.min()
    max_val = series.max()
    return ((series - min_val) / (max_val - min_val) * (range_max - range_min) + range_min).astype(int)

In [23]:
def get_top_10_concepts(concepts):
    concepts_count = pd.Series(concepts).value_counts()
    top_10_concepts = concepts_count.nlargest(10).index.tolist()
    top_10_counts = concepts_count.nlargest(10).values.tolist()
    top_10_concepts_with_counts = list(zip(top_10_concepts, top_10_counts))
    return top_10_concepts_with_counts

In [24]:
# Flowmetrics labeling
reach_columns = ["Twitter", "Facebook", "Wikipedia"]
engagement_columns = ["Blogs", "Reddit", "Videos", "Mendeley"]
feedback_columns = ["Peers", "Expert"]
influence_columns = ["News", "Citation_crossref"]
outcomes_columns = ["Policy", "Patents"]

In [25]:
agg_scores = df.groupby(["Topic", "Name"]).agg({
    "Twitter": "sum",
    "Facebook": "sum",
    "Reddit": "sum",
    "Wikipedia": "sum",
    "Blogs": "sum",
    "Videos": "sum",
    "Mendeley": "sum",
    "Peers": "sum",
    "Expert": "sum",
    "News": "sum",
    "Citation_crossref": "sum",
    "Policy": "sum",
    "Patents": "sum",
    "Research_concepts": lambda x: [concept for sublist in x for concept in sublist]
}).reset_index()

agg_scores['Count'] = df.groupby(["Topic", "Name"]).size().values
agg_scores['Count_norm'] = normalize(df.groupby(["Topic", "Name"]).size().values, 40, 100)
agg_scores["Reach"] = agg_scores[reach_columns].apply(lambda row: list(zip(reach_columns, row)), axis=1)
agg_scores["Engagement"] = agg_scores[engagement_columns].apply(lambda row: list(zip(engagement_columns, row)), axis=1)
agg_scores["Feedback"] = agg_scores[feedback_columns].apply(lambda row: list(zip(feedback_columns, row)), axis=1)
agg_scores["Influence"] = agg_scores[influence_columns].apply(lambda row: list(zip(influence_columns, row)), axis=1)
agg_scores["Outcome"] = agg_scores[outcomes_columns].apply(lambda row: list(zip(outcomes_columns, row)), axis=1)
agg_scores["Representation"] = agg_scores["Research_concepts"].apply(get_top_10_concepts)

In [26]:
# Remove topics where the 'Representation' list has fewer than 10 concepts
agg_scores = agg_scores[agg_scores["Representation"].apply(lambda x: len(x) == 10)].reset_index(drop=True)
agg_scores

Unnamed: 0,Topic,Name,Twitter,Facebook,Reddit,Wikipedia,Blogs,Videos,Mendeley,Peers,...,Patents,Research_concepts,Count,Count_norm,Reach,Engagement,Feedback,Influence,Outcome,Representation
0,0,optimization,1268.0,21.0,6.0,39.0,0.0,1.0,0.0,0.0,...,55.0,"[optimization problems, optimization, edge cro...",356,57,"[(Twitter, 1268.0), (Facebook, 21.0), (Wikiped...","[(Blogs, 0.0), (Reddit, 6.0), (Videos, 1.0), (...","[(Peers, 0.0), (Expert, 0.0)]","[(News, 0.0), (Citation_crossref, 10812.0)]","[(Policy, 0.0), (Patents, 55.0)]","[(optimization, 356), (optimization problems, ..."
1,1,combinatorial problems,103.0,2.0,3.0,5.0,0.0,0.0,0.0,0.0,...,8.0,"[constraint satisfaction problems (csp), combi...",38,41,"[(Twitter, 103.0), (Facebook, 2.0), (Wikipedia...","[(Blogs, 0.0), (Reddit, 3.0), (Videos, 0.0), (...","[(Peers, 0.0), (Expert, 0.0)]","[(News, 0.0), (Citation_crossref, 1294.0)]","[(Policy, 0.0), (Patents, 8.0)]","[(combinatorial problems, 38), (combinatorial ..."
2,2,matching algorithm,31.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,"[graph-based, synthetic data, state space, mat...",6,40,"[(Twitter, 31.0), (Facebook, 2.0), (Wikipedia,...","[(Blogs, 0.0), (Reddit, 0.0), (Videos, 0.0), (...","[(Peers, 0.0), (Expert, 0.0)]","[(News, 0.0), (Citation_crossref, 169.0)]","[(Policy, 0.0), (Patents, 0.0)]","[(matching algorithm, 6), (matching methods, 5..."
3,3,artificial intelligence,2849.0,11.0,15.0,24.0,0.0,2.0,0.0,0.0,...,35.0,"[mobile robots, cognitive systems, cognitive s...",157,47,"[(Twitter, 2849.0), (Facebook, 11.0), (Wikiped...","[(Blogs, 0.0), (Reddit, 15.0), (Videos, 2.0), ...","[(Peers, 0.0), (Expert, 0.0)]","[(News, 0.0), (Citation_crossref, 4861.0)]","[(Policy, 0.0), (Patents, 35.0)]","[(artificial intelligence, 157), (expert syste..."
4,4,probability,266.0,9.0,6.0,3.0,0.0,0.0,0.0,0.0,...,15.0,"[memetic, estimation of distribution algorithm...",59,42,"[(Twitter, 266.0), (Facebook, 9.0), (Wikipedia...","[(Blogs, 0.0), (Reddit, 6.0), (Videos, 0.0), (...","[(Peers, 0.0), (Expert, 0.0)]","[(News, 0.0), (Citation_crossref, 1338.0)]","[(Policy, 0.0), (Patents, 15.0)]","[(probability, 59), (probability distributions..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
152,244,medical images,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,"[medical images, clutter (information theory),...",2,40,"[(Twitter, 7.0), (Facebook, 0.0), (Wikipedia, ...","[(Blogs, 0.0), (Reddit, 0.0), (Videos, 0.0), (...","[(Peers, 0.0), (Expert, 0.0)]","[(News, 0.0), (Citation_crossref, 53.0)]","[(Policy, 0.0), (Patents, 0.0)]","[(medical images, 2), (clutter (information th..."
153,247,virtual reality,22.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,"[virtual reality, virtual worlds, knowledge tr...",6,40,"[(Twitter, 22.0), (Facebook, 0.0), (Wikipedia,...","[(Blogs, 0.0), (Reddit, 0.0), (Videos, 0.0), (...","[(Peers, 0.0), (Expert, 0.0)]","[(News, 0.0), (Citation_crossref, 92.0)]","[(Policy, 0.0), (Patents, 0.0)]","[(virtual reality, 6), (virtual environments, ..."
154,264,autonomous driving,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,"[scene understanding, different frequency, fre...",2,40,"[(Twitter, 7.0), (Facebook, 0.0), (Wikipedia, ...","[(Blogs, 0.0), (Reddit, 0.0), (Videos, 0.0), (...","[(Peers, 0.0), (Expert, 0.0)]","[(News, 0.0), (Citation_crossref, 23.0)]","[(Policy, 0.0), (Patents, 0.0)]","[(autonomous driving, 2), (scene understanding..."
155,272,embedded systems,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,"[grammatical evolution, l2 cache, memory syste...",1,40,"[(Twitter, 3.0), (Facebook, 0.0), (Wikipedia, ...","[(Blogs, 0.0), (Reddit, 0.0), (Videos, 0.0), (...","[(Peers, 0.0), (Expert, 0.0)]","[(News, 0.0), (Citation_crossref, 4.0)]","[(Policy, 0.0), (Patents, 0.0)]","[(grammatical evolution, 1), (l2 cache, 1), (m..."


In [27]:
topics = {}

for idx, row in agg_scores.iterrows():
    topics[idx] = {
        'topic': row['Name'],
        'cluster_size': row['Count_norm'],
        'reach': row['Reach'],
        'engagement': row['Engagement'],
        'feedback': row['Feedback'],
        'influence': row['Influence'],
        'outcome': row['Outcome'],
        'words': row['Representation']  # List of tuples (concept, count)
    }

### 1.2.3 Impact-augmented knowledge graph

In [28]:
GRAPH_FILE = project_dir / "data" / "impact_augmented_kg.ttl"

In [29]:
def normalize_edge_weights(G, weight_attr="width", new_range=(0, 1)):
    a, b = new_range
    weights = [G[u][v][weight_attr] for u, v in G.edges]
    min_weight = min(weights)
    max_weight = max(weights)
    for u, v in G.edges:
        original_weight = G[u][v][weight_attr]
        normalized_weight = a + ((original_weight - min_weight) * (b - a)) / (max_weight - min_weight)
        G[u][v][weight_attr] = round(normalized_weight, 2)

In [30]:
G = nx.Graph()

for topic, value in topics.items():
    topic_node = f"Topic {topic}"
    G.add_node(topic_node, size=value['cluster_size'], label=value['topic'], type='topic')
    
    for word, _ in value.get("words", [])[:10]: #[:100]
        G.add_node(word, size=10, label=word, type='leaf')
        G.add_edge(topic_node, word, weight=1)

overlap_leafs = []
for topic1, value1 in topics.items():
    for topic2, value2 in topics.items():
        if topic1 != topic2:
            words1 = set([word[0] for word in value1.get("words", [])])
            words2 = set([word[0] for word in value2.get("words", [])])
            common_words = words1.intersection(words2)
            if common_words:
                overlap_leafs.append(next(iter(common_words)))
                G.add_edge(f"Topic {topic1}", f"Topic {topic2}", color="#4caf50", weight=len(common_words) + 
                           float(sum(value for _, value in value1.get("reach", []))) + 
                           float(sum(value for _, value in value2.get("reach", []))) + 
                           float(sum(value for _, value in value1.get("engagement", []))) + 
                           float(sum(value for _, value in value2.get("engagement", []))) +
                           float(sum(value for _, value in value1.get("feedback", []))) + 
                           float(sum(value for _, value in value2.get("feedback", []))) +
                           float(sum(value for _, value in value1.get("influence", []))) + 
                           float(sum(value for _, value in value2.get("influence", []))) +
                           float(sum(value for _, value in value1.get("outcome", []))) + 
                           float(sum(value for _, value in value2.get("outcome", []))))

# Normalize weights
normalize_edge_weights(G, weight_attr="weight", new_range=(1, 20))

In [31]:
net = Network(height="800px", width="100%", notebook=True, bgcolor="white", font_color="black") ##f0f0f0
net.from_nx(G)

for node in net.nodes:
    if node['type'] == "topic":
        node['borderWidth'] = 4
        node['shadow'] = True
        node['color'] = "#9e9e9e"
    else:
        if node['id'] in overlap_leafs:
            node['borderWidth'] = 3
            node['shadow'] = True
            node['color'] = "#ff9800"
        else:
            node['borderWidth'] = 3
            node['shadow'] = True
            node['color'] = "#616161"

for edge in net.edges:
    weight = edge.get('width', 1)
    edge['label'] = str(weight)
    edge['font'] = {'size': 8}
    
net.force_atlas_2based()
net.show("G_topic_knowledge_graph.html")

G_topic_knowledge_graph.html


In [32]:
# Set up RDF graph and namespaces
rdf_graph = RDFGraph()
FLOW = Namespace("http://example.org/flowmetrics#")
SKOS = Namespace("http://www.w3.org/2004/02/skos/core#")
CSO = Namespace("http://cso.kmi.open.ac.uk/schema/cso#")

rdf_graph.bind("flow", FLOW)
rdf_graph.bind("skos", SKOS)
rdf_graph.bind("cso", CSO)
rdf_graph.bind("rdfs", RDFS)
rdf_graph.bind("xsd", XSD)

def extract_topic_id(label):
    return label.replace("Topic ", "").strip()  # returns string like "42"

for node, data in G.nodes(data=True):
    if data["type"] == "topic":
        topic_id = extract_topic_id(node)
        topic_uri = URIRef(FLOW + f"topic_{topic_id}")
        rdf_graph.add((topic_uri, RDF.type, FLOW.ResearchTopic))
        rdf_graph.add((topic_uri, RDF.type, SKOS.Concept))
        rdf_graph.add((topic_uri, RDFS.label, Literal(data["label"], datatype=XSD.string)))

for u, v, data in G.edges(data=True):
    if G.nodes[u]["type"] == "topic" and G.nodes[v]["type"] == "topic":
        id1, id2 = extract_topic_id(u), extract_topic_id(v)
        topic_uri1 = URIRef(FLOW + f"topic_{id1}")
        topic_uri2 = URIRef(FLOW + f"topic_{id2}")
        sorted_ids = sorted([int(id1), int(id2)])
        pair_uri = URIRef(FLOW + f"pair_{sorted_ids[0]}_{sorted_ids[1]}")

        rdf_graph.add((pair_uri, RDF.type, FLOW.TopicPair))
        rdf_graph.add((pair_uri, FLOW.hasTopic, topic_uri1))
        rdf_graph.add((pair_uri, FLOW.hasTopic, topic_uri2))

        words1 = set(w[0] for w in topics[int(id1)].get("words", []))
        words2 = set(w[0] for w in topics[int(id2)].get("words", []))
        shared = sorted(words1 & words2)
        for concept in shared[:3]:
            rdf_graph.add((pair_uri, FLOW.hasSharedConcept, Literal(concept, datatype=XSD.string)))

        for stage in impact_stages:
            platform_scores = {}
            for topic_id in [int(id1), int(id2)]:
                for platform, score in topics[topic_id].get(stage, []):
                    platform_scores[platform] = platform_scores.get(platform, 0.0) + score

            for platform, score in platform_scores.items():
                bnode = BNode()
                rdf_graph.add((pair_uri, FLOW[f"has{stage.capitalize()}Impact"], bnode))
                rdf_graph.add((bnode, FLOW.score, Literal(score, datatype=XSD.float)))
                rdf_graph.add((bnode, FLOW.platform, Literal(platform, datatype=XSD.string)))

# Save TTL
rdf_graph.serialize(destination=str(GRAPH_FILE), format="turtle")
print(f"RDF TTL file saved.")

RDF TTL file saved.
