# Citation Sentiment Pipeline (With Transformers)

This notebook extends the **Basic** pipeline by adding two ML models:

- **MultiCite (AllenAI)** `allenai/multicite-multilabel-scibert` → *multi-label citation intent*
  - labels: `motivation`, `background`, `uses`, `extends`, `similarities`, `differences`, `future_work`
  - we report the **top-1 label** and probability, and all labels above a threshold (default 0.5)
- **SiEBERT (RoBERTa-large)** `siebert/sentiment-roberta-large-english` → *binary sentiment*
  - we report the **positive probability** (0–1) as `hf_sent_score`

The rest of the pipeline is the same:
1) `Cited by` → citing PMIDs, 2) citing abstracts → sentences, 3) compute indicators, 4) CSV.


## Parameters (edit here and Run All)

- `TARGET_PMID`: the PubMed ID of the paper you analyze
- `OUT_DIR`: where to store CSV
- `MAX_CITING`: (optional) limit the number of citing papers (None for all)
- `NCBI_API_KEY`: (optional) set your key to raise rate limit from 3 to 10 req/s

> Tip: Get a free NCBI API key at <https://www.ncbi.nlm.nih.gov/account/> and set it here or as an env var.


In [1]:
# >>> Edit here <<<
TARGET_PMID = "10519872"  # Example: Re-emergent tremor of Parkinson's disease (JNNP, 1999)
OUT_DIR = "results_csv"
MAX_CITING = 50          # set to None for all
NCBI_API_KEY = ""        # optional API key
PAUSE = 0.25             # polite pause between API calls (seconds)

# Transformers
MULTICITE_MODEL = "allenai/multicite-multilabel-scibert"
SIEBERT_MODEL   = "siebert/sentiment-roberta-large-english"
MULTICITE_THRESH = 0.50  # labels above this probability will be included in mc_labels_multi


## Install & Data (one-time)

Run this cell once per environment. This installs both the **basic** deps and **Transformers**.


In [2]:
# installs (skip if already installed)
%pip -q install requests pandas tqdm nltk textblob lxml transformers torch sentencepiece

Note: you may need to restart the kernel to use updated packages.


In [3]:
# one-time downloads for NLTK resources
import nltk
nltk.download('punkt')       # 既に入っていてもOK
nltk.download('punkt_tab')   # ★ これが新たに必要
nltk.download('vader_lexicon')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/yutaashihara/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/yutaashihara/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/yutaashihara/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

## About API limits

- PubMed E-utilities rate limit: **3 req/sec** (or **10 req/sec** with API key).
- This notebook sleeps between calls to be polite. You can tune `PAUSE` if needed.


## Load libraries & define helpers

In [4]:
import os, time, re, math, requests, pandas as pd, torch, torch.nn.functional as F
from lxml import etree
from tqdm import tqdm
from nltk import sent_tokenize
from nltk.sentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
from transformers import AutoTokenizer, AutoModelForSequenceClassification

EU = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"

def _get(url, **params):
    if NCBI_API_KEY:
        params.setdefault("api_key", NCBI_API_KEY)
    r = requests.get(url, params=params, timeout=60)
    r.raise_for_status()
    return r

def get_pubmed_meta(pmid: str):
    xml = _get(EU+"efetch.fcgi", db="pubmed", id=pmid, retmode="xml").text
    root = etree.fromstring(xml.encode())
    title = root.xpath('string(.//ArticleTitle)') or ""
    journal = root.xpath('string(.//Journal/Title)') or ""
    pmcid = root.xpath('string(.//ArticleIdList/ArticleId[@IdType="pmc"])') or ""
    doi   = root.xpath('string(.//ArticleIdList/ArticleId[@IdType="doi"])') or ""
    return dict(title=title.strip(), journal=journal.strip(), pmcid=pmcid.strip(), doi=doi.strip())

def get_citing_pmids(pmid: str):
    j = _get(EU+"elink.fcgi", dbfrom="pubmed", linkname="pubmed_pubmed_citedin", id=pmid, retmode="json").json()
    dbs = j["linksets"][0].get("linksetdbs", [])
    pmids = dbs[0]["links"] if dbs else []
    return pmids

def get_abstract_sentences(pmid: str):
    xml = _get(EU+"efetch.fcgi", db="pubmed", id=pmid, retmode="xml").text
    root = etree.fromstring(xml.encode())
    abst = ' '.join(root.xpath('.//AbstractText/text()')).strip()
    return sent_tokenize(abst) if abst else []

# ---- Scorers: VADER / TextBlob / Custom ----
sid = SentimentIntensityAnalyzer()
TOKEN = re.compile(r"[A-Za-z']+")
CUSTOM_POS = {"support","increase","enhance","robust","effective","novel","improve","key","helps","correlated","important","useful"}
CUSTOM_NEG = {"reduce","decrease","inhibit","fail","negative","contradict","weak","poor","limited"}

def vader01(text: str) -> float:
    return (sid.polarity_scores(text)["compound"] + 1) / 2

def textblob01(text: str) -> float:
    return (TextBlob(text).sentiment.polarity + 1) / 2

def ratios_and_custom(text: str):
    toks = [t.lower() for t in TOKEN.findall(text)]
    if not toks:
        return 0.5, 0.0, 0.0
    S = set(toks)
    pos = len(S & CUSTOM_POS)
    neg = len(S & CUSTOM_NEG)
    total = len(S)
    pos_ratio = pos / total
    neg_ratio = neg / total
    score = (pos - neg) / (pos + neg + 1e-6)
    score01 = (score + 1) / 2
    return score01, pos_ratio, neg_ratio

# ---- Transformers: MultiCite (multi-label) & SiEBERT (binary sentiment) ----
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

mc_tok = AutoTokenizer.from_pretrained(MULTICITE_MODEL)
mc_mod = AutoModelForSequenceClassification.from_pretrained(MULTICITE_MODEL).to(device).eval()

# get label map from config (id2label)
mc_id2label = mc_mod.config.id2label
mc_label_order = [mc_id2label[str(i)] if isinstance(mc_id2label, dict) else mc_id2label[i] for i in range(mc_mod.config.num_labels)]

def multicite_labels(text: str, thresh: float = 0.5):
    with torch.no_grad():
        out = mc_mod(**mc_tok(text, return_tensors='pt', truncation=True, max_length=512).to(device))
        # Multi-label → sigmoid
        probs = torch.sigmoid(out.logits)[0].detach().cpu().numpy()
    # top-1
    top_idx = int(probs.argmax())
    top_label = mc_label_order[top_idx]
    top_prob = float(probs[top_idx])
    # all labels above threshold
    multi = [lbl for lbl, p in zip(mc_label_order, probs) if p >= thresh]
    return top_label, top_prob, multi, probs.tolist()

# SiEBERT (pos/neg)
sb_tok = AutoTokenizer.from_pretrained(SIEBERT_MODEL)
sb_mod = AutoModelForSequenceClassification.from_pretrained(SIEBERT_MODEL).to(device).eval()

def hf_sentiment_posprob(text: str) -> float:
    with torch.no_grad():
        out = sb_mod(**sb_tok(text, return_tensors='pt', truncation=True, max_length=512).to(device))
        probs = torch.softmax(out.logits, dim=-1)[0]
        pos_prob = float(probs[1].detach().cpu())
    return pos_prob

  from .autonotebook import tqdm as notebook_tqdm


KeyError: '0'

## Run pipeline

In [None]:
import os
os.makedirs(OUT_DIR, exist_ok=True)

meta = get_pubmed_meta(TARGET_PMID)
src_title, src_journal = meta["title"], meta["journal"]

citing = get_citing_pmids(TARGET_PMID)
if MAX_CITING is not None:
    citing = citing[:MAX_CITING]

rows = []
for cpmid in tqdm(citing, desc=f"Cited-by for PMID {TARGET_PMID}"):
    try:
        cmeta = get_pubmed_meta(cpmid)
        ct_title = cmeta["title"]
        sentences = get_abstract_sentences(cpmid)
        if not sentences:
            continue
        for sent in sentences:
            # basic
            v_vader = vader01(sent)
            v_tb = textblob01(sent)
            v_custom, r_pos, r_neg = ratios_and_custom(sent)
            # transformers
            mc_label, mc_prob, mc_multi, mc_all = multicite_labels(sent, thresh=MULTICITE_THRESH)
            sb_pos = hf_sentiment_posprob(sent)
            rows.append({
                "source_title": src_title,
                "source_journal": src_journal,
                "citing_title": ct_title,
                "citing_pmid": cpmid,
                "citation_sentence": sent,
                "vader_score": v_vader,
                "textblob_score": v_tb,
                "custom_score": v_custom,
                "pos_ratio": r_pos,
                "neg_ratio": r_neg,
                "mc_top1_label": mc_label,
                "mc_top1_prob": mc_prob,
                "mc_labels_multi": ';'.join(mc_multi),
                "hf_sent_score": sb_pos
            })
        time.sleep(PAUSE)
    except Exception as e:
        print("WARN:", cpmid, e)

import pandas as pd
df = pd.DataFrame(rows, columns=[
    "source_title","source_journal","citing_title","citing_pmid","citation_sentence",
    "vader_score","textblob_score","custom_score","pos_ratio","neg_ratio",
    "mc_top1_label","mc_top1_prob","mc_labels_multi","hf_sent_score"
])
out_csv = os.path.join(OUT_DIR, f"{TARGET_PMID}_citations_transformers.csv")
df.to_csv(out_csv, index=False)
out_csv