# Reference‑aware citation sentence extraction (PMC JATS)

このノートブックは **PMC の本文 XML（JATS）** を使って、
- 参照リスト（`<ref-list>`）で **対象論文（PMID/DOI）に対応する `<ref id>`** を特定し、
- 本文中の **`<xref ref-type="bibr" rid="...">`** の位置から、**引用が実際に出てくる「文」** を抽出します。

> OA でない論文や PMC に本文が無い場合は、**抄録**を文分割して近似します（オプション）。


## Parameters (edit then Run All)

- `TARGET_PMID`: 解析対象の PMID
- `OUT_DIR`: 出力ディレクトリ
- `MAX_CITING`: 取得する被引用論文数の上限（None で全件）
- `FALLBACK_TO_ABSTRACT`: PMC で引用文が特定できない時に抄録で代替するか
- `NCBI_API_KEY`: 任意。指定するとレート制限が緩和 (3 → 10 req/s)


In [1]:
# >>> Edit here <<<
TARGET_PMID = "10519872"   # Example: Re-emergent tremor of Parkinson's disease
OUT_DIR = "results_csv"
MAX_CITING = 50            # None for all
FALLBACK_TO_ABSTRACT = True
NCBI_API_KEY = ""
PAUSE = 0.25               # polite sleep between API calls (sec)


## Install & one‑time data

初回のみ実行してください。`punkt_tab` は NLTK 3.8+ で必要です。


In [2]:
%pip -q install requests pandas tqdm nltk textblob lxml

Note: you may need to restart the kernel to use updated packages.


In [3]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('vader_lexicon')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/yutaashihara/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/yutaashihara/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/yutaashihara/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

## API notes

- NCBI E-utilities は 3 req/sec（API key で 10 req/sec）。`PAUSE` を調整してください。
- PMC JATS が無い論文は本文からの厳密抽出ができません（抄録近似にフォールバック）。


## Pipeline

1. PubMed `efetch` で **対象論文**のメタ（title/journal/DOI）を取得

2. `elink (pubmed_pubmed_citedin)` で **被引用 PMIDs** を列挙

3. 各 citing PMID のメタから **PMCID** を取得 → あれば `efetch db=pmc` で **JATS** を取得

4. `<ref-list>` の `<ref>` を走査し、`<pub-id pub-id-type="pmid/doi">` が **対象**に一致するものの **`id` (R#)** を取得

5. 本文 `.//body//xref[@ref-type="bibr" and @rid="R#"]` をすべて探索し、親段落を文字列化 → **[CIT]** を挿入 → 文分割 → [CIT] を含む文だけを抽出

6. 文ごとにスコア（VADER/TextBlob/Custom + pos/neg ratios）を計算し、CSV へ


## Scoring (Basic)

- `vader_score`: NLTK VADER の compound（−1〜1）→ (x+1)/2 で 0〜1 に変換
- `textblob_score`: TextBlob polarity（−1〜1）→ 0〜1
- `custom_score`: 簡易ポジ/ネガ辞書で (pos−neg)/(pos+neg) を 0〜1 へ
- `pos_ratio` / `neg_ratio`: 語彙比率（参考指標）


In [5]:
import os, re, time, math, requests, pandas as pd
from lxml import etree
from tqdm import tqdm
from nltk import sent_tokenize
from nltk.sentiment import SentimentIntensityAnalyzer
from textblob import TextBlob

EU = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"

def _get(url, **params):
    if NCBI_API_KEY:
        params.setdefault("api_key", NCBI_API_KEY)
    r = requests.get(url, params=params, timeout=60)
    r.raise_for_status()
    return r

def get_pubmed_meta(pmid: str):
    xml = _get(EU+"efetch.fcgi", db="pubmed", id=pmid, retmode="xml").text
    root = etree.fromstring(xml.encode())
    def xstr(xpath): 
        return (root.xpath(f'string({xpath})') or "").strip()
    meta = {
        "title":   xstr(".//ArticleTitle"),
        "journal": xstr(".//Journal/Title"),
        "pmcid":   xstr(".//ArticleIdList/ArticleId[@IdType=\\\"pmc\\\"]"),
        "doi":     xstr(".//ArticleIdList/ArticleId[@IdType=\\\"doi\\\"]").lower(),
    }
    return meta

def get_citing_pmids(pmid: str):
    j = _get(EU+"elink.fcgi", dbfrom="pubmed", linkname="pubmed_pubmed_citedin", id=pmid, retmode="json").json()
    dbs = j["linksets"][0].get("linksetdbs", [])
    pmids = dbs[0]["links"] if dbs else []
    return pmids

def get_pmc_jats(pmcid: str):
    if not pmcid: 
        return None
    xml = _get(EU+"efetch.fcgi", db="pmc", id=pmcid).text
    try:
        return etree.fromstring(xml.encode())
    except Exception:
        return None

def find_ref_ids_for_target(jats_root, target_pmid: str, target_doi: str):
    if jats_root is None:
        return []
    ref_ids = []
    for ref in jats_root.xpath('.//ref-list//ref'):
        rid = ref.get('id')
        if not rid: 
            continue
        pmid = (ref.xpath('string(.//pub-id[@pub-id-type="pmid"])') or "").strip()
        doi  = (ref.xpath('string(.//pub-id[@pub-id-type="doi"])') or "").strip().lower()
        ext_doi = (ref.xpath('string(.//ext-link[@ext-link-type="doi"])') or "").strip().lower()
        if (pmid and pmid == target_pmid) or            (target_doi and (doi == target_doi or ext_doi == target_doi)):
            ref_ids.append(rid)
    return list(dict.fromkeys(ref_ids))

def paragraph_text_with_marker(node_xml: str, rid: str) -> str:
    marked = re.sub(rf'<xref[^>]*\brid\s*=\s*"{re.escape(rid)}"[^>]*>.*?</xref>', ' [CIT] ', node_xml, flags=re.I|re.S)
    text = re.sub(r'<[^>]+>', '', marked)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

BLOCK_TAGS = {'p','td','th','caption','sec','list-item'}

def extract_citation_sentences(jats_root, rid: str):
    sents = []
    if jats_root is None:
        return sents
    xrefs = jats_root.xpath(f".//body//xref[@ref-type='bibr' and @rid='{rid}']")
    for x in xrefs:
        node = x
        while node is not None and node.tag not in BLOCK_TAGS:
            node = node.getparent()
        if node is None:
            node = x.getparent()
        xml_str = etree.tostring(node, encoding='unicode')
        text = paragraph_text_with_marker(xml_str, rid)
        for sent in sent_tokenize(text):
            if 'CIT' in sent:
                sents.append(sent.strip())
    return list(dict.fromkeys(sents))

# ---- Scorers ----
sid = SentimentIntensityAnalyzer()
TOKEN = re.compile(r"[A-Za-z']+")
CUSTOM_POS = {"support","increase","enhance","robust","effective","novel","improve","key","helps","correlated","important","useful"}
CUSTOM_NEG = {"reduce","decrease","inhibit","fail","negative","contradict","weak","poor","limited"}

def vader01(text: str) -> float:
    return (sid.polarity_scores(text)["compound"] + 1) / 2

def textblob01(text: str) -> float:
    return (TextBlob(text).sentiment.polarity + 1) / 2

def ratios_and_custom(text: str):
    toks = [t.lower() for t in TOKEN.findall(text)]
    if not toks:
        return 0.5, 0.0, 0.0
    S = set(toks)
    pos = len(S & CUSTOM_POS)
    neg = len(S & CUSTOM_NEG)
    total = len(S)
    pos_ratio = pos / total
    neg_ratio = neg / total
    score = (pos - neg) / (pos + neg + 1e-6)
    score01 = (score + 1) / 2
    return score01, pos_ratio, neg_ratio




## Run pipeline

In [6]:
import os
os.makedirs(OUT_DIR, exist_ok=True)

tmeta = get_pubmed_meta(TARGET_PMID)
t_title, t_journal, t_doi = tmeta["title"], tmeta["journal"], tmeta["doi"]

citing = get_citing_pmids(TARGET_PMID)
if MAX_CITING is not None:
    citing = citing[:MAX_CITING]

rows = []
for cpmid in tqdm(citing, desc=f"Cited-by for PMID {TARGET_PMID}"):
    try:
        cmeta = get_pubmed_meta(cpmid)
        ct_title, c_pmcid = cmeta["title"], cmeta["pmcid"]
        jats = get_pmc_jats(c_pmcid) if c_pmcid else None

        sentences = []
        ref_ids = []
        if jats is not None:
            ref_ids = find_ref_ids_for_target(jats, TARGET_PMID, t_doi)
            for rid in ref_ids:
                sentences.extend(extract_citation_sentences(jats, rid))

        if not sentences and FALLBACK_TO_ABSTRACT:
            xml = _get(EU+"efetch.fcgi", db="pubmed", id=cpmid, retmode="xml").text
            root = etree.fromstring(xml.encode())
            abst = ' '.join(root.xpath('.//AbstractText/text()')).strip()
            sentences = sent_tokenize(abst) if abst else []

        for sent in sentences:
            v_vader = vader01(sent)
            v_tb = textblob01(sent)
            v_custom, r_pos, r_neg = ratios_and_custom(sent)
            rows.append({
                "source_title": t_title,
                "source_journal": t_journal,
                "citing_title": ct_title,
                "citing_pmid": cpmid,
                "pmcid": c_pmcid,
                "ref_ids": ';'.join(ref_ids) if ref_ids else "",
                "citation_sentence": sent,
                "vader_score": v_vader,
                "textblob_score": v_tb,
                "custom_score": v_custom,
                "pos_ratio": r_pos,
                "neg_ratio": r_neg,
            })
        time.sleep(PAUSE)
    except Exception as e:
        print("WARN:", cpmid, e)

import pandas as pd
cols = ["source_title","source_journal","citing_title","citing_pmid","pmcid","ref_ids",
        "citation_sentence","vader_score","textblob_score","custom_score","pos_ratio","neg_ratio"]
df = pd.DataFrame(rows, columns=cols)
out_csv = os.path.join(OUT_DIR, f"{TARGET_PMID}_citations_refaware_basic.csv")
df.to_csv(out_csv, index=False)
out_csv

XPathEvalError: Invalid expression