Original Notebook: 

https://www.kaggle.com/code/verracodeguacas/xml-inside-recall-boosted

# 📄🧾 XML Inside, Recall Boosted! 🚀

This notebook builds a complete extraction pipeline for dataset citations in scientific articles. It includes XML files so that we don't leave any information on the table. It is designed to be run **cell-by-cell**, In my opinion, making it much easier to understand, modify, and debug compared to pipeline-style scripts that wrap everything into `.py` files.

There are two new main contributions here:

- **XML support**: Some articles in this dataset include useful metadata only in the XML files — not the PDF. Including the XML gives a small but measurable improvement in score. If you’re aiming to squeeze out every last bit of performance, you’ll need to include it. 
  
- **Transparent execution**: All code runs directly in the notebook, with intermediate outputs shown in place. This structure helps with experimentation and troubleshooting, especially during development or error analysis.

Throughout the notebook, you’ll also see **diagnostic printouts** that helped surface edge cases and bugs during development; especially around ID cleaning and filtering. This part is still subject to improvement and tweaking which I haven't done thoroughly.

> Inspiration for this approach came from [this excellent notebook](https://www.kaggle.com/code/mccocoful/notebook2d0b45c244). I’ve adopted most of its best practices while adapting flexibility and XML handling.


In [1]:
# Install packages
!uv pip install -q --system --no-index --find-links='/kaggle/input/latest-mdc-whls/whls' pymupdf

## 🔧 Environment Setup and Constants

We begin by installing the required dependencies and importing all key libraries. The notebook uses `pymupdf` to parse PDFs, `lxml` to handle XML, and prefers `polars` over `pandas` for performance — although we'll switch back to pandas in a few places where polars gave me trouble.

We also define a few helper functions for things like scoring, formatting DOIs, and resolving paths depending on whether we’re running locally or in the Kaggle environment.

Verbose mode for Polars is turned on at this stage as well. This helps surface type coercions or lazy evaluation issues when chaining operations.


In [2]:
# Imports and Constants
import os, re, pathlib
import polars as pl
from lxml import etree
import pymupdf
from typing import Tuple

DOI_URL = 'https://doi.org/'

# Polars verbosity for debugging
pl.Config.set_verbose(True)

polars.config.Config

In [3]:
# Utilities and Helpers

def is_submission():
    return bool(os.getenv('KAGGLE_IS_COMPETITION_RERUN'))

def is_kaggle_env():
    return (len([k for k in os.environ.keys() if 'KAGGLE' in k]) > 0) or is_submission()

def get_prefix_path(prefix: str) -> pathlib.Path:
    # Use correct directory based on environment
    return pathlib.Path(f'/kaggle/{prefix}' if is_kaggle_env() else f'.{prefix}').expanduser().resolve()

def is_doi(name: str) -> pl.Expr:
    return pl.col(name).str.starts_with(DOI_URL)

def doi_link_to_id(name: str) -> pl.Expr:
    return pl.when(is_doi(name)).then(pl.col(name).str.split(DOI_URL).list.last()).otherwise(name).alias(name)

def doi_id_to_link(name: str, substring: str, url: str = DOI_URL) -> pl.Expr:
    return pl.when(pl.col(name).str.starts_with(substring)).then(url + pl.col(name).str.to_lowercase()).otherwise(name).alias(name)

def score(preds: pl.DataFrame, gt: pl.DataFrame, on: list = ['article_id', 'dataset_id'], verbose: bool = True) -> Tuple[float, float, float]:
    if 'id' in preds.columns and 'dataset_id' not in preds.columns:
        preds = preds.rename({'id': 'dataset_id'})
    hits = gt.join(preds, on=on)
    tp = hits.height
    fp = preds.height - tp
    fn = gt.height - tp

    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0.0

    if verbose:
        print(f"Precision: {precision:.4f}, Recall: {recall:.4f}, F1: {f1:.4f}")
        print(f"True Positives: {tp}, False Positives: {fp}, False Negatives: {fn}")

    return precision, recall, f1

## 📄 XML and PDF Parsing Functions

This section defines the core logic to convert PDFs and XMLs into plain text.

- `pdf2text` reads each PDF using `pymupdf`, extracting text page by page.
- `xml2text` uses `lxml` and applies different parsing logic depending on the XML schema. It handles common formats like TEI, JATS, Wiley, and BioC, and falls back to a generic method for unknown structures. I found that these formats existed thanks to this discussion: https://www.kaggle.com/competitions/make-data-count-finding-data-references/discussion/584638
- A small normalization trick is applied to remove line breaks after certain DOI prefixes, which sometimes show up with awkward formatting (e.g. `10.\n5061`).

The goal here is to get everything into a consistent, readable text format so we can scan for dataset IDs later. If both XML and PDF exist for an article, we append the XML content after the PDF — no deduplication, just a simple merge.


In [4]:
# XML & PDF Parsing

def xml_kind(path: pathlib.Path) -> str:
    head = path.open('rb').read(2048).decode('utf8', 'ignore')
    if 'www.tei-c.org/ns' in head:
        return 'tei'
    if re.search(r'(NLM|TaxonX)//DTD', head):
        return 'jats'
    if 'www.wiley.com/namespaces' in head:
        return 'wiley'
    if 'BioC.dtd' in head:
        return 'bioc'
    return 'unknown'

def xml2text(path: pathlib.Path) -> str:
    kind = xml_kind(path)
    root = etree.parse(str(path)).getroot()
    if kind in ('tei', 'bioc', 'unknown'):
        txt = ' '.join(root.itertext())
    elif kind == 'jats':
        elems = root.xpath('//body//sec|//ref-list')
        txt = ' '.join(' '.join(e.itertext()) for e in elems)
    elif kind == 'wiley':
        elems = root.xpath('//*[local-name()="body"]|//*[local-name()="refList"]')
        txt = ' '.join(' '.join(e.itertext()) for e in elems)
    else:
        txt = ' '.join(root.itertext())
    txt = re.sub(r'10\.\d{4,9}/\s+', '10.', txt)
    return txt

def pdf2text(path: pathlib.Path, out_dir: pathlib.Path) -> None:
    doc = pymupdf.open(str(path))
    out = out_dir / f"{path.stem}.txt"
    with open(out, "wb") as f:
        for page in doc:
            f.write(page.get_text().encode("utf8"))
            f.write(b"\n")

## 🛠️ Bulk Parsing: PDFs and XMLs to Plain Text

Now we bring the parsers together to process the full test set.

The function `parse_all_pdfs_xmls` scans the test directory, finds all `.pdf` and `.xml` files, and converts each one to a `.txt` file in a target `parsed` directory. 

- PDFs are processed first.
- XMLs (if present) are appended to the same `.txt` file using binary append mode (`"ab"`), so the result includes both sources back-to-back. You may want to improve this by saving pdf's and xmls into separate folders and analyzing them separately, the sky is your limit!

Any parsing errors are caught and logged with the article ID. This helps avoid crashing the entire run if a single file is corrupted.

At the end of this step, every article will have a `.txt` file with its text content — pulled from PDF, XML, or both. (note that I didn't find any article ONLY in PDF format in the train/test set)


In [5]:
# Parse All PDFs & XMLs to TXT
from tqdm.auto import tqdm

def parse_all_pdfs_xmls(pdf_dir, xml_dir, parsed_dir):
    pdf_files = list(pdf_dir.glob('*.pdf'))
    if not pdf_files and not xml_dir.exists():
        raise ValueError("No PDF or XML files found.")

    parsed_dir.mkdir(parents=True, exist_ok=True)

    # PDF → TXT
    for pdf in tqdm(pdf_files, desc="PDF→TXT"):
        try:
            pdf2text(pdf, parsed_dir)
        except Exception as e:
            print(f"PDF error {pdf.stem}: {e}")

    # XML → TXT (append mode)
    if xml_dir.exists():
        for xml in tqdm(xml_dir.glob('*.xml'), desc="XML→TXT"):
            try:
                txt = xml2text(xml).encode("utf8")
                out = parsed_dir / f"{xml.stem}.txt"
                with open(out, "ab") as f:  # 'ab' = append binary
                    f.write(txt)
                    f.write(b"\n")
            except Exception as e:
                print(f"XML error {xml.stem}: {e}")
    print("Done parsing to text.")


## 🔍 Regex and Reference-Aware Text Preparation

Before we try to extract dataset IDs, we need to load all the parsed text files into a single DataFrame. This happens in the `get_text_df` function.

Here’s what it does:

- Reads all `.txt` files from the parsed directory.
- Applies some Unicode normalization and removes non-ASCII characters.
- Collapses multiple newlines and cleans up formatting.
- Splits each article into two parts: the **body** and the **references**, based on where words like "references" or "acknowledgments" are detected (in reverse).
  
The idea is to isolate references at the end of the paper, since that’s often where dataset citations appear. We store the result as a Polars DataFrame with three columns: `text`, `refs`, and `body`.

This layout will be helpful later when we want to compare where an ID appears — in the body or in the references section.


In [6]:
# Extraction Helpers
# This cell defines a regex for extracting dataset IDs from text,
# and a helper function to read in all parsed .txt files as a DataFrame.

import matplotlib.pyplot as plt
import polars as pl
from pathlib import Path

REGEX_IDS = (
    r"(?i)\b(?:"
    r"CHEMBL\d+|"
    r"E-GEOD-\d+|E-PROT-\d+|EMPIAR-\d+|"
    r"ENSBTAG\d+|ENSOARG\d+|"
    r"EPI_ISL_\d{5,}|EPI\d{6,7}|"
    r"HPA\d+|CP\d{6}|IPR\d{6}|PF\d{5}|KX\d{6}|K0\d{4}|"
    r"PRJNA\d+|PXD\d+|SAMN\d+|"
    r"dryad\.[^\s\"<>]+|pasta\/[^\s\"<>]+"
    r")"
)


def get_text_df(parsed_dir: Path) -> pl.DataFrame:
    paths = list(parsed_dir.rglob('*.txt'))
    records = [{'article_id': p.stem, 'text': p.read_text(encoding='utf8')} for p in paths]
    return (
        pl.DataFrame(records)
        .with_columns(
            pl.col("text")
              .str.normalize("NFKC")
              .str.replace_all(r"[^\p{Ascii}]", "")
        )
        .with_columns(
            pl.col("text")
              .str.split(r'\n{2,}')
              .list.eval(pl.col("").str.replace_all('\n', ' '))
              .list.join('\n')
              .alias('text')
        )
        .with_columns([
            pl.col("text")
              .str.slice(pl.col("text").str.len_chars() // 4)
              .str.reverse()
              .alias('rtext'),
            pl.col("text")
              .str.slice(0, pl.col("text").str.len_chars() // 4)
              .alias('ltext'),
        ])
        .with_columns(
            pl.col("rtext")
              .str.find(r'(?i)\b(secnerefer|erutaretil detic|stnemegdelwonkca)\b')
              .alias('ref_idx')
        )
        .with_columns(
            pl.when(pl.col("ref_idx").is_null()).then(0).otherwise(pl.col("ref_idx")).alias("ref_idx")
        )
        .with_columns([
            pl.col("rtext")
              .str.slice(0, pl.col("ref_idx"))
              .str.reverse()
              .alias("refs"),
            (pl.col("ltext") + pl.col("rtext").str.slice(pl.col("ref_idx")).str.reverse()).alias("body")
        ])
        .drop("rtext", "ltext")
    )

## 🧬 Candidate Dataset ID Extraction

This is the main engine of the notebook — the `extract_candidates` function runs the full ID extraction pipeline.

Here’s the step-by-step breakdown:

- **[A] Extract candidate IDs** using a long regex that targets common dataset patterns (e.g., Dryad, PASTA, PRJNA, SAMN, etc.). This step is done with Polars, then converted to Pandas for flexibility.
  
- **[B] Explode**: Articles often contain multiple matches, so we explode the list of matches into separate rows — one per candidate ID.

- **[C] Clean**: IDs can be messy (extra punctuation, spacing). We strip out whitespace, punctuation at the end, and standardize casing.

- **[D] Normalize DOIs**: Dryad and PASTA references are turned into full DOI links using their respective prefixes.

- **[E] Choose the best version**: Prefer a canonical DOI if one is available; otherwise, fall back to the cleaned version.

- **[F] Filter**: This step does the heavy lifting:
  - Drop nulls and self-citations.
  - Remove false positives like `figshare`.
  - Enforce minimum DOI suffix length.
  - Drop known stub DOIs that appear in boilerplate text.
  - Ensure parentheses and brackets are balanced — a common issue in malformed references.

⚠️ Note: At this point, I ran into some compatibility issues in Polars, especially with filtering conditions that required context across multiple columns. For that reason, I moved this logic to Pandas — it was more stable and easier to debug. I'll keep trying to make this notebook pandas free in the future, but for now I decided to publish it like that.

- **[G] Extract context window**: For each surviving candidate, we grab a chunk of text around the ID. This will later be used by the classifier (or for manual inspection).

The final DataFrame contains three columns: `article_id`, `dataset_id`, and `window`.


In [7]:
import pandas as pd
from collections import Counter

def extract_candidates(args):
    parsed_in = get_prefix_path("working") / args['i']
    print(f"🔵 Step 2: Begin ID Extraction Pipeline")
    print(f"   → Will process parsed text files from: {parsed_in}")
    
    # Start from polars then convert to pandas for further steps
    text_df = get_text_df(parsed_in)
    print(f"🟢 Step 1: Loaded text DataFrame")
    print(f"   → Rows: {text_df.height}, Columns: {list(text_df.columns)}")
    print(text_df.with_columns(pl.col("text").str.slice(0, 100).alias("text_snippet")).head(2).to_pandas())

    # Step A: Extract candidate IDs (regex)
    df = text_df.with_columns(pl.col("text").str.extract_all(REGEX_IDS).alias("id")).to_pandas()
    print(f"🟦 [A] Extract candidate IDs")
    print(df[["article_id", "id"]].head(2))

    # Step B: Explode for one row per candidate
    df = df.explode("id").rename(columns={"id": "match_id"})
    print(f"🟦 [B] Exploded IDs")
    print(df[["article_id", "match_id"]].head(2))

    # Step C: Clean IDs
    df["id"] = df["match_id"]
    df["id_nospace"] = df["id"].str.replace(r"\s+", "", regex=True)
    df["id_cleaned"] = df["id_nospace"].str.replace(r"[-.,;:!?/)\]\(\[]+$", "", regex=True)
    print(f"🟦 [C] Cleaned IDs")
    print(df[["article_id", "id", "id_cleaned"]].head(2))

    # Step D: Expand DOIs
    def norm_dryad(x):
        return f"https://doi.org/10.5061/{x.lower()}" if isinstance(x, str) and x.startswith("dryad.") else None
    def norm_pasta(x):
        return f"https://doi.org/10.6073/{x.lower()}" if isinstance(x, str) and x.startswith("pasta/") else None

    df["id_final_dryad"] = df["id_cleaned"].map(norm_dryad)
    df["id_final_pasta"] = df["id_cleaned"].map(norm_pasta)
    print(f"🟦 [D] Normalized DOIs (dryad/pasta)")
    print(df[["article_id", "id_final_dryad", "id_final_pasta"]].head(2))

    # Step E: Prioritize full DOI URL, fallback to cleaned
    df["id_use"] = df["id_final_dryad"].combine_first(df["id_final_pasta"]).combine_first(df["id_cleaned"])
    print(f"🟦 [E] Chose ID to use")
    print(df[["article_id", "id_use"]].head(2))

    # Step F: Filter false positives (Enhanced)
    # -- Drop nulls
    df = df[df["id_use"].notnull()]
    # -- Remove IDs that include the article's own ID
    df = df[~df.apply(lambda row: str(row["article_id"]).replace("_", "/").lower() in str(row["id_use"]).lower(), axis=1)]
    # -- Remove 'figshare'
    df = df[~df["id_use"].str.contains("figshare", na=False)]
    # -- Remove DOIs with short suffixes
    def valid_doi(x):
        if isinstance(x, str) and x.startswith(DOI_URL):
            return len(x.rsplit("/", 1)[-1]) >= 4
        return True
    df = df[df["id_use"].apply(valid_doi)]
    # -- Remove stub DOIs
    STUBS = ["https://doi.org/10.5061/dryad", "https://doi.org/10.6073/pasta", "https://doi.org/10.5281/zenodo"]
    df = df[~df["id_use"].isin(STUBS)]
    # -- Paren/bracket matching
    df = df[df["id_use"].str.count(r"\(") == df["id_use"].str.count(r"\)")]
    df = df[df["id_use"].str.count(r"\[") == df["id_use"].str.count(r"\]")]
    print(f"🟦 [F] Filtered false positives (showing a few):")
    print(df[["article_id", "id_use"]].head(5))

    # Step G: Extract window context and rename
    def get_window(row):
        idx = row["text"].find(row["id_use"])
        if idx == -1:
            return ""
        start = max(idx - args['ws'] - len(str(row["id_use"])), 0)
        end = idx + args['ws'] + len(str(row["id_use"]))
        return row["text"][start:end]
    df["window"] = df.apply(get_window, axis=1)
    df = df[["article_id", "id_use", "window"]].drop_duplicates().rename(columns={"id_use": "dataset_id"})
    print(f"\n✅ Completed extraction: {len(df)} unique (article_id, dataset_id) pairs")
    return df

## 🚀 Main Pipeline: Run It All

This final function ties everything together.

- **Step 1**: We parse all PDFs and XMLs using `parse_all_pdfs_xmls`, converting them into `.txt` files inside the `parsed` folder.
- **Step 2**: We call `extract_candidates`, which loads those text files and runs the full ID extraction and filtering pipeline.
- The results are saved in both `.parquet` and `.csv` formats — ready for submission.

We also tag each `dataset_id` as either `"Primary"` or `"Secondary"`:
- A `"Primary"` ID is a proper DOI or something like a `SAMN` identifier.
- Everything else is treated as `"Secondary"` — probably useful, but not canonical.

Finally, the code builds a Kaggle submission file with the expected format: `row_id`, `article_id`, `dataset_id`, and `type`.

Everything prints as it runs — no surprises, no hidden state.

✅ When you run this cell, your entire pipeline runs end-to-end, and the final submission is saved to disk.


In [8]:
# Cell 8: Main Pipeline, concise output
def main_pipeline():
    args = {'i': 'parsed', 'o': 'extracted_ids.parquet', 'gt': 'make-data-count-finding-data-references/train_labels.csv', 'ws': 100}

    print("🌟 STEP 1: Parse all PDFs and XMLs to text files")
    pdf_dir = pathlib.Path('/kaggle/input/make-data-count-finding-data-references/test/PDF')
    xml_dir = pathlib.Path('/kaggle/input/make-data-count-finding-data-references/test/XML')
    parsed_dir = get_prefix_path('working') / args['i']
    print(parsed_dir)
    parse_all_pdfs_xmls(pdf_dir, xml_dir, parsed_dir)

    print("\n🌟 STEP 2: Extract candidate dataset IDs from text")
    df = extract_candidates(args)
    out_parq = get_prefix_path("working") / args['o']
    df.to_parquet(out_parq)
    print(f"✔ Saved extracted IDs to: {out_parq} — {len(df)} rows")

    # Build submission DataFrame with 'type'
    def assign_type(x):
        if isinstance(x, str) and (x.startswith(DOI_URL) or x.startswith("SAMN")):
            return "Primary"
        else:
            return "Secondary"
    sub = df.copy()
    sub["type"] = sub["dataset_id"].apply(assign_type)
    sub = sub.drop_duplicates(subset=["article_id", "dataset_id"]).reset_index(drop=True)
    sub["row_id"] = range(len(sub))
    sub = sub[["row_id", "article_id", "dataset_id", "type"]]
    print("\n[main_pipeline] Submission DataFrame (first rows):")
    print(sub.head())

    sub.to_csv(get_prefix_path("working") / "submission.csv", index=False)
    print(f"✔ Submission saved — {len(sub)} rows")

    # Optionally: add scoring and validation if running on train split

    print("\n✅ Pipeline finished!")

main_pipeline()


🌟 STEP 1: Parse all PDFs and XMLs to text files
/kaggle/working/parsed


PDF→TXT:   0%|          | 0/30 [00:00<?, ?it/s]

XML→TXT: 0it [00:00, ?it/s]

Done parsing to text.

🌟 STEP 2: Extract candidate dataset IDs from text
🔵 Step 2: Begin ID Extraction Pipeline
   → Will process parsed text files from: /kaggle/working/parsed
🟢 Step 1: Loaded text DataFrame
   → Rows: 30, Columns: ['article_id', 'text', 'ref_idx', 'refs', 'body']
          article_id                                               text  \
0  10.1002_ece3.9627  Ecology and Evolution. 2022;12:e9627.      | 1...   
1   10.1002_mp.14424  PleThora: Pleural effusion and thoracic cavity...   

   ref_idx                                               refs  \
0    18389       Amir ,  Z.   ,    Moore ,  J. H.   ,    N...   
1    14620   1     Kumar   V   ,    Gu   Y   ,    Basu   S...   

                                                body  \
0  Ecology and Evolution. 2022;12:e9627.      | 1...   
1  PleThora: Pleural effusion and thoracic cavity...   

                                        text_snippet  
0  Ecology and Evolution. 2022;12:e9627.      | 1...  
1  PleThora: Ple

In [9]:
def show_submission(sub_csv='/kaggle/working/submission.csv'):
    df = pd.read_csv(sub_csv)
    df = df.reset_index(drop=True)
    df['row_id'] = df.index
    print(df[['row_id', 'article_id', 'dataset_id', 'type']].to_string(index=False))

show_submission()

 row_id             article_id                              dataset_id      type
      0      10.1002_ece3.9627 https://doi.org/10.5061/dryad.b8gtht7h3   Primary
      1      10.1002_ecs2.4619  PASTA/D835832D7FD00D9E4466E44EEA87FAB3 Secondary
      2      10.1002_ece3.6303 https://doi.org/10.5061/dryad.37pvmcvgb   Primary
      3 10.1002_chem.202001668                                  K03946 Secondary
      4 10.1002_chem.202001668                                  K01438 Secondary
      5      10.1002_ece3.4466   https://doi.org/10.5061/dryad.r6nq870   Primary
      6      10.1002_ece3.5260   https://doi.org/10.5061/dryad.2f62927   Primary
      7      10.1002_ece3.6144 https://doi.org/10.5061/dryad.zw3r22854   Primary
      8      10.1002_ecs2.1280     https://doi.org/10.5061/dryad.p3fg9   Primary


In [10]:
! rm -rf parsed
! rm -rf src
! rm -rf extracted_ids.parquet

Result:

Score: 0.521

Rank: 74 (2025-07-21-12:31, JST)

Runtime: 2min (kaggle editor), 15min (Scoring)

Your Best Entry!
Your most recent submission scored 0.521, which is an improvement of your previous score of 0.520. Great job!

Moving up to rank 74. rising like my electricity bill. #kaggle - https://kaggle.com/competitions/make-data-count-finding-data-references 

Make Data Count, Maggie Demkin, and Walter Reade. Make Data Count - Finding Data References. https://kaggle.com/competitions/make-data-count-finding-data-references, 2025. Kaggle.