# Student submission reference checker

This code will cycle through all pdf files in a folder (presumably student assignment submissions), identify the references and check whether the references are real or hallucination. This process requires a little confidence in installing softwares and a little familiarity with Terminals.

## Setup

The software used to extract references is  called [GROBIT](https://grobid.readthedocs.io/en/latest/) which stands for "GeneRation Of BIbliographic Data".

The GROBIT software is delivered via [Docker](https://www.docker.com/) on your computer. 

### Installing and Testing Docker

Go to [Docker](https://www.docker.com/) to download the Desktop version of Docker that fits your operating system. In my case that was the Windows AMD64 version. Once you downloaded the Docker software to your computer you should go to your Terminal and use the following two commands.

``` bash
docker --version
docker run hello-world
```

They should give you the version of Docker you have installed and successfully run the short `hello-world` script.

### Installing and Testing GROBIT

This software uses some pre-trained artificial intelligence engine to extract, from a piece of text, the bibliographic information for the references used.

Now you need to install and run the [GROBIT](https://grobid.readthedocs.io/en/latest/) application. The full guidance is available [here](https://grobid.readthedocs.io/en/latest/getting_started/).

In your Terminal run

``` bash
docker pull grobid/grobid:0.8.2.1-crf
```

This basically loads the software. You can then start it with the following 

``` bash
docker run --rm --init --ulimit core=0 -p 8070:8070 grobid/grobid:0.8.2.1-crf
```

If you now open your Docker Desktop you can see the an instance, or a container, with the GROBIT software running.

![Docker Desktop](Docker_image1.png)

If you click on the port, then a browser will open with the following 

![Grobit](Grobit_image1.png)

Here you can go to the **PDF** tab and upload a PDF document. It highlights the references and the can create hyperlinks to the detected references.

You can close that application by clicking on the stop button in the Docker Desktop.


## Workflow

Before you run the code you need to make sure to have the GROBIT container open as described above (starting Gorbit desktop and then running the two commands above). `GROBID_URL` refers to that open container. 

Parameters to set:

1. `PDF_FOLDER`, this is the path to the folder in which the pdf submissions are saved.
2. `GROBID_URL `, this is the path to the instance in which GROBIT runs. YOu can get it by copying the url you get when you click on the relevant Port in the Docker Desktop (see above)

The output will be saved into the source directory of this file.

In [1]:
import os
import re
import json
import time
import sys
from dataclasses import dataclass, asdict
from typing import List, Optional, Dict, Any, Tuple
from weakref import ref

import requests
from lxml import etree
from rapidfuzz import fuzz
from tqdm import tqdm

from bs4 import BeautifulSoup
from urllib.parse import urlparse

PDF_FOLDER = "C:/Temp/YAWYR2/test_file"
GROBID_URL = "http://localhost:8070/api/processReferences"

# Verification APIs
CROSSREF_WORKS = "https://api.crossref.org/works"
OPENALEX_WORKS = "https://api.openalex.org/works"

# Heuristics / thresholds
TITLE_MATCH_THRESHOLD = 85      # fuzzy title match
MIN_EVIDENCE_SCORE = 3          # how many checks must pass to call it "verified"
REQUIRE_AUTHOR_MATCH = True     # no verification without Author Match
REQUIRE_TITLE_MATCH = True      # no verification without title reaching TITLE_MATCH_THRESHOLD
REQUEST_TIMEOUT = 60
SLEEP_BETWEEN_CALLS = 0.2       # be kind to APIs

@dataclass
class ParsedRef:
    title: Optional[str]
    year: Optional[int]
    authors: List[str]
    venue: Optional[str]
    doi: Optional[str]
    url: Optional[str] = None
    raw: Optional[str] = None

@dataclass
class VerificationResult:
    verified: bool
    score: int
    reason: str
    matched_source: Optional[str] = None
    matched_id: Optional[str] = None
    matched_title: Optional[str] = None
    matched_doi: Optional[str] = None



In [2]:

def grobid_extract_references(pdf_path: str) -> List[ParsedRef]:
    with open(pdf_path, "rb") as f:
        r = requests.post(
            GROBID_URL,
            files={"input": (os.path.basename(pdf_path), f, "application/pdf")},
            timeout=REQUEST_TIMEOUT,
        )
    r.raise_for_status()
    tei_xml = r.text.encode("utf-8", errors="ignore")

    root = etree.fromstring(tei_xml)
    ns = {"tei": "http://www.tei-c.org/ns/1.0"}

    refs = []
    for bibl in root.xpath(".//tei:listBibl/tei:biblStruct", namespaces=ns):
        title = _first_text(bibl.xpath(".//tei:title[@level='a']/text()", namespaces=ns)) \
                or _first_text(bibl.xpath(".//tei:title/text()", namespaces=ns))

        year_txt = _first_text(bibl.xpath(".//tei:date/@when", namespaces=ns)) \
                   or _first_text(bibl.xpath(".//tei:date/text()", namespaces=ns))
        year = _parse_year(year_txt)

        doi = _first_text(bibl.xpath(".//tei:idno[@type='DOI']/text()", namespaces=ns))
        if doi:
            doi = doi.strip().lower()
            doi = doi.replace("https://doi.org/", "").replace("http://doi.org/", "")

        authors = []
        for a in bibl.xpath(".//tei:author", namespaces=ns):
            surname = _first_text(a.xpath(".//tei:surname/text()", namespaces=ns))
            forename = _first_text(a.xpath(".//tei:forename/text()", namespaces=ns))
            if surname and forename:
                authors.append(f"{surname}, {forename}")
            elif surname:
                authors.append(surname)

        venue = _first_text(bibl.xpath(".//tei:monogr//tei:title/text()", namespaces=ns))
        
        url = (_first_text(bibl.xpath(".//tei:idno[@type='URI']/text()", namespaces=ns)) or \
                _first_text(bibl.xpath(".//tei:idno[@type='url']/text()", namespaces=ns)) or \
                _first_text(bibl.xpath(".//tei:ptr/@target", namespaces=ns)))
        if url:
            url = url.strip()

        # Sometimes Grobid provides an unstructured ref string:
        raw = _first_text(bibl.xpath(".//tei:note[@type='raw_reference']/text()", namespaces=ns))

        refs.append(ParsedRef(title=title, year=year, authors=authors, venue=venue, doi=doi, url=url, raw=raw))

    return refs


def verify_by_url(ref: ParsedRef) -> Optional[VerificationResult]:
    if not ref.url or not ref.title:
        return None

    url = ref.url.strip()
    if not url.lower().startswith(("http://", "https://")):
        return None

    try:
        # Prefer GET (HEAD often blocked or lies)
        r = requests.get(
            url,
            timeout=30,
            allow_redirects=True,
            headers={
                "User-Agent": "ref-checker/0.1 (+https://example.org; contact: you@example.com)"
            },
        )
        if r.status_code >= 400:
            return VerificationResult(False, 0, f"URL returned HTTP {r.status_code}", matched_source="url", matched_id=url)

        ctype = (r.headers.get("Content-Type") or "").lower()

        # If the link points to a PDF, you could optionally treat “reachable PDF” as weak evidence,
        # or even run Grobid on it later.
        if "application/pdf" in ctype or url.lower().endswith(".pdf"):
            return VerificationResult(
                verified=True,
                score=2,
                reason="URL reachable and points to a PDF",
                matched_source="url",
                matched_id=r.url,
                matched_title=None,
            )

        # Parse HTML title / meta
        soup = BeautifulSoup(r.text, "html.parser")

        candidates = []
        if soup.title and soup.title.get_text(strip=True):
            candidates.append(soup.title.get_text(strip=True))

        og = soup.find("meta", attrs={"property": "og:title"})
        if og and og.get("content"):
            candidates.append(og["content"].strip())

        tw = soup.find("meta", attrs={"name": "twitter:title"})
        if tw and tw.get("content"):
            candidates.append(tw["content"].strip())

        h1 = soup.find("h1")
        if h1 and h1.get_text(strip=True):
            candidates.append(h1.get_text(strip=True))

        candidates = [c for c in candidates if c]
        if not candidates:
            return VerificationResult(False, 0, "URL reachable but no title candidates found", matched_source="url", matched_id=r.url)

        best = max(fuzz.token_set_ratio(ref.title, c) for c in candidates)
        if best >= 80:
            return VerificationResult(
                verified=True,
                score=3,
                reason=f"URL title match (score={best})",
                matched_source="url",
                matched_id=r.url,
                matched_title=max(candidates, key=lambda c: fuzz.token_set_ratio(ref.title, c)),
            )

        return VerificationResult(
            verified=False,
            score=0,
            reason=f"URL reachable but title mismatch (best score={best})",
            matched_source="url",
            matched_id=r.url,
            matched_title=max(candidates, key=lambda c: fuzz.token_set_ratio(ref.title, c)),
        )

    except requests.RequestException as e:
        return VerificationResult(False, 0, f"URL fetch failed: {e.__class__.__name__}", matched_source="url", matched_id=url)


def verify_reference(ref: ParsedRef) -> VerificationResult:
    # 1) DOI check (strongest)
    if ref.doi:
        ok, meta = crossref_lookup_doi(ref.doi)
        if ok:
            return VerificationResult(
                verified=True,
                score=10,
                reason="DOI verified via Crossref",
                matched_source="crossref",
                matched_id=meta.get("DOI"),
                matched_title=_safe_title(meta),
                matched_doi=meta.get("DOI"),
            )
        # DOI present but not found is suspicious
        # continue with title search as a fallback

    # 2) Title search (OpenAlex + Crossref)
    evidence_score = 0
    reasons = []

    best_match = None  # (source, id, title, doi, title_score, year_ok, author_ok)
    if ref.title:
        oa = openalex_search(ref)
        if oa:
            evidence_score += oa["evidence"]
            reasons.append(oa["reason"])
            best_match = oa["best_match"]

        cr = crossref_search(ref)
        if cr:
            evidence_score += cr["evidence"]
            reasons.append(cr["reason"])
            # Keep whichever match has better title similarity
            if (best_match is None) or (cr["best_match"][4] > best_match[4]):
                best_match = cr["best_match"]

    if evidence_score >= MIN_EVIDENCE_SCORE and best_match:
        source, mid, mtitle, mdoi, tscore, year_ok, author_ok = best_match
        return VerificationResult(
            verified=True,
            score=evidence_score,
            reason="; ".join(reasons),
            matched_source=source,
            matched_id=mid,
            matched_title=mtitle,
            matched_doi=mdoi,
        )

    # After DOI lookup fails (or after DB search fails), try URL check
    url_result = verify_by_url(ref)
    if url_result and url_result.verified:
        return url_result

    # 3) Not verified
    why = "No confident match found"
    if ref.title is None and ref.doi is None:
        why = "No DOI and title missing (cannot verify reliably)"
    elif ref.doi and not ref.title:
        why = "DOI not found and title missing"

    return VerificationResult(
        verified=False,
        score=evidence_score,
        reason=why + ((" | " + "; ".join(reasons)) if reasons else ""),
    )


def crossref_lookup_doi(doi: str) -> Tuple[bool, Dict[str, Any]]:
    # Crossref works endpoint supports /works/{doi}
    url = f"{CROSSREF_WORKS}/{doi}"
    try:
        r = requests.get(url, timeout=REQUEST_TIMEOUT, headers={"User-Agent": "ref-checker/0.1 (mailto:you@example.com)"})
        if r.status_code == 200:
            data = r.json()
            return True, data.get("message", {})
        return False, {}
    except requests.RequestException:
        return False, {}


def crossref_search(ref: ParsedRef) -> Optional[Dict[str, Any]]:
    q = ref.title.strip()
    params = {"query.bibliographic": q, "rows": 5}
    try:
        r = requests.get(CROSSREF_WORKS, params=params, timeout=REQUEST_TIMEOUT,
                         headers={"User-Agent": "ref-checker/0.1 (mailto:you@example.com)"})
        r.raise_for_status()
        items = r.json().get("message", {}).get("items", [])
    except requests.RequestException:
        return None
    finally:
        time.sleep(SLEEP_BETWEEN_CALLS)

    best = None
    for it in items:
        mtitle = _safe_title(it)
        if not mtitle:
            continue
        tscore = fuzz.token_set_ratio(ref.title, mtitle)
        year_ok = _year_matches(ref.year, it.get("issued", {}))
        author_ok = _author_overlap(ref.authors, it.get("author", []))
        # Hard gates (policy)
        if REQUIRE_TITLE_MATCH and tscore < TITLE_MATCH_THRESHOLD:
            continue
        if REQUIRE_AUTHOR_MATCH and not author_ok:
            continue
        evidence = 0
        if tscore >= TITLE_MATCH_THRESHOLD:
            evidence += 1
        if year_ok:
            evidence += 1
        if author_ok:
            evidence += 1

        cand = ("crossref", it.get("DOI"), mtitle, it.get("DOI"), tscore, year_ok, author_ok)
        if best is None or cand[4] > best[4]:
            best = cand

    if not best:
        return None

    evidence = (1 if best[4] >= TITLE_MATCH_THRESHOLD else 0) + (1 if best[5] else 0) + (1 if best[6] else 0)
    return {
        "evidence": evidence,
        "reason": f"Crossref best title score={best[4]}, year_ok={best[5]}, author_ok={best[6]}",
        "best_match": best,
    }


def openalex_search(ref: ParsedRef) -> Optional[Dict[str, Any]]:
    # OpenAlex: use filter=title.search: and optional year
    params = {
        "search": ref.title,
        "per-page": 5,
    }
    try:
        r = requests.get(OPENALEX_WORKS, params=params, timeout=REQUEST_TIMEOUT)
        r.raise_for_status()
        results = r.json().get("results", [])
    except requests.RequestException:
        return None
    finally:
        time.sleep(SLEEP_BETWEEN_CALLS)

    best = None
    for it in results:
        mtitle = it.get("title")
        if not mtitle:
            continue
        tscore = fuzz.token_set_ratio(ref.title, mtitle)
        year_ok = (ref.year is None) or (it.get("publication_year") == ref.year)
        author_ok = _openalex_author_overlap(ref.authors, it.get("authorships", []))
        # Hard gates (policy)
        if REQUIRE_TITLE_MATCH and tscore < TITLE_MATCH_THRESHOLD:
            continue
        if REQUIRE_AUTHOR_MATCH and not author_ok:
            continue
        evidence = 0
        if tscore >= TITLE_MATCH_THRESHOLD:
            evidence += 1
        if year_ok:
            evidence += 1
        if author_ok:
            evidence += 1

        mid = it.get("id")
        doi = it.get("doi")
        if doi:
            doi = doi.replace("https://doi.org/", "").lower()

        cand = ("openalex", mid, mtitle, doi, tscore, year_ok, author_ok)
        if best is None or cand[4] > best[4]:
            best = cand

    if not best:
        return None

    evidence = (1 if best[4] >= TITLE_MATCH_THRESHOLD else 0) + (1 if best[5] else 0) + (1 if best[6] else 0)
    return {
        "evidence": evidence,
        "reason": f"OpenAlex best title score={best[4]}, year_ok={best[5]}, author_ok={best[6]}",
        "best_match": best,
    }


def analyze_file(pdf_dir: str, out_json: str = "report.json") -> Dict[str, Any]:
    report = {"files": {}}

    pdfs = [os.path.join(pdf_dir, f) for f in os.listdir(pdf_dir) if f.lower().endswith(".pdf")]
    for pdf_path in tqdm(pdfs, desc="Processing PDFs"):
        filename = os.path.basename(pdf_path)
        try:
            refs = grobid_extract_references(pdf_path)
        except Exception as e:
            report["files"][filename] = {"error": str(e), "refs": []}
            continue

        flagged = []
        verified = []       
        ref_results = []

        for ref in refs:
            vr = verify_reference(ref)
            ref_results.append({"ref": asdict(ref), "verification": asdict(vr)})
            if not vr.verified:
                flagged.append({"ref": asdict(ref), "why": asdict(vr)})
            else:
                verified.append({"ref": asdict(ref), "why": asdict(vr)})
                
        report["files"][filename] = {
            "num_refs": len(refs),
            "num_unverified": len(flagged),
            "unverified": flagged[:],  # you can truncate if you want
            "verified": verified[:],
        }

    with open(out_json, "w", encoding="utf-8") as f:
        json.dump(report, f, indent=2, ensure_ascii=False)

    return report



Here are some helper functions used.

In [3]:

# ---------- helpers ----------

def _first_text(items):
    return items[0].strip() if items else None

def _parse_year(s: Optional[str]) -> Optional[int]:
    if not s:
        return None
    m = re.search(r"(19|20)\d{2}", s)
    return int(m.group(0)) if m else None

def _safe_title(crossref_item: Dict[str, Any]) -> Optional[str]:
    t = crossref_item.get("title")
    if isinstance(t, list) and t:
        return t[0]
    if isinstance(t, str):
        return t
    return None

def _year_matches(ref_year: Optional[int], issued: Dict[str, Any]) -> bool:
    if ref_year is None:
        return True
    parts = issued.get("date-parts")
    if not parts or not parts[0]:
        return False
    return parts[0][0] == ref_year

def _author_overlap(ref_authors: List[str], crossref_authors: List[Dict[str, Any]]) -> bool:
    if not ref_authors or not crossref_authors:
        return False
    ref_surnames = {a.split(",")[0].strip().lower() for a in ref_authors if a}
    cr_surnames = {a.get("family", "").strip().lower() for a in crossref_authors if a.get("family")}
    return len(ref_surnames & cr_surnames) >= 1

def _openalex_author_overlap(ref_authors: List[str], authorships: List[Dict[str, Any]]) -> bool:
    if not ref_authors or not authorships:
        return False
    ref_surnames = {a.split(",")[0].strip().lower() for a in ref_authors if a}
    oa_surnames = set()
    for au in authorships:
        name = (au.get("author") or {}).get("display_name") or ""
        # last token as surname-ish heuristic
        parts = name.split()
        if parts:
            oa_surnames.add(parts[-1].strip().lower())
    return len(ref_surnames & oa_surnames) >= 1



Now we run the code. Here we assume that there is only one file in `PDF_FOLDER`!!!

In [4]:
pdf_dir = PDF_FOLDER
report = {"files": {}}
pdfs = [os.path.join(pdf_dir, f) for f in os.listdir(pdf_dir) if f.lower().endswith(".pdf")]
for pdf_path in tqdm(pdfs, desc="Processing PDFs"):
    filename = os.path.basename(pdf_path)
    try:
        refs = grobid_extract_references(pdf_path)
    except Exception as e:
        report["files"][filename] = {"error": str(e), "refs": []}
        continue

    flagged = []
    verified = []       
    ref_results = []

    for ref in refs:
        vr = verify_reference(ref)
        ref_results.append({"ref": asdict(ref), "verification": asdict(vr)})
        if not vr.verified:
            flagged.append({"ref": asdict(ref), "why": asdict(vr)})
        else:
            verified.append({"ref": asdict(ref), "why": asdict(vr)})
            
    report["files"][filename] = {
        "num_refs": len(refs),
        "num_unverified": len(flagged),
        "unverified": flagged[:],  # you can truncate if you want
        "verified": verified[:],
    }

Processing PDFs:   0%|          | 0/1 [00:00<?, ?it/s]

Processing PDFs: 100%|██████████| 1/1 [00:52<00:00, 52.59s/it]


These are the papers identified

In [6]:
refs

[ParsedRef(title='Learning to teach', year=1991, authors=['Arends, R', 'Castle, S'], venue='Learning to teach', doi=None, url=None, raw=None),
 ParsedRef(title='Teaching for Quality Learning at University', year=2011, authors=['Biggs, J', 'Tang, C'], venue='Teaching for Quality Learning at University', doi=None, url=None, raw=None),
 ParsedRef(title='Assessment and classroom learning', year=1998, authors=['Black, P', 'Wiliam, D'], venue='Assessment and classroom learning', doi=None, url=None, raw=None),
 ParsedRef(title='Taxonomy of Educational Objectives, Handbook I: Cognitive Domain', year=1956, authors=['Bloom, B'], venue='Taxonomy of Educational Objectives, Handbook I: Cognitive Domain', doi=None, url=None, raw=None),
 ParsedRef(title='Universal Design for Learning Guidelines', year=2018, authors=[], venue='Universal Design for Learning Guidelines', doi=None, url=None, raw=None),
 ParsedRef(title='Authentic learning environments. Handbook of research on educational communications a

This is the report listing the unverified and verified papers

In [5]:
report


{'files': {'Writing documents_YU Joyce_finalversion.pdf': {'num_refs': 13,
   'num_unverified': 5,
   'unverified': [{'ref': {'title': 'Learning to teach',
      'year': 1991,
      'authors': ['Arends, R', 'Castle, S'],
      'venue': 'Learning to teach',
      'doi': None,
      'url': None,
      'raw': None},
     'why': {'verified': False,
      'score': 0,
      'reason': 'No confident match found',
      'matched_source': None,
      'matched_id': None,
      'matched_title': None,
      'matched_doi': None}},
    {'ref': {'title': 'Teaching for Quality Learning at University',
      'year': 2011,
      'authors': ['Biggs, J', 'Tang, C'],
      'venue': 'Teaching for Quality Learning at University',
      'doi': None,
      'url': None,
      'raw': None},
     'why': {'verified': False,
      'score': 0,
      'reason': 'No confident match found',
      'matched_source': None,
      'matched_id': None,
      'matched_title': None,
      'matched_doi': None}},
    {'ref': {'titl

In [8]:
grobid_extract_references(pdf_path)


[ParsedRef(title='How immigration is changing the economies of rich countries', year=2024, authors=[], venue='The Economist', doi=None, url=None, raw=None),
 ParsedRef(title=None, year=2025, authors=[], venue=None, doi=None, url='https://www.economist.com/', raw=None),
 ParsedRef(title='The Economic Impact of Immigration', year=2023, authors=['Peri, G'], venue='Journal of Economic Perspectives', doi=None, url=None, raw=None),
 ParsedRef(title='Ageing Japan needs a drastic shift in migration policy', year=2025, authors=[], venue='Ageing Japan needs a drastic shift in migration policy', doi=None, url='https://www.oxfordeconomics.com/resource/ageing-japan-needs-a-drastic-shift-in-migration-policy/', raw=None),
 ParsedRef(title='The Fiscal Impact of Immigration in the UK -Migration Observatory, Migration Observatory', year=2024, authors=['Vargas-Silva, C', 'Sumption, M', 'Walsh, P'], venue='The Fiscal Impact of Immigration in the UK -Migration Observatory, Migration Observatory', doi=None,

In [3]:
analyze_file(PDF_FOLDER, out_json="report_file.json")
print("Wrote report_file.json")

Processing PDFs: 100%|██████████| 1/1 [00:01<00:00,  1.15s/it]

Wrote report_file.json



