# LLM-only Reference Cleaner (DOCX -> DOCX) – User Manual

## What this tool does

This notebook takes a messy References section from a Word document and converts it into a clean, consistent reference list using an LLM.

It can:
- split glued references into individual entries  
- reformat each reference into a target style  
- optionally sort references alphabetically by first author  
- apply numbering styles (1., [1], A., or none)  
- export a clean References section as a new .docx file  

---

## What you need before running

- A Google Colab notebook  
- A DeepSeek or OpenAI API key set as an environment variable  
- Your messy references file in .docx format  
- (Optional) A clean references .docx file to infer style from  

---

## Step 1: Set your API key

In Colab, run this in a cell:

export DEEPSEEK_API_KEY="your_api_key_here"

or for OpenAI:

export OPENAI_API_KEY="your_api_key_here"

---

## Step 2: Run the main script cell

Just run the full Python cell that contains the pipeline code.

When prompted in the terminal:

do you want your target numbering sorting be alphabetical(a) or keep(k) ?
- a = sort references alphabetically by first author
- k = keep original order

do you want your target numbering style be dot(d) | brackets(b) | alpha(a) | none(n) ?
- d = 1. Reference
- b = [1] Reference
- a = A. Reference
- n = Reference (no numbering)

---

## Step 3: Upload your files

You will be prompted to upload:

1) SOURCE .docx  
   This is your messy reference file.

2) TARGET .docx (optional)  
   This is a clean reference list.  
   If you upload this, the model will infer the formatting style.  
   If you cancel, a default formatting style is used.

---

## Step 4: What happens automatically

The pipeline runs these steps:

1) Extracts only the References section from your SOURCE docx  
2) Uses LLM to split glued references into a list  
3) (Optional) Sorts references alphabetically by first author  
4) Uses LLM to reformat each reference into target style  
5) Adds numbering in Python  
6) Writes a new References section into a clean .docx file  

---

## Step 5: Download result

After the run finishes, Colab will automatically download:

references_llm_fixed.docx

This file contains:
- cleanly formatted references  
- consistent style  
- correct numbering  
- hanging indent  

---

## Common issues

If it crashes with “Failed to get valid JSON”:
- Reduce BATCH_SIZE (e.g. from 6 to 4)
- Try running again (LLMs are stochastic)

If References heading is not found:
- Make sure your Word doc has a line exactly like:
  References

If sorting looks wrong:
- Alphabetical sorting is based on first author surname extracted by LLM  
- Rare edge cases may mis-order unusual names  

---

## Tips

- If your target journal does NOT want DOIs, remove them from your target reference file before uploading it  
- If your references contain non-English names, alphabetical sorting may be slightly imperfect  
- You can rerun with different numbering styles without re-uploading your files  

---

## Output

Final file:
references_llm_fixed.docx

You can copy this directly into your paper.

In [None]:
!pip -q install openai python-docs

In [None]:
import os, json
from openai import OpenAI

ds_key = "YOUR-OWN-KEY"

os.environ["DEEPSEEK_API_KEY"] = ds_key  # or use Colab secrets

client = OpenAI(
    api_key=os.environ["DEEPSEEK_API_KEY"],
    base_url="https://api.deepseek.com/v1"
)

In [None]:
# =========================
# LLM-only Reference Cleaner (DOCX -> DOCX) for Google Colab
# Robust + stepwise + never-crash batch formatting + optional alphabetical sort
#
# What it does:
# 1) Upload SOURCE .docx (messy references like your scr2)
# 2) (Optional) Upload TARGET .docx (clean reference style like scr3)
# 3) Extract ONLY the References section text
# 4) STEP 1: LLM splits glued references into individual items
# 5) STEP 1.5: Optional sort (alphabetical by first author surname using LLM-extracted key)
# 6) STEP 2: LLM formats refs in batches using index mapping (no batch mismatch)
# 7) Python adds numbering prefixes (dot / brackets / alpha / none)
# 8) Writes a clean output DOCX with hanging indent
#
# =========================

!pip -q install python-docx openai

import os, re, json, time
from typing import List, Dict, Any, Optional, Tuple
from docx import Document
from docx.shared import Inches
from google.colab import files
from openai import OpenAI

# -------------------------
# USER CONFIG
# -------------------------
PROVIDER = "deepseek"      # "deepseek" or "openai"
MODEL = "deepseek-chat"    # deepseek: "deepseek-chat"; openai: "gpt-4o-mini" / "gpt-4.1-mini"
DEEPSEEK_BASE_URL = "https://api.deepseek.com"

DEBUG = True
DEBUG_SHOW_JSON = False   # True -> prints sample JSON returned by model (can be noisy)
BATCH_SIZE = 6            # 4-6 stable; 8-10 faster but riskier
SPLIT_CHUNK_CHARS = 6000  # chunk size for splitting step

# Sorting
SORT_MODE = input("do you want your target numbering sorting be alphabetical(a) or keep(k) ?")    # "keep" or "alphabetical"
# Numbering prefix added in Python AFTER LLM formatting (LLM is told: no numbering)
NUMBERING_STYLE = input("do you want your target numbering style be dot(d) | brackets(b) | alpha(a) | none(n) ?")

# Output
OUT_PATH = "references_llm_fixed.docx"

# -------------------------
# Client
# -------------------------
def make_client() -> OpenAI:
    if PROVIDER == "deepseek":
        key = (os.environ.get("DEEPSEEK_API_KEY") or "").strip()
        if not key:
            raise RuntimeError("Missing DEEPSEEK_API_KEY. Add it to Colab Secrets or os.environ.")
        return OpenAI(api_key=key, base_url=DEEPSEEK_BASE_URL)

    if PROVIDER == "openai":
        key = (os.environ.get("OPENAI_API_KEY") or "").strip()
        if not key:
            raise RuntimeError("Missing OPENAI_API_KEY. Add it to Colab Secrets or os.environ.")
        return OpenAI(api_key=key)

    raise ValueError("PROVIDER must be 'deepseek' or 'openai'")

client = make_client()

# -------------------------
# Robust JSON extraction
# -------------------------
def extract_first_json_balanced(text: str) -> str:
    """Extract the first complete JSON object/array using brace/bracket balancing."""
    if not text:
        raise ValueError("Empty model output")

    s = text.strip()
    start = None
    for i, ch in enumerate(s):
        if ch in "{[":
            start = i
            break
    if start is None:
        raise ValueError("No JSON start found")

    opener = s[start]
    closer = "}" if opener == "{" else "]"

    depth = 0
    in_str = False
    esc = False

    for j in range(start, len(s)):
        ch = s[j]
        if in_str:
            if esc:
                esc = False
            elif ch == "\\":
                esc = True
            elif ch == '"':
                in_str = False
            continue
        else:
            if ch == '"':
                in_str = True
                continue
            if ch == opener:
                depth += 1
            elif ch == closer:
                depth -= 1
                if depth == 0:
                    return s[start:j+1]

    raise ValueError("Unbalanced JSON (never closed)")

def parse_json_robust(text: str) -> Any:
    t = (text or "").strip()
    if not t:
        raise ValueError("Empty response")
    try:
        return json.loads(t)
    except json.JSONDecodeError:
        blob = extract_first_json_balanced(t)
        return json.loads(blob)

def llm_json(prompt: str, temperature: float = 0.0, max_retries: int = 3, max_tokens: int = 6000) -> Any:
    sys = "Return ONLY valid JSON. No markdown. No commentary. No extra text."
    last_txt = None
    cur_prompt = prompt

    for _ in range(max_retries):
        resp = client.chat.completions.create(
            model=MODEL,
            messages=[
                {"role": "system", "content": sys},
                {"role": "user", "content": cur_prompt},
            ],
            temperature=temperature,
            stop=["```"],
            max_tokens=max_tokens,
        )
        txt = resp.choices[0].message.content or ""
        last_txt = txt

        try:
            return parse_json_robust(txt)
        except Exception:
            cur_prompt = f"""
You output invalid JSON or extra text. Output ONLY valid JSON for the schema requested.
No markdown. No explanations.

BAD_OUTPUT:
{txt}
"""
            time.sleep(0.25)

    raise ValueError(f"Failed to get valid JSON after retries. Last output:\n{(last_txt or '')[:900]}")

# -------------------------
# DOCX: find References and extract only that section
# -------------------------
def norm(s: str) -> str:
    return re.sub(r"\s+", " ", (s or "").strip()).lower()

def find_references_start(doc: Document) -> int:
    # 1) exact-ish heading
    for i, p in enumerate(doc.paragraphs):
        t = norm(p.text)
        if t in {"references", "reference", "bibliography", "literature cited"}:
            return i
        if re.fullmatch(r"references\s*:?", t):
            return i
    # 2) fallback: short line containing "references"
    for i, p in enumerate(doc.paragraphs):
        t = norm(p.text)
        if "references" in t and len(t) <= 35:
            return i
    return -1

def extract_text_after_heading(doc: Document, heading_idx: int) -> str:
    lines = []
    for p in doc.paragraphs[heading_idx+1:]:
        txt = (p.text or "").strip()
        if not txt:
            continue
        # stop at an obvious next heading (very rough)
        if len(txt) <= 25 and txt.isupper() and "REFER" not in txt.upper():
            break
        lines.append(txt)
    return "\n".join(lines).strip()

def extract_references_section_text(doc: Document) -> str:
    idx = find_references_start(doc)
    if idx < 0:
        raise ValueError("Could not find a 'References' heading. Make sure there is a line that says: References")
    txt = extract_text_after_heading(doc, idx)
    if not txt.strip():
        raise ValueError("Found References heading but extracted no text after it.")
    return txt

# -------------------------
# Step 0: infer style from target references text (robust schema)
# -------------------------
DEFAULT_STYLE = {
    "authors_rule": "Use 'Family, Initials' and separate authors with '; ' (semicolon + space). Keep author order. Do not invent authors.",
    "year_rule": "Use 4-digit year if present; place it after authors or after journal consistent with target.",
    "journal_vol_issue_pages_rule": "Keep journal/volume/issue/pages ordering consistent with target. Do not invent missing fields.",
    "title_rule": "Keep title as-is (do not invent capitalization). End title with a period.",
    "doi_rule": "If DOI exists, include EXACTLY 'https://doi.org/<doi>' at the end unless target excludes DOI.",
    "one_line_rule": "One reference per line. No internal newlines."
}

def infer_style_from_target_text(target_refs_text: str) -> Dict[str, Any]:
    prompt = f"""
Infer a reference formatting spec from a clean reference list.

Return JSON in either of these shapes:
A) {{"spec": {{...}}}}
B) {{...}}   (the spec directly)

Spec keys required:
- authors_rule
- year_rule
- journal_vol_issue_pages_rule
- title_rule
- doi_rule
- one_line_rule

TARGET_REFERENCE_LIST_TEXT:
{target_refs_text}
"""
    data = llm_json(prompt, max_tokens=2500)

    if isinstance(data, dict) and "spec" in data and isinstance(data["spec"], dict):
        spec = data["spec"]
    elif isinstance(data, dict):
        spec = data
    else:
        spec = DEFAULT_STYLE.copy()

    # force one-line rule regardless (prevents model inserting newlines)
    spec["one_line_rule"] = "One reference per line. No internal newlines."
    return spec

# -------------------------
# Step 1: chunk + LLM split glued references into list
# -------------------------
BRACKET_MARKER = re.compile(r"\[\d+\]")

def chunk_by_markers(text: str, max_chars: int = 6000) -> List[str]:
    t = (text or "").strip()
    if not t:
        return []

    positions = [m.start() for m in BRACKET_MARKER.finditer(t)]
    if len(positions) <= 1:
        # fallback: raw chunking
        return [t[i:i+max_chars] for i in range(0, len(t), max_chars)]

    chunks = []
    start = positions[0]
    last = start

    for pos in positions[1:]:
        if (pos - start) > max_chars:
            chunks.append(t[start:last].strip())
            start = last
        last = pos

    chunks.append(t[start:].strip())
    return [c for c in chunks if c]

def step1_split_refs_llm(raw_text: str) -> List[str]:
    chunks = chunk_by_markers(raw_text, max_chars=SPLIT_CHUNK_CHARS)
    if not chunks:
        raise ValueError("No text to split.")

    all_refs = []
    for ci, ch in enumerate(chunks, start=1):
        prompt = f"""
Return JSON only: {{"refs":[...]}}.

Task: Split the bibliography text into individual references.

Rules:
- Each list element must contain EXACTLY ONE reference.
- Preserve original content; do NOT rewrite.
- Split when a new reference marker begins, like "[12]" or "12."
- Keep DOI URLs as-is.
- Do not add/remove authors, title, year, journal, volume, pages.
- Do not merge two references into one list element.
- Output refs must be in the same order as they appear.

TEXT_CHUNK ({ci}/{len(chunks)}):
{ch}
"""
        data = llm_json(prompt, max_tokens=6000)

        if DEBUG and DEBUG_SHOW_JSON and ci == 1:
            print("\n[SPLIT JSON sample]")
            print(json.dumps(data, indent=2, ensure_ascii=False)[:1500])

        refs = [r.strip() for r in data.get("refs", []) if r and r.strip()]
        all_refs.extend(refs)

    if not all_refs:
        raise ValueError("Split failed: empty refs list.")
    return all_refs

# -------------------------
# Step 1.5: Optional alphabetical sort (by first author surname)
# -------------------------
def extract_first_author_surname_llm(ref: str) -> str:
    prompt = f"""
Return JSON only: {{"key":"..."}}.

Task: Extract the first author's FAMILY NAME (surname) from the reference.
Rules:
- If you cannot confidently find a surname, return "zzz" as key.
- Output key should be lowercase ascii if possible; keep letters only.
- Do NOT rewrite the reference, just extract a sorting key.

REFERENCE:
{ref}
"""
    data = llm_json(prompt, max_tokens=500)
    key = (data.get("key") or "").strip().lower()
    key = re.sub(r"[^a-z]", "", key)
    return key if key else "zzz"

def alphabetical_sort_refs(refs: List[str]) -> List[str]:
    # cache keys to avoid repeated calls if rerun
    keys = []
    for r in refs:
        k = extract_first_author_surname_llm(r)
        keys.append(k)
    # stable sort: key then original order
    idxs = list(range(len(refs)))
    idxs.sort(key=lambda i: (keys[i], i))
    return [refs[i] for i in idxs]

# -------------------------
# Step 2: robust batch formatting (index-based) + retry + per-ref fallback
# LLM is told: DO NOT add numbering. Python adds numbering afterwards.
# -------------------------
def reformat_refs_batch_llm(refs: List[str], style_spec: Dict[str, Any]) -> List[str]:
    def _call(batch_refs: List[str]) -> List[Optional[str]]:
        prompt = f"""
Return JSON only.

Reformat each reference into the target style.

STYLE_SPEC:
{json.dumps(style_spec, ensure_ascii=False)}

Hard rules:
- Do NOT invent missing info.
- One line per reference (no '\\n' inside).
- Do NOT merge references.
- Do NOT drop any reference.
- Do NOT add any numbering prefix (no '1.' no '[1]' no 'A.'). I will add numbering in Python.
- Output MUST contain EXACTLY {len(batch_refs)} unique indices.

Return JSON EXACT schema:
{{
  "items": [
    {{"index": 0, "formatted": "..." }},
    {{"index": 1, "formatted": "..." }}
  ]
}}

Indices must be 0..{len(batch_refs)-1}, no repeats.

Input references:
{json.dumps(batch_refs, ensure_ascii=False)}
"""
        data = llm_json(prompt, max_tokens=6000)

        if DEBUG and DEBUG_SHOW_JSON:
            print("\n[FORMAT JSON sample]")
            print(json.dumps(data, indent=2, ensure_ascii=False)[:1500])

        items = data.get("items", [])
        out = [None] * len(batch_refs)

        if isinstance(items, list):
            for it in items:
                if not isinstance(it, dict):
                    continue
                idx = it.get("index")
                fmt = (it.get("formatted") or "").strip()
                if isinstance(idx, int) and 0 <= idx < len(batch_refs) and fmt:
                    # force one-line
                    fmt = re.sub(r"\s*\n\s*", " ", fmt).strip()
                    out[idx] = fmt
        return out

    out = _call(refs)
    missing = [i for i, v in enumerate(out) if not v]

    if missing:
        out2 = _call(refs)
        for i in missing:
            if out2[i]:
                out[i] = out2[i]
        missing = [i for i, v in enumerate(out) if not v]

    if missing:
        # per-ref fallback
        for i in missing:
            single = refs[i]
            prompt_single = f"""
Return JSON only: {{"formatted":"..."}}.

Format ONE reference into the target style.

STYLE_SPEC:
{json.dumps(style_spec, ensure_ascii=False)}

Rules:
- One line only.
- Do not invent missing info.
- Do NOT add numbering prefix.

REFERENCE:
{single}
"""
            data = llm_json(prompt_single, max_tokens=2500)
            fmt = (data.get("formatted") or "").strip()
            fmt = re.sub(r"\s*\n\s*", " ", fmt).strip()
            out[i] = fmt if fmt else None

    final = []
    for i, v in enumerate(out):
        if not v:
            # last resort (never crash)
            final.append(refs[i].strip())
        else:
            final.append(v)
    return final

# -------------------------
# Numbering in Python
# -------------------------
def alpha_label(n: int) -> str:
    # 1->A, 2->B ... 26->Z, 27->AA ...
    s = ""
    while n > 0:
        n -= 1
        s = chr(ord("A") + (n % 26)) + s
        n //= 26
    return s

def add_prefix(i_global: int, text: str) -> str:
    if NUMBERING_STYLE.lower() == "n":
        return text
    if NUMBERING_STYLE.lower() == "d":
        return f"{i_global}. {text}"
    if NUMBERING_STYLE.lower() == "b":
        return f"[{i_global}] {text}"
    if NUMBERING_STYLE.lower() == "a":
        return f"{alpha_label(i_global)}. {text}"
    return f"{i_global}. {text}"

# -------------------------
# Pipeline
# -------------------------
def reformat_refs_llm_pipeline(raw_text: str, style_spec: Dict[str, Any], batch_size: int = 6) -> List[str]:
    refs = step1_split_refs_llm(raw_text)

    if DEBUG:
        print("\n=== STEP 1 (split) ===")
        print("refs:", len(refs))
        print("first ref preview:\n", refs[0][:320], "\n")

    if SORT_MODE.lower() == "a":
        refs = alphabetical_sort_refs(refs)
        if DEBUG:
            print("=== STEP 1.5 (alphabetical sort) ===")
            print("first sorted ref preview:\n", refs[0][:320], "\n")

    formatted_all = []
    cur = 1
    for i in range(0, len(refs), batch_size):
        chunk = refs[i:i+batch_size]
        formatted_chunk = reformat_refs_batch_llm(chunk, style_spec)

        # Add numbering prefix in Python globally
        for j, s in enumerate(formatted_chunk):
            formatted_all.append(add_prefix(cur + j, s))

        cur += len(chunk)

    if DEBUG:
        print("=== STEP 2 (formatted) ===")
        print("formatted:", len(formatted_all))
        print("first formatted preview:\n", formatted_all[0][:320], "\n")

    return formatted_all

# -------------------------
# Write DOCX output with hanging indent
# -------------------------
def set_hanging_indent(paragraph, left_inch=0.35, hanging_inch=0.25):
    paragraph.paragraph_format.left_indent = Inches(left_inch)
    paragraph.paragraph_format.first_line_indent = Inches(-hanging_inch)

def save_formatted_docx(formatted_lines: List[str], out_path: str):
    out = Document()
    out.add_heading("References", level=1)
    for line in formatted_lines:
        p = out.add_paragraph(line)
        set_hanging_indent(p)
    out.save(out_path)

# -------------------------
# RUN
# -------------------------
print("Upload your SOURCE .docx (messy references).")
up1 = files.upload()
if not up1:
    raise ValueError("No source file uploaded.")
source_path = next(iter(up1.keys()))
print("Source:", source_path)

print("\nOptional: upload TARGET .docx (clean refs style you want).")
print("If you don't have it, click Cancel in the upload dialog.")
try:
    up2 = files.upload()
    target_path = next(iter(up2.keys())) if up2 else None
except Exception:
    target_path = None

# Extract source refs text
src_doc = Document(source_path)
raw_text = extract_references_section_text(src_doc)

# Build style spec
style_spec = DEFAULT_STYLE.copy()
if target_path:
    tgt_doc = Document(target_path)
    target_refs_text = extract_references_section_text(tgt_doc)  # References section only
    if DEBUG:
        print("\nInferring style from target doc (References section only)...")
    style_spec = infer_style_from_target_text(target_refs_text)

# Force one-line rule no matter what
style_spec["one_line_rule"] = "One reference per line. No internal newlines."

print("\n=== STYLE SPEC IN USE ===")
print(json.dumps(style_spec, indent=2, ensure_ascii=False))

# Run pipeline
formatted_lines = reformat_refs_llm_pipeline(raw_text, style_spec, batch_size=BATCH_SIZE)

# Save + download
save_formatted_docx(formatted_lines, OUT_PATH)
print("Saved:", OUT_PATH)
files.download(OUT_PATH)

do you want your target numbering sorting be alphabetical(a) or keep(k) ?a
do you want your target numbering style be dot(d) | brackets(b) | alpha(a) | none(n) ?n
Upload your SOURCE .docx (messy references).


Saving ReferencesQ.docx to ReferencesQ (2).docx
Source: ReferencesQ (2).docx

Optional: upload TARGET .docx (clean refs style you want).
If you don't have it, click Cancel in the upload dialog.


Saving oriRef.docx to oriRef (2).docx

Inferring style from target doc (References section only)...

=== STYLE SPEC IN USE ===
{
  "authors_rule": {
    "format": "last_name, first_initial., last_name, first_initial., ...",
    "separator": ", ",
    "final_separator": ", ",
    "max_authors": null,
    "et_al_usage": "not used in provided examples",
    "initials": "with period, no space",
    "ordering": "as listed"
  },
  "year_rule": {
    "format": "year",
    "parentheses": "none",
    "position": "after authors, followed by period and space",
    "separator": ". "
  },
  "journal_vol_issue_pages_rule": {
    "format": "Journal. Vol.(Issue), Pages.",
    "journal_abbreviation": "standard abbreviated, periods after abbreviated words",
    "volume_style": "Vol.",
    "issue_style": "(Issue),",
    "pages_style": "Pages.",
    "page_range_separator": "–",
    "no_issue_format": "Vol., Pages."
  },
  "title_rule": {
    "format": "sentence case",
    "italic": false,
    "quotes": fa

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>