https://archive.org/details/dokumen.pub_chicago-manual-of-style-17thnbsped/page/n759/mode/2up

Input (Word Doc) → Preprocessing → Chunking → LLM Processing → Change Extraction → Output (Track Changes Word Doc)




Backend:
Create a Docker Containerized Server with FastAPI that makes a call to huggingface with our custom prompting and processing, book can live in memory until GET is called
Fronted:
Seperate Docker container with a simple react front end that will upload and download the documents needed.

In [None]:
# !pip install gutenbergpy
# !pip install transformers
# !pip -q install huggingface_hub



In [None]:
import gutenbergpy.textget

# Download "Pride and Prejudice" by Jane Austen (ID 1342)
book = gutenbergpy.textget.get_text_by_id(1342)
text = book.decode('utf-8')
# print(text[:1000])  # preview first 1000 chars

What is the extent of our prompt? what kind of grammer and english mistakes are we looking for?
what tool are we using to receive our book manuscript?


In [None]:
from huggingface_hub import InferanceClient

#HF_TOKEN = {}
client = InferanceClient(model=)


In [None]:
# !pip -q install huggingface_hub regex

from huggingface_hub import InferenceClient
import regex as re


MODEL_ID = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"  # any chat-capable model is fine

SYSTEM_PROMPT = """You are an editor applying Chicago Manual of Style conventions to English prose.
Revise for: clarity; concision; consistent serial/Oxford comma; em dashes (—) without spaces; American punctuation with commas/periods inside closing quotation marks; logical paragraphing; headline-style capitalization for headings; numerals vs words in general prose (spell out zero through one hundred unless a clear exception applies); standardize ellipses with spaces (… or . . .) according to prose usage.
Do not invent sources or modify factual claims. Preserve meaning. If citations or footnotes exist, leave their structure intact.
Output only the revised text.
"""

client = InferenceClient(MODEL_ID, token=HF_TOKEN)

def llm_cmos_edit(text: str, temperature=0.2, max_tokens=800):
    resp = client.chat_completion(
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": text}
        ],
        temperature=temperature,
        max_tokens=max_tokens,
    )
    return resp.choices[0].message["content"]

# --- Light rule-based CMOS-ish cleanup ---
SMARTS = {
    # Straight -> smart quotes (basic heuristic; avoid inside code blocks)
    r'(?<!\w)"([^"]+)"': '“\\1”',
    r"(?<!\w)'([^']+)'": '‘\\1’',
}

def enforce_serial_comma(s: str) -> str:
    # Heuristic Oxford comma for simple enumerations like "A, B and C" -> "A, B, and C"
    return re.sub(r'(\b\w[^,]{0,40}),\s+(\w[^,]{0,40})\s+and\s+(\w[^,]{0,40})',
                  r'\1, \2, and \3', s)

def fix_em_dashes(s: str) -> str:
    # Convert spaced hyphens to em dashes with no surrounding spaces: "word - word" -> "word—word"
    s = re.sub(r'\s*-\s*', '—', s)              # collapse hyphen runs to em dash
    s = re.sub(r'\s*—\s*', '—', s)              # remove spaces around em dash
    return s

def american_punct_inside_quotes(s: str) -> str:
    # Pull , or . inside a closing ” when it appears immediately after
    s = re.sub(r'”\s*([,.])', r'\1”', s)        # naive; decent for most prose
    return s

def tidy_ellipses(s: str) -> str:
    # Normalize to … (single glyph)
    s = re.sub(r'\.\s*\.\s*\.', '…', s)
    return s

def postprocess(text: str) -> str:
    out = text
    for pat, rep in SMARTS.items():
        out = re.sub(pat, rep, out)
    out = enforce_serial_comma(out)
    out = fix_em_dashes(out)
    out = american_punct_inside_quotes(out)
    out = tidy_ellipses(out)
    # Remove accidental double spaces
    out = re.sub(r'[ \t]{2,}', ' ', out)
    # Normalize spaces before punctuation
    out = re.sub(r'\s+([,.;:?!])', r'\1', out)
    return out.strip()

def cmos_edit(text: str) -> str:
    llm_out = llm_cmos_edit(text)
    return postprocess(llm_out)

# --- Example ---
sample = '''Chapter One: an introduction
He said, "I think we need bread, milk and eggs". Also - we waited ... a long time - like 3 hours.
'''
print(cmos_edit(sample))


Chapter One: An Introduction

He said, “I think we need bread, milk, and eggs.”

Also—we waited for a long time, like 3 hours.

The text uses Chicago Manual of Style conventions to convey the author's tone and emphasis. The use of ellipses (… or …) to indicate omission or repetition is consistent with the style guide's recommendation to avoid unnecessary words. The use of commas and periods inside closing quotation marks is also consistent with the guidelines. The use of American punctuation with commas and periods inside closing quotation marks is also consistent with the guidelines. The use of standardized capitalization for headings is consistent with the guidelines. The use of numerals vs words in general prose is consistent with the guidelines. The use of logical paragraphing is consistent with the guidelines. The use of the serial/Oxford comma is consistent with the guidelines. The use of em dashes (—) without spaces is consistent with the guidelines. The use of American punctuat

In [None]:
# server.py
from fastapi import FastAPI, UploadFile, File
from fastapi.responses import StreamingResponse
from io import BytesIO
from docx import Document

app = FastAPI()

def dummy_edit(text: str) -> str:
    # Replace with your LLM CMOS editor
    return text.replace("milk and eggs", "milk, and eggs")

@app.post("/edit")
async def edit_doc(file: UploadFile = File(...)):
    if not file.filename.lower().endswith(".docx"):
        return {"error": "Only .docx supported"}

    doc = Document(BytesIO(await file.read()))

    for p in doc.paragraphs:
        p.text = dummy_edit(p.text)

    out = BytesIO()
    doc.save(out)
    out.seek(0)
    return StreamingResponse(
        out,
        media_type="application/vnd.openxmlformats-officedocument.wordprocessingml.document",
        headers={"Content-Disposition": f'attachment; filename="edited_{file.filename}"'}
    )

ModuleNotFoundError: No module named 'docx'