# üóÇÔ∏è Multi-Modal Document Studio
**End-to-End Open-Source Document Analysis ‚Äî No API Keys Required**

Built an interactive document analysis pipeline that accepts any PDF, TXT, or pasted text (contracts, emails, articles, policies) and runs a full automated review using only open-source models from HuggingFace.

**What it does:**
- Extracts and previews token structure and the chat-template formatted prompt
- Identifies named entities (people, organisations, locations) across the full document
- Scores each clause for risk categories (liability, termination, indemnification, etc.) using zero-shot classification ‚Äî no labelled training data needed
- Detects overall document tone / sentiment
- Generates a structured analyst brief (summary, key parties, obligations, risks, next action) via a locally-running quantized LLM with token-by-token streaming
- Synthesises the brief as audio using a text-to-speech model
- Exposes everything through a Gradio Blocks UI with file upload and tabbed outputs

**Concepts applied:**
- `pipeline()` API ‚Äî NER, sentiment, zero-shot classification, text-to-speech
- `AutoTokenizer` + `apply_chat_template` ‚Äî understanding token IDs and chat prompt formatting
- `AutoModelForCausalLM` ‚Äî loading and running a local instruction-tuned LLM
- INT8 quantization via `optimum-quanto` (Apple MPS) and `bitsandbytes` (CUDA) to fit the model in memory
- `TextIteratorStreamer` + background thread ‚Äî real-time streaming token generation
- Device-aware loading and memory management (`gc.collect()`, `empty_cache()`)

---
**Hardware:** Runs on Apple Silicon (`mps`), NVIDIA GPU (`cuda`), or CPU. Set `DEVICE` in Cell 3.

## Cell 1 ‚Äî Install Dependencies

In [None]:
# Run once ‚Äî restart kernel after installation
!uv pip install -q transformers torch accelerate pdfplumber gradio sentencepiece optimum-quanto huggingface_hub

print("‚úÖ All packages installed. Restart the kernel, then run from Cell 2 onwards.")

## Cell 2 ‚Äî HuggingFace Authentication

Models are downloaded from the [HuggingFace Hub](https://huggingface.co). You need a free account and an access token.

1. Go to https://huggingface.co/settings/tokens
2. Create a token with **Read** permissions
3. Set it as an environment variable before launching Jupyter: `export HF_TOKEN=hf_...`  
   ‚Äî or paste it directly into the cell below (avoid committing it to git)

In [None]:
import os
from huggingface_hub import login

hf_token = os.environ.get('HF_TOKEN')
if hf_token and hf_token.startswith("hf_"):
  print("HF key looks good so far")
else:
  print("HF key is not set - please click the key in the left sidebar")
login(hf_token, add_to_git_credential=True)

print("‚úÖ Logged in to HuggingFace Hub.")

## Cell 3 ‚Äî Constants & Device Detection

In [None]:
import torch

# ‚îÄ‚îÄ Device ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
if torch.backends.mps.is_available(): # Macbook Silicon
    DEVICE = "mps"
    QUANT_BACKEND = "optimum-quanto"   # MPS path
elif torch.cuda.is_available(): # NVIDIA GPU
    DEVICE = "cuda"
    QUANT_BACKEND = "bitsandbytes"     # CUDA path
else:
    DEVICE = "cpu"
    QUANT_BACKEND = "none"

print(f"Device      : {DEVICE}")
print(f"Quant backend: {QUANT_BACKEND}")

# ‚îÄ‚îÄ Model IDs ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

NER_MODEL         = "dslim/bert-base-NER"
ZERO_SHOT_MODEL    = "facebook/bart-large-mnli"      # Zero-shot classification
SENTIMENT_MODEL    = "nlptown/bert-base-multilingual-uncased-sentiment"
MODEL          = "meta-llama/Llama-3.2-1B-Instruct" 

# ‚îÄ‚îÄ Risk labels for zero-shot ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
RISK_LABELS = [
    "low risk", "medium risk", "high risk",
    "financial obligation", "liability", "termination clause",
    "data privacy", "intellectual property", "indemnification"
]

print("\n‚úÖ Constants set.")

## Cell 3 ‚Äî Utility: Text Extraction

In [None]:
import pdfplumber, pathlib

def extract_text(source) -> str:
    """Extract text from a PDF path, TXT path, or raw string."""
    if source is None:
        return ""
    if hasattr(source, "name"):          # Gradio UploadedFile
        source = source.name
    # Raw text strings (multi-line or long) are not valid file paths
    if isinstance(source, str) and ('\n' in source or len(source) > 260):
        return source
    p = pathlib.Path(str(source))
    try:
        if p.exists() and not p.is_dir():
            if p.suffix.lower() == ".pdf":
                with pdfplumber.open(p) as pdf:
                    return "\n".join(page.extract_text() or "" for page in pdf.pages)
            else:
                return p.read_text(errors="ignore")
    except OSError:
        pass
    return ""

# Quick smoke-test
# sample = "This Agreement is entered into between Acme Corp and John Smith on 1 Jan 2025."
# print(extract_text(sample)[:200])
print("‚úÖ Text extractor ready.")

## Cell 4 ‚Äî NER Pipeline

In [None]:
from transformers import pipeline, AutoTokenizer

print(f"Loading NER model onto {DEVICE} ‚Ä¶")

ner_pipeline = pipeline(
    "ner",
    model=NER_MODEL,
    aggregation_strategy="simple",
    device=0 if DEVICE == "cuda" else -1  # pipeline uses int device index
)

tokenizer = AutoTokenizer.from_pretrained(MODEL)

def run_ner(text: str) -> str:
    """Return a formatted string of named entities."""
    if not text.strip():
        return "No text provided."
    chunk_size = 400
    words = text.split()
    chunks = [" ".join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]
    all_entities = []
    print(chunks)
    for chunk in chunks:
        ids = tokenizer.encode(chunk, truncation=True, max_length=512, add_special_tokens=False)
        safe_chunk = tokenizer.decode(ids)
        all_entities.extend(ner_pipeline(safe_chunk))
    
    if not all_entities:
        return "No named entities found."
    
    header = ["| Entity | Group | Score |", "|--------|-------|-------|"]
    rows = [f"| {e['word']} | {e['entity_group']} | {e['score']:.2f} |" for e in all_entities]
    return "\n".join(header + rows)


print("\n‚úÖ NER pipeline ready.")

## Cell 5 ‚Äî Sentiment Pipeline

In [None]:
print(f"Loading sentiment model ‚Ä¶")
sentiment_pipeline = pipeline(
    "sentiment-analysis",
    device=0 if DEVICE == "cuda" else -1
)

def run_sentiment(text: str) -> str:
    """Return overall document sentiment."""
    if not text.strip():
        return "No text provided."
    # Sentiment models also cap at 512 tokens ‚Äî use first 400 words as proxy
    snippet = " ".join(text.split()[:400])
    result = sentiment_pipeline(snippet, truncation=True, max_length=512)[0]
    
    emoji = "üü¢" if result["label"] == "POSITIVE" else "üî¥"
    return f"{emoji} {result['label']}  (confidence: {result['score']:.2f})"


print("‚úÖ Sentiment pipeline ready.")

## Cell 6 ‚Äî Zero-Shot Risk Scorer

In [None]:
import re, textwrap

print("Loading zero-shot classification model ‚Ä¶")
zsc_pipeline = pipeline(
    "zero-shot-classification",
    model=ZERO_SHOT_MODEL,
    device=0 if DEVICE == "cuda" else -1
)

SUPPORTED_DOC_TYPES = ["legal contract", "email", "privacy policy", "research article", "general document"]

DOC_TYPE = "general document"

def run_risk_scoring(text: str) -> str:
    if not text.strip():
        return "No text provided.", "No text provided."
    
    snippet = text[:512]
    DOC_TYPE = zsc_pipeline(snippet, candidate_labels=SUPPORTED_DOC_TYPES, truncation=True, max_length=512)["labels"][0]
    
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())[:10]
    lines = ["| # | Risk | Label | Score | Clause |", "|---|------|-------|-------|--------|"]
    for i, clause in enumerate(sentences, 1):
        result = zsc_pipeline(clause, candidate_labels=RISK_LABELS, truncation=True, max_length=512)
        top_label = result["labels"][0]
        top_score = result["scores"][0]
        risk_icon = (
            "üî¥" if "high" in top_label or top_label in ("liability", "indemnification") else
            "üü°" if "medium" in top_label or top_label in ("termination clause", "financial obligation") else
            "üü¢"
        )
        snippet = textwrap.shorten(clause, width=90, placeholder="‚Ä¶")
        lines.append(f"| {i:02d} | {risk_icon} | {top_label} | {top_score:.2f} | {snippet} |")

    
    print(f"Document type: {DOC_TYPE}")
    return "\n".join(lines)


print("\n‚úÖ Risk scorer ready.")

## Cell 7 ‚Äî Tokenizer + Chat Template Preview

In [None]:
from transformers import AutoTokenizer

print("Loading tokenizer ‚Ä¶")
tokenizer = AutoTokenizer.from_pretrained(MODEL)

def run_token_preview(text: str) -> str:
    """Show token count and the chat-template-formatted prompt."""
    if not text.strip():
        return "No text provided."
    
    # Raw tokenization
    tokens = tokenizer.encode(text)
    token_count = len(tokens)
    first_10 = tokens[:10]
    
    # Apply chat template to format the prompt
    messages = [
        {"role": "system", "content": "You are a professional document analyst."},
        {"role": "user",   "content": f"Briefly summarise this document:\n\n{' '.join(text.split()[:300])}"}
    ]
    chat_prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    chat_tokens = tokenizer.apply_chat_template(
        messages, tokenize=True, add_generation_prompt=True
    )
    
    return (
        f"üìä Token Stats\n"
        f"  Raw token count    : {token_count}\n"
        f"  First 10 token IDs : {first_10}\n"
        f"  Chat prompt tokens : {len(chat_tokens)}\n\n"
        f"üìù Chat-Template Prompt (first 600 chars):\n"
        f"{'-'*60}\n"
        f"{chat_prompt[:600]}"
    )


print("\n‚úÖ Tokenizer ready.")

## Cell 8 ‚Äî LLM Brief Generator (Streaming)

> **Note:** First run will download ~600 MB for TinyLlama. Subsequent runs use the local cache.

In [None]:
import gc, threading
from transformers import AutoModelForCausalLM, TextIteratorStreamer

print(f"Loading LLM on {DEVICE} ‚Ä¶")

# Quantization ‚Äî MPS path uses optimum-quanto; CUDA path uses bitsandbytes
load_kwargs = dict(device_map="auto" if DEVICE == "cuda" else None)

if QUANT_BACKEND == "bitsandbytes":
    from transformers import BitsAndBytesConfig
    load_kwargs["quantization_config"] = BitsAndBytesConfig(load_in_4bit=True)
elif QUANT_BACKEND == "optimum-quanto":
    from optimum.quanto import quantize, qint8
    # We quantize after loading for MPS
    pass

llm_model = AutoModelForCausalLM.from_pretrained(MODEL)

if QUANT_BACKEND == "optimum-quanto":
    quantize(llm_model, weights=qint8)

if DEVICE == "mps":
    llm_model = llm_model.to("mps")

llm_model.eval()
print("‚úÖ LLM loaded.")


def build_brief_prompt(text: str) -> list[dict]:
    
    SYSTEM_PROMPT = (
        f"You are a professional document analyst. The following is a {DOC_TYPE}.\n"
        "Provide a structured brief with:\n"
        "1. One-sentence summary\n"
        "2. Key parties or stakeholders\n"
        "3. Main obligations or key points (up to 5 bullet points)\n"
        "4. Notable risks or red flags\n"
        "5. Recommended next action\n"
        "Be concise. Use bullet points."
    )
    return [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user",   "content": text[:1500]}  # cap context for speed
    ]

def run_llm_brief(text: str, max_new_tokens: int = 400):
    """Generate a structured LLM brief. Yields accumulated text as each token arrives."""
    if not text.strip():
        yield "No text provided."
        return

    messages = build_brief_prompt(text)
    input_ids = tokenizer.apply_chat_template(
        messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
    ).to(DEVICE)
    attention_mask = torch.ones_like(input_ids).to(DEVICE)

    streamer = TextIteratorStreamer(
        tokenizer, skip_prompt=True, skip_special_tokens=True
    )

    gen_kwargs = dict(
        input_ids=input_ids,
        max_new_tokens=max_new_tokens,
        do_sample=False,
        streamer=streamer,
        attention_mask=attention_mask,
        pad_token_id=tokenizer.eos_token_id,
    )

    thread = threading.Thread(target=llm_model.generate, kwargs=gen_kwargs)
    thread.start()

    accumulated = ""
    for token_text in streamer:
        accumulated += token_text
        yield accumulated

    thread.join()

    del input_ids
    gc.collect()
    if DEVICE == "cuda":
        torch.cuda.empty_cache()
    elif DEVICE == "mps":
        torch.mps.empty_cache()


print("=== LLM Brief (streaming) ===")

## Cell 9 ‚Äî Full Analysis Pipeline (no UI)

In [None]:
from datasets import load_dataset
from IPython.display import Audio


def run_speech(text: str):
    synthesiser = pipeline("text-to-speech", "microsoft/speecht5_tts", device=DEVICE)
    embeddings_dataset = load_dataset("matthijs/cmu-arctic-xvectors", split="validation", trust_remote_code=True)
    speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
    speech = synthesiser(text, forward_params={"speaker_embeddings": speaker_embedding})
    return Audio(speech["audio"], rate=speech["sampling_rate"])

print("Text to speech ready.")


## Cell 10 ‚Äî Gradio UI (Optional)

Launches a single-page web UI with file upload and streaming LLM output.

> Run this cell to start the app. A local URL (e.g. `http://127.0.0.1:7860`) will appear below.

In [None]:
import gradio as gr
from pydantic.v1.types import NoneBytes

def gradio_analyse(file_obj, raw_text: str):
    """Gradio handler ‚Äî streams results stage-by-stage as each completes."""
    source = file_obj if file_obj is not None else raw_text
    if not source:
        yield "", "Please upload a file or paste text.", "", "", "", None
        return

    text = extract_text(source)
    yield text, "[1/6] Running token preview‚Ä¶", "", "", "", "", None

    tok = run_token_preview(text)
    yield text, tok, "[2/6] Running NER‚Ä¶", "", "", "", None

    ents = run_ner(text)
    yield text, tok, ents, "[3/6] Running risk scoring‚Ä¶", "","", None

    risk = run_risk_scoring(text)
    yield text, tok, ents, risk, "[4/6] Running sentiment‚Ä¶", "", None

    sent = run_sentiment(text)
    yield text, tok, ents, risk, sent, "[5/6] Generating brief‚Ä¶", None

    for partial_brief in run_llm_brief(text):
        yield text, tok, ents, risk, sent, partial_brief, None

    speech = run_speech(partial_brief)
    yield text, tok, ents, risk, sent, partial_brief, speech

with gr.Blocks(title="Multi-Modal Document Studio") as demo:
    gr.Markdown("# üóÇÔ∏è Multi-Modal Document Studio\nUpload a PDF/TXT or paste text below.")
    
    with gr.Row():
        file_input = gr.File(label="Upload PDF or TXT", file_types=[".pdf", ".txt"])
        text_input = gr.Textbox(label="Or paste text here", lines=8, placeholder="Paste document text‚Ä¶")
    
    run_btn = gr.Button("üîç Analyse Document", variant="primary")
    
    with gr.Tabs():
        with gr.Tab("üìä Token Preview"):  tok_out  = gr.Markdown()
        with gr.Tab("üè∑Ô∏è Named Entities"): ent_out  = gr.Markdown()
        with gr.Tab("‚ö†Ô∏è Risk Scores"):    risk_out = gr.Markdown()
        with gr.Tab("üí¨ Final Brief"): 
            sent_out = gr.Markdown(label="üòê Sentiment")     
            llm_out  = gr.Markdown(label="üí¨ Brief")
        
        with gr.Tab("üé§ Speech"):
            speech_out = gr.Audio(label="üé§ Speech")
        
        
    
    run_btn.click(
        gradio_analyse,
        inputs=[file_input, text_input],
        outputs=[text_input, tok_out, ent_out, risk_out, sent_out, llm_out, speech_out]
    )

demo.launch(inbrowser=True)

---

## Concepts Demonstrated

| Cell | Pipeline / API | What it shows |
|------|---------------|---------------|
| 4 | `pipeline('ner')` | Extract parties, dates, money |
| 5 | `pipeline('sentiment-analysis')` | Overall document tone |
| 6 | `pipeline('zero-shot-classification')` | Per-clause risk without labelled data |
| 7 | `AutoTokenizer` + `apply_chat_template` | Token IDs & prompt format |
| 8 | `AutoModelForCausalLM` + `TextIteratorStreamer` + quantization | Local LLM + streaming |
| 8 | `gc.collect()` + `empty_cache()` | MPS/CUDA memory management |
| 9 | End-to-end chaining | All components wired together |
| 10 | Gradio Blocks UI | File upload, tabbed outputs, streaming |