##RAG Privacy Policy Simplifier


**Project type:** Proof-of-Concept (PoC)  
**Input:** fittiching Privacy policy URL online  
**Output:** Simplified rewrite in clear English  
**Main idea:** Use **RAG** to reduce hallucination and summarization and highlighting


## 1) Problem Statement

Privacy policies are tooo long and written in legal language. Most users dont read them fully or they misunderstand what data is collected, how it is used, and who it is shared with. Normal summarizers can also **hallucinate** or **drop important details**

## 2) Goal

Build a lightweight end-to-end **RAG pipeline** that rewrites (not invents) or omit privacy policy text into **clear, simple English** while preserving meaning and reducing hallucination risk. This is mainly to **prove the idea works** before heavy optimization. With focus **on only TikTok** as a PoC use case for simplifying.


## 3) Scope

**Included:**
- Fetch policy text from official URL
- Clean text (fix weird encoding chars)
- Chunking + embeddings + retrieval (Top-K)
- Rewrite with strict rules (no guessing)
- evalution

**Not included (for now):**
- Legal verification / lawyer review
- Perfect section-by-section formatting
- Multi-doc evaluation system (advanced scoring)
- Guarantee of 100% coverage


## **Model Development and Coding**

## Importing Libraries



In [None]:
!pip -q install -U transformers accelerate bitsandbytes sentence-transformers lxml textstat matplotlib
!pip -q install pandas==2.2.2 requests==2.32.4

import os
import re
import math
import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
import textstat

from bs4 import BeautifulSoup
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", device)
if device == "cuda":
    print("GPU:", torch.cuda.get_device_name(0))

## Secure HF Token + Project Configration

In [None]:
from getpass import getpass

if not os.getenv("HF_TOKEN"):
    print("\nHF_TOKEN not found in environment.")
    print("Paste your HuggingFace token:")
    os.environ["HF_TOKEN"] = getpass("HF_TOKEN: ")

hf_token = os.getenv("HF_TOKEN", "")

print("HF_TOKEN check passed.")

POLICY_URLS = [
    "https://www.tiktok.com/legal/privacy-policy?lang=en"
]

RETRIEVAL_QUERY = "data collection, sharing, retention, rights, security, advertising, tracking"

# Chunking knobs
CHUNK_WORDS = 500  # after lots of trials this was the best threshould for the use case
CHUNK_OVERLAP = 80
MIN_CHUNK_WORDS = 120

# Retrieval knobs
TOP_K = 12  # this is after alot of trials i started with 6 but if you prefer more coverage, ok

# Models
EMBED_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
GEN_MODEL_NAME   = "mistralai/Mistral-7B-Instruct-v0.2"

# Generator knobs
GEN_MAX_NEW_TOKENS = 1400 # to make it stable


##Fetching Polices + Generic cleaning

In [None]:
def _normalize_whitespace(s: str) -> str:
    s = s.replace("\u00a0", " ")
    s = re.sub(r"[ \t]+", " ", s)
    s = re.sub(r"\n{3,}", "\n\n", s)
    return s.strip()

# This removes the unreadable symbols retrieved while reading the policies, which I observed after reviewing a number of them.
def fix_mojibake(s: str) -> str:
    return (s.replace("â", "'")
             .replace("â", '"')
             .replace("â", '"')
             .replace("â", "-")
             .replace("Â", " ")
             .strip())

def fetch_policy(url: str, timeout: int = 30) -> str:
    headers = {
        "User-Agent": "Mozilla/5.0",
        "Accept-Language": "en-US,en;q=0.9",
    }
    r = requests.get(url, headers=headers, timeout=timeout)
    if r.status_code != 200:
        preview = (r.text or "")[:200]
        raise RuntimeError(f"Fetch failed: {r.status_code}\nPreview:\n{preview}")

    soup = BeautifulSoup(r.text, "lxml")

    for tag in soup(["script", "style", "noscript"]):
        tag.decompose()

    main = soup.find("main") or soup.find("article")
    text = main.get_text("\n") if main else soup.get_text("\n")

    text = _normalize_whitespace(text)
    text = fix_mojibake(text)
    return text

def clean_text(raw: str) -> str:
    before = len(raw)
    lines = [ln.strip() for ln in raw.splitlines()]
    lines = [ln for ln in lines if len(ln) > 2]
    cleaned = "\n".join(lines)
    cleaned = _normalize_whitespace(cleaned)
    cleaned = fix_mojibake(cleaned)
    after = len(cleaned)
    print(f"Cleaned chars: {after} (removed ~{max(0, before-after)} chars)")
    return cleaned

# Will be used to display the result in consle for faster viewing
def preview_head_tail(text: str, n: int = 1200) -> None:
    print("\n" + "="*70)
    print(f"TEXT STATS | chars: {len(text)} | words: {len(text.split())}")
    print("="*70)
    print("\n[HEAD PREVIEW]\n")
    print(text[:n])
    print("\n" + "-"*70)
    print("\n[TAIL PREVIEW]\n")
    print(text[-n:])
    print("\n" + "="*70 + "\n")

policies = []
for url in POLICY_URLS:
    print("\nFetching:", url)
    raw = fetch_policy(url)
    text = clean_text(raw)

    if len(text) < 20000:
        print("WARNING: Policy text looks too short (<20k chars). It may be incomplete")

    preview_head_tail(text, n=1200)
    policies.append({"url": url, "text": text})

# Optional manual override: because after alot of trials many of polcies are pages are protected from fitiching
MANUAL_POLICY_TEXT = ""
if MANUAL_POLICY_TEXT.strip():
    policies = [{"url": "MANUAL_INPUT", "text": clean_text(MANUAL_POLICY_TEXT)}]
    preview_head_tail(policies[0]["text"], n=1200)


I pulled the TikTok policy text from the link and cleaned the messy lines.
I checked the **head** and **tail** to make sure the page is not empty or cut.
It looks full length (big **word count**), so i can move to chunking now.


##**(RAG pipeline)**

##Chunking
split policy into overlapping chunks


In [None]:
def chunk_text(text: str, chunk_words: int = 500, overlap_words: int = 80, min_words: int = 120) -> list[str]:
    words = text.split()
    chunks = []
    i = 0
    step = max(1, chunk_words - overlap_words)

    while i < len(words):
        chunk_words_list = words[i:i + chunk_words]
        chunk = " ".join(chunk_words_list).strip()
        if chunk and len(chunk_words_list) >= min_words:
            chunks.append(chunk)
        i += step

    return chunks

def chunk_stats(chunks: list[str]) -> None:
    sizes = [len(c.split()) for c in chunks]
    print(f"Chunks: {len(chunks)} | words min/avg/max: {min(sizes)}/{sum(sizes)//len(sizes)}/{max(sizes)}")

def preview_chunks(chunks: list[str], take: int = 2, chars: int = 800) -> None:
    take = min(take, len(chunks))
    for idx in range(take):
        print("\n" + "="*70)
        print(f"CHUNK #{idx} | words: {len(chunks[idx].split())}")
        print("="*70)
        print(chunks[idx][:chars])
        print("\n" + "="*70)

for p in policies:
    chunks = chunk_text(p["text"], CHUNK_WORDS, CHUNK_OVERLAP, MIN_CHUNK_WORDS)

    if len(chunks) < 6:
        print("WARNING: Few chunks (<6). !!!!")

    chunk_stats(chunks)
    preview_chunks(chunks, take= 2, chars= 800)
    p["chunks"] = chunks


look i chunked the policy into ~500-word parts with a small **overlap** so the edges dont cut meanings
ended up with **20 chunks**, most of them near the same size, so retrieval should be stable i think
the small last chunk is normal,

##Embeddings + Retrieval

In [None]:
embedder = SentenceTransformer(EMBED_MODEL_NAME)

def embed_chunks(chunks: list[str]) -> tuple[np.ndarray, list[str]]:
    chunks_clean = [c.strip() for c in chunks if c and c.strip()]
    emb = embedder.encode(chunks_clean, normalize_embeddings=True, show_progress_bar=False)
    return np.array(emb), chunks_clean

def retrieve_topk(chunks: list[str], chunk_emb: np.ndarray, query: str, top_k: int):
    q_emb = embedder.encode([query], normalize_embeddings=True, show_progress_bar=False)[0]
    scores = chunk_emb @ q_emb
    k = min(top_k, len(scores))
    top_idx = np.argsort(-scores)[:k].tolist()
    return top_idx, [chunks[i] for i in top_idx], [float(scores[i]) for i in top_idx]

def preview_retrieved(indices, retrieved, scores, chars: int = 800):
    print("\nRetrieved chunk indices:", indices)
    for j, (idx, txt, sc) in enumerate(zip(indices, retrieved, scores)):
        print("\n" + "=" * 70)
        print(f"RETRIEVED #{j} | chunk={idx} | score={sc:.4f} | words={len(txt.split())}")
        print("=" * 70)
        print(txt[:chars])
        print("\n" + "=" * 70)

for p in policies:
    p["chunk_emb"], p["chunks_clean"] = embed_chunks(p["chunks"])

    print(f'\n[RAG] url={p.get("url","")} | chunks={len(p["chunks_clean"])} | TOP_K={TOP_K}')
    idxs, retrieved, scores = retrieve_topk(p["chunks_clean"], p["chunk_emb"], RETRIEVAL_QUERY, TOP_K)

    preview_retrieved(idxs, retrieved, scores, chars=800)

    p["retrieved_idxs"] = idxs
    p["retrieved_texts"] = retrieved
    p["retrieved_scores"] = scores


look i ran embeddings on the **20 chunks** and pulled **TOP_K=12**
top results look real tbh: sharing, ads/analytics, location, security, transfers, contacts, purchases, and rights

scores are mostly ~0.50 which is fine, and the few lower ones are just country add-on parts, not random junk or noise

now i can feed these retrieved chunks into the generator


## Load Generator Model (Mistral 7B Instruct, 4-bit)

Load tokenizer + model

4-bit quantization to fit GPU

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)

tokenizer = AutoTokenizer.from_pretrained(GEN_MODEL_NAME, token=hf_token, use_fast=True)
if tokenizer.pad_token_id is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    GEN_MODEL_NAME,
    token=hf_token,
    quantization_config=bnb_config,
    device_map="auto",
)
model.eval()

print("Loaded generator:", GEN_MODEL_NAME)
print("pad_token_id:", tokenizer.pad_token_id)


I loaded the **generator model** (Mistral 7B Instruct) and it finished without errors
So now the model is ready

##Rewrite/Simplify

In [None]:
def build_context_from_idxs(all_chunks: list[str], idxs: list[int], max_chunks: int = 11, per_chunk_chars: int = 1200) -> str:
    uniq = sorted(set(idxs))[:max_chunks]
    chosen = []
    for i in uniq:
        c = (all_chunks[i] or "").strip()
        c = fix_mojibake(c)
        if c:
            chosen.append(c[:per_chunk_chars])
    return "\n\n".join(chosen)

def cleanup_output(text: str) -> str:
    t = fix_mojibake(text.replace("\r", "")).strip()
    t = re.sub(r"\n{3,}", "\n\n", t)
    t = re.sub(r"\.{2,}", ".", t)      # remove ".. .. .."
    t = t.replace("<", "").replace(">", "")
    return t.strip()

def generate_rewrite(context: str, max_new_tokens: int = 1400) -> str:
    user_prompt = (
        "Task: Rewrite this privacy policy in clear, simple English for an average user.\n\n"
        "Strict rules:\n"
        "- Preserve the meaning. Do not add, guess, or invent anything.\n"
        "- Do not merge different parts in a way that changes meaning.\n"
        "- If something is unclear or not explicitly stated, omit it.\n"
        "- Use short paragraphs only. No bullet lists. No numbering.\n"
        "- Avoid unusual symbols. Use normal punctuation only.\n"
        "- Do not repeat the same idea again and again. If you start repeating, stop.\n\n"
        "Text:\n"
        f"{context}\n"
    )

    messages = [{"role": "user", "content": user_prompt}]
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) if hasattr(tokenizer, "apply_chat_template") else user_prompt

    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=5500)
    input_len = inputs["input_ids"].shape[-1]
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
            repetition_penalty=1.12, # After trying different penalty values, this worked the best for the use case
            no_repeat_ngram_size=8,
            pad_token_id=tokenizer.eos_token_id,
        )

    gen_tokens = out[0][input_len:]
    return cleanup_output(tokenizer.decode(gen_tokens, skip_special_tokens=True))

TOP_K_FOR_GEN = min(11, TOP_K)

for p in policies:
    context = build_context_from_idxs(
        all_chunks=p["chunks"],
        idxs=p["retrieved_idxs"],
        max_chunks=TOP_K_FOR_GEN,
        per_chunk_chars=1200
    )

    print(f"\n[GEN] url={p['url']} | context_chunks={min(TOP_K_FOR_GEN, len(set(p['retrieved_idxs'])))}")
    simplified = generate_rewrite(context, max_new_tokens=GEN_MAX_NEW_TOKENS).strip()

    simplified += (
        "\n\n---\n"
        "Note (experimental): This summary is for learning/testing purposes only and may contain mistakes. "
        "The official and legally binding text is the original privacy policy.\n"
        "Ahmed Wadee Moustafa"
    )

    p["simplified"] = simplified

    print("\n" + "="*90)
    print("SIMPLIFIED OUTPUT")
    print("="*90)
    print(simplified)
    print("="*90 + "\n")


I ran the **trial summary** and as a normal user Id rate it as ( proof, not perfect)
The **good part** is it matched the real chunks: **location (approx + GPS)**, **image/audio analysis**, **sharing with ads/analytics/payment**, and **data transfer** (Singapore/Malaysia/Ireland/US)
The **bad part** is small **over-talk / semi-hallucsination**: it used **bullet points** even tho we said “short paragrphs only”, and it wrote “you **consent**” which can be risky if the text didnt say it cleary
Also some lines are **too general** like “business partners for social networking”, feels broad and not super pinned to the exact wording


##Evaluation

In [None]:
def compute_scores(text: str) -> dict:
    return {
        "FKGL": float(textstat.flesch_kincaid_grade(text)),
        "GunningFog": float(textstat.gunning_fog(text)),
        "SMOG": float(textstat.smog_index(text)),
        "Words": int(len(text.split())),
        "Chars": int(len(text)),
    }

rows = []
for p in policies:
    before = p["text"]
    after = p["simplified"]

    s_before = compute_scores(before)
    s_after = compute_scores(after)

    rows.append({"Policy": p["url"], "Version": "Before", **s_before})
    rows.append({"Policy": p["url"], "Version": "After", **s_after})

df = pd.DataFrame(rows)
display(df)

metrics = ["FKGL", "GunningFog", "SMOG"]

for policy_url in df["Policy"].unique():
    sub = df[df["Policy"] == policy_url].set_index("Version")
    before_vals = [sub.loc["Before", m] for m in metrics]
    after_vals = [sub.loc["After", m] for m in metrics]

    x = np.arange(len(metrics))
    w = 0.35

    plt.figure(figsize=(8, 4))
    plt.bar(x - w/2, before_vals, width=w, label="Before")
    plt.bar(x + w/2, after_vals, width=w, label="After")
    plt.xticks(x, metrics)
    plt.ylabel("Score")
    plt.title(f"Readability Metrics (Before vs After)\n{policy_url}")
    plt.legend()
    plt.show()

I checked the **scores table** and its clear the output is waaay shorter than the real policy
The original is **8237 words**, but the rewrite is only **374 words**, so it’s more like a tiny summary not a full simplifed policy
Readability got a bit better (**FKGL / Fog / SMOG** drop), but it’s still kinda high, so the language is still “legal-ish”
So for now it’s ok as a **trial proof**, but later I need either **more tokens** or a **chunk-by-chunk rewrite** or even more maybe even paid models and gpu power if I want more lenght


In [None]:
# =========================
# Method 2 (OpenAI) — Full Policy Simplification (rewrite, NOT a short summary)
# Goal: keep ALL meaning + details, but rewrite in very simple English.
#
# Why chunk-by-chunk?
# - One-shot on the full policy can hit context limits or silently truncate.
# - Chunking keeps coverage stable.
# =========================

!pip -q install -U openai
from openai import OpenAI
if not os.getenv("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("OPENAI_API_KEY: ")

client = OpenAI()

avail = {m.id for m in client.models.list().data}
PREF  = ["gpt-5.2-pro","gpt-5.2","gpt-5","gpt-4.1","gpt-4o","gpt-4o-mini"]
MODEL = next((m for m in PREF if m in avail), None)
if not MODEL:
    raise RuntimeError("No preferred model available (billing/access).")

MAX_OUT = 1400

PROMPT = (
    "You are a detailed and precise legal language simplifier.\n"
    "Rewrite the policy text below in VERY simple English (A1 level max).\n"
    "Preserve ALL legal meaning and ALL details. Do NOT turn it into a short summary.\n"
    "Strict rules:\n"
    "- Do not add, guess, or invent anything.\n"
    "- Do not omit details.\n"
    "- Do not mix different parts in a way that changes meaning.\n"
    "- Keep the original order as much as possible.\n"
    "- No bullet points. No numbering.\n"
    "- Avoid weird symbols or anything that looks like code.\n"
    "- Use short sentences and short paragraphs.\n"
    "- If a legal term is needed to keep meaning, keep it and explain it simply.\n\n"
    "TEXT:\n{chunk}"
)

def _rw(chunk: str) -> str:
    r = client.responses.create(
        model=MODEL,
        max_output_tokens=MAX_OUT,
        input=[{"role":"user","content":PROMPT.format(chunk=chunk)}],
    )
    return (r.output_text or "").strip()

def _print_head_mid_tail(title: str, txt: str, head=2000, mid=2000, tail=2000):
    print("\n" + "="*90)
    print(f"{title} (head)")
    print("="*90)
    print(txt[:head])

    m = len(txt)//2
    print("\n" + "-"*90)
    print(f"{title} (middle)")
    print("-"*90)
    print(txt[max(0, m-mid//2): m+mid//2])

    print("\n" + "-"*90)
    print(f"{title} (tail)")
    print("-"*90)
    print(txt[-tail:])

for p in policies:
    chunks = p.get("chunks_clean") or p.get("chunks") or chunk_text(
        p["text"], CHUNK_WORDS, CHUNK_OVERLAP, MIN_CHUNK_WORDS
    )

    p["simplified_openai_full"] = "\n\n".join(
        _rw(c) for c in chunks if c and c.strip()
    ).strip()

    # BEFORE vs AFTER (head + middle + tail)
    _print_head_mid_tail("BEFORE", p["text"])
    _print_head_mid_tail("AFTER",  p["simplified_openai_full"])

    # metrics table (kept as-is)
    df = pd.DataFrame([
        {"Policy": p["url"], "Version": "Before", **compute_scores(p["text"])},
        {"Policy": p["url"], "Version": "After_RAG_Small", **compute_scores(p.get("simplified",""))},
        {"Policy": p["url"], "Version": "After_OpenAI_Full", **compute_scores(p["simplified_openai_full"])}
    ])
    display(df)

    print("MODEL:", MODEL)
    print("words before:", len(p["text"].split()))
    print("words rag_small:", len(p.get("simplified","").split()))
    print("words openai_full:", len(p["simplified_openai_full"].split()))


In [None]:

    # ONE graph only (same style: 3 metrics, 3 versions)
    metrics = ["FKGL", "GunningFog", "SMOG"]
    sub = df.set_index("Version")

    before_vals = [sub.loc["Before", m] for m in metrics]
    rag_vals    = [sub.loc["After_RAG_Small", m] for m in metrics]
    full_vals   = [sub.loc["After_OpenAI_Full", m] for m in metrics]

    x = np.arange(len(metrics))
    w = 0.28

    plt.figure(figsize=(9, 4))
    plt.bar(x - w, before_vals, width=w, label="Before")
    plt.bar(x,     rag_vals,    width=w, label="After_RAG_Small")
    plt.bar(x + w, full_vals,   width=w, label="After_OpenAI_Full")
    plt.xticks(x, metrics)
    plt.ylabel("Score (lower is easier)")
    plt.title(f"Readability Scores — {p['url']}")
    plt.savefig(f"outputs/readability_scores_{i}.png", dpi=200, bbox_inches="tight")
    plt.legend()
    plt.show()

## Conclusion (what the results really mean)

I can now say this PoC worked in the way i actually needed, not just “some output”. The original TikTok policy is about **8237 words**, and the reading level was crazy high (**FKGL ~14.82**, **Fog ~17.66**). So even if a normal user tries, they will miss stuff or get bored fast, because the language is legal and dense.

When i used the Top-K RAG path, the output was only **374 words**. Thats not real “simplification”, thats more like a **small summary** of a few important areas. It can be useful for a fast view (like data sharing, ads, location, transfers), but it does not cover the full policy, so i cant claim it solves the real problem alone.

The real shift happened with the OpenAI chunk-by-chunk rewrite. It kept **full coverage**, and still made the policy easier to read. The output became **13045 words**, and thats not a bug. Its logical. Legal sentences got split into smaller sentences, and hard terms got **explained** inside the text, so the meaning stays but the reading becomes easier. The scores prove that too: **FKGL dropped to ~8.07**, **Fog to ~10.31**, and **SMOG to ~10.67**. So i got a big readability win without shrinking the policy into a tiny “marketing” summary.

I also did the eye check (head / middle / tail) and the flow looks consistent. The rewrite kept the same idea order and didnt look like random invented paragraphs. So for this PoC, the best logic is: i use chunk-rewrite as the main method for the **full simplified policy**, and i keep Top-K retrieval as a second method only for **quick risk highlights** or “where to look first”, not as a replacment for full simplifcation.


In [None]:
os.makedirs("outputs", exist_ok=True)

for i,p in enumerate(policies,1):
    open(f"outputs/policy_{i}_before.txt","w",encoding="utf-8").write(p["text"])
    open(f"outputs/policy_{i}_rag_small.txt","w",encoding="utf-8").write(p.get("simplified",""))
    open(f"outputs/policy_{i}_openai_full.txt","w",encoding="utf-8").write(p.get("simplified_openai_full",""))

print("saved txt files -> outputs/")


## Wrap-up

I finished the main run and the results look stable.

- **Model used (OpenAI):** gpt-5.2-pro  
- **Policy size (original):** ~8237 words  
- **Runtime (this run):** ~24 minutes for 1 policy (end-to-end, chunk rewrite)  
- **Cost note:** this method makes many model calls (one per chunk), so scaling to more policies will **cost more** and will likely need a bigger budget (and maybe stronger compute / batching) to stay fast.

### What i saved
- The full **Before** text
- The full **After (OpenAI full rewrite)** text
- The **RAG small** output (for quick risk highlights)
- A small JSON summary with **word counts + readability scores**
- The one readability chart image (Before vs RAG vs OpenAI)

### Next step if i scale to more policies
If i run this on 5 policies, i should expect:
- higher total cost (more chunks = more calls)
- longer runtime (unless i parallelize or reduce per-chunk output)


