<a href="https://colab.research.google.com/github/bbanzai88/Data-Science-Repository/blob/main/BillDiff_MVP_v3_selfcontained.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# 🧾 Legislative Impact Analyzer — **v3 Self‑Contained** (Colab)

This single notebook includes:
- MVP viewer (v1/v2/v3) with redlines, labels, risks, **always-visible 'Likely affected businesses'** table
- Timeline compare (v1→v2, v2→v3, v1→v3)
- **AI/NLP meaning extraction** (transformers)
- **Entity linking** (agencies, companies, affected populations)
- **Interactive timeline slider**
- **Full Report v3** (HTML + CSV) with plain-English summaries + business effects
- **PDF export** of the HTML report
- **Mini dashboards** (Plotly)
- Optional **LLM deep reasoning** (OpenAI / Anthropic) — only if API key is set

> **How to use:** In Colab, go to `Runtime → Run all`.


In [2]:
# Clean any broken leftovers
!pip -q uninstall -y en-core-web-sm en_core_web_sm || true

# Make sure spaCy is a compatible version
!pip -q install "spacy==3.7.4"

# Install the official small English model wheel (3.7.1 matches spaCy 3.7.x)
!pip -q install \
  https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl

# Quick validation
!python -m spacy validate




⠙ Loading compatibility table...[2K[38;5;2m✔ Loaded compatibility table[0m
[1m
[38;5;4mℹ spaCy installation: /usr/local/lib/python3.12/dist-packages/spacy[0m

NAME             SPACY            VERSION                            
en_core_web_sm   >=3.7.2,<3.8.0   [38;5;2m3.7.1[0m   [38;5;2m✔[0m



In [3]:

# --- Sample bill versions (auto-created) ---
import json, os

DATA_DIR = "mvp_data"; os.makedirs(DATA_DIR, exist_ok=True)

v1 = {
  "bill_id": "H.R. 9999", "version": "Introduced", "date": "2024-01-12",
  "sections": [
    {"path": "Sec. 101", "title": "Short Title",
     "text": "This Act may be cited as the Consumer Data Protection Act of 2024."},
    {"path": "Sec. 102(a)", "title": "Definitions",
     "text": ("(1) COVERED ENTITY.—The term 'covered entity' means any person that collects, processes, or transfers personal data "
              "of more than 10,000 individuals in a calendar year. "
              "(2) PERSONAL DATA.—The term 'personal data' means information that identifies, relates to, or could reasonably be linked to an individual.")},
    {"path": "Sec. 201", "title": "Data Security",
     "text": ("A covered entity shall implement reasonable administrative, technical, and physical safeguards to protect the confidentiality, "
              "integrity, and availability of personal data. A civil penalty of $500 per violation applies for willful neglect.")},
    {"path": "Sec. 301", "title": "Small Business Exemption",
     "text": ("This Act shall not apply to entities with fewer than 25 employees or annual gross revenues under $2,000,000.")}
  ]
}
v2 = {
  "bill_id": "H.R. 9999", "version": "Engrossed in House", "date": "2024-03-05",
  "sections": [
    {"path": "Sec. 101", "title": "Short Title",
     "text": "This Act may be cited as the Consumer Data Protection and Innovation Act of 2024."},
    {"path": "Sec. 102(a)", "title": "Definitions",
     "text": ("(1) COVERED ENTITY.—The term 'covered entity' means any person that collects, processes, or transfers personal data "
              "of more than 5,000 residents in a calendar year, excluding de-identified data handled by processors acting under contract. "
              "(2) PERSONAL DATA.—The term 'personal data' means information that identifies or relates to a natural person, "
              "but does not include publicly available information.")},
    {"path": "Sec. 201", "title": "Data Security",
     "text": ("A covered entity may implement reasonable administrative, technical, and physical safeguards, including encryption at rest and in transit, "
              "to protect the confidentiality, integrity, and availability of personal data. A civil penalty of $50 per violation applies; "
              "entities self-attesting to compliance are not subject to audit more than once every 5 years.")},
    {"path": "Sec. 301", "title": "Small Business Exception",
     "text": ("This Act shall not apply to entities with fewer than 50 employees or annual gross revenues under $5,000,000, except data brokers.")},
    {"path": "Sec. 401", "title": "Research Exemption",
     "text": ("Nothing in this Act shall restrict research uses of de-identified data by qualified institutions for the public interest.")}
  ]
}
v3 = {
  "bill_id": "H.R. 9999", "version": "Passed Senate", "date": "2024-05-18",
  "sections": [
    {"path": "Sec. 101", "title": "Short Title",
     "text": "This Act may be cited as the Consumer Data Integrity and Innovation Act of 2024."},
    {"path": "Sec. 102(a)", "title": "Definitions",
     "text": ("(1) COVERED ENTITY.—The term 'covered entity' means any person that collects, processes, or transfers personal data "
              "of more than 8,000 individuals in a calendar year, including residents and citizens, and excluding de-identified data handled by processors "
              "pursuant to a written contract that prohibits re-identification. "
              "(2) PERSONAL DATA.—The term 'personal data' means information that identifies or relates to a natural person, and includes "
              "persistent identifiers linked to a device or household; publicly available information is not personal data.")},
    {"path": "Sec. 201", "title": "Data Security",
     "text": ("A covered entity shall implement reasonable administrative, technical, and physical safeguards, including encryption at rest and in transit, "
              "multi-factor authentication for privileged access, and annual workforce training, to protect the confidentiality, integrity, and availability of personal data. "
              "A civil penalty of $250 per violation applies; entities claiming compliance may be audited at least once every 2 years.")},
    {"path": "Sec. 301", "title": "Small Business Exception",
     "text": ("This Act shall not apply to entities with fewer than 40 employees or annual gross revenues under $4,000,000; "
              "this exception does not apply to data brokers or entities primarily engaged in targeted advertising.")},
    {"path": "Sec. 401", "title": "Research Safe Harbor",
     "text": ("Qualified institutions conducting research in the public interest may use de-identified data, provided that such institutions maintain "
              "appropriate technical and organizational measures and publicly post a research summary within 180 days of project completion.")}
  ]
}

for name, obj in [("bill_v1.json", v1), ("bill_v2.json", v2), ("bill_v3.json", v3)]:
  with open(f"{DATA_DIR}/{name}", "w") as f: json.dump(obj, f, indent=2)

print("Sample data written to", DATA_DIR)


Sample data written to mvp_data


In [4]:

# --- Core heuristics, diff, and mapping ---
import re
import ipywidgets as W
import pandas as pd
from IPython.display import HTML, display
from rapidfuzz import fuzz

def split_sentences(text):
  parts = re.split(r'(?<=[.!?;])\s+', (text or '').strip())
  return [p.strip() for p in parts if p.strip()]

def inline_redline(a, b):
  a_tokens = re.findall(r"\w+|\W", a or "")
  b_tokens = re.findall(r"\w+|\W", b or "")
  i=j=0; out=[]
  while i < len(a_tokens) and j < len(b_tokens):
    if a_tokens[i] == b_tokens[j]:
      out.append(a_tokens[i]); i+=1; j+=1
    else:
      del_cost = fuzz.ratio(a_tokens[i], b_tokens[j+1]) if j+1 < len(b_tokens) else -1
      ins_cost = fuzz.ratio(a_tokens[i+1], b_tokens[j]) if i+1 < len(a_tokens) else -1
      if del_cost >= ins_cost:
        out.append(f"<ins>{b_tokens[j]}</ins>"); j += 1
      else:
        out.append(f"<del>{a_tokens[i]}</del>"); i += 1
  while i < len(a_tokens): out.append(f"<del>{a_tokens[i]}</del>"); i+=1
  while j < len(b_tokens): out.append(f"<ins>{b_tokens[j]}</ins>"); j+=1
  return ''.join(out)

def make_redline_html(old_text, new_text):
  old_sents = split_sentences(old_text); new_sents = split_sentences(new_text)
  pairs, used = [], set()
  for i, s in enumerate(old_sents):
    best_j, best = None, -1
    for j, t in enumerate(new_sents):
      if j in used: continue
      score = fuzz.token_set_ratio(s, t)
      if score > best: best, best_j = score, j
    if best_j is not None:
      used.add(best_j); pairs.append((s, new_sents[best_j], best))
  for j, t in enumerate(new_sents):
    if j not in used: pairs.append(("", t, 0))
  rows = []
  for old_s, new_s, score in pairs:
    rd = inline_redline(old_s, new_s) if old_s else f"<ins>{new_s}</ins>"
    rows.append(f"<tr><td class='old'>{old_s}</td><td class='new'>{rd}</td><td class='score'>{score}</td></tr>")
  style = '''
  <style>
  table.diff {width:100%; border-collapse:collapse; font-family:ui-monospace, monospace; font-size:14px}
  table.diff td, table.diff th {border:1px solid #ddd; vertical-align:top; padding:6px}
  del {background:#ffe0e0; text-decoration:line-through}
  ins {background:#e0ffe0; text-decoration:none}
  .badge{display:inline-block; padding:2px 6px; border-radius:8px; background:#eee; margin-right:6px; font-size:12px}
  table.smol {border-collapse:collapse; margin-top:6px; font-family:ui-sans-serif, system-ui; font-size:13px}
  table.smol th, table.smol td {border:1px solid #ddd; padding:4px 6px}
  table.smol th {background:#f6f6f6; text-align:left}
  .muted{color:#666; font-style:italic}
  </style>
  '''
  html = style + "<table class='diff'><tr><th>Old</th><th>New (redline)</th><th>Match</th></tr>" + ''.join(rows) + "</table>"
  return HTML(html)

def extract_numbers(text):
  nums = []
  for x in re.findall(r"\$?\b\d[\d,]*(?:\.\d+)?\b", text or ""):
    x = x.replace('$','').replace(',','')
    try: nums.append(float(x))
    except: pass
  return nums

def label_semantic_change(old_text, new_text):
  labels = set(); rationales = []
  old_l, new_l = (old_text or "").lower(), (new_text or "").lower()
  old_nums, new_nums = extract_numbers(old_text), extract_numbers(new_text)

  if re.search(r'\bshall\b', old_l) and re.search(r'\bmay\b', new_l):
    labels.add("Obligation softened"); rationales.append("Changed 'shall' → 'may'.")
  if re.search(r'\bmay\b', old_l) and re.search(r'\bshall\b', new_l):
    labels.add("Obligation strengthened"); rationales.append("Changed 'may' → 'shall'.")

  if old_nums and new_nums and (min(new_nums) < min(old_nums)):
    labels.add("Broadening (lower threshold)"); rationales.append(f"Lowered threshold ({min(old_nums)} → {min(new_nums)}).")
  if old_nums and new_nums and (max(new_nums) > max(old_nums)):
    labels.add("Broadening (higher cap)"); rationales.append(f"Increased cap ({max(old_nums)} → {max(new_nums)}).")

  if ("residents" in new_l and "individuals" in old_l) or ("citizens" in new_l and "residents" in old_l):
    labels.add("Narrowing"); rationales.append("Eligibility wording narrowed.")
  if ("individuals" in new_l and "residents" in old_l):
    labels.add("Broadening"); rationales.append("Eligibility wording broadened.")

  if "means" in old_l and "means" in new_l and fuzz.token_set_ratio(old_text or "", new_text or "") < 90:
    labels.add("Definition change"); rationales.append("Definition text altered.")

  if ("except" in new_l or "excluding" in new_l) and not ("except" in old_l or "excluding" in old_l):
    labels.add("New exception"); rationales.append("Introduced an exception/exclusion.")
  if ("except" not in new_l and "excluding" not in new_l) and ("except" in old_l or "excluding" in old_l):
    labels.add("Exception removed"); rationales.append("Removed a prior exception/exclusion.")

  if any(t in new_l for t in ["reasonable","good cause","appropriate"]) and not any(t in old_l for t in ["reasonable","good cause","appropriate"]):
    labels.add("Ambiguity risk"); rationales.append("Introduced vague standard.")

  if ("penalty" in old_l or "$" in old_l) and ("penalty" in new_l or "$" in new_l):
    if old_nums and new_nums and min(new_nums) < min(old_nums):
      labels.add("Penalty reduced"); rationales.append("Lower per-violation penalty.")
    if old_nums and new_nums and min(new_nums) > min(old_nums):
      labels.add("Penalty increased"); rationales.append("Higher per-violation penalty.")

  if "self-attest" in new_l: labels.add("Oversight risk"); rationales.append("Self-attestation introduced.")
  if ("audit" in new_l and "not" in new_l and "more than" in new_l):
    labels.add("Oversight risk"); rationales.append("Audit frequency limited.")
  if ("audit" in new_l and "at least" in new_l):
    labels.add("Oversight strengthened"); rationales.append("Minimum audit cadence established.")

  return sorted(labels), rationales

def detect_risks(new_text):
  risks = []
  t = (new_text or "").lower()
  def add(kind, why): risks.append({"risk":kind, "rationale":why})
  if "self-attest" in t: add("Fraud/abuse", "Self-attestation weakens verification.")
  if "not subject to audit" in t or ("audit" in t and "not" in t and "more than" in t): add("Oversight gap", "Audit frequency limited.")
  if "at least" in t and "audit" in t: add("Compliance cost", "More frequent audits may increase costs.")
  if "except" in t and ("employees" in t or "revenue" in t): add("Loophole", "Size-based exception may be gamed.")
  if "citizens" in t and "residents" not in t: add("Bias risk", "Citizens-only eligibility can exclude residents.")
  return risks

NAICS_MAP = {
  "data broker": "NAICS 514199 - All Other Information Services",
  "data brokers": "NAICS 514199 - All Other Information Services",
  "encryption": "NAICS 541512 - Computer Systems Design Services",
  "processor": "NAICS 518210 - Data Processing, Hosting, and Related Services",
  "processors": "NAICS 518210 - Data Processing, Hosting, and Related Services",
  "research": "NAICS 541715 - R&D in Physical, Engineering, and Life Sciences",
  "targeted advertising": "NAICS 541810 - Advertising Agencies",
  "advertising": "NAICS 541810 - Advertising Agencies",
  "multi-factor authentication": "NAICS 541512 - Computer Systems Design Services",
  "data security": "NAICS 541513 - Computer Facilities Management Services"
}

def map_affected_entities(text):
  hits = []
  t = (text or "").lower()
  def variants(s):
    base = s.lower(); out = {base}
    if base.endswith("s"): out.add(base[:-1])
    else: out.add(base + "s")
    return out
  for k, v in NAICS_MAP.items():
    keys = {k.lower()} | variants(k)
    if any(sub in t for sub in keys):
      hits.append({"keyword": k, "naics": v, "impact": "Compliance & opportunity"})
  unique = {}
  for h in hits:
    unique.setdefault(h["naics"], h)
  return list(unique.values())

def index_sections(meta): return {s["path"]: s for s in meta["sections"]}
VERSIONS = [{"key":"v1","meta":v1},{"key":"v2","meta":v2},{"key":"v3","meta":v3}]
IDX = {v["key"]: index_sections(v["meta"]) for v in VERSIONS}
ALL_PATHS = sorted(set().union(*[set(IDX[k].keys()) for k in IDX]))
VERSION_LABELS = {v["key"]: f"{v['meta']['version']} ({v['meta']['date']})" for v in VERSIONS}

def compare_section(path, a_key, b_key):
  old = IDX[a_key].get(path, {"title":"(new)", "text":""})
  new = IDX[b_key].get(path, {"title":"(removed)", "text":""})
  labels, rationales = label_semantic_change(old.get("text",""), new.get("text",""))
  risks = detect_risks(new.get("text",""))
  impacts = map_affected_entities(new.get("text",""))
  return old, new, labels, rationales, risks, impacts


In [5]:

# --- Entity linking (agencies, companies, affected populations) ---
import spacy, re
nlp = spacy.load("en_core_web_sm")

GAZ_AGENCIES = {
  "FTC": "Federal Trade Commission",
  "FCC": "Federal Communications Commission",
  "HHS": "Department of Health and Human Services",
  "CMS": "Centers for Medicare & Medicaid Services",
  "FDA": "Food and Drug Administration",
  "DHS": "Department of Homeland Security"
}
COMPANY_HINTS = re.compile(r"\b(Inc\.|LLC|Ltd\.|Corp\.|Corporation|Company)\b", re.I)
POPULATION_TERMS = ["consumers","residents","citizens","children","seniors","patients","students","workers"]

def entity_linking(text: str):
  doc = nlp(text or "")
  agencies, companies, populations = set(), set(), set()

  for k, v in GAZ_AGENCIES.items():
    if re.search(rf"\b{k}\b", text or "", re.I) or re.search(rf"\b{re.escape(v)}\b", text or "", re.I):
      agencies.add(v)

  for ent in doc.ents:
    if ent.label_ == "ORG":
      if COMPANY_HINTS.search(ent.text):
        companies.add(ent.text.strip())
      if re.search(r"\b(Department|Agency|Commission|Administration|Authority|Bureau)\b", ent.text, re.I):
        agencies.add(ent.text.strip())
    if ent.label_ in ("NORP","ORG","GPE","PERSON"):
      for term in POPULATION_TERMS:
        if re.search(rf"\b{term}\b", ent.text, re.I) or re.search(rf"\b{term}\b", text or "", re.I):
          populations.add(term)

  return {
    "agencies": sorted(agencies),
    "companies": sorted(companies),
    "populations": sorted(populations)
}


In [6]:

# --- AI/NLP meaning extraction (transformers) ---
import torch
from transformers import pipeline

_ZS_MODEL = "facebook/bart-large-mnli"
_SUM_MODEL = "facebook/bart-large-cnn"

_zs = None
_sum = None
def _get_zs():
  global _zs
  if _zs is None:
    _zs = pipeline("zero-shot-classification", model=_ZS_MODEL, device=0 if torch.cuda.is_available() else -1)
  return _zs

def _get_sum():
  global _sum
  if _sum is None:
    _sum = pipeline("summarization", model=_SUM_MODEL, device=0 if torch.cuda.is_available() else -1)
  return _sum

_ZS_LABELS = [
  "obligation strengthened", "obligation softened",
  "scope broadened", "scope narrowed",
  "new exception", "exception removed",
  "definition changed", "penalties increased", "penalties reduced",
  "oversight strengthened", "oversight weakened", "ambiguous language added"
]

def transformer_meaning(old_text: str, new_text: str):
  try:
    zs = _get_zs()
    premise = (old_text or "").strip()
    hypothesis = (new_text or "").strip()
    joined = f"PREVIOUS: {premise}\nNEW: {hypothesis}"
    zres = zs(joined, _ZS_LABELS, multi_label=True)
    labels_ranked = [lbl for _, lbl in sorted(zip(zres['scores'], zres['labels']), reverse=True)]
    sm = _get_sum()
    to_sum = joined[:3000]
    sres = sm(to_sum, max_length=110, min_length=40, do_sample=False)
    summary = sres[0]['summary_text'].strip()
    return labels_ranked[:5], summary
  except Exception as e:
    labels, ration = label_semantic_change(old_text or "", new_text or "")
    simple = summarize_simple(old_text or "", new_text or "", labels, ration)
    return labels, " ".join(simple) if simple else "No material change detected."


In [7]:

# --- Plain-English summarizer + business effects inference ---
def summarize_simple(old_text, new_text, labels, rationales):
  old_l = (old_text or "").lower(); new_l = (new_text or "").lower()
  bits = []
  if "Obligation softened" in labels: bits.append("Requirements are weaker (some 'musts' became 'may').")
  if "Obligation strengthened" in labels: bits.append("Requirements are stricter (some 'may' became 'must').")
  if "Broadening (lower threshold)" in labels: bits.append("More entities now fall under the rule (threshold was lowered).")
  if "Broadening (higher cap)" in labels: bits.append("Larger limits are allowed (higher caps).")
  if "Narrowing" in labels: bits.append("Fewer people/entities are eligible (narrower wording).")
  if "Broadening" in labels: bits.append("Eligibility is broader (more people/entities included).")
  if "New exception" in labels: bits.append("Adds a new exception that reduces when the rule applies.")
  if "Exception removed" in labels: bits.append("Removes an exception so the rule applies more often.")
  if "Definition change" in labels: bits.append("Changes an official definition, which can shift scope.")
  if "self-attest" in new_l: bits.append("Lets organizations self-attest, easing compliance but reducing oversight.")
  if "audit" in new_l and "at least" in new_l: bits.append("Requires minimum audit cadence.")
  if "audit" in new_l and "not" in new_l and "more than" in new_l: bits.append("Limits audits to a maximum frequency.")
  if not bits and (new_text or "").strip() != (old_text or "").strip(): bits.append("Text changed in ways that may affect interpretation.")
  if not bits: bits.append("No material change detected.")
  out=[]; seen=set()
  for b in bits:
    if b not in seen: out.append(b); seen.add(b)
  return out[:4]

def infer_business_effects(new_text, labels, hits):
  t = (new_text or "").lower()
  effects = []
  if "Obligation strengthened" in labels: effects.append("Higher compliance workload and potential tooling costs.")
  if "Obligation softened" in labels: effects.append("Lower compliance workload; risk shifts to consumers/regulators.")
  if "New exception" in labels: effects.append("Some firms newly exempt; competitors may gain a cost advantage.")
  if "Exception removed" in labels: effects.append("Previously exempt firms now face compliance costs.")
  if "Broadening (lower threshold)" in labels: effects.append("Smaller firms pulled into scope; onboarding compliance programs.")
  if "Broadening (higher cap)" in labels: effects.append("Operational limits expand; could enable larger programs.")
  if "Oversight risk" in labels: effects.append("Fraud/abuse exposure increases; reputational/regulatory risk.")
  if "Oversight strengthened" in labels: effects.append("More audits; ongoing compliance/assurance expenses.")
  if "multi-factor authentication" in t or "mfa" in t: effects.append("Security implementation spend (MFA rollout, IAM upgrades).")
  if "encryption" in t: effects.append("Data-at-rest/in-transit encryption; key management spend.")
  if "penalty" in t or "$" in t: effects.append("Financial exposure changes tied to per-violation penalties.")
  if "targeted advertising" in t or "advertising" in t: effects.append("Ad-tech pipelines may require reconfiguration or consent changes.")
  if "research" in t: effects.append("R&D access to data may be eased but with safeguards.")
  naics_list = [h["naics"] for h in (hits or [])]
  if any("Advertising Agencies" in n for n in naics_list): effects.append("Audience targeting and attribution methods may need updates.")
  if any("Data Processing" in n for n in naics_list): effects.append("Processors face contract updates and audit-readiness tasks.")
  if any("Information Services" in n for n in naics_list): effects.append("Data brokers may need opt-outs, registries, or limits on resale.")
  if any("Computer Systems Design" in n for n in naics_list): effects.append("Service providers see demand for security/identity integrations.")
  if any("Computer Facilities Management" in n for n in naics_list): effects.append("Ongoing infra controls and monitoring requirements likely.")
  if any("R&D" in n for n in naics_list): effects.append("Research programs gain flexibility but must manage de-identification.")
  dedup=[]; seen=set()
  for e in effects:
    if e not in seen: dedup.append(e); seen.add(e)
  return dedup[:5] if dedup else ["No clear business effect detected."]


In [8]:

# --- Interactive viewer + timeline slider ---
sec_dd   = W.Dropdown(options=ALL_PATHS, description="Section:")
ver_from = W.Dropdown(options=[("v1","v1"),("v2","v2"),("v3","v3")], value="v1", description="From:")
ver_to   = W.Dropdown(options=[("v1","v1"),("v2","v2"),("v3","v3")], value="v2", description="To:")
out = W.Output()

def make_business_table(hits):
  rows = []
  if not hits:
    rows.append("<tr><td colspan='3' class='muted'>(none detected)</td></tr>")
  else:
    for h in hits:
      rows.append(f"<tr><td>{h['naics']}</td><td>{h['keyword']}</td><td>{h['impact']}</td></tr>")
  html = ('''
  <table class="smol">
    <thead><tr><th>NAICS category</th><th>Trigger</th><th>Impact</th></tr></thead>
    <tbody>''' + "".join(rows) + "</tbody></table>")
  return HTML(html)

def billwide_table(b_key):
  agg = {}
  for path, sec in IDX[b_key].items():
    for hit in map_affected_entities(sec.get("text","")):
      agg.setdefault(hit["naics"], set()).add(path)
  rows = []
  if not agg:
    rows.append("<tr><td colspan='2' class='muted'>(none detected)</td></tr>")
  else:
    for naics, paths in sorted(agg.items(), key=lambda x: (-len(x[1]), x[0])):
      rows.append(f"<tr><td>{naics}</td><td>{len(paths)} section(s)</td></tr>")
  html = ('''
  <table class="smol">
    <thead><tr><th>NAICS category</th><th>Mentions</th></tr></thead>
    <tbody>''' + "".join(rows) + "</tbody></table>")
  return HTML(html)

def render():
  out.clear_output()
  path, a_key, b_key = sec_dd.value, ver_from.value, ver_to.value
  old, new, labels, rationales, risks, impacts = compare_section(path, a_key, b_key)
  links = entity_linking(new.get("text",""))
  t_labels, t_summary = transformer_meaning(old.get('text',''), new.get('text',''))

  with out:
    print(f"{path} — {old.get('title','')}  →  {new.get('title','')}")
    print(f"Compare: {VERSION_LABELS[a_key]}  →  {VERSION_LABELS[b_key]}")
    if labels:
      badges = " ".join([f"<span class='badge'>{l}</span>" for l in labels])
      display(HTML(badges))

    display(HTML("<p><strong>Transformer meaning (AI):</strong><br/>" +
                 "; ".join(t_labels[:3]) + "<br/><em>" + t_summary + "</em></p>"))

    simple = summarize_simple(old.get('text',''), new.get('text',''), labels, rationales)
    effects = infer_business_effects(new.get('text',''), labels, impacts)
    display(HTML("<p><strong>What this means (plain-English):</strong><br/>" + "; ".join(simple) + "</p>"))

    display(make_redline_html(old.get('text',''), new.get('text','')))

    if rationales:
      print("\nWhy we think meaning changed:")
      for r in rationales: print(" •", r)
    if risks:
      print("\nRisk flags:")
      for r in risks: print(f" • {r['risk']}: {r['rationale']}")

    print("\nLikely affected businesses:"); display(make_business_table(impacts))

    print("\nLinked entities:")
    print(" • Agencies:", ", ".join(links["agencies"]) or "(none)")
    print(" • Companies:", ", ".join(links["companies"]) or "(none)")
    print(" • Populations:", ", ".join(links["populations"]) or "(none)")

    print("\nBill-wide affected business categories (for selected 'To' version):")
    display(billwide_table(b_key))

    print("\nChange history for this section (v1→v2 and v2→v3):")
    for (x,y) in [("v1","v2"),("v2","v3")]:
      o,n,L,R,_,_ = compare_section(path, x, y)
      print(" •", f"{VERSION_LABELS[x]} → {VERSION_LABELS[y]} :: {', '.join(L) if L else 'No material change'}")

def _on_change(_): render()
for w in (sec_dd, ver_from, ver_to): w.observe(_on_change, names="value")

display(W.HBox([sec_dd]))
display(W.HBox([ver_from, ver_to]))
render()
display(out)

PAIR_OPTS = [("v1 → v2", ("v1","v2")), ("v2 → v3", ("v2","v3")), ("v1 → v3", ("v1","v3"))]
pair_slider = W.SelectionSlider(options=PAIR_OPTS, value=PAIR_OPTS[0][1], description='Timeline:', continuous_update=False)
play = W.Play(interval=1000, value=0, min=0, max=len(PAIR_OPTS)-1, step=1)
link = W.jslink((play, 'value'), (pair_slider, 'index'))
out_slider = W.Output()

def render_slider_view(_=None):
  with out_slider:
    out_slider.clear_output()
    a_key, b_key = pair_slider.value
    path = sec_dd.value
    old, new, labels, rationales, risks, impacts = compare_section(path, a_key, b_key)
    print(f"{path} :: {VERSION_LABELS[a_key]} → {VERSION_LABELS[b_key]}")
    display(make_redline_html(old.get('text',''), new.get('text','')))
    if labels: print("Labels:", ", ".join(labels))
    if risks: print("Risk flags:", ", ".join([r['risk'] for r in risks]))

pair_slider.observe(render_slider_view, names="value")
sec_dd.observe(render_slider_view, names="value")

display(W.HBox([pair_slider, play]))
render_slider_view()
display(out_slider)


HBox(children=(Dropdown(description='Section:', options=('Sec. 101', 'Sec. 102(a)', 'Sec. 201', 'Sec. 301', 'S…

HBox(children=(Dropdown(description='From:', options=(('v1', 'v1'), ('v2', 'v2'), ('v3', 'v3')), value='v1'), …

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]



config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Your max_length is set to 110, but your input_length is only 39. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)


Output()

HBox(children=(SelectionSlider(continuous_update=False, description='Timeline:', options=(('v1 → v2', ('v1', '…

Output()

In [9]:

# --- Full Report v3 (HTML + CSV) ---
from IPython.display import HTML as _IPHTML

def _as_html_str(x):
  return x if isinstance(x, str) else getattr(x, "data", str(x))

STYLE = """
<style>
body { font-family: ui-sans-serif, system-ui, -apple-system, Segoe UI, Roboto, Arial; line-height: 1.5; padding: 24px; color: #222; }
h1, h2, h3 { margin: 0.6em 0 0.3em; }
small.meta { color:#555 }
section.block { border:1px solid #e0e0e0; border-radius:10px; padding:14px; margin:16px 0; background:#fff; }
.badge { display:inline-block; padding:2px 8px; border-radius:999px; background:#eef; margin-right:6px; font-size:12px; border:1px solid #dde; }
table { border-collapse:collapse; }
table.diff { width:100%; border-collapse:collapse; font-family:ui-monospace, monospace; font-size:13px; }
table.diff th, table.diff td { border:1px solid #ddd; vertical-align:top; padding:6px }
del { background:#ffe0e0; text-decoration:line-through }
ins { background:#e0ffe0; text-decoration:none }
table.smol { border-collapse:collapse; font-size:13px; margin-top:6px }
table.smol th, table.smol td { border:1px solid #ddd; padding:4px 6px }
table.smol th { background:#f6f6f6; text-align:left }
.muted { color:#666; font-style:italic }
hr.sep { border:none; height:1px; background:#eee; margin:16px 0; }
</style>
"""

pairs = [('v1','v2'), ('v2','v3'), ('v1','v3')]
header_line = " — ".join([VERSION_LABELS.get(k, k) for k in ['v1','v2','v3'] if k in VERSION_LABELS])

def build_report():
  html = [STYLE, f"<h1>Comprehensive Bill Diff Report (v3)</h1>", f"<small class='meta'>{header_line}</small>"]

  def _billwide(b_key):
    agg = {}
    for path, sec in IDX[b_key].items():
      for hit in map_affected_entities(sec.get('text','')):
        agg.setdefault(hit['naics'], set()).add(path)
    rows = []
    if not agg:
      rows.append("<tr><td colspan='2' class='muted'>(none detected)</td></tr>")
    else:
      for naics, paths in sorted(agg.items(), key=lambda x:(-len(x[1]), x[0])):
        rows.append(f"<tr><td>{naics}</td><td>{len(paths)} section(s)</td></tr>")
    return "<table class='smol'><thead><tr><th>NAICS category</th><th>Mentions</th></tr></thead><tbody>" + "".join(rows) + "</tbody></table>"

  html.append("<section class='block'><h2>Bill-wide affected business categories</h2>")
  for k in ['v1','v2','v3']:
    if k in IDX:
      html.append(f"<h3>{VERSION_LABELS[k]}</h3>")
      html.append(_billwide(k))
  html.append("</section>")

  pair_records = []
  html.append("<section class='block'><h2>Pairwise summary</h2>")
  for (a,b) in pairs:
    if a not in IDX or b not in IDX: continue
    html.append(f"<h3>{VERSION_LABELS[a]} &rarr; {VERSION_LABELS[b]}</h3>")
    rows = []
    for p in ALL_PATHS:
      old, new, labels, rationales, risks, impacts = compare_section(p, a, b)
      simple = summarize_simple(old.get('text',''), new.get('text',''), labels, rationales)
      effects = infer_business_effects(new.get('text',''), labels, impacts)
      t_labels, t_summary = transformer_meaning(old.get('text',''), new.get('text',''))
      links = entity_linking(new.get('text',''))
      pair_records.append({
        "pair": f"{VERSION_LABELS[a]} -> {VERSION_LABELS[b]}",
        "path": p,
        "labels": ", ".join(labels),
        "risk_count": len(risks),
        "impact_count": len(impacts),
        "simple_summary": " ".join(simple),
        "business_effects": " | ".join(effects),
        "ai_top_labels": ", ".join(t_labels[:3]),
        "ai_summary": t_summary,
        "entities_agencies": ", ".join(links["agencies"]),
        "entities_companies": ", ".join(links["companies"]),
        "entities_populations": ", ".join(links["populations"])
      })
      lbl = ", ".join(labels) if labels else "<span class='muted'>—</span>"
      rows.append(
        f"<tr><td>{p}</td><td>{lbl}</td><td>{len(risks)}</td><td>{len(impacts)}</td>"
        f"<td>{'; '.join(simple) if simple else '<span class=\"muted\">—</span>'}</td>"
        f"<td>{'; '.join(effects) if effects else '<span class=\"muted\">—</span>'}</td>"
        f"</tr>"
      )
    html.append("<table class='smol'><thead><tr>"
                "<th>Section</th><th>Labels</th><th>Risk flags</th><th>Affected business entries</th>"
                "<th>Plain-English summary</th><th>Potential business effects</th>"
                "</tr></thead><tbody>" + "".join(rows) + "</tbody></table>")
  html.append("</section>")

  for p in ALL_PATHS:
    html.append(f"<section class='block'><h2>{p}</h2>")
    for (a,b) in pairs:
      if a not in IDX or b not in IDX: continue
      old, new, labels, rationales, risks, impacts = compare_section(p, a, b)
      simple = summarize_simple(old.get('text',''), new.get('text',''), labels, rationales)
      effects = infer_business_effects(new.get('text',''), labels, impacts)
      t_labels, t_summary = transformer_meaning(old.get('text',''), new.get('text',''))
      links = entity_linking(new.get('text',''))

      html.append(f"<h3>{VERSION_LABELS[a]} &rarr; {VERSION_LABELS[b]}</h3>")
      if labels: html.append("<div>" + " ".join([f"<span class='badge'>{l}</span>" for l in labels]) + "</div>")
      html.append("<p><strong>Transformer meaning (AI):</strong><br/>" + "; ".join(t_labels[:3]) + "<br/><em>" + t_summary + "</em></p>")
      html.append("<p><strong>What this means (plain-English):</strong><br/>" + ("; ".join(simple) if simple else "<span class='muted'>No material change detected.</span>") + "</p>")
      red = make_redline_html(old.get('text',''), new.get('text',''))
      html.append(_as_html_str(red))
      if rationales:
        html.append("<strong>Why we think meaning changed:</strong><ul>" + "".join([f"<li>{r}</li>" for r in rationales]) + "</ul>")
      if risks:
        html.append("<strong>Risk flags:</strong><ul>" + "".join([f"<li>{r['risk']}: {r['rationale']}</li>" for r in risks]) + "</ul>")

      rows = []
      if not impacts:
        rows.append("<tr><td colspan='3' class='muted'>(none detected)</td></tr>")
      else:
        for h in impacts:
          rows.append(f"<tr><td>{h['naics']}</td><td>{h['keyword']}</td><td>{h['impact']}</td></tr>")
      html.append("<div><strong>Likely affected businesses</strong>"
                  "<table class='smol'><thead><tr><th>NAICS category</th><th>Trigger</th><th>Impact</th></tr></thead><tbody>"
                  + "".join(rows) + "</tbody></table></div>")

      html.append("<p><strong>Linked entities:</strong><br/>" +
                  "Agencies: " + (", ".join(links["agencies"]) or "(none)") + "<br/>" +
                  "Companies: " + (", ".join(links["companies"]) or "(none)") + "<br/>" +
                  "Populations: " + (", ".join(links["populations"]) or "(none)") + "</p>")

      html.append("<p><strong>Potential effects to affected businesses:</strong><br/>" +
                  ("; ".join(effects) if effects else "<span class='muted'>No clear business effect detected.</span>") + "</p>")
      html.append("<hr class='sep'/>")
    html.append("</section>")

  return html, pairs

import pandas as pd
html, pairs = build_report()
REPORT_HTML = "bill_diff_full_report_v3.html"
CSV_SUMMARY = "bill_diff_pairs_summary_v3.csv"

html = [_as_html_str(x) for x in html]
with open(REPORT_HTML, "w", encoding="utf-8") as f:
  f.write("".join(html))

rows=[]
for (a,b) in pairs:
  for p in ALL_PATHS:
    old, new, labels, rationales, risks, impacts = compare_section(p, a, b)
    simple = summarize_simple(old.get('text',''), new.get('text',''), labels, rationales)
    effects = infer_business_effects(new.get('text',''), labels, impacts)
    t_labels, t_summary = transformer_meaning(old.get('text',''), new.get('text',''))
    links = entity_linking(new.get('text',''))
    rows.append({
      "pair": f"{VERSION_LABELS[a]} -> {VERSION_LABELS[b]}",
      "path": p,
      "labels": ", ".join(labels),
      "risk_count": len(risks),
      "impact_count": len(impacts),
      "simple_summary": " ".join(simple),
      "business_effects": " | ".join(effects),
      "ai_top_labels": ", ".join(t_labels[:3]),
      "ai_summary": t_summary,
      "entities_agencies": ", ".join(links["agencies"]),
      "entities_companies": ", ".join(links["companies"]),
      "entities_populations": ", ".join(links["populations"])
    })
pd.DataFrame(rows).to_csv(CSV_SUMMARY, index=False)

print("Wrote:", REPORT_HTML, "and", CSV_SUMMARY)
_IPHTML(f"<p><b>Download:</b> <a href='{REPORT_HTML}' target='_blank'>{REPORT_HTML}</a> &nbsp;|&nbsp; <a href='{CSV_SUMMARY}' target='_blank'>{CSV_SUMMARY}</a></p>")


Your max_length is set to 110, but your input_length is only 39. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)
Your max_length is set to 110, but your input_length is only 61. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=30)
Your max_length is set to 110, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)
Your max_length is set to 110, but your input_length is only 41. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=20)
Your

Wrote: bill_diff_full_report_v3.html and bill_diff_pairs_summary_v3.csv


In [10]:

# --- PDF export (from HTML report) ---
import os, asyncio, nest_asyncio
nest_asyncio.apply()
from pyppeteer import launch

async def html_to_pdf(html_path, pdf_path):
  browser = await launch(args=['--no-sandbox'])
  page = await browser.newPage()
  await page.goto(f'file://{os.path.abspath(html_path)}', waitUntil='networkidle2')
  await page.pdf({'path': pdf_path, 'format': 'Letter', 'printBackground': True})
  await browser.close()

HTML_REPORT = "bill_diff_full_report_v3.html"
PDF_OUT = "bill_diff_full_report_v3.pdf"

if os.path.exists(HTML_REPORT):
  try:
    await html_to_pdf(HTML_REPORT, PDF_OUT)
    print("Wrote:", PDF_OUT)
  except Exception as e:
    print("PDF export error:", e, "- You can rerun this cell after the report is generated.")
else:
  print("Missing HTML report:", HTML_REPORT)


[INFO] Starting Chromium download.
INFO:pyppeteer.chromium_downloader:Starting Chromium download.
100%|██████████| 183M/183M [00:01<00:00, 128Mb/s]
[INFO] Beginning extraction
INFO:pyppeteer.chromium_downloader:Beginning extraction
[INFO] Chromium extracted to: /root/.local/share/pyppeteer/local-chromium/1181205
INFO:pyppeteer.chromium_downloader:Chromium extracted to: /root/.local/share/pyppeteer/local-chromium/1181205


Wrote: bill_diff_full_report_v3.pdf


In [11]:

# --- Mini dashboards (Plotly) ---
import plotly.express as px
import pandas as pd

def collect_pair_records():
  rows=[]
  for (a,b) in [("v1","v2"), ("v2","v3"), ("v1","v3")]:
    for p in ALL_PATHS:
      old, new, labels, rationales, risks, impacts = compare_section(p, a, b)
      rows.append({
        "pair": f"{VERSION_LABELS[a]} -> {VERSION_LABELS[b]}",
        "section": p,
        "risk_count": len(risks),
        "impact_count": len(impacts),
        "labels": labels,
        "naics": [h["naics"] for h in impacts]
      })
  return pd.DataFrame(rows)

df = collect_pair_records()

label_rows = []
for _, r in df.iterrows():
  for l in (r["labels"] or []):
    label_rows.append({"pair": r["pair"], "label": l})
dfL = pd.DataFrame(label_rows)
if not dfL.empty:
  fig1 = px.bar(dfL, x="label", color="pair", barmode="group", title="Label frequency by pair")
  fig1.show()

naics_rows = []
for _, r in df.iterrows():
  for n in (r["naics"] or []):
    naics_rows.append({"pair": r["pair"], "naics": n})
dfN = pd.DataFrame(naics_rows)
if not dfN.empty:
  fig2 = px.bar(dfN, x="naics", color="pair", title="NAICS mentions by pair")
  fig2.update_layout(xaxis_tickangle=35)
  fig2.show()

fig3 = px.scatter(df, x="impact_count", y="risk_count", color="pair", hover_data=["section"], title="Risk vs Impact counts")
fig3.show()


In [12]:

# --- Optional LLM deep reasoning (OpenAI / Anthropic) ---
import os, textwrap

USE_LLM = False  # set True to enable

def llm_deep_reason(old_text, new_text, provider="openai", model=None, max_tokens=300):
  prompt = textwrap.dedent(f"""
  You are a senior legislative analyst. Compare the PREVIOUS and NEW text.
  1) What changed in meaning (plain English, 3–5 bullets)?
  2) Who is newly in/out of scope?
  3) Compliance/oversight implications (2–4 bullets)?
  4) Potential effects for businesses (2–4 bullets)?

  PREVIOUS:
  {old_text or ''}

  NEW:
  {new_text or ''}
  """)

  if not USE_LLM:
    return "LLM disabled.", []

  if provider == "openai":
    key = os.getenv("OPENAI_API_KEY")
    if not key: return "(OpenAI key missing — set OPENAI_API_KEY)", []
    try:
      from openai import OpenAI
      client = OpenAI(api_key=key)
      mdl = model or "gpt-4o-mini"
      resp = client.chat.completions.create(
        model=mdl,
        messages=[{"role":"user","content":prompt}],
        temperature=0.2,
        max_tokens=max_tokens
      )
      text = resp.choices[0].message.content.strip()
      bullets = [b.strip("-• ").strip() for b in text.split("\n") if b.strip()]
      return text, bullets[:8]
    except Exception as e:
      return f"(OpenAI error: {e})", []

  elif provider == "anthropic":
    key = os.getenv("ANTHROPIC_API_KEY")
    if not key: return "(Anthropic key missing — set ANTHROPIC_API_KEY)", []
    try:
      import anthropic
      client = anthropic.Anthropic(api_key=key)
      mdl = model or "claude-3-5-sonnet-20241022"
      resp = client.messages.create(
        model=mdl,
        max_tokens=max_tokens,
        temperature=0.2,
        messages=[{"role":"user","content":prompt}]
      )
      text = resp.content[0].text.strip()
      bullets = [b.strip("-• ").strip() for b in text.split("\n") if b.strip()]
      return text, bullets[:8]
    except Exception as e:
      return f"(Anthropic error: {e})", []

  else:
    return "(Unknown provider)", []

print("LLM integration available. Toggle USE_LLM=True and set API keys to enable.")


LLM integration available. Toggle USE_LLM=True and set API keys to enable.


In [13]:

# --- Download helpers (Colab) ---
from IPython.display import HTML
import os
def offer_downloads():
  try:
    from google.colab import files
    for f in ["bill_diff_full_report_v3.html", "bill_diff_pairs_summary_v3.csv"]:
      if os.path.exists(f):
        files.download(f)
  except Exception:
    pass

print("Use offer_downloads() after running the report cell to download files.")


Use offer_downloads() after running the report cell to download files.
