#### Notebook Structure

**Part I: Compliance Copilot Report**  
This section is a concise strategic report covering the proposal’s justification, solution overview, key results discussion, identified challenges, conclusion and next steps.

**Part II: Compliance Copilot Demo**  
This section walks through the RAG pipeline steps, prints example outputs, and finally launches the terminal-based Copilot for live testing with a set of curated Q&A examples.

**README.md**  
Contains full technical details on installing dependencies, running the scalable pipeline code, and other execution instructions.


## Part I: Compliance Copilot Report



### 1. Introduction

This section presents the **business rationale**, **strategic justification**, and **solution proposal** for the **Dow Jones AI Compliance Copilot**—a tool designed to **add an intelligence layer** to compliance data that Dow Jones already licenses, processes, and delivers to B2B clients. By embedding these datasets into an RAG-powered Q&A interface, clients can:

- Receive instant, source-cited answers in natural language
- Save dozens of weekly hours on due diligence
- Reduce the risk of missing critical insights
- Make faster, more accurate decisions

Financially, this intelligence layer can **significantly increase the perceived value** of licensed data, unlocking new upsell and recurring revenue opportunities.

> **Note:** In this POC, some answers may seem “obvious” or easily found with a web search, which can give the impression of limited value. In a production scenario, however, the Copilot would be powered by Dow Jones’ **exclusive, proprietary data**—meaning every conversational response delivered is unique to the customer’s licensed content and cannot be replicated by general-purpose search engines. 

### 2. Strategic Justification & Market Opportunity

#### 2.1 Why B2B Compliance First?
- The **Governance, Risk & Compliance (GRC)** market is valued at **$62.9 billion** in 2024 with a **13.2% CAGR** through 2030 [1].
- The **RegTech** subsector grows from **$4.7 billion** in 2024 to **$29 billion** by 2034 at a **20% CAGR** [2].
- B2B compliance solutions command **ARPU of $50K–$500K/year**, far above B2C offerings [3].

Focusing initially on corporate compliance clients maximizes ROI, tapping a fast-growing market with buyers willing to invest in high-value solutions.

### 2.2 Existing Dow Jones Offerings
- **RiskFeeds**: Structured feeds (sanctions, PEPs, adverse media) in **XML, CSV, JSON** formats for screening and reporting [4].
- **Integrity Check (with Xapien)**: Entity due diligence reports based on entity name and country, cutting analysis from **days to minutes** [5].

**How the Copilot Differs**:
- **Complementary** to RiskFeeds and Integrity Check, adding open-ended Q&A across all licensed data.
- Returns **source-cited answers** instead of fixed reports or raw feeds.
- Provides **ad hoc insights** to support immediate decision-making, beyond standard outputs.

### 3. Solution Proposal

#### 4.1 Data Acquisition
- **Production**: Clients continue to license **proprietary Dow Jones data** (sanctions, PEPs, filings, adverse media) via XML/CSV/JSON and APIs.
- **POC**: For this proof of concept, **publicly sourced documents** were extracted and processed from the internet to simulate the same data structure.

#### 4.2 Pipeline
```plaintext
1. Licensed Dow Jones datasets
2. Ingestion & Parsing (PDF, CSV/XML, HTML)
3. Vector Indexing (embeddings)
4. Agent AI + LLM (semantic search + generation)
5. Source-cited answers in Q&A interface
```

#### 4.3 Key POC Components
- **RAG**: Combines vector search with LLM for grounded responses.
- **Prompt Engineering**: Domain-specific prompt tuning.
- **Model Flexibility**: Support for GPT-4, Claude, and on-premise Ollama models.

#### 4.4 README & Usage Instructions
The **README** in the repository guides:
- **Installation & Setup**: Virtual environment, dependencies, local models.
- **Pipeline Execution**: Document download (`docs/fetch_data.py`), semantic index build (`src/build_index.py`), CLI Q&A (`main.py`, `backend.py`).
- **Notebook Demo**: `compliance_demo_report.ipynb` walks through interactive tests with example questions and source-cited answers.

#### 4.5 Test Questionnaire
Because local indexing was resource-intensive and only a small document set was indexed, the file **`compliance_suggestion_questions.md`** includes curated test questions referencing the indexed context. This accelerated testing by covering scenarios such as:
- Entity mention lookups in sanctions and reports
- Queries on recent regulatory recommendations
- Verifications of company-risk alerts 

### 4. Next Steps
1. **Infrastructure**: Migrate local storage to AWS S3/Azure Blob/GCP Buckets.
2. **OCR Integration**: Add Tesseract or Amazon Textract for scanned PDFs.
3. **Structured Validation**: Define precision/recall metrics and gather expert feedback.
4. **Deployments**: REST API (FastAPI), front-end (React/Streamlit), cloud deployment (Vercel, Hugging Face).
5. **Multi-tenant & Customization**: Client-specific histories and permissions.

### 6. Conclusion
The **Dow Jones AI Compliance Copilot** introduces a conversational intelligence layer atop already licensed data, delivering rapid, cited, and contextual insights. This evolution drives **new recurring revenue**, **higher margins**, and **enhances Dow Jones' leadership** in the RegTech space.

### References
1. Grand View Research, “eGRC Market Size, Share & Trends Analysis Report,” 2024.
2. Mordor Intelligence, “RegTech Market Forecast to 2034.”
3. Gartner, “Enterprise GRC Buyer Insights,” 2023.
4. The Wealth Mosaic, “Dow Jones RiskFeeds Overview.”
5. Xapien & Dow Jones Partnership Announcement, 2022.

----------------------------------------------------------------------

## Part 2: Compliance Copilot Demo 

### 1. Best Practices for Demo Usage

- **Ask objective, specific questions** related to compliance.  
- **Avoid vague or overly broad questions.**  
- **Verify any uncertain or ambiguous answers**, as the system is restricted to indexed content and is parameterized to minimize hallucinations—watch for any extrapolation.  
- **Consult `compliance_suggestion_questions.md`** for a sample of supported questions.  

_In this demo notebook, you will first see a concise overview of the pipeline steps and outputs. After that, you’ll test the Copilot interactively via the terminal interface._ 

### 2. Load and Preview Raw Documents

The `docs/fetch_data.py` simulates how compliance documents would be ingested for the POC by:

1. **Downloading** various public filings, advisories, and sanction lists (SEC, OFAC, FATF, FinCEN, EU) into `data/raw`.  
2. **Handling failures** (403/404) by adding files manually when needed.  
3. **Note**: In a real deployment, documents (PDF, CSV, TXT, etc.) would be loaded directly from authenticated internal storage or cloud buckets, not scraped from public websites.

In [1]:
import os

RAW_PATH = "data/raw"
print(" === Raw files: ===")
for fname in os.listdir(RAW_PATH):
    print("-", fname) 

 === Raw files: ===
- apple_10k_2022.txt
- eu_annex2_sanctions.pdf
- fatf_annual_report_2022-2023.pdf
- fatf_assessment_methodology_2022.pdf
- fatf_effectiveness_compliance_report_2022.pdf
- fatf_procedures_mutual_evaluations_2022.pdf
- fatf_universal_procedures.pdf
- fincen_advisory_corruption_2022.pdf
- fincen_advisory_elder_exploitation_2022.pdf
- fincen_alert_pig_butchering_2023.pdf
- fincen_alert_russian_elites_2022.pdf
- fincen_ransomware_advisory.pdf
- microsoft_10k_2022.txt
- ofac_sdn_list.csv


### 3. Embedding and FAISS Index

The `build_vector_store` function (in lieu of rerunning the full `build_index.py` pipeline) allows you to update the FAISS index with any new or changed documents:

1. **Load documents** from `data/raw` using `load_documents`.  
2. **Embed and store** them in `data/index` via `embed_and_store` from `embedder.py`.  



The `embed_and_store` function handles document chunking, embedding, and FAISS index creation:

1. **Chunking**: Splits each document into 512-char chunks (no overlap) using `RecursiveCharacterTextSplitter`.  
2. **Embedding Model**: Uses `sentence-transformers/paraphrase-albert-small-v2` (normalized embeddings via `HuggingFaceEmbeddings`) for efficient, local inference.  
3. **Index Creation**: Builds a FAISS index from the chunked texts and associated metadata.  
4. **Persistence**: Saves the FAISS index to `data/index` for fast retrieval.

> ** Attention:** This process recreates the index in `data/index`.  
> Do **not** rerun unless you have added or modified documents in `data/raw`; otherwise you may overwrite your existing index.

In [None]:
#from src.build_index import main as build_index

# This module creates the index in the data/index directory. 
# Its not recommended to run now it again unless you change the documents in the data/raw indexing another ones in data/index directory.
#build_index() 

### 4. Inspect FAISS Index Contents

The `inspect_index` function lets you peek “under the hood” of your FAISS vector store by:
1. Loading the same embedding model used for indexing (`paraphrase-albert-small-v2` with normalized embeddings).
2. Deserializing the FAISS index from `data/index`.
3. Iterating over each document in the internal store to build a summary:
   - **doc_id**: 1-based counter  
   - **source**: original filename or URL metadata (or `"unknown"`)  
   - **preview**: first ~300 characters of the document text


In [3]:
from inspect_index import inspect_index

# Load all summaries but display only the first 3 for readability
#Note: There are many more documents in the index, but we show only the first three here to keep the output concise.

summaries = inspect_index("data/index")
summaries[:3]

[{'doc_id': 1,
  'source': 'data\\raw\\eu_annex2_sanctions.pdf',
  'preview': '1 \nAnnex II – Sanctions-related commitments   \n \nThe sequence of implementation of the commitments d etailed in this Annex is \nspecified in Annex V (Implementation Plan) to this Joint Comprehensive Plan of \nAction (JCPOA). \n  \n \nA.  European Union 1 \n \n1.  The EU and EU Member States commit to termi'},
 {'doc_id': 2,
  'source': 'data\\raw\\eu_annex2_sanctions.pdf',
  'preview': 'specified in Sections 1.1-1.10 below, to terminate all provisions of \nCouncil Decision 2010/413/CFSP (as subsequently ame nded), as \nspecified in Sections 1.1-1.10 below, and to termin ate or amend \nnational implementing legislation as required, in a ccordance with \nAnnex V:  \n \n1.1. \n Financial, ba'},
 {'doc_id': 3,
  'source': 'data\\raw\\eu_annex2_sanctions.pdf',
  'preview': '30a, 30b and 31 of Council Regulation (EU) No 267/2 012); \n \n1.1.2. \n Sanctions on banking activities (Article 11 of Coun cil Decision 

### 5. Run a Sample QA Interaction

The `backend.py` module defines how to construct and run the RAG-powered Q&A chain:

1. **QA_PROMPT**  
   - Custom prompt template that instructs the model to answer using only the provided context and to return a polite fallback if no answer is found.

2. **build_qa_chain(index_path: str)**  
   - **Loads** the FAISS index from `index_path` with the same embedding model (`paraphrase-albert-small-v2`).  
   - **Initializes** the local Ollama LLM (`tinyllama`) with controlled temperature, top-p, and repeat penalty.  
   - **Creates** a `RetrievalQA` chain that retrieves relevant passages and generates answers via the prompt.

3. **run_qa_app(index_path: str)**  Starts a **terminal-based loop**: prompts the user for questions, invokes the QA chain, prints answers, and exits on “exit”/“quit”. 

## Using the Copilot: CLI Terminal & Notebook 

**1. Terminal Q&A Session**  
Launch the interactive CLI in your terminal (not in a notebook cell):

**Ask:** Type your question at the `>` prompt and press **Enter**.  
**Answer:** You’ll see a source-cited response in the terminal.  
**Exit:** Press **Esc** or type `exit`/`quit`.  
**Tip:** Use the curated questions in `compliance_suggestion_questions.md` or feed questions from `test.txt` as shown below.


In [4]:
### Running Each Question through the CLI Chain in the Notebook

from backend import build_qa_chain

# This module builds the question-answering chain using the index in data/index directory.
qa_chain = build_qa_chain("data/index")

# Read questions from TEST.txt and run them through the QA chain
with open('TEST.txt', 'r', encoding='utf-8') as f:
    questions = [line.strip() for line in f if line.strip()]

for i, q in enumerate(questions, 1):
    print(f"--- Question {i} ---\n{q}\n")
    result = qa_chain.invoke({"query": q})
    print(f"{result['result']}\n") 

Loading FAISS index from: data/index


  embedding = HuggingFaceEmbeddings(
  from .autonotebook import tqdm as notebook_tqdm


=== Starting local LLM via Ollama ===
--- Question 1 ---
1. OLÁ



  llm = Ollama(


No, based on the provided context, "I'm sorry, I couldn't find information about that based on the current documents."

--- Question 2 ---
2. How are cryptocurrency mixers used in ransomware schemes?

Answer: Cryptocurrency mixers are used in ransomware schemes to protect victims' anonymity by mixing their payment requests with other transactions, making it difficult for cybercriminals to trace the origin of the funds. Mixers can also be used to conceal the true identity of the victim or to create a false impression that the payment request is legitimate.

--- Question 3 ---
3. What are the risks associated with cryptocurrency mixers?

The provided context mentions that the DOJ has built relationships with regulatory and enforcement partners both within the US government and around the world, and outlines their response strategies to the use of cryptocurrency as a payment method by bad actors to facilitate ransom and blackmail. The publication produced by the Attorney General’s Cyber-D


**Discussion & Tuning**  
- The **fallback response** on “OLÁ” shows the configured prompt and chain guard against hallucinations.  
- **Iterative refinement:** repeating Question 2 as Question 4 produces a slightly different (and often clearer) formulation.  
- **Model parameters** (temperature, top_p, repeat_penalty, etc.) directly influence answer style:
  - **Higher temperature** → more varied, creative responses (but possibly less precise).  
  - **Lower temperature** → more deterministic, focused answers.  
  - **top_p** controls nucleus sampling; **repeat_penalty** discourages verbatim repetition.  

> Tweak these settings in `Ollama(...)` or your chosen LLM config to dial in the desired balance between creativity and accuracy.

**Real-World Front-End Behavior**  
- In a production UI, users would **enter one question at a time**, not batch via a `.txt`.  
- The system would respond **in real time**, rendering answers as:
  - **Free-text paragraphs**  
  - **Structured tables** or lists  
  - **Citations** linking back to original documents  
- A front end would also **store conversation history**, allowing follow-up queries and context-aware interactions.  
- Users could provide **feedback** on individual answers, enabling continuous improvement and re-tuning of prompts or model parameters.

---

**Conclusion**  
Batch testing with `test.txt` is a convenient POC shortcut. In a full deployment, the Copilot would run interactively in a web app or chat interface—one question, one answer at a time—complete with real-time rendering, history tracking, and dynamic parameter tuning.
