 ### PolicyPulse

PolicyPulse:

PolicyPulse is an AI-driven system designed to streamline the privacy policy compliance process. The project leverages a Retrieval-Augmented Generation (RAG) pipeline to analyze Financial company privacy policies (OPP Corpus)  and determine their compliance with regulations such as GDPR and CCPA. The system allows users to upload privacy policies, which are then parsed into segments. Using a combination of large language models (LLMs) and a knowledge base of regulatory documents, the system can answer compliance-related questions and flag areas where a policy may not fully adhere to the required standards.

Key features:

1. Privacy Policy Fragmentation: Uploaded policies are broken down into manageable segments that can be individually analyzed.

2. Retrieval-Augmented Generation (RAG): The system retrieves relevant GDPR/CCPA guidelines and compares them with the uploaded policy segments to assess compliance.

3. Dynamic Q&A: Users can ask specific compliance questions, and the system will provide responses based on both the policy content and regulatory requirements.

4. Compliance Reporting: PolicyPulse generates a summary of compliance status, highlighting any gaps or missing information that need to be addressed.

5. Targeted Feedback: The system offers recommendations on how to make the policy compliant with GDPR/CCPA regulations.


*  To implement a RAG system using LangChain
*  Be able to formulate metric(s) that you may want to choose as your evaluation to what degree your system replicates gold answers (labeled data) that we will provide.
* Try out various hyper-parameters and settings to see which configuration works the best (given your chosen metric)  
* Write a comprehensive evaluation, which also includes risks and limitations (and a lot more)



Base RAG components

  2.1 Text Embeddings    
  2.2 Text Chunking   
  2.3 The Vector DB & Semantic Search  
  2.4 The Language Model   
  2.5 Testing the LLM in a LangChain Chain   
  2.6. Setting up a simple RAG Chain     

In [1]:
import logging
import openai
import os
logging.basicConfig(level=logging.INFO)
from google.colab import userdata

openai_api_key = userdata.get('OPENAI_API_KEY')

os.environ["OPENAI_API_KEY"] = openai_api_key
!pip install openai==0.28.0
from google.colab import userdata

openai_api_key = userdata.get('OPENAI_API_KEY')
openai.api_key = openai_api_key



In [2]:
!pip install qdrant-client


Collecting qdrant-client
  Downloading qdrant_client-1.12.1-py3-none-any.whl.metadata (10 kB)
Collecting grpcio-tools>=1.41.0 (from qdrant-client)
  Downloading grpcio_tools-1.67.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.3 kB)
Collecting portalocker<3.0.0,>=2.7.0 (from qdrant-client)
  Downloading portalocker-2.10.1-py3-none-any.whl.metadata (8.5 kB)
Collecting protobuf<6.0dev,>=5.26.1 (from grpcio-tools>=1.41.0->qdrant-client)
  Downloading protobuf-5.28.3-cp38-abi3-manylinux2014_x86_64.whl.metadata (592 bytes)
Collecting grpcio>=1.41.0 (from qdrant-client)
  Downloading grpcio-1.67.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.9 kB)
Collecting h2<5,>=3 (from httpx[http2]>=0.20.0->qdrant-client)
  Downloading h2-4.1.0-py3-none-any.whl.metadata (3.6 kB)
Collecting hyperframe<7,>=6.0 (from h2<5,>=3->httpx[http2]>=0.20.0->qdrant-client)
  Downloading hyperframe-6.0.1-py3-none-any.whl.metadata (2.7 kB)
Collecting hpack<5,>=4.0 (fro

In [3]:
%%capture
!pip -q install git+https://github.com/huggingface/transformers
!pip install -q datasets loralib sentencepiece
!pip -q install bitsandbytes accelerate
!pip -q install langchain
!pip install einops
!pip install faiss-gpu
!pip install --upgrade --quiet  langchain-community chromadb bs4 qdrant-client
!pip install langchainhub

!pip install --upgrade --quiet  wikipedia
!pip install --upgrade --quiet  arxiv
!pip install --upgrade --quiet  pymupdf

!pip install xmltodict

!pip install cohere

!pip install unstructured
!pip install PyMuPDF
!pip install pdfminer.six


In [4]:
import torch
import os
import bs4
import json
import numpy as np
import time
import pandas as pd

from pprint import pprint

import locale

from transformers import AutoTokenizer , AutoModelForCausalLM
from transformers import pipeline, BitsAndBytesConfig

from langchain.llms import HuggingFacePipeline
from langchain import PromptTemplate, LLMChain
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain_core.output_parsers import StrOutputParser
from langchain import hub
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_community.vectorstores import Chroma
from langchain_community.vectorstores import Qdrant
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.utils.math import cosine_similarity

from langchain_community.document_loaders import ArxivLoader
from langchain_community.document_loaders import TextLoader
from langchain_community.document_loaders import WikipediaLoader
from langchain_community.document_loaders import OnlinePDFLoader
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.document_loaders import PubMedLoader

from langchain_community.chat_models import ChatCohere

from langchain.document_loaders import PyMuPDFLoader


from google.colab import userdata
from google.colab import drive
import re
drive.mount('/content/drive')




Mounted at /content/drive


In [5]:
locale.getpreferredencoding = lambda: "UTF-8"
from google.colab import userdata

In [6]:
%%capture
!pip install sentence_transformers
COHERE_API_KEY = userdata.get('COHERE_API_KEY')

In [7]:
locale.getpreferredencoding = lambda: "UTF-8"

In [8]:
%%capture
!pip install sentence_transformers

Add your keys from the secret store (do **NOT** print them out or leave them exposed as plaintext in your notebook!):

In [9]:
COHERE_API_KEY = userdata.get('COHERE_API_KEY')

## 2. Building the Components of our RAG System

Let us introduce and test the base components of our RAG system. We will largely use the Hugging Face and LangChan libraries.



### 2.1 The Embedding Model

We will need to represent text (pieces) as vectors. For this, we will use the [sentence_transformer]() architecture.



**NOTE:** The models you can use are: 'all-mpnet-base-v2', 'all-MiniLM-L6-v2', 'multi-qa-mpnet-base-dot-v1', 'all-distilroberta-v1', and 'avsolatorio/GIST-Embedding-v0'



In [10]:
%%capture
base_embeddings = HuggingFaceEmbeddings(model_name="multi-qa-mpnet-base-dot-v1")

In [None]:
text = "This is a test document."
query_result = base_embeddings.embed_query(text)
print(f'Embedding dimension: {len(query_result)}')

doc_result = base_embeddings.embed_documents(["Germany won the World Cup 4 times.", "This is not a test document."])
len(doc_result)

Embedding dimension: 768


2

Do those dimensions look correct?

Now lets see if the embedding model is working as we want.  Ideally our embeddings go beyond shared words and capture the underlying meaning.

In [None]:
#Let's see how well our embeddng model works
similarity = cosine_similarity([query_result], doc_result)[0]

similarity

array([0.21410193, 0.94317828])

That's how you should define your embedding models.

Next, we turn to text chunks.

### 2.2. Loading and Chunking Texts

We first need to load the documents. Here is an example:

In [None]:
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)

documents = loader.load()

We will need to split the  text in chunks that are 'suitable' as retrieval units. Let's for starters define a chunk size of 128 and have no overlap between the chunks:  


In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=128, chunk_overlap=0)
splits = text_splitter.split_documents(documents)
print('Number of splits/chunks: ', str(len(splits)))

Number of splits/chunks:  444


Ok, so it looks like we have now many splits (chunks) from one document. Here is how you can get the content:

In [None]:
splits[39].page_content

'correcting previous mistakes. It plays a crucial role in real-world tasks where trial and error are inevitable.'

Perfect. Now we have the splits and embeddings. Next, the embeddings need to be stored in a vector db.

### 2.3 Storing the Embeddings of Chunks in Vectorstores

After loading and chunking the data, we need to save the vector representations of the chunks in a vectorstore. We will use Qdrant here for simplicity. We load the splits (structured chunks) and the embeddings:

In [None]:
# vectorstore = Qdrant.from_documents(splits,
#     base_embeddings,
#     location=":memory:",  # Local mode with in-memory storage only
#     collection_name="test",
# )
# retriever = vectorstore.as_retriever()

The nice thing is that the vector store also does the similarity searches for us:

In [None]:
query = "What is Chain of Thought doing?"
docs = vectorstore.similarity_search_by_vector(base_embeddings.embed_query(query)) # will rank the splits

In [None]:
docs

[Document(metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', '_id': 'a57f9ab6c79a4d41b7ed725844609340', '_collection_name': 'test'}, page_content='the model’s thinking process.'),
 Document(metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', '_id': 'd02203ea2b8546b2a961eb90257c3d98', '_collection_name': 'test'}, page_content='[1] Wei et al. “Chain of thought prompting elicits reasoning in large language models.” NeurIPS 2022'),
 Document(metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', '_id': 'd66e8e75ab6d4a00b69d3aa8c2011324', '_collection_name': 'test'}, page_content='the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process'),
 Document(metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', '_id': 'f2638e51e2744c158f68d47bdd353dea', '_collection_name': 'test'}, page_content='Chain of thought (CoT; Wei et al. 2022) has 

Looks good! We have an ordered list of documents that seem to relate to the question. That is what we need.

The last major component is the actual LLM.

### 2.4. The LLM

We will use one Open Source Model ("mistralai/Mistral-7B-Instruct-v0.1") and one Proprietery Model (Cohere) for our tests. Let's first set up the OS model:

In [None]:
!huggingface-cli login



    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) y
Token is valid (permission: fineGrained).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in yo

In [None]:
%%capture

quantization_config = BitsAndBytesConfig(load_in_4bit=True,
                                         )
llm_mistral_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.2",
    torch_dtype=torch.float32,
    device_map='auto',
    quantization_config=quantization_config
)
llm_mistral_tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

We use the model first to generate a Hugging Face pipeline. A pipeline simplifies the process of actually generating responses.

In [None]:
mistral_pipe = pipeline(
    "text-generation",
    model=llm_mistral_model,
    tokenizer=llm_mistral_tokenizer,
    max_new_tokens=1000,
    temperature=0.55,
    top_p=0.9,
    do_sample=True,
    repetition_penalty=1.3
)
mistral_pipe.model.config.pad_token_id = mistral_pipe.model.config.eos_token_id

    # wrapping the Hugging Face pipeline into a LangChain object

Does it work?

In [None]:
mistral_pipe("[INST]Give me a two-sentence story about an apple![/INST]")

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': '[INST]Give me a two-sentence story about an apple![/INST]Once upon a time, in the heart of an enchanted orchard, hung an ethereal golden apple. Whispers claimed it granted eternal youth to those who tasted its forbidden sweetness.'}]

Reasonable!

We will also use a Cohere model, but will create this below as part of the LangChain framework.

### 2.5 Testing the LLM in a LangChain Chain

Chains will be defined and discussed in Week 11. In short, they are convenient programmatic ways to deal with 'chains' of actions that involve LLMs. For example, a list of events like 'here is a city name. Plug that city name into prompt template, then generate a story about that city. Lastly, format the model output as a string' can be easily handled by LangChain's Chain framework. In this case, the Chain would consist of the prompt template, the LLM, and the String Formatter. The parameter (the city in this case) will be provided at run time by invocation of the Chain. Let's test that.

To use a Hugging Face model in a LangChain environment, we need to wrap the model into a LangChain pipeline object:

In [None]:
mistral_llm_lc = HuggingFacePipeline(pipeline=mistral_pipe)

  mistral_llm_lc = HuggingFacePipeline(pipeline=mistral_pipe)


Next, we need to define a template and create a corresponding prompt template that can take any questiion

In [None]:
test_llm_template = """[INST] Give me a two-sentence story about an {object}! [/INST]"""
test_llm_prompt_template = PromptTemplate(template=test_llm_template, input_variables=["object"])



Let's define a Chain, a static flow of actions that (usually) involve at least a definition of the variables used in the chain, one or more templates, LLM step(s) and potentially other actions. This would be a chain that declares the variable 'object' to be expected when the chain is invoked, then inserts it into the template, and passes this to our mistral model pipeline (wrapped as a LangChain object):    

In [None]:
test_llm_chain_short = (
    {"object": RunnablePassthrough()}
    | test_llm_prompt_template
    | mistral_llm_lc
)

In [None]:
test_llm_chain_short.invoke('apple')

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


"[INST] Give me a two-sentence story about an apple! [/INST] In the heart of an old orchard, an apple hung ripe and red on the tree's branches. A young boy climbed up its sturdy trunk, his eyes shining with delight as he plucked the juicy fruit from its home."

Works too. We will use this notation moving forward.

Next, how would we do this with a Cohere Chat Model instead of Mistral?

In [None]:
cohere_chat_model = ChatCohere(cohere_api_key=COHERE_API_KEY)

  cohere_chat_model = ChatCohere(cohere_api_key=COHERE_API_KEY)


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


This can be plugged straight into the Chain:

In [None]:
test_cohere_llm_chain_short = (
    {"object": RunnablePassthrough()}
    | test_llm_prompt_template
    | cohere_chat_model
)

In [None]:
# #test_cohere_llm_chain_short.invoke('apple')
# response = test_cohere_llm_chain_short.invoke('apple')
# print(dir(response))


Works! (Note: you may want to review the format of the template. The one we used here is the one from Mistral, and the format may or may not be optimal for Cohere.)

How can we get the output formatting under control? We can add a String Formatter to the chain:


In [None]:
# output_parser = StrOutputParser()

# test_cohere_llm_chain_short_formatted = (
#     {"object": RunnablePassthrough()}
#     | test_llm_prompt_template
#     | cohere_chat_model
#     | output_parser
# )

# test_cohere_llm_chain_short_formatted.invoke('apple')

### 2.6 Setting Up a Simple RAG Chain

For RAG, we will follow the same approach. Except... you will **later** need to change the chain to include the retrieval step.

We first do a simple test: create a RAG template that takes a question and a pre-defined context as input, and generates the answer based on the provided context:

In [None]:
output_parser = StrOutputParser()

rag_template = """[INST] Answer the question based only on the following context:
{context}

Question: {question}
[/INST]
"""
rag_prompt_template = ChatPromptTemplate.from_template(rag_template)

base_rag_chain =(
    {"context": RunnablePassthrough(),
     "question": RunnablePassthrough()}
    | rag_prompt_template
    | mistral_llm_lc
    | output_parser
)

predefined_context = "Germany has won the World Cup 4 times."
question = "How many times did Germany win the world cup?"

resp = base_rag_chain.invoke({'context': predefined_context,
                           'question': question})
print(resp)

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Human: [INST] Answer the question based only on the following context:
{'context': 'Germany has won the World Cup 4 times.', 'question': 'How many times did Germany win the world cup?'}

Question: {'context': 'Germany has won the World Cup 4 times.', 'question': 'How many times did Germany win the world cup?'}
[/INST]
Answer: The answer to your query is consistent with the provided context. In this case, the context states that "Germany has won the World Cup 4 times," so my response would be: Germany has won the World Cup a total of 4 times according to the given information.


That's great. But of course, the context needs to be created in an earlier retrieval step. More precisely, the documents will be first retrieved as a list, and then they will need to be formatted into one string to pass to the LLM in the context window.

Here is a simple formatting function that can be hooked into the chain, which combines a list of chunks into one string:



In [None]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

So how could we build a simple chain? Let's first just get the retrieval done and the formatted retrieved data and the question inserted into the prompt template:

In [None]:
rag_template = """Here is a context:\n{context} \n\nand here is a question: \n{question}"""

rag_prompt = ChatPromptTemplate.from_template(rag_template)

rag_chain = (
    {"context": retriever | format_docs,
     "question": RunnablePassthrough()}
    | rag_prompt


)

In [None]:
output = rag_chain.invoke('What is Chain of Thought?')

Ok... with some formatting... this looks good:

In [None]:
print(output.messages[0].content)

Here is a context:
the model’s thinking process.

[1] Wei et al. “Chain of thought prompting elicits reasoning in large language models.” NeurIPS 2022

the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process

Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes 

and here is a question: 
What is Chain of Thought?


Let's complete the RAG Chain:

In [None]:
output_parser = StrOutputParser()

rag_template = """[INST]Please answer the question below only based on the context information provided.\n\nHere is a context:\n{context} \n\nHere is a question: \n{question}.[/INST]"""
rag_prompt = ChatPromptTemplate.from_template(rag_template)

rag_chain = (
    {"context": retriever | format_docs,
     "question": RunnablePassthrough()}
    | rag_prompt
    | mistral_llm_lc

)



In [None]:
rag_chain.invoke('What is Chain of Thought?')

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


'Human: [INST]Please answer the question below only based on the context information provided.\n\nHere is a context:\nthe model’s thinking process.\n\n[1] Wei et al. “Chain of thought prompting elicits reasoning in large language models.” NeurIPS 2022\n\nthe problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process\n\nTree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes \n\nHere is a question: \nWhat is Chain of Thought?.[/INST] According to the context from the paper "Chain of thought prompting elicits reasoning in large language models" by Wei et al., Chain of Thought refers to a method used to understand the thinking process of large language models. This approach breaks down complex problems into smaller thought steps, generating multiple thoughts for each step, resulting in a tree-like structure that represents the sequence of these thoughts.'

### 2.What about the Cohere models?

In [None]:
cohere_rag_chain = (
    {"context": retriever | format_docs,
     "question": RunnablePassthrough()}
    | rag_prompt
    | cohere_chat_model
    | output_parser
)

In [None]:
cohere_rag_chain.invoke('What is Chain of Thought?')

'Chain of Thought (CoT) is a prompting technique used to elicit reasoning and logical thinking in large language models. It involves decomposing a complex problem into multiple thought steps, generating a tree-like structure of thoughts and reasoning possibilities. CoT helps language models provide more transparent and explainable responses by laying out the steps taken to arrive at a conclusion.'

Works too! Time to build the real thing and do experimentation.

## 3. The RAG Model & Experimentation for fintech policies

With this we can get started. First, we need to acquire the data, chunk it, vectorize it, and store the embeddings (and in this simple case also the docs) in our Qdrant vector db.


### 3.1 The Vector Database

We will start by creating our datastore, Qdrant. Usually, you would deploy the vector db as a server, but in this case let's simply put everything in memory. Also, in this case we will store not only the embeddings but the whole document in the vector store.

We will also create the retriever, which defines the way the documents are being retrieved. The retriever parameters define for example which method is used, how many docs are retrieved, etc. See [this LangChain link ](https://python.langchain.com/docs/modules/data_connection/retrievers/vectorstore)for more information.

## 1. Start by extracting the company name and relevant metadata from an uploaded PDF policy document.


In [10]:
!pip install pi_heif
!pip install unstructured_inference

Collecting pi_heif
  Downloading pi_heif-0.20.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.5 kB)
Downloading pi_heif-0.20.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (989 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/989.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━[0m [32m399.4/989.0 kB[0m [31m12.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m989.0/989.0 kB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pi_heif
Successfully installed pi_heif-0.20.0
Collecting unstructured_inference
  Downloading unstructured_inference-0.8.1-py3-none-any.whl.metadata (5.9 kB)
Collecting layoutparser (from unstructured_inference)
  Downloading layoutparser-0.3.4-py3-none-any.whl.metadata (7.7 kB)
Collecting python-multipart (from unstructured_inference)
  Downloading python_multipart-0

## 1. GDPR Documents Structure

GDPR dcoument is split into two:
1. Recitals : Recitals are introductory statements found at the beginning of the document. They provide context, rationale, and intentions behind the law but do not contain binding rules. Recitals in the GDPR are numbered sequentially (1, 2, 3, etc.) and cover various topics, but they don’t establish rules by themselves.

2. Articles : Articles are the enforceable sections of the law. Each article contains specific provisions and requirements that must be followed. In the GDPR, articles outline the rights, obligations, and penalties related to personal data protection.

In a RAG system, chunking recitals and articles separately helps because:

Recitals: Recitals provide context and can help answer questions about why certain rules exist.

Articles: Articles offer direct answers to compliance-related questions, as they contain enforceable requirements.



## 1. Load privacy policy and GDPR documents and get Hugging Face Embeddings

In [45]:
%%capture

policy_paths = [
    "/content/drive/MyDrive/compliance/policies/bitpay.pdf",
    "/content/drive/MyDrive/compliance/policies/moonpay.pdf",
    "/content/drive/MyDrive/compliance/policies/klarna.pdf",
    "/content/drive/MyDrive/compliance/policies/bilt.pdf",
    "/content/drive/MyDrive/compliance/policies/stripe.pdf"
]
gdpr_url = "https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679"
base_embeddings = HuggingFaceEmbeddings(model_name="multi-qa-mpnet-base-dot-v1")

In [46]:
import fitz
import requests
from langchain.embeddings import HuggingFaceEmbeddings
import logging
from typing import Tuple, List, Optional
from pathlib import Path
import tempfile

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def download_pdf_from_url(url: str) -> Optional[Path]:
    """
    Download a PDF from a URL and save it temporarily

    Args:
        url: URL of the PDF file

    Returns:
        Path: Path to the downloaded PDF file, or None if download fails
    """
    try:
        response = requests.get(url, stream=True)
        response.raise_for_status()
        temp_file = tempfile.NamedTemporaryFile(delete=False, suffix='.pdf')
        temp_path = Path(temp_file.name)

        with open(temp_path, 'wb') as pdf_file:
            for chunk in response.iter_content(chunk_size=8192):
                if chunk:
                    pdf_file.write(chunk)

        logger.info(f"Successfully downloaded PDF to {temp_path}")
        return temp_path

    except Exception as e:
        logger.error(f"Failed to download PDF: {e}")
        return None

def load_pdf_content_with_page_count(pdf_path: str) -> Tuple[List[str], int]:
    """
    Load the content of a PDF and return the text from each page along with the page count.

    Args:
        pdf_path: Path to the PDF file

    Returns:
        tuple: (List of page contents as strings, Total number of pages)
    """
    try:
        document = fitz.open(pdf_path)
        contents = [page.get_text() for page in document]
        num_pages = len(contents)
        document.close()
        return contents, num_pages
    except Exception as e:
        logger.error(f"Failed to load PDF content from {pdf_path}: {e}")
        return [], 0

def initialize_embeddings() -> Optional[HuggingFaceEmbeddings]:
    """
    Initialize HuggingFace embeddings for document representation using the specific model.

    Returns:
        HuggingFaceEmbeddings: Initialized embeddings object, or None if failed
    """
    try:
        embeddings = HuggingFaceEmbeddings(model_name="multi-qa-mpnet-base-dot-v1")
        logger.info("Successfully initialized embeddings with model: multi-qa-mpnet-base-dot-v1")
        return embeddings
    except Exception as e:
        logger.error(f"Failed to initialize embeddings: {e}")
        return None

def load_policy_documents(policy_paths: List[str], gdpr_url: str) -> Tuple[List[List[str]], List[str]]:
    """
    Load multiple policy documents and the GDPR document with proper error handling.

    Parameters:
    - policy_paths (List[str]): List of paths to company policy PDFs
    - gdpr_url (str): URL to the GDPR reference PDF

    Returns:
    - tuple: (list of policy contents for each document, gdpr contents)
    """
    base_embeddings = initialize_embeddings()
    if base_embeddings is None:
        logger.error("Failed to initialize embeddings")
        return [], []

    all_policy_contents = []
    for policy_path in policy_paths:
        policy_contents, policy_pages = load_pdf_content_with_page_count(policy_path)
        logger.info(f"Loaded policy document ({Path(policy_path).name}): {policy_pages} pages")
        all_policy_contents.append(policy_contents)

    gdpr_contents = []
    gdpr_temp_path = download_pdf_from_url(gdpr_url)

    if gdpr_temp_path:
        try:
            gdpr_contents, gdpr_pages = load_pdf_content_with_page_count(str(gdpr_temp_path))
            logger.info(f"Loaded GDPR document: {gdpr_pages} pages")
            gdpr_temp_path.unlink()

        except Exception as e:
            logger.error(f"Error loading GDPR content: {e}")
    else:
        logger.error("Failed to download GDPR document")

    return all_policy_contents, gdpr_contents



all_policy_contents, gdpr_contents = load_policy_documents(policy_paths, gdpr_url)

for i, policy_content in enumerate(all_policy_contents, 1):
    print(f"Total pages in Policy Document {i}: {len(policy_content)}")

print(f"Total pages in GDPR PDF: {len(gdpr_contents)}")


Total pages in Policy Document 1: 10
Total pages in Policy Document 2: 14
Total pages in Policy Document 3: 69
Total pages in Policy Document 4: 22
Total pages in Policy Document 5: 33
Total pages in GDPR PDF: 88


In [47]:
all_policy_contents[0]

['Scope \nThis EU Privacy Notice (“Notice”) describes the ways BitPay Inc. \n(“BitPay”, “we”, “us”, “our”) collects, stores, uses and protects \npersonal data. BitPay’s services include Merchant processing \nservices, websites, mobile applications, products, and other \nfeatures, technologies or functionalities, including customer \nsupport (the “Services”) offered by us. This Notice applies to all \nprocessing of personal data on the website www.bitpay.com and \nall other BitPay sites on which it appears (the “Sites”). \nThe purpose of this Notice is to inform you about the processing \nof personal data by us and to inform you of your rights. \nResponsibility \nCompliant processing. BitPay complies with applicable United \nStates federal and state regulations and European Economic \nArea (European Union + Norway, Iceland, Liechtenstein) \n(hereinafter “EEA”) data protection laws, such as the General \nData Protection Regulation and its implementation acts \n(hereinafter “GDPR”). \nBit

In [48]:
gdpr_contents[0]

'I \n(Legislative acts) \nREGULATIONS \nREGULATION (EU) 2016/679 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL \nof 27 April 2016 \non the protection of natural persons with regard to the processing of personal data and on the free \nmovement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation) \n(Text with EEA relevance) \nTHE EUROPEAN PARLIAMENT AND THE COUNCIL OF THE EUROPEAN UNION, \nHaving regard to the Treaty on the Functioning of the European Union, and in particular Article 16 thereof, \nHaving regard to the proposal from the European Commission, \nAfter transmission of the draft legislative act to the national parliaments, \nHaving regard to the opinion of the European Economic and Social Committee (1), \nHaving regard to the opinion of the Committee of the Regions (2), \nActing in accordance with the ordinary legislative procedure (3), \nWhereas: \n(1)  \nThe protection of natural persons in relation to the processing of personal data is a funda

## 2. Extract Company name from PDF document

In [49]:
import openai
import fitz
import re
from typing import Optional, List, Dict

def load_pdf_first_page_content(pdf_path: str) -> Optional[str]:
    """Load the content of the first page of a PDF."""
    try:
        pdf_document = fitz.open(pdf_path)
        first_page = pdf_document[0]
        first_page_content = first_page.get_text()
        pdf_document.close()
        return first_page_content
    except Exception as e:
        print(f"Error loading PDF: {e}")
        return None

def normalize_company_name(company_name: str, preserve_case: bool = True) -> str:
    """
    Normalize the company name by removing common suffixes like 'Inc.', 'LLC', 'Ltd.'
    and trailing punctuation.

    Args:
        company_name (str): The company name to normalize
        preserve_case (bool): If True, preserves the original casing, if False converts to lowercase
    """
    # Define common suffixes to remove
    suffixes = r'\b(Inc|LLC|Ltd|Corporation|Corp|Limited|Technologies|Bank AB|AB)\b\.?'

    # Remove suffixes
    normalized_name = re.sub(suffixes, '', company_name, flags=re.IGNORECASE).strip()

    # Remove any trailing punctuation, like commas or periods
    normalized_name = re.sub(r'[,\.]+$', '', normalized_name).strip()

    if not preserve_case:
        normalized_name = normalized_name.lower()

    return normalized_name

def process_policies(pdf_paths: List[str]) -> List[Dict]:
    """
    Process multiple PDF policies and extract company names.

    Args:
        pdf_paths: List of paths to PDF files

    Returns:
        List of dictionaries containing processing results
    """
    results = []

    for pdf_path in pdf_paths:
        result = {
            "pdf_path": pdf_path,
            "status": "error",
            "company_name": None,
            "message": ""
        }

        try:
            company_name = extract_company_name_from_first_page(pdf_path)

            if company_name:
                result.update({
                    "status": "success",
                    "company_name": company_name,
                    "message": "Successfully processed"
                })
            else:
                result["message"] = "Failed to extract company name"

        except Exception as e:
            result["message"] = f"Error processing PDF: {str(e)}"

        results.append(result)

    return results

def extract_company_name_from_first_page(pdf_path: str) -> Optional[str]:
    """Extract and normalize the company name from the first page of a PDF using GPT."""
    first_page_content = load_pdf_first_page_content(pdf_path)

    if not first_page_content:
        print("No content found on the first page.")
        return None

    prompt = (
        "Identify the exact name of the company from the following privacy policy text. "
        "Maintain the original capitalization of the company name. "
        "The company name usually appears at the start, often in the first paragraph or introduction.\n\n"
        f"Privacy Policy Text:\n\n{first_page_content}\n\nCompany Name:"
    )

    try:
        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": "You are a helpful assistant. Return only the company name with proper capitalization."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=20,
            temperature=0.0
        )

        company_name = response['choices'][0]['message']['content'].strip()
        normalized_company_name = normalize_company_name(company_name, preserve_case=True)
        print(f"Company Name Extracted from {pdf_path}: {normalized_company_name}")
        return normalized_company_name

    except Exception as e:
        print(f"Error in API call: {e}")
        return None

if __name__ == "__main__":

    results = process_policies(policy_paths)

    for result in results:
        if result["status"] == "success":
            print("\nProcessed Result:")
            print(f"Company Name: {result['company_name']}")
            print(f"PDF Path: {result['pdf_path']}")
        else:
            print(f"Error processing {result['pdf_path']}: {result['message']}")


Company Name Extracted from /content/drive/MyDrive/compliance/policies/bitpay.pdf: BitPay
Company Name Extracted from /content/drive/MyDrive/compliance/policies/moonpay.pdf: MoonPay
Company Name Extracted from /content/drive/MyDrive/compliance/policies/klarna.pdf: Klarna
Company Name Extracted from /content/drive/MyDrive/compliance/policies/bilt.pdf: Bilt
Company Name Extracted from /content/drive/MyDrive/compliance/policies/stripe.pdf: Stripe

Processed Result:
Company Name: BitPay
PDF Path: /content/drive/MyDrive/compliance/policies/bitpay.pdf

Processed Result:
Company Name: MoonPay
PDF Path: /content/drive/MyDrive/compliance/policies/moonpay.pdf

Processed Result:
Company Name: Klarna
PDF Path: /content/drive/MyDrive/compliance/policies/klarna.pdf

Processed Result:
Company Name: Bilt
PDF Path: /content/drive/MyDrive/compliance/policies/bilt.pdf

Processed Result:
Company Name: Stripe
PDF Path: /content/drive/MyDrive/compliance/policies/stripe.pdf


## 4. Classify the policy segment use the policy category classifier model along with explainability

In [None]:
import tensorflow as tf
from transformers import AutoTokenizer
from sklearn.preprocessing import LabelEncoder
import numpy as np
from typing import Tuple, Dict, List

model_checkpoint = 'nlpaueb/legal-bert-base-uncased'
model_path = '/content/drive/MyDrive/compliance/OPP-115/legalbert_finetuned_model'
legalbert_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

loaded_model = tf.saved_model.load(model_path)

target_names = [
    "Data Retention",
    "Data Security",
    "Do Not Track",
    "First Party Collection/Use",
    "International and Specific Audiences",
    "Other",
    "Policy Change",
    "Third Party Sharing/Collection",
    "User Access, Edit and Deletion",
    "User Choice/Control"
]

label_encoder = LabelEncoder()
label_encoder.fit(target_names)

def get_prediction_confidence(predictions: np.ndarray) -> float:
    softmax_probs = tf.nn.softmax(predictions, axis=1).numpy()[0]
    return float(np.max(softmax_probs))

def get_top_n_predictions(predictions: np.ndarray, n: int = 3) -> List[Tuple[str, float]]:
    softmax_probs = tf.nn.softmax(predictions, axis=1).numpy()[0]
    top_n_indices = np.argsort(softmax_probs)[-n:][::-1]
    return [(label_encoder.inverse_transform([idx])[0], float(softmax_probs[idx])) for idx in top_n_indices]

def analyze_policy_keywords(text: str, category: str) -> List[str]:
    category_keywords = {
        "Data Retention": ["retain", "store", "keep", "period", "duration", "delete"],
        "Data Security": ["secure", "protect", "encrypt", "safeguard", "confidential"],
        "Do Not Track": ["track", "cookie", "browser", "signal", "tracking"],
        "First Party Collection/Use": ["collect", "gather", "use", "process", "information"],
        "International and Specific Audiences": ["country", "region", "child", "minor", "jurisdiction"],
        "Third Party Sharing/Collection": ["share", "disclose", "transfer", "third party", "partner"],
        "User Access, Edit and Deletion": ["access", "edit", "delete", "modify", "remove"],
        "User Choice/Control": ["opt", "consent", "choose", "control", "preference"]
    }

    keywords = category_keywords.get(category, [])
    found_keywords = [word for word in keywords if word.lower() in text.lower()]
    return found_keywords

def predict_policy_category(text: str, model: tf.Module, tokenizer: AutoTokenizer,
                          label_encoder: LabelEncoder) -> Dict:
    encoding = tokenizer(
        text,
        truncation=True,
        padding='max_length',
        max_length=300,
        return_tensors='tf'
    )

    input_ids = encoding['input_ids']
    attention_mask = encoding['attention_mask']

    predictions = model.signatures['serving_default'](
        input_ids=tf.constant(input_ids),
        attention_mask=tf.constant(attention_mask)
    )

    predicted_logits = predictions['dense']
    confidence = get_prediction_confidence(predicted_logits)
    top_predictions = get_top_n_predictions(predicted_logits)
    main_category = top_predictions[0][0]

    relevant_keywords = analyze_policy_keywords(text, main_category)

    return {
        "category": main_category,
        "confidence": confidence,
        "alternative_categories": top_predictions[1:],
        "relevant_keywords": relevant_keywords,
        "text_length": len(text.split()),
        "prediction_explanation": {
            "main_category_confidence": f"{confidence:.2%}",
            "keywords_found": len(relevant_keywords),
            "text_analysis": f"Found {len(relevant_keywords)} keywords related to {main_category}"
        }
    }

def analyze_policy_segment(text: str) -> Dict:
    try:
        results = predict_policy_category(text, loaded_model, legalbert_tokenizer, label_encoder)

        summary = {
            "status": "success",
            "category": results["category"],
            "confidence": f"{results['confidence']:.2%}",
            "keywords": results["relevant_keywords"],
            "alternatives": [(cat, f"{conf:.2%}") for cat, conf in results["alternative_categories"]],
            "analysis": results["prediction_explanation"]
        }

    except Exception as e:
        summary = {
            "status": "error",
            "error_message": str(e)
        }

    return summary

if __name__ == "__main__":
    example_text = """We may share your personal information with third-party service providers
    to facilitate our services, including payment processing, data analysis, email delivery,
    hosting services, customer service, and marketing. These third parties are authorized to
    use your personal information only as necessary to provide these services to us and are
    required to maintain the confidentiality and security of your information. We do not sell
    your personal information to any third parties."""

    results = analyze_policy_segment(example_text)

    if results["status"] == "success":
        print("\nPolicy Analysis Results:")
        print(f"Category: {results['category']}")
        print(f"Confidence: {results['confidence']}")
        print("\nRelevant Keywords:", ', '.join(results['keywords']))
        print("\nAlternative Categories:")
        for cat, conf in results['alternatives']:
            print(f"- {cat}: {conf}")
        print("\nAnalysis Details:")
        for key, value in results['analysis'].items():
            print(f"- {key}: {value}")
    else:
        print(f"Error in analysis: {results['error_message']}")


Policy Analysis Results:
Category: Data Security
Confidence: 23.18%

Relevant Keywords: confidential

Alternative Categories:
- Other: 8.54%
- User Access, Edit and Deletion: 8.53%

Analysis Details:
- main_category_confidence: 23.18%
- keywords_found: 1
- text_analysis: Found 1 keywords related to Data Security


## 5. Classify the policy segment use the policy category classifier model along without explainability

In [None]:
import tensorflow as tf
from transformers import AutoTokenizer
from sklearn.preprocessing import LabelEncoder

model_checkpoint = 'nlpaueb/legal-bert-base-uncased'
model_path = '/content/drive/MyDrive/compliance/OPP-115/legalbert_finetuned_model'
legalbert_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

loaded_model = tf.saved_model.load(model_path)

target_names = [
    "Data Retention",
    "Data Security",
    "Do Not Track",
    "First Party Collection/Use",
    "International and Specific Audiences",
    "Other",
    "Policy Change",
    "Third Party Sharing/Collection",
    "User Access, Edit and Deletion",
    "User Choice/Control"
]

label_encoder = LabelEncoder()
label_encoder.fit(target_names)

def predict_policy_category(text, model, tokenizer, label_encoder):
    encoding = tokenizer(
        text,
        truncation=True,
        padding='max_length',
        max_length=300,
        return_tensors='tf'
    )

    input_ids = encoding['input_ids']
    attention_mask = encoding['attention_mask']

    predictions = model.signatures['serving_default'](
        input_ids=tf.constant(input_ids),
        attention_mask=tf.constant(attention_mask)
    )

    predicted_class = tf.argmax(predictions['dense'], axis=1).numpy()[0]
    predicted_category = label_encoder.inverse_transform([predicted_class])[0]

    return predicted_category

if __name__ == "__main__":
    example_text = """We may share your personal information with third-party service providers
    to facilitate our services, including payment processing, data analysis, email delivery,
    hosting services, customer service, and marketing. These third parties are authorized to
    use your personal information only as necessary to provide these services to us and are
    required to maintain the confidentiality and security of your information. We do not sell
    your personal information to any third parties."""

    predicted_category = predict_policy_category(example_text, loaded_model, legalbert_tokenizer, label_encoder)
    print(f"Predicted Policy Category: {predicted_category}")

Predicted Policy Category: Data Security


## 5. Extract section header page by page and chunk by section headers

1. processes multiple PDF privacy policies

2. extracting formatted text

3. analyzing formatting to identify potential section headers

4. confirming them with GPT

5. chunking the policy text based on confirmed headers for structured output.

In [51]:
import openai
import fitz
from typing import List, Dict
from collections import defaultdict

def extract_formatted_text(pdf_path: str) -> List[Dict]:
    """
    Extract text with its formatting information from PDF.
    Returns list of dictionaries containing text and its formatting properties.
    """
    formatted_blocks = []
    try:
        doc = fitz.open(pdf_path)
        position = 0

        for page_num, page in enumerate(doc):
            blocks = page.get_text("dict")["blocks"]
            prev_y1 = None

            for block in blocks:
                if "lines" in block:
                    for line in block["lines"]:
                        y0 = line["bbox"][1]
                        line_spacing = y0 - prev_y1 if prev_y1 is not None else 0
                        prev_y1 = line["bbox"][3]

                        for span in line["spans"]:
                            text = span["text"].strip()
                            if text:
                                formatted_blocks.append({
                                    "text": text,
                                    "font_name": span["font"],
                                    "font_size": span["size"],
                                    "is_bold": "bold" in span["font"].lower() or span["flags"] & 2**4 != 0,
                                    "line_spacing": line_spacing,
                                    "position": position,
                                    "page_num": page_num + 1
                                })
                            position += len(text) + 1

        return formatted_blocks
    except Exception as e:
        print(f"Error extracting formatted text: {e}")
        return []

def analyze_document_formatting(blocks: List[Dict]) -> Dict:
    """
    Analyze document formatting to establish baseline metrics.
    """
    stats = defaultdict(list)

    for block in blocks:
        stats["font_sizes"].append(block["font_size"])
        stats["line_spacings"].append(block["line_spacing"])

    return {
        "avg_font_size": sum(stats["font_sizes"]) / len(stats["font_sizes"]),
        "max_font_size": max(stats["font_sizes"]),
        "avg_line_spacing": sum(stats["line_spacings"]) / len(stats["line_spacings"]) if stats["line_spacings"] else 0
    }

def identify_potential_headers(blocks: List[Dict], format_stats: Dict) -> List[Dict]:
    """
    Identify potential headers based on formatting characteristics.
    """
    potential_headers = []

    for block in blocks:
        formatting_score = 0
        characteristics = []

        # Check font size
        if block["font_size"] > format_stats["avg_font_size"]:
            formatting_score += 2
            characteristics.append("larger_font")

        # Check if bold
        if block["is_bold"]:
            formatting_score += 2
            characteristics.append("bold")

        # Check line spacing
        if block["line_spacing"] > format_stats["avg_line_spacing"] * 1.5:
            formatting_score += 1
            characteristics.append("increased_spacing")

        # Check text length
        word_count = len(block["text"].split())
        if word_count <= 10:
            formatting_score += 1
            characteristics.append("short_text")

        # Check for title case or all caps
        if block["text"].istitle() or block["text"].isupper():
            formatting_score += 1
            characteristics.append("title_case_or_caps")

        if formatting_score >= 3:
            potential_headers.append({
                "text": block["text"],
                "position": block["position"],
                "characteristics": characteristics,
                "formatting_score": formatting_score
            })

    return potential_headers

def confirm_headers_with_gpt(potential_headers: List[Dict]) -> List[Dict]:
    """
    Use GPT to confirm which potential headers are actual privacy policy section headers.
    """
    confirmed_headers = []

    for header in potential_headers:
        prompt = f"""Analyze if this text represents a privacy policy section header.

Text: "{header['text']}"

Formatting characteristics detected:
{', '.join(header['characteristics'])}

Consider:
1. Is this a typical privacy policy topic? (e.g., data collection, processing, sharing, security, rights)
2. Does it look like a header based on its formatting? (detected: {header['formatting_score']}/7 header characteristics)
3. Is it concise and descriptive?
4. Does it introduce a new section rather than being part of content?

Is this a main section header? Answer only 'yes' or 'no'."""

        try:
            response = openai.ChatCompletion.create(
                model="gpt-4",
                messages=[
                    {"role": "system", "content": "You are an expert at identifying section headers in privacy policies."},
                    {"role": "user", "content": prompt}
                ],
                max_tokens=10,
                temperature=0.0
            )

            if response['choices'][0]['message']['content'].strip().lower() == 'yes':
                confirmed_headers.append({
                    "header": header["text"],
                    "position": header["position"]
                })

        except Exception as e:
            print(f"Error in GPT confirmation: {e}")
            continue

    return confirmed_headers

def chunk_policy_by_headers(headers: List[Dict], blocks: List[Dict]) -> List[Dict]:
    """
    Chunk the policy text based on section headers using the original formatted blocks.
    """
    chunks = []
    full_text = ""
    current_position = 0

    for block in blocks:
        full_text += block["text"] + "\n"

    for i, header_data in enumerate(headers):
        start_pos = header_data["position"]
        end_pos = headers[i + 1]["position"] if i + 1 < len(headers) else len(full_text)
        chunk_text = full_text[start_pos:end_pos].strip()

        chunks.append({
            "header": header_data["header"],
            "content": chunk_text
        })

    return chunks

def process_multiple_policies(pdf_paths: List[str]) -> List[Dict]:
    """
    Process multiple PDF policy documents and extract sections for each.
    """
    results = []
    for pdf_path in pdf_paths:
        try:
            formatted_blocks = extract_formatted_text(pdf_path)
            if not formatted_blocks:
                results.append({"status": "error", "pdf_path": pdf_path, "message": "Failed to extract formatted text"})
                continue

            format_stats = analyze_document_formatting(formatted_blocks)

            potential_headers = identify_potential_headers(formatted_blocks, format_stats)

            confirmed_headers = confirm_headers_with_gpt(potential_headers)

            if not confirmed_headers:
                results.append({"status": "error", "pdf_path": pdf_path, "message": "No headers found in the document"})
                continue

            chunks = chunk_policy_by_headers(confirmed_headers, formatted_blocks)

            results.append({
                "status": "success",
                "pdf_path": pdf_path,
                "total_chunks": len(chunks),
                "chunks": chunks
            })

        except Exception as e:
            results.append({"status": "error", "pdf_path": pdf_path, "message": str(e)})

    return results

if __name__ == "__main__":

    results = process_multiple_policies(policy_paths)

    for result in results:
        if result["status"] == "success":
            print(f"\nProcessed Result for {result['pdf_path']}:")
            print(f"Total number of chunks created: {result['total_chunks']}\n")
            for chunk in result["chunks"]:
                print(f"Header: {chunk['header']}")
                print(f"Content Snippet: {chunk['content'][:150]}...\n")
        else:
            print(f"Error processing {result['pdf_path']}: {result['message']}")


Processed Result for /content/drive/MyDrive/compliance/policies/bitpay.pdf:
Total number of chunks created: 34

Header: Scope
Content Snippet: Scope
This EU Privacy Notice (“Notice”) describes the ways BitPay Inc.
(“BitPay”, “we”, “us”, “our”) collects, stores, uses and protects
personal data...

Header: Responsibility
Content Snippet: Responsibility
Compliant processing.
BitPay complies with applicable United
States federal and state regulations and European Economic
Area (European ...

Header: What is “personal data”?
Content Snippet: What is “personal data”?
Personal data is any information relating to an identified natural
person or that can be used to identify a natural person di...

Header: How we obtain personal data
Content Snippet: How we obtain personal data
Means of collection.
We obtain your personal data in various
ways:
•
Provided by you.
We obtain information provided by yo...

Header: Who are the data subjects?
Content Snippet: are the data subjects?
There are several 

## 6. chunk by section header

In [52]:
import fitz
from typing import List, Dict
from collections import defaultdict

def further_chunk_multiple_policies(policies: List[Dict], chunk_size: int = 1000, chunk_overlap: int = 50) -> List[Dict]:
    """Further split each chunk's content for multiple companies using a Recursive Text Splitter."""
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
        is_separator_regex=False
    )
    refined_chunks = []

    for policy in policies:
        pdf_path = policy["pdf_path"]
        company_name = pdf_path.split('/')[-1].replace('.pdf', '').capitalize()

        for chunk in policy["chunks"]:
            try:
                sub_chunks = text_splitter.split_text(chunk["content"])

                for i, sub_chunk in enumerate(sub_chunks, 1):
                    refined_chunks.append({
                        "page_content": sub_chunk,
                        "metadata": {
                            "company_name": company_name,
                            "pdf_path": policy["pdf_path"],
                            "header": chunk["header"],
                            "chunk_index": i,
                            "total_sub_chunks": len(sub_chunks),
                            "document_type": "Policy"
                        }
                    })
            except Exception as e:
                print(f"Error processing chunk for company {company_name} with header '{chunk['header']}': {str(e)}")
                continue

    return refined_chunks

if __name__ == "__main__":

    refined_chunks = further_chunk_multiple_policies(results)

    print(f"Total number of sub-chunks created: {len(refined_chunks)}\n")

    for i, final_chunk in enumerate(refined_chunks, 1):
        print(f"Chunk {i}/{len(refined_chunks)}")
        print(f"PDF Path: {final_chunk['metadata']['pdf_path']}")
        print(f"Header: {final_chunk['metadata']['header']}")
        print(f"Sub-chunk {final_chunk['metadata']['chunk_index']}/{final_chunk['metadata']['total_sub_chunks']}")
        print(f"Content Snippet: {final_chunk['page_content'][:150]}...\n")


Total number of sub-chunks created: 719

Chunk 1/719
PDF Path: /content/drive/MyDrive/compliance/policies/bitpay.pdf
Header: Scope
Sub-chunk 1/1
Content Snippet: Scope
This EU Privacy Notice (“Notice”) describes the ways BitPay Inc.
(“BitPay”, “we”, “us”, “our”) collects, stores, uses and protects
personal data...

Chunk 2/719
PDF Path: /content/drive/MyDrive/compliance/policies/bitpay.pdf
Header: Responsibility
Sub-chunk 1/2
Content Snippet: Responsibility
Compliant processing.
BitPay complies with applicable United
States federal and state regulations and European Economic
Area (European ...

Chunk 3/719
PDF Path: /content/drive/MyDrive/compliance/policies/bitpay.pdf
Header: Responsibility
Sub-chunk 2/2
Content Snippet: The Sites may include links to websites of
third parties (for example, hyperlinks, banners or buttons). We are
not responsible for the content of thos...

Chunk 4/719
PDF Path: /content/drive/MyDrive/compliance/policies/bitpay.pdf
Header: What is “personal data”?
Sub-

## 7. Generate sumary of total chunks and sub chunks

In [53]:
from typing import List, Dict
from collections import defaultdict

def organize_chunks_by_company(chunks: List[Dict]) -> Dict[str, List[Dict]]:
    """
    Organize Document chunks by company.

    Parameters:
    - chunks (List[Dict]): List of chunk dictionaries containing 'page_content' and metadata with 'company_name'.

    Returns:
    - dict: Dictionary with company names as keys and lists of chunks as values.
    """
    organized_chunks = defaultdict(list)
    for chunk in chunks:
        company_name = chunk['metadata'].get('company_name', 'Unknown Company')
        organized_chunks[company_name].append(chunk)
    return organized_chunks

def generate_summary_by_company(organized_chunks: Dict[str, List[Dict]]) -> str:
    """
    Generates a summary of the number of chunks for each company.

    Parameters:
    - organized_chunks (dict): Dictionary with company names as keys and lists of chunks as values.

    Returns:
    - str: Summary string.
    """
    summary = "\nPolicy Document Summary by Company:\n" + "=" * 50 + "\n"
    grand_total_chunks = 0

    for company_name, chunks in organized_chunks.items():
        num_chunks = len(chunks)
        summary += f"\nCompany: {company_name}\n" + "-" * 35 + f"\nTotal Chunks: {num_chunks}\n"
        grand_total_chunks += num_chunks

    summary += "=" * 50 + f"\nGrand Total Chunks: {grand_total_chunks}"
    return summary

if __name__ == "__main__":
    refined_chunks = further_chunk_multiple_policies(results)

    organized_chunks = organize_chunks_by_company(refined_chunks)

    summary = generate_summary_by_company(organized_chunks)
    print(summary)

    print(f"\nTotal number of sub-chunks created: {len(refined_chunks)}\n")

    for i, final_chunk in enumerate(refined_chunks, 1):
        print(f"Chunk {i}/{len(refined_chunks)}")
        print(f"PDF Path: {final_chunk['metadata']['pdf_path']}")
        print(f"Header: {final_chunk['metadata']['header']}")
        print(f"Sub-chunk {final_chunk['metadata']['chunk_index']}/{final_chunk['metadata']['total_sub_chunks']}")
        print(f"Content Snippet: {final_chunk['page_content'][:150]}...\n")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Device...

Chunk 181/719
PDF Path: /content/drive/MyDrive/compliance/policies/klarna.pdf
Header: Technical
Sub-chunk 2/2
Content Snippet: data (information
about political
opinion, religious
beliefs and/or health
information
contained in the
PEP lists, as well as
biometric
information fr...

Chunk 182/719
PDF Path: /content/drive/MyDrive/compliance/policies/klarna.pdf
Header: section 3
Sub-chunk 1/1
Content Snippet: section 3
for
information about
Klarna’s savings and
payment accounts).
Contact and
identification data....

Chunk 183/719
PDF Path: /content/drive/MyDrive/compliance/policies/klarna.pdf
Header: Payment
Sub-chunk 1/1
Content Snippet: Payment
information.
Information about
goods/services. (The
store)
Information about
your use of Klarna’s
services. (Klarna)...

Chunk 184/719
PDF Path: /content/drive/MyDrive/compliance/policies/klarna.pdf
Header: Technical
Sub-chunk 1/2
Content Snippet: Technical
information
gen

## 10. Chunk GDPR document by each recital and Article

In [54]:
from langchain.schema import Document

def chunk_gdpr_by_section(gdpr_contents: List[str]) -> List[Document]:
    """
    Chunk GDPR text into sections based on articles and recitals, without mapping to policy categories.

    Parameters:
    - gdpr_contents (List[str]): List containing GDPR text.

    Returns:
    - List[Document]: List of Document objects, each representing a chunk of a GDPR section.
    """
    try:
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=50)
        documents = []
        full_text = "\n".join(gdpr_contents)

        recital_pattern = re.compile(r"\((\d+)\)\s", re.MULTILINE)
        article_pattern = re.compile(r"(Article\s(\d+))\b", re.IGNORECASE)

        all_matches = [(match.start(), match.group(1), 'recital') for match in re.finditer(recital_pattern, full_text)]
        all_matches += [(match.start(), match.group(1), 'article', match.group(2)) for match in re.finditer(article_pattern, full_text)]
        all_matches.sort()

        for i, (start, header, section_type, *article_number) in enumerate(all_matches):
            end = all_matches[i + 1][0] if i + 1 < len(all_matches) else len(full_text)
            section_text = full_text[start:end].strip()

            section_header = None

            if section_type == 'article' and article_number:
                article_number = article_number[0].strip()
                section_header = f"Article {article_number}"

            elif section_type == 'recital':
                section_header = f"Recital {header}"

            chunks = text_splitter.split_text(section_text)

            for chunk in chunks:
                documents.append(Document(
                    page_content=chunk,
                    metadata={
                        "header": section_header
                    }
                ))

        print(f"\nTotal documents created: {len(documents)}")
        return documents

    except Exception as e:
        print(f"Error processing GDPR contents: {e}")
        return []

if __name__ == "__main__":
    try:

        gdpr_sentence_chunks = chunk_gdpr_by_section(gdpr_contents)

        print("\nSample chunks:")
        for doc in gdpr_sentence_chunks[:5]:
            print(f"\n header: {doc.metadata['header']}")
            print(f"Content Snippet: {doc.page_content[:150]}...")

    except Exception as e:
        print(f"Error in main execution: {e}")



Total documents created: 762

Sample chunks:

 header: Article 16
Content Snippet: Article 16 thereof, 
Having regard to the proposal from the European Commission, 
After transmission of the draft legislative act to the national parl...

 header: Recital 1
Content Snippet: (1)  
The protection of natural persons in relation to the processing of personal data is a fundamental right....

 header: Article 8
Content Snippet: Article 8...

 header: Recital 1
Content Snippet: (1) of the Charter of Fundamental Rights of the European Union (the ‘Charter’) and...

 header: Article 16
Content Snippet: Article 16...


In [55]:
def further_chunk_gdpr_content(company_name: str, initial_documents: List[Document]) -> List[Document]:
    """Further split GDPR content into smaller chunks while preserving metadata."""
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=50)
    gdpr_chunks = []

    for document in initial_documents:
        sub_chunks = text_splitter.split_text(document.page_content)
        for sub_chunk in sub_chunks:
            gdpr_chunks.append(Document(
                page_content=sub_chunk,
                metadata={
                    "header": document.metadata.get("header"),
                    "document_type": "GDPR"
                }
            ))
    print(f"[DEBUG] Further chunked GDPR content for company {company_name}")
    return gdpr_chunks


## 11. Prepare chunks with metadata for vector store

In [None]:
from langchain.schema import Document
from langchain.vectorstores import Qdrant

all_documents = [
    Document(
        page_content=chunk["chunk_text"],
        metadata={
            "category": chunk["predicted_category"],
            "chunk_number": chunk["chunk_number"],
            "company_name": company_name,
            "document_type": "Policy"
        }
    )
    for chunk in classified_paragraph_chunks
] + [
    Document(
        page_content=chunk.page_content,
        metadata={
            "section_type": chunk.metadata["section_type"],
            "chunk_number": idx + 1,
            "document_type": "GDPR"
        }
    )
    for idx, chunk in enumerate(gdpr_sentence_chunks)
]

qdrant_vectorstore = Qdrant.from_documents(
    documents=all_documents,
    embedding=base_embeddings,
    location=":memory:",
    collection_name="policy_pulse_db",
    force_recreate=True
)

gdpr_retriever = qdrant_vectorstore.as_retriever(filter={"document_type": "GDPR"})
policy_retriever = qdrant_vectorstore.as_retriever(filter={"document_type": "Policy"})


## Identify if it is GDPR related or if it is policy related

In [None]:
def route_question_to_retriever(question, gdpr_retriever, policy_retriever):
    """
    Routes the question to the appropriate retriever based on 'GDPR' in the question,
    then filters the retrieved documents to ensure correct document type.

    Parameters:
    - question (str): The user's question.
    - gdpr_retriever: Retriever for GDPR content.
    - policy_retriever: Retriever for Policy content.

    Returns:
    - list of documents that match the document_type filter.
    """
    if "GDPR" in question.upper():
        print("Question identified as GDPR-related.")
        results = gdpr_retriever.invoke(question)
        return [result for result in results if result.metadata.get("document_type") == "GDPR"]
    else:
        print("Question identified as Policy-related.")
        results = policy_retriever.invoke(question)
        return [result for result in results if result.metadata.get("document_type") == "Policy"]

gdpr_question = "What are my rights under GDPR?"
gdpr_results = route_question_to_retriever(gdpr_question, gdpr_retriever, policy_retriever)

print("\nResults for GDPR-related question:")
for i, result in enumerate(gdpr_results[:5], 1):
    print(f"Result {i}: Content: {result.page_content[:100]} - Metadata: {result.metadata}")

policy_question = "What is the data retention policy?"
policy_results = route_question_to_retriever(policy_question, gdpr_retriever, policy_retriever)

print("\nResults for Policy-related question:")
for i, result in enumerate(policy_results[:5], 1):
    print(f"Result {i}: Content: {result.page_content[:100]} - Metadata: {result.metadata}")


Question identified as GDPR-related.

Results for GDPR-related question:
Result 1: Content: 1.
Member States shall by law reconcile the right to the protection of personal data pursuant to thi - Metadata: {'section_type': None, 'chunk_number': 816, 'document_type': 'GDPR', '_id': '58299f555fc846eb8a7e8cadfb7d219e', '_collection_name': 'policy_pulse_db'}
Result 2: Content: Whereas: 
(1)  
The protection of natural persons in relation to the processing of personal data is  - Metadata: {'section_type': None, 'chunk_number': 3, 'document_type': 'GDPR', '_id': '23c56dfba23a43d2ab4c0147369a4878', '_collection_name': 'policy_pulse_db'}
Result 3: Content: 2.
Where processing has been restricted under paragraph 1, such personal data shall, with the except - Metadata: {'section_type': None, 'chunk_number': 485, 'document_type': 'GDPR', '_id': '4c2e324b771e4d4bba5a8a805073b249', '_collection_name': 'policy_pulse_db'}
Result 4: Content: processing of personal data by private bodies falls within th

## 12. Mistral Model

In [56]:
!huggingface-cli login



    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) y
Token is valid (permission: fineGr

Enabled CPU offloading with llm_int8_enable_fp32_cpu_offload=True
Changed to float16 precision for better memory efficiency
Added more quantization options for better performance
Added offload folder specification
Added padding and truncation settings for tokenizer

In [57]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    llm_int8_enable_fp32_cpu_offload=True
)

llm_mistral_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.2",
    device_map="auto",
    torch_dtype=torch.float16,
    quantization_config=quantization_config,
    offload_folder="offload_folder"
)

llm_mistral_tokenizer = AutoTokenizer.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.2",
    padding_side="right",
    truncation=True
)

mistral_pipe = pipeline(
    "text-generation",
    model=llm_mistral_model,
    tokenizer=llm_mistral_tokenizer,
    max_new_tokens=1000,
    temperature=0.55,
    top_p=0.9,
    do_sample=True,
    repetition_penalty=1.3,
    torch_dtype=torch.float16
)

mistral_pipe.model.config.pad_token_id = mistral_pipe.model.config.eos_token_id

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Device set to use cuda:0


In [58]:
mistral_pipe("[INST]Give me  two-sentence about GDPR regulations ![/INST]")

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': '[INST]Give me  two-sentence about GDPR regulations ![/INST] The General Data Protection Regulation (GDPR) is a European Union law that came into effect in May 2018, aimed at protecting the privacy and personal data of EU citizens. It requires organizations to obtain explicit consent from individuals before collecting, processing, or sharing their personal information, and imposes significant fines for noncompliance.'}]

In [59]:
def get_context(context):
    global mistral_context
    mistral_context = context
    return context

company_context = "This is a company privacy policy review focused on GDPR compliance. The company handles sensitive user data."

get_context(company_context)

question = "[INST]" + mistral_context + " Give me two sentences about GDPR data retention rules. [/INST]"
response = mistral_pipe(question)

print(response[0]["generated_text"])


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[INST]This is a company privacy policy review focused on GDPR compliance. The company handles sensitive user data. Give me two sentences about GDPR data retention rules. [/INST] Under the General Data Protection Regulation (GDPR), companies are required to limit the collection, processing, and retention of personal data only for specified purposes and in accordance with specific time limits or legal requirements. Therefore, your organization should have clear policies regarding how long you retain different types of personal data based on their purpose and any applicable laws.


## 16. Mistral RAG chain

In [60]:
mistral_llm_lc = HuggingFacePipeline(pipeline=mistral_pipe)


## 17 Evaluation

In [63]:
import json

json_file_path = '/content/drive/MyDrive/compliance/policies/validation_data.json'

with open(json_file_path, 'r') as json_file:
    loaded_validation = json.load(json_file)

print("Loaded Validation Data:")
for key, value in loaded_validation.items():
    print(f"Question ID: {key}")
    print(f"Question: {value['question']}")
    print(f"Gold Answer: {value['gold_answer'][:100]}...")


Loaded Validation Data:
Question ID: 1
Question: When and where does BitPay collect personal data?
Gold Answer: Main Answer:
BitPay collects personal data during key operational moments across their service platf...
Question ID: 5
Question: When does MoonPay require service consent?
Gold Answer: Main Answer:
MoonPay initiates consent collection at key service engagement points and before specif...
Question ID: 6
Question: What are the requirements for handling sensitive data in research?
Gold Answer: Main Answer:
Organizations conducting research with sensitive data must implement specific safeguard...
Question ID: 7
Question: Why does Klarna process different categories of shopping data?
Gold Answer: Main Answer:
Klarna processes distinct categories of shopping data to support payment services and r...
Question ID: 9
Question: What types of personal data does Bilt collect for rewards services?
Gold Answer: Main Answer:
Bilt collects multiple categories of personal data to facilitate t

In [64]:
import openai
import fitz
import re
import pandas as pd
from typing import Optional, List, Dict
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain.vectorstores import Qdrant
from langchain.schema import Document

from langchain.schema import Document

def prepare_documents_for_vectorstore(chunks: List[Document]) -> List[Document]:
    """Prepare documents for vector store with consistent metadata handling."""
    documents = []

    for idx, chunk in enumerate(chunks):
        if isinstance(chunk, dict):
            chunk = Document(page_content=chunk['page_content'], metadata=chunk['metadata'])

        doc_type = chunk.metadata.get("document_type", "Unknown")

        metadata_company_name = chunk.metadata.get("company_name", "").lower()

        metadata = {
            "chunk_number": idx + 1,
            "document_type": doc_type,
            "company_name": metadata_company_name,
            "header": chunk.metadata.get("header", ""),
            "doc_link": (
                "https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679"
                if doc_type == "GDPR"
                else f"/content/drive/MyDrive/compliance/policies/{metadata_company_name}.pdf"
            )
        }

        documents.append(Document(
            page_content=chunk.page_content,
            metadata=metadata
        ))
        print(f"[DEBUG] Prepared document {idx + 1} with metadata: {metadata}")

    return documents


def setup_vectorstore(documents: List[Document], embeddings) -> Qdrant:
    """Initialize and populate the vector store."""
    try:
        vectorstore = Qdrant.from_documents(
            documents=documents,
            embedding=embeddings,
            location=":memory:",
            collection_name="policy_pulse_db",
            force_recreate=True
        )
        print("[DEBUG] Vector store setup complete")
        return vectorstore
    except Exception as e:
        print(f"Error setting up vector store: {e}")
        return None

def prepare_validation_data_with_company(validation_data: Dict, header: Optional[str] = None) -> Dict:
    """Prepare validation data by adding the company name and section header for each question."""
    updated_validation = {}
    for question_id, question_data in validation_data.items():
        updated_data = question_data.copy()
        updated_data['header'] = header if header else "N/A"
        updated_validation[question_id] = updated_data
    return updated_validation


def display_validation_results(validation_data: Dict) -> None:
    """Display validation results in a formatted manner."""
    for question_id, question_data in validation_data.items():
        question_text = question_data['question']
        company_name = question_data.get('company_name', 'N/A')
        section_header = question_data.get('header', 'N/A')

        print(f"Question ID: {question_id}")
        print(f"Question: {question_text}")
        print(f"Company Name: {company_name}")
        print(f"Section Header: {section_header}")
        print(f"Ground Truth: {question_data.get('gold_answer', 'N/A')}")
        print('-' * 50)


if __name__ == "__main__":

    for pdf_path in policy_paths:
        company_name = extract_company_name_from_first_page(pdf_path)
        if company_name:
            gdpr_final_chunks = further_chunk_gdpr_content(company_name, gdpr_sentence_chunks)

            all_chunks = gdpr_final_chunks + refined_chunks
            all_documents = prepare_documents_for_vectorstore(all_chunks)
            print("[DEBUG] Prepared Documents for Vector Store:")
            for idx, document in enumerate(all_documents):
              print(f"Document {idx + 1}:")
              print(f"Page Content: {document.page_content[:100]}...")
              print(f"Metadata: {document.metadata}")
              print("-" * 50)


            qdrant_vectorstore = setup_vectorstore(all_documents, base_embeddings)
            #gdpr_retriever = qdrant_vectorstore.as_retriever(filter={"document_type": "GDPR"})
            combined_filter = {"document_type": "GDPR"}
            gdpr_retriever = qdrant_vectorstore.as_retriever(search_kwargs={'filter': combined_filter})

            print(f"[TEST] Applying retrieval filter with company_name: '{company_name}'")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
--------------------------------------------------
Document 793:
Page Content: (2) or failure to provide access in violation of...
Metadata: {'chunk_number': 793, 'document_type': 'GDPR', 'company_name': '', 'header': 'Recital 2', 'doc_link': 'https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679'}
--------------------------------------------------
Document 794:
Page Content: Article 58(1). 
6.
Non-compliance with an order by the supervisory authority as referred to in...
Metadata: {'chunk_number': 794, 'document_type': 'GDPR', 'company_name': '', 'header': 'Article 58', 'doc_link': 'https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679'}
--------------------------------------------------
Document 795:
Page Content: Article 58...
Metadata: {'chunk_number': 795, 'document_type': 'GDPR', 'company_name': '', 'header': 'Article 58', 'doc_link': 'https://eur-lex.europa.eu/legal-content/EN/

In [65]:
print(f"[TEST] Applying retrieval filter for GDPR documents with company_name: '{company_name}'")

test_query = "What is the purpose of GDPR?"
retrieved_docs = gdpr_retriever.get_relevant_documents(test_query)

print("\n[DEBUG] Retrieved Documents for GDPR Retriever Test:")
for idx, doc in enumerate(retrieved_docs):
  if doc.metadata.get("document_type") != "GDPR":
    print(f"[ERROR] Non-GDPR document retrieved: {doc.metadata}")
  else:
    print(f"[DEBUG] GDPR document {idx + 1} retrieved with metadata: {doc.metadata}")


[TEST] Applying retrieval filter for GDPR documents with company_name: 'Stripe'

[DEBUG] Retrieved Documents for GDPR Retriever Test:
[DEBUG] GDPR document 1 retrieved with metadata: {'chunk_number': 285, 'document_type': 'GDPR', 'company_name': '', 'header': 'Recital 166', 'doc_link': 'https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679', '_id': 'd77d606022244445b0d74abc24380644', '_collection_name': 'policy_pulse_db'}
[DEBUG] GDPR document 2 retrieved with metadata: {'chunk_number': 509, 'document_type': 'GDPR', 'company_name': '', 'header': 'Article 3', 'doc_link': 'https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679', '_id': 'be51948f27d7432a979103ddae0d41dc', '_collection_name': 'policy_pulse_db'}
[DEBUG] GDPR document 3 retrieved with metadata: {'chunk_number': 131, 'document_type': 'GDPR', 'company_name': '', 'header': 'Recital 73', 'doc_link': 'https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679', '_id': '7de1615f29

  retrieved_docs = gdpr_retriever.get_relevant_documents(test_query)


## First test on Random sample data

In [84]:
test_questions = {
    1: {
        'question': 'How does MoonPay protect sensitive financial information?',
        'company_name': 'MoonPay',
        'document_type': 'Policy'
    },
    2: {
        'question': 'How does BitPay obtain personal data',
        'company_name': 'BitPay',
        'document_type': 'Policy'
    },
    3: {
        'question': 'What are the data subject rights under GDPR?',
        'company_name': None,
        'document_type': 'GDPR'
    }
}



policy_prompt = PromptTemplate(
    input_variables=["context", "question"],
    template="Given the following context from policy documents, answer the question.\n\nContext:\n{context}\n\nQuestion: {question}\n\nAnswer:"
)

rag_chain_mistral = (
    {
        "context": RunnablePassthrough(),
        "question": RunnablePassthrough()
    }
    | policy_prompt
    | mistral_llm_lc
    | RunnablePassthrough()
)

def test_retrieval_for_policy_questions(test_questions, policy_retriever, gdpr_retriever, rag_chain_mistral):
    """Test retrieval function with specific questions, company names, and policy categories."""
    results = {}
    for question_id, question_data in test_questions.items():
        print("question_data", question_data)
        question_text = question_data['question']
        company_name = question_data.get('company_name')
        print("company name: ",company_name )
        print(f"[TEST] Processing question ID {question_id} - {question_text}")
        company_name = question_data['company_name']
        question_type = question_data['document_type']

        if question_type == 'GDPR':
            print(f"[TEST] Retrieving GDPR content for question ID {question_id} - {question_text}")
            gdpr_results = gdpr_retriever.get_relevant_documents(question_text, metadata_filters={"document_type": "GDPR"})
            context_snippets = " ".join([doc.page_content[:500] for doc in gdpr_results])

        elif question_type == 'Policy':
            print(f"[TEST] Retrieving Policy content for question ID {question_id} ({company_name})")
            policy_results = policy_retriever.get_relevant_documents(
                question_text,
                metadata_filters={
                    "company_name": company_name,
                    "document_type": "Policy"
                }
            )
            gdpr_results = gdpr_retriever.get_relevant_documents(question_text, metadata_filters={"document_type": "GDPR"})
            context_snippets = " ".join([doc.page_content[:500] for doc in policy_results + gdpr_results])

        print(f"\n--- Debug: Retrieved Documents for Question {question_id} ---")
        for doc in (policy_results if question_type == 'Policy' else gdpr_results):
            print(f"Retrieved document metadata: {doc.metadata}")

        rag_input = {
            "context": context_snippets,
            "question": question_text
        }

        response = rag_chain_mistral.invoke(rag_input)

        results[question_id] = response
        print(f"\n--- Testing RAG Chain with Mistral for Question {question_id} ---")
        print(f"Question: '{question_text}'")
        print("Response:", response)
        print("-" * 50)

    return results

test_retrieval_results = test_retrieval_for_policy_questions(test_questions, policy_retriever, gdpr_retriever, rag_chain_mistral)


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


question_data {'question': 'How does MoonPay protect sensitive financial information?', 'company_name': 'MoonPay', 'document_type': 'Policy'}
company name:  MoonPay
[TEST] Processing question ID 1 - How does MoonPay protect sensitive financial information?
[TEST] Retrieving Policy content for question ID 1 (MoonPay)

--- Debug: Retrieved Documents for Question 1 ---
Retrieved document metadata: {'chunk_number': 921, 'document_type': 'Policy', 'company_name': 'Moonpay', 'header': 'How does MoonPay protect my personal information?', 'doc_link': '/content/drive/MyDrive/compliance/policies/Moonpay.pdf', '_id': '9dc3a3dbf28a4cf390f20850bc364cd5', '_collection_name': 'policy_pulse_db'}
Retrieved document metadata: {'chunk_number': 903, 'document_type': 'Policy', 'company_name': 'Moonpay', 'header': 'How we protect and store personal information', 'doc_link': '/content/drive/MyDrive/compliance/policies/Moonpay.pdf', '_id': '686c844c1b864924a692dc286250c23c', '_collection_name': 'policy_pulse_

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.



--- Testing RAG Chain with Mistral for Question 1 ---
Question: 'How does MoonPay protect sensitive financial information?'
Response: Given the following context from policy documents, answer the question.

Context:
{'context': 's MoonPay protect my personal information?\nTo protect your personal information from unauthorized access and use, MoonPay uses\nsecurity measures that comply with federal law. These measures include computer\nsafeguards and secured files and buildings.\nHow doe we protect and store personal information\nMoonPay implements and maintains reasonable measures to protect your information.\nCustomer files are protected with safeguards according to the sensitivity of the relevant\ninformation. Reasonable controls (such as restricted access) are placed on our\ncomputer systems. Physical access to areas where personal information is gathered,\nprocessed or stored is limited to authorized employees.\nWe may store and process all or part of your personal and transactio 

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.



--- Testing RAG Chain with Mistral for Question 2 ---
Question: 'How does BitPay obtain personal data'
Response: Given the following context from policy documents, answer the question.

Context:
{'context': 'How we obtain personal data\nMeans of collection.\nWe obtain your personal data in various\nways:\n•\nProvided by you.\nWe obtain information provided by you, for\nexample, if you contact us, sign up for our newsletter, or\nprovide information to us in the course of using our Services.\nWhen you provide personal data to BitPay, please do not\nprovide information that is irrelevant, inaccurate, and/or\nunnecessary for the provision of Services.\n•\nAutomatically collected.\nWe obtain some information\nautomatic s MoonPay collect my personal information?\nMoonPay collects your personal information when you:\n• Create an account with MoonPay.\n• Place or complete an order with MoonPay.\n• Use your debit or credit card or any other permissible payment method to purchase or\nsell digit

## Testing on Actual Validation data

In [66]:
from typing import Optional

def extract_company_name_from_question(question_text: str) -> Optional[str]:
    """Extract the company name from the question text using GPT-4."""

    prompt = (
        "Identify the exact name of the company mentioned in the following question. "
        "If there is no specific company name, return 'None'.\n\n"
        f"Question Text:\n\n{question_text}\n\nCompany Name:"
    )

    try:
        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": "You are a helpful assistant. Return only the company name if found; otherwise, return 'None'."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=10,
            temperature=0.0
        )

        company_name = response['choices'][0]['message']['content'].strip()

        if company_name.lower() == "none":
            return None

        return company_name

    except Exception as e:
        print(f"Error in API call: {e}")
        return None


In [67]:

gdpr_df = pd.read_csv('/content/drive/MyDrive/compliance/policies/gdpr_pc.csv')
gdpr_df.head()

Unnamed: 0,Article Number,Article Header,OPP Category
0,1,Subject matter and objectives,First Party Collection/Use
1,2,Material scope,First Party Collection/Use
2,3,Territorial scope,International & Specific Audiences
3,4,Definitions,Other
4,5,Processing principles,First Party Collection/Use


In [68]:
from langchain.prompts import ChatPromptTemplate

prompt_role = "You are a GDPR compliance expert tasked with providing accurate, structured answers."
prompt_task = "Using only the provided context, generate a comprehensive response that includes all required sections."
prompt_context = "Context:\n{context}"
prompt_question = "Question:\n{question}"
prompt_instruction = """Answer the question by referring specifically to the context.
Structure your response exactly as follows:

Main Answer:
[Provide a concise answer using the context information]

Key Points:
• [List key points as bullet points]
• [Each point should start with a bullet point]
• [Extract at least 3-4 key points]

GDPR References:
[Must include specific GDPR articles mentioned or most relevant to the context. Use format: 'Article X - Title']"""

policy_pulse_template = f"""<s>[INST] {prompt_role}
{prompt_task}

{prompt_context}

{prompt_question}

{prompt_instruction} [/INST]"""

policy_prompt = ChatPromptTemplate.from_template(policy_pulse_template)


In [70]:
def classify_question_type(question: str) -> tuple:
    """
    Classify if a question is about a specific company policy or general GDPR.
    Returns "Policy" if a company name is identified in the question,
    otherwise returns "GDPR".

    Args:
        question: The question text

    Returns:
        Tuple: ("Policy" or "GDPR", matched company name or None)
    """
    company_name = extract_company_name_from_question(question)
    if company_name:
      company_name = company_name.lower()
    return ("Policy", company_name) if company_name else ("GDPR", None)

def clean_generated_response(response: str) -> str:
    """
    This function cleans the generated response to remove any unwanted content
    such as initial instructions or repeated context.
    """
    cleaned_response = re.sub(r'Human: <s>\[INST\].*?\[/INST\]', '', response, flags=re.DOTALL)

    cleaned_response = re.sub(r'Context:\s*{.*?}\n', '', cleaned_response, flags=re.DOTALL)

    cleaned_response = re.sub(r'Question:\s*{.*?}\n', '', cleaned_response, flags=re.DOTALL)

    cleaned_response = cleaned_response.strip()

    return cleaned_response


# policy_prompt = PromptTemplate(
#     input_variables=["context", "question"],
#     template="Given the following context from policy documents, answer the question.\n\nContext:\n{context}\n\nQuestion: {question}\n\nAnswer:"
# )

rag_chain_mistral = (
    {
        "context": RunnablePassthrough(),
        "question": RunnablePassthrough()
    }
    | policy_prompt
    | mistral_llm_lc
    | RunnablePassthrough()
)

def test_retrieval_for_policy_questions(test_questions, qdrant_vectorstore, gdpr_retriever, rag_chain_mistral):
    """Test retrieval function with specific questions, company names, and policy categories."""
    results = []
    for question_id, question_data in test_questions.items():
        question_text = question_data['question']
        ground_truth = question_data['gold_answer']
        print(f"[TEST] Processing question ID {question_id} - {question_text}")

        question_type, dynamic_company_name = classify_question_type(question_text)
        print(f"Dynamically extracted company name: {dynamic_company_name}")
        print(f"Dynamically extracted question_type: {question_type}")

        gdpr_results = []
        policy_results = []

        if question_type == 'GDPR':
            print(f"[TEST] Retrieving GDPR content for question ID {question_id} - {question_text}")
            gdpr_results = gdpr_retriever.get_relevant_documents(question_text, metadata_filters={"document_type": question_type})
            print("gdpr_results", gdpr_results)

        elif question_type == 'Policy' and dynamic_company_name:
            print(f"[TEST] Retrieving Policy content for question ID {question_id} ({dynamic_company_name})")
            combined_filter = {
                "company_name": dynamic_company_name
            }

            policy_retriever = qdrant_vectorstore.as_retriever(search_kwargs={
                'filter': combined_filter
            })
            policy_results = policy_retriever.get_relevant_documents(question_text)

            print("policy_results", policy_results)

            if not policy_results:
                print(f"[WARNING] No policy documents found for '{dynamic_company_name}'. Using broader retrieval.")
                policy_results = qdrant_vectorstore.as_retriever().get_relevant_documents(question_text)

            print(f"[INFO] Retrieving GDPR articles for the question using semantic similarity")
            gdpr_results = gdpr_retriever.get_relevant_documents(question_text)
        else:
            print(f"[ERROR] Missing company name for question ID {question_id}. Skipping retrieval.")
            continue

        context_snippets = [doc.page_content[:500] for doc in policy_results + gdpr_results if doc.page_content]
        context_snippets_str = " ".join(context_snippets)

        document_sources = [doc.metadata.get('doc_link', 'N/A') for doc in policy_results + gdpr_results]
        document_sources_str = ", ".join(document_sources)

        if not context_snippets_str:
            print(f"[ERROR] No context retrieved for Question ID {question_id}.")
            results.append({
                "question": question_text,
                "ground_truth": ground_truth,
                "contexts": "No relevant context found for answering the question.",
                "answer": "No answer generated.",
                "doc_source": "N/A"
            })
            continue

        rag_input = {
            "context": context_snippets_str,
            "question": question_text
        }
        response = rag_chain_mistral.invoke(rag_input)

        generated_answer = clean_generated_response(response)

        print(f"\n--- Testing RAG Chain with Mistral for Question {question_id} ---")
        print(f"Question: '{question_text}'")
        print("Response:", response)
        print("-" * 50)

        results.append({
            "question": question_text,
            "ground_truth": ground_truth,
            "contexts": context_snippets_str,
            "answer": generated_answer,
            "doc_source": document_sources_str
        })

    return results

test_retrieval_results = test_retrieval_for_policy_questions(loaded_validation, qdrant_vectorstore, gdpr_retriever, rag_chain_mistral)

df_results = pd.DataFrame(test_retrieval_results)
csv_file_path = '/content/drive/MyDrive/compliance/validation_results.csv'
df_results.to_csv(csv_file_path, index=False)

print(f"Results saved to {csv_file_path}")

[TEST] Processing question ID 1 - When and where does BitPay collect personal data?
Dynamically extracted company name: bitpay
Dynamically extracted question_type: Policy
[TEST] Retrieving Policy content for question ID 1 (bitpay)
policy_results [Document(metadata={'chunk_number': 846, 'document_type': 'Policy', 'company_name': 'bitpay', 'header': 'How we obtain personal data', 'doc_link': '/content/drive/MyDrive/compliance/policies/bitpay.pdf', '_id': '2266551299e3465d9fb0473378d1ce85', '_collection_name': 'policy_pulse_db'}, page_content='How we obtain personal data\nMeans of collection.\nWe obtain your personal data in various\nways:\n•\nProvided by you.\nWe obtain information provided by you, for\nexample, if you contact us, sign up for our newsletter, or\nprovide information to us in the course of using our Services.\nWhen you provide personal data to BitPay, please do not\nprovide information that is irrelevant, inaccurate, and/or\nunnecessary for the provision of Services.\n•\nA

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.



--- Testing RAG Chain with Mistral for Question 1 ---
Question: 'When and where does BitPay collect personal data?'
Response: Human: <s>[INST] You are a GDPR compliance expert tasked with providing accurate, structured answers.
Using only the provided context, generate a comprehensive response that includes all required sections.

Context:
{'context': 'How we obtain personal data\nMeans of collection.\nWe obtain your personal data in various\nways:\n•\nProvided by you.\nWe obtain information provided by you, for\nexample, if you contact us, sign up for our newsletter, or\nprovide information to us in the course of using our Services.\nWhen you provide personal data to BitPay, please do not\nprovide information that is irrelevant, inaccurate, and/or\nunnecessary for the provision of Services.\n•\nAutomatically collected.\nWe obtain some information\nautomatic disclosure of data is necessary to report suspected illegal or\nfraudulent activity or to investigate violations of our Terms of

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.



--- Testing RAG Chain with Mistral for Question 5 ---
Question: 'When does MoonPay require service consent?'
Response: Human: <s>[INST] You are a GDPR compliance expert tasked with providing accurate, structured answers.
Using only the provided context, generate a comprehensive response that includes all required sections.

Context:
{'context': 'Global Privacy Policy\nEffective Date: 3rd May 2024\nMoonPay (defined below under “Our relationship to you”) is committed to protecting the\nprivacy of visitors to our websites and our customers. This Privacy Policy describes how\nwe handle your personal data when you access our services, which include our content\non the websites located at\nmoonpay.com\n,\nbuy.moonpay.com\n,\nsell.moonpay.com\nor any\nother websites, pages, features, or content we own or operate (collectively, the\n"Site(s)"), or any MoonP them from sharing your information with anyone else.\n• Companies or other third parties in connection with business transfers or bankrup

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Dynamically extracted company name: None
Dynamically extracted question_type: GDPR
[TEST] Retrieving GDPR content for question ID 6 - What are the requirements for handling sensitive data in research?
gdpr_results [Document(metadata={'chunk_number': 267, 'document_type': 'GDPR', 'company_name': '', 'header': 'Recital 156', 'doc_link': 'https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679', '_id': 'f2cb9eb2890a447abb2b23275dccce22', '_collection_name': 'policy_pulse_db'}, page_content='under specific conditions and subject to appropriate safeguards for data subjects, specifications and derogations \nwith regard to the information requirements and rights to rectification, to erasure, to be forgotten, to restriction \nof processing, to data portability, and to object when processing personal data for archiving purposes in the'), Document(metadata={'chunk_number': 268, 'document_type': 'GDPR', 'company_name': '', 'header': 'Recital 156', 'doc_link': 'https://eur-lex.eur

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Dynamically extracted company name: klarna
Dynamically extracted question_type: Policy
[TEST] Retrieving Policy content for question ID 7 (klarna)
policy_results [Document(metadata={'chunk_number': 1015, 'document_type': 'Policy', 'company_name': 'klarna', 'header': 'Technical', 'doc_link': '/content/drive/MyDrive/compliance/policies/klarna.pdf', '_id': 'f29607da7e4c4f1196dff08cad2a77a7', '_collection_name': 'policy_pulse_db'}, page_content='Technical\ninformation\ngenerated through\nyour use of Klarna’s\nservices. (Klarna)\n19\n[19]\nWhen you shop in\na store that offers\nKlarna as a payment\nmethod or has Klarna\ncheckout, we will\nassess the order in\nwhich different\npayment methods\nshould be presented\nto you at the store\ncheckout. This\nprocessing does not\naffect which of\nKlarna’s payment\nmethods are available\nto you.\nThis processing\nconstitutes profiling.\nSee section 6 for more\ninformation about\nprofiling.\nFrom you:\nFrom other sources:\nIf you have accepted\nand use

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Dynamically extracted company name: bilt
Dynamically extracted question_type: Policy
[TEST] Retrieving Policy content for question ID 9 (bilt)
policy_results [Document(metadata={'chunk_number': 1233, 'document_type': 'Policy', 'company_name': 'bilt', 'header': 'Information We Receive from Third Parties:', 'doc_link': '/content/drive/MyDrive/compliance/policies/bilt.pdf', '_id': 'a4a57745ea664bd7a7f34ee1a4356458', '_collection_name': 'policy_pulse_db'}, page_content='e Receive from Third Parties:\nWe may collect information about you from third parties that perform services and analytics for us.\nSuch companies may include credit bureaus, data providers, fraud detection services and data\nanalytics providers, as well as certain of our Third Party Service Providers and their partners. We\nmay also collect engagement and transaction information from our merchant partners about you as\na member of the Bilt Rewards program in order to provide notice and issuance of rewards and\nbenefits in 

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Dynamically extracted company name: stripe
Dynamically extracted question_type: Policy
[TEST] Retrieving Policy content for question ID 10 (stripe)
policy_results [Document(metadata={'chunk_number': 1461, 'document_type': 'Policy', 'company_name': 'stripe', 'header': 'a. Personal Data we collect about End Users', 'doc_link': '/content/drive/MyDrive/compliance/policies/stripe.pdf', '_id': 'f55a84b6be6442f794a3ef375265b330', '_collection_name': 'policy_pulse_db'}, page_content='a. Personal Data we collect about End Users\nUsing Link or Connecting your bank account.\nStripe offers a service called "Link," which\nallows you to store your payment methods with Stripe to conveniently use them across our\nBusiness Users. When you sign up for Link, you agree to store your Personal Data (such as\nname, contact information, payment method details) with Stripe. This will allow for a more\nstreamlined purchasing experience when using Link in the future. If you choose to pay with\nLink, we will also

In [71]:
pd.set_option('display.max_colwidth', None)
df_results.head()

Unnamed: 0,question,ground_truth,contexts,answer,doc_source
0,When and where does BitPay collect personal data?,"Main Answer:\nBitPay collects personal data during key operational moments across their service platforms and jurisdictions. Data collection occurs during initial account registration, throughout active transaction processing, during customer support interactions, and in real-time platform usage sessions. Collection takes place across their web platform, mobile applications, merchant integration points, and verification partner networks, with specific protocols for each location. The system employs technical tools to gather data through automated collection at these touchpoints, implementing specific measures for data minimization and jurisdictional compliance. Each collection timing and location is documented in their data mapping system, with clear records of processing purposes, retention periods, and legal bases.\n\nKey Points:\n• Collection timing aligned with key user interactions\n• Multiple collection locations across service platforms\n• Real-time data gathering during active sessions\n• Location-specific protocols for each jurisdiction\n\nGDPR Reference: Article 13 - Information to be provided where personal data are collected","How we obtain personal data\nMeans of collection.\nWe obtain your personal data in various\nways:\n•\nProvided by you.\nWe obtain information provided by you, for\nexample, if you contact us, sign up for our newsletter, or\nprovide information to us in the course of using our Services.\nWhen you provide personal data to BitPay, please do not\nprovide information that is irrelevant, inaccurate, and/or\nunnecessary for the provision of Services.\n•\nAutomatically collected.\nWe obtain some information\nautomatic disclosure of data is necessary to report suspected illegal or\nfraudulent activity or to investigate violations of our Terms of\nUse; and\n•\nWe may also share certain Shopper data with our Merchants,\nfor example, in cases of suspected fraud or in connection\nwith an ongoing investigation.\nTransfers outside the EEA.\nBitPay is primarily located in the\nUnited States, and some of the third parties mentioned above\nmay be located in the United States or other countries outside the\nEEA. We employ the foll are the data subjects?\nThere are several types of data subjects whose personal data we\nprocess:\n•\nVisitors of our Sites;\n•\nMerchants who sign up for our Services (“Merchants” means\nany business that uses our Services to process payments,\nincluding non-profits that use our Services to accept\ndonations and “Aggregators” that use our Services as a\npayment services platform for their own Merchants);\n•\nShoppers of Merchants (“Shoppers” means individuals who\nindirectly interface with BitPay when payin ies.\nWe may share your personal data with the following\nthird parties:\n•\nOther BitPay entities, including to help detect and prevent\npotentially illegal acts and violations of our policies, and to\nguide decisions about our products, services and\ncommunications;\n•\nService providers who help with our business operations and to\ndeliver our Services, such as:\n◦\nCloud service providers providing cloud infrastructure;\n◦\nProviders of ID verification solutions and other due\ndiligence solutions (such as which the personal data are initially collected, the controller, after having met all the requirements for the \nlawfulness of the original processing, should take into account, inter alia: any link between those purposes and \nthe purposes of the intended further processing; the context in which the personal data have been collected, in (1) or personal data relating to criminal convictions and offences referred to in mechanism and then adopted by the Commission. After the completion of the processing on behalf of the \ncontroller, the processor should, at the choice of the controller, return or delete the personal data, unless there is \na requirement to store the personal data under Union or Member State law to which the processor is subject. should be explicit and legitimate and determined at the time of the collection of the personal data. The personal \ndata should be adequate, relevant and limited to what is necessary for the purposes for which they are processed. \nThis requires, in particular, ensuring that the period for which the personal data are stored is limited to a strict","Main Answer:\nBitPay obtains personal data from two main sources: directly from individuals (""You"") through interactions like signing up for newsletters or using its Services, and automatically through technologies like cookies or IP addresses for investigative purposes. Some Shopper data might also be shared with merchants during suspicious activity investigations. BitPay processes personal data for visitors of its sites, merchants utilizing its Services, and shoppers making transactions. Data transfers occur mainly to US-based partners but could extend to locations beyond the European Economic Area (EEA).\n\nKey Points:\n• Personal data can come from users interacting with BitPay or generated automatically.\n• Automatically obtained data helps combat potential illegal activities and breaches of terms.\n• Three distinct groups of data subjects involve site visitors, merchants, and shoppers.\n• Transactions may result in sharing data with associated merchants.\n• Data flows mostly within the U.S., though it occasionally goes overseas.\n\nGDPR References:\n• Article 6 – Lawfulness of Processing ('The six legal bases')\n• Article 5 – Principles Relating to Processing of Personal Data\n• Article 28 – Processing by the Controller with the Help of a Processor","/content/drive/MyDrive/compliance/policies/bitpay.pdf, /content/drive/MyDrive/compliance/policies/bitpay.pdf, /content/drive/MyDrive/compliance/policies/bitpay.pdf, /content/drive/MyDrive/compliance/policies/bitpay.pdf, https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679, https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679, https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679, https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679"
1,When does MoonPay require service consent?,"Main Answer:\nMoonPay initiates consent collection at key service engagement points and before specific processing activities commence. Their framework activates consent requirements during new service adoption, marketing engagement, and third-party integration moments. The system implements separate consent protocols for different processing activities, particularly focusing on financial services requirements timing. Their approach maintains regular validation schedules for consent status and implements clear withdrawal options throughout the service relationship.\n\nKey Points:\n• Service initiation consent timing\n• Processing-specific consent intervals\n• Regular validation schedules\n• Withdrawal opportunity timing\n\nGDPR Reference: Article 7 - Conditions for consent","Global Privacy Policy\nEffective Date: 3rd May 2024\nMoonPay (defined below under “Our relationship to you”) is committed to protecting the\nprivacy of visitors to our websites and our customers. This Privacy Policy describes how\nwe handle your personal data when you access our services, which include our content\non the websites located at\nmoonpay.com\n,\nbuy.moonpay.com\n,\nsell.moonpay.com\nor any\nother websites, pages, features, or content we own or operate (collectively, the\n""Site(s)""), or any MoonP them from sharing your information with anyone else.\n• Companies or other third parties in connection with business transfers or bankruptcy\nproceedings.\n• Companies or other entities that purchase MoonPay assets.\n• Law enforcement, regulators, or any other third parties when we are compelled to do\nso by applicable law or if we have a good faith belief that such use is reasonably\nnecessary, including to: protect the rights, property, or safety of MoonPay,\nMoonPay customers, third party, or the pu ird-party sites and services\nIf you authorize one or more third-party applications to access your MoonPay Services,\nthen information you have provided to MoonPay may be shared with those third parties.\nA connection you authorize or enable between your MoonPay account and a\nnon-MoonPay account, payment instrument, or platform is considered an “account\nconnection.” Unless you provide further permissions, MoonPay will not authorize these\nthird parties to use this information for any purpose other t ow we use your personal information\nWe may use your information in the following ways and for the following purposes:\n1) To maintain legal and regulatory compliance\nMoonPay needs to process your personal information in order to comply with\nanti-money laundering and security laws. In addition, when you seek to link a bank\naccount to your MoonPay account, we may require you to provide additional information\nwhich we may use in collaboration with service providers acting on our behalf to verify\nyou (43) \nIn order to ensure that consent is freely given, consent should not provide a valid legal ground for the processing \nof personal data in a specific case where there is a clear imbalance between the data subject and the controller, in \nparticular where the controller is a public authority and it is therefore unlikely that consent was freely given in \nall the circumstances of that specific situation. Consent is presumed not to be freely given if it does not allow \nseparate consent to be giv Article 7 \nConditions for consent \n1.\nWhere processing is based on consent, the controller shall be able to demonstrate that the data subject has \nconsented to processing of his or her personal data. \n2.\nIf the data subject's consent is given in the context of a written declaration which also concerns other matters, the \nrequest for consent shall be presented in a manner which is clearly distinguishable from the other matters, in an \nintelligible and easily accessible form, using clear and plain (44) \nProcessing should be lawful where it is necessary in the context of a contract or the intention to enter into a \ncontract. (32) \nConsent should be given by a clear affirmative act establishing a freely given, specific, informed and \nunambiguous indication of the data subject's agreement to the processing of personal data relating to him or her, \nsuch as by a written statement, including by electronic means, or an oral statement. This could include ticking a \nbox when visiting an internet website, choosing technical settings for information society services or another \nstatement or conduct which clearly indicates in","Main Answer:\nMoonPay requires service consent when users give a clear affirmative act to establish their agreement to the processing of their personal data, particularly when linking a non-MoonPay account, payment instrument, or platform to their MoonPay account.\n\nKey Points:\n- When creating an account connection, users must explicitly agree to share their personal information with MoonPay.\n- MoonPay uses this information solely for its intended purpose within its services.\n- The request for user consent must be presented separately, in an intelligible and easily accessible form, allowing for separate consents.\n\nGDPR References:\n- Article 6 - Lawfulness of Processing: ""processing shall be lawful if and to the extent that at least one of the following conditions is met.""\n * Condition b): ""the data subject has given consent to the processing of his or her personal data for one or more specific purposes"".\n- Article 7 - Conditions for consent: ""where processing is based on consent, the controller shall be able to demonstrate that the data subject has consented to processing of his or her personal data"".","/content/drive/MyDrive/compliance/policies/moonpay.pdf, /content/drive/MyDrive/compliance/policies/moonpay.pdf, /content/drive/MyDrive/compliance/policies/moonpay.pdf, /content/drive/MyDrive/compliance/policies/moonpay.pdf, https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679, https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679, https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679, https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679"
2,What are the requirements for handling sensitive data in research?,"Main Answer:\nOrganizations conducting research with sensitive data must implement specific safeguards while balancing scientific needs. As outlined in Recital 156, while research enjoys certain exemptions, it requires appropriate safeguards for data subject rights. The framework must include specific measures for data minimization, pseudonymization, and security. Organizations need to document necessity assessments and implement strict access controls while ensuring research validity. The approach should include clear protocols for handling consent and withdrawal rights in research contexts.\n\nKey Points:\n• Specific safeguards for research data processing\n• Documented necessity and proportionality assessments\n• Clear protocols for handling consent withdrawal\n• Enhanced security measures for sensitive data\n\nGDPR Reference: Article 89 - Safeguards and derogations for research","under specific conditions and subject to appropriate safeguards for data subjects, specifications and derogations \nwith regard to the information requirements and rights to rectification, to erasure, to be forgotten, to restriction \nof processing, to data portability, and to object when processing personal data for archiving purposes in the public interest, scientific or historical research purposes or statistical purposes. The conditions and safeguards in \nquestion may entail specific procedures for data subjects to exercise those rights if this is appropriate in the light \nof the purposes sought by the specific processing along with technical and organisational measures aimed at \nminimising the processing of personal data in pursuance of the proportionality and necessity principles. The \nprocessing of personal data for scientific such a manner that the data subject is not or no longer identifiable. This Regulation does not therefore concern \nthe processing of such anonymous information, including for statistical or research purposes. (33) \nIt is often not possible to fully identify the purpose of personal data processing for scientific research purposes at \nthe time of data collection. Therefore, data subjects should be allowed to give their consent to certain areas of \nscientific research when in keeping with recognised ethical standards for scientific research. Data subjects should \nhave the opportunity to give their consent only to certain areas of research or parts of research projects to the \nextent allowed by the inte","Main Answer:\nUnder GDPR regulations, handling sensitive data in research requires adherence to specific conditions and safeguards for data subjects. These conditions include provisions regarding the information requirements, data subjects' rights, and technical and organizational measures.\n\nKey Points:\n1. **Data Subjects’ Rights**: Data subjects have the right to request rectification, erasure, restriction of processing, data portability, and objection under certain circumstances (Articles 15–20). For example, they can ask to restrict processing if the data being processed is incorrect or unnecessary.\n2. **Appropriate Safeguards**: Appropriate safeguards must be implemented based on the nature of the processing activities. Such safeguards could involve pseudonymization techniques to ensure anonymized data protection.\n3. **Identifiability Principle**: Processing of personal data for scientific research aims to make the data non-identifiable so it falls outside the scope of GDPR regulation (Recital 33). However, identifying data subjects might still be necessary during the course of research; thus, data controllers need to follow recognized ethical guidelines while obtaining informed consent from these individuals (Recitals 38 & 63).\n\nIn summary, researchers dealing with sensitive data within the purview of GDPR must comply with various conditions related to informing data subjects about their rights, implementing adequate safeguards, maintaining data minimally identifiable where feasible, and ensuring ethical treatment of participants through transparent communication and proper consents.","https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679, https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679, https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679, https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679"
3,Why does Klarna process different categories of shopping data?,"Main Answer:\nKlarna processes distinct categories of shopping data to support payment services and risk assessment. Their framework addresses various purposes including fraud prevention, service personalization, and payment processing. The system processes enhanced data categories for different payment methods and shopping services. Their approach maintains clear documentation of processing purposes, with specific justifications for different data categories across their range of services.\n\nKey Points:\n• Purpose-specific processing requirements\n• Clear necessity documentation\n• Risk-based processing protocols\n• Service-specific justifications\n\nGDPR Reference: Article 6(1)(b) - Performance of contract","Technical\ninformation\ngenerated through\nyour use of Klarna’s\nservices. (Klarna)\n19\n[19]\nWhen you shop in\na store that offers\nKlarna as a payment\nmethod or has Klarna\ncheckout, we will\nassess the order in\nwhich different\npayment methods\nshould be presented\nto you at the store\ncheckout. This\nprocessing does not\naffect which of\nKlarna’s payment\nmethods are available\nto you.\nThis processing\nconstitutes profiling.\nSee section 6 for more\ninformation about\nprofiling.\nFrom you:\nFrom other sources:\nIf yo Klarna will process\nyour data to deliver\nthe service, which\nincludes processing\ninformation about\nwhich websites/web\ndomains you visit:\nKlarna processes\ninformation about the\necommerce\nwebsites/web domains\nFrom you:\nThe processing is\nnecessary for Klarna\nto perform a contract\n(terms and conditions\nfor the Klarna\nShopping Service) with\nyou (Article 6(1)(b) UK\nGDPR).\nIf the service\nprocesses information\nthat constitutes\nsensitive personal data\n(i.e. such data about\nWhen the contract\nbetween you an Shopping Service.\nThis processing\nconstitutes profiling\nand automated\ndecision-making. We\nFrom you:\nFrom other sources:\nThe processing is\nnecessary for Klarna\nto perform a contract\n(terms and conditions\nfor the shopping\nservice) with you\n(Article 6(1)(b) UK\nGDPR).\nWhen the fraud\nassessment is\nperformed.\nSensitive personal\ndata.\nService-specific\npersonal data (see\nsection 3 for\ninformation about\nthe Klarna mobile\napplication).\nInformation about\ngoods/services.\n(Klarna or store you\nshop from)\nCont according to law.\n[13]\nTo decide what\nkind of marketing we\nwill provide to you.\nIf you do not want us\nto perform this\nprocessing of your\ndata, please contact\nus. We will then cease\nto use your data for\nmarketing. Contact\ninformation is available\nin section 12.\nThe processing may\nconstitute profiling.\nSee section 6 for more\ninformation about\nprofiling.\nFrom you:\nFrom other sources:\nThe processing is\nbased on a balancing\nof interests (Article 6(1)\n(f) UK GDPR). When\nbalancing interests,\nKlarna has processing. Derogations from the general prohibition for processing such special categories of personal data \nshould be explicitly provided, inter alia, where the data subject gives his or her explicit consent or in respect of \nspecific needs in particular where the processing is carried out in the course of legitimate activities by certain \nassociations or foundations the purpose of which is to permit the exercise of fundamental freedoms. or because it is in the vital interest of the data subject. apply only to processing for the purposes referred to in those paragraphs. specific needs, in particular where the processing of such data is carried out for certain health-related purposes \n4.5.2016 \nL 119/10 \nOfficial Journal of the European Union \nEN","Main Answer:\nKlarna processes various categories of shopping data to offer its services effectively, fulfill contracts, assess fraud risks, and tailor marketing efforts based on user behavior.\n\nKey Points:\n- Contract Performance: Processing customer's data is necessary for performing a contract between Klarna and the individual (article 6(1)(b)). It enables offering customized payment solutions and managing transactions.\n- Fraud Assessment: Analyzing users' data helps prevent potential financial losses due to fraudulent activity during online purchases.\n- Marketing Personalization: Profiles created based on browsing history contribute towards delivering targeted advertisements and promotions.\n\nGDPR References:\n- Article 6(1)(b): Necessary for entering into a contract.\n- Article 6(1)(c): Compliance with legal obligations.\n- Article 6(1)(f): Balance of interests pursued by Klarna.","/content/drive/MyDrive/compliance/policies/klarna.pdf, /content/drive/MyDrive/compliance/policies/klarna.pdf, /content/drive/MyDrive/compliance/policies/klarna.pdf, /content/drive/MyDrive/compliance/policies/klarna.pdf, https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679, https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679, https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679, https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679"
4,What types of personal data does Bilt collect for rewards services?,"Main Answer:\nBilt collects multiple categories of personal data to facilitate their rewards and payment services. Their collection includes transaction data, payment card information, and merchant interaction details. The system captures specific data points including purchase amounts, merchant identifiers, and transaction timestamps through partners like Fidel Ltd and payment networks. Their framework maintains clear documentation of collection purposes, with specific protocols for rewards tracking. The approach includes enhanced data collection for different payment methods and reward categories.\n\nKey Points:\n• Structured collection of transaction and payment data \n• Purpose-specific data gathering\n• Clear documentation of rewards tracking\n• Enhanced collection for payment networks\n\nGDPR Reference: Article 5(1)(c) - Data minimization","e Receive from Third Parties:\nWe may collect information about you from third parties that perform services and analytics for us.\nSuch companies may include credit bureaus, data providers, fraud detection services and data\nanalytics providers, as well as certain of our Third Party Service Providers and their partners. We\nmay also collect engagement and transaction information from our merchant partners about you as\na member of the Bilt Rewards program in order to provide notice and issuance of r to\ncredit bureaus\nDoes Bilt share?\nYes\nCan you limit this sharing?\nNo\nReasons we can share your personal\ninformation\nFor our marketing purposes—\nto offer our product ed for Bilt Services:\nWe may provide transaction data and other information to the payment card networks, merchants\nand other partners for use in connection with the program. This information may be Personal\nInformation and NPI, such as your primary account number.\nInformation Disclose Agreement that is provided to you.\nIf you have any questions about our General Privacy Policy or how it applies to specific data, please\ncontact us at support@biltrewards.com. We will make every effort to resolve your concerns. (1) or personal data relating to criminal convictions and offences referred to in which the personal data are initially collected, the controller, after having met all the requirements for the \nlawfulness of the original processing, should take into account, inter alia: any link between those purposes and \nthe purposes of the intended further processing; the context in which the personal data have been collected, in should be explicit and legitimate and determined at the time of the collection of the personal data. The personal \ndata should be adequate, relevant and limited to what is necessary for the purposes for which they are processed. \nThis requires, in particular, ensuring that the period for which the personal data are stored is limited to a strict (64) \nThe controller should use all reasonable measures to verify the identity of a data subject who requests access, in \nparticular in the context of online services and online identifiers. A controller should not retain personal data for \nthe sole purpose of being able to react to potential requests.","Main Answer:\nBilt collects various types of personal data for its rewards services primarily from third parties that perform services and analytics for them. These third parties could be credit bureaus, data providers, fraud detection services, data analytics providers, and some of their partners. Additionally, Bilt gathers engagement and transaction information from its merchant partners regarding members of the Bilt Rewards program.\n\nKey Points:\n- Data Collection from Third Parties: Information obtained from external sources including but not limited to credit bureaus, data providers, fraud detection services, data analytics providers, and merchant partners.\n- Marketing Purposes: Transaction data and other related details shared with payment card networks, merchants, and other business associates to enhance reward programs and offers.\n- Limited Control Over Sharing: Users cannot restrict the sharing of their personal data within the given framework since it's an essential part of the service provision.\n\nGDPR References:\n- Article 5(a)-Lawfulness, fairness and transparency: ""Personal data shall be...collected for specified, explicit and legitimate purposes...""\n- Article 5(c)-Processing Limitation Principle: ""Personal data shall not be kept in a form that permits identification of individuals for longer than is necessary""\n- Recital 27: Special categories of personal data: ""Processing...shall be carried out under the control of professional bodies or association carrying on activities in the public interest.""","/content/drive/MyDrive/compliance/policies/bilt.pdf, /content/drive/MyDrive/compliance/policies/bilt.pdf, /content/drive/MyDrive/compliance/policies/bilt.pdf, /content/drive/MyDrive/compliance/policies/bilt.pdf, https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679, https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679, https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679, https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679"
