
Looking at the paper `https://arxiv.org/pdf/2405.20900v1` since I've naively just used the poor performing NLP methods in my tests, I'd like to do a quick experiment.

Since the most effective prompt design for privacy policy analysis was feeding in binary questions, I'd like to take it a step further. Using a LLM to produce a tree of questions, from a privacy policy.  These questions sets would be affirmations, i.e questions that always evaluate to true against the original source. i.e we'd produce, given a policy, lets call it Policy A, Question Tree A. Then Policy B, Question Tree B.

Then we would either manually or use a model to evaluate:
- Policy A vs Question Tree B
- Policy B vs Question Tree A

Using this method, we not only can compare between updates/changes, but directly and somewhat quantifiably between two seperate companies, it would be the exact same process. And then additionally, we have implicitly computed a digestable summary of both the changes **and** the entire content, if we produce an accurate question set.

Also could be cool to write a little white paper analysing the performance of this, seems quite apt for "AI Compatable" and seems to me like a great outreach/ resource on your website.

If we did conduct the experiment and white paper, I basically instead of the original paper's prompt metrics i would additionally want to perform some seperate implementation experiments.


**Experiment 1: Single-Stage Combined Approach**
-   **Description:** This experiment employs a single LLM call with a single prompt that encompasses all instructions: analyzing the policy, identifying affirmations, formulating questions (with unique IDs), and logically structuring them into the specified JSON tree according to the dependency rule.
-   **Prompt Design:** The prompt combines all steps, including explicit JSON schema instructions and examples for nesting.


**Experiment 2: Two-Stage Sequential (In-Chat)**
-   **Description:** This approach involves two sequential LLM turns within the same conversational context.
    -   **Turn 1:** The LLM receives the policy excerpt and instructions for generating the "True" questions, each with a unique ID, as a simple bulleted list (order does not matter here).
    -   **Turn 2:** The LLM is then prompted (in the same chat session, retaining memory of Turn 1's input and output) to take the questions it just generated and reorder them into the JSON tree structure according to the logical dependency rule.
-   **Prompt Design:**
    -   **Prompt for Turn 1 (Question Generation):** Similar to the "Prompt 1: Question Generation" detailed below, including the requirement for a unique ID for each question.
    -   **Prompt for Turn 2 (JSON Ordering):** Instructs the LLM to take "the questions you just generated" and structure them into the JSON tree, explicitly stating the nesting logic and the **logical contingency rule**.


**Experiment 3: Divorced Two-Stage (No Policy Context for Re-ordering)**
-   **Description:** This experiment separates the two tasks entirely, employing two distinct LLM calls (potentially to different instances or fresh contexts). The crucial difference here is that the LLM performing the ordering task *does not* receive the original privacy policy text.
    -   **Model A Call:** Receives the policy excerpt and generates the questions with IDs as a bulleted list (similar to Turn 1 of Experiment 2).
    -   **Model B Call:** Receives *only the list of generated questions* and instructions to reorder them into the JSON tree structure based on logical dependencies.
-   **Prompt Design:**
    -   **Prompt for Model A:** Identical to "Prompt 1: Question Generation" below.
    -   **Prompt for Model B:** Similar to "Prompt 2: JSON Tree Ordering," but the `Original Privacy Policy Excerpt` is **omitted**. The ordering model must infer dependencies purely from the question text and IDs.

**Experiment 4: Divorced Two-Stage (With Policy Context for Re-ordering)**
-   **Description:** This is a two-stage approach with distinct LLM calls, where the LLM performing the ordering task is *re-provisioned* with the original privacy policy context.
    -   **Model A Call:** Receives the policy excerpt and generates the questions with IDs as a bulleted list (identical to Model A in Experiment 3).
    -   **Model B Call:** Receives *both* the original privacy policy excerpt *and* the list of generated questions, along with instructions to reorder them into the JSON tree structure.
-   **Prompt Design:**
    -   **Prompt for Model A:** Identical to "Prompt 1: Question Generation" below.
    -   **Prompt for Model B:** Identical to "Prompt 2: JSON Tree Ordering" below, where the `Original Privacy Policy Excerpt` is explicitly included.

So whilst im testing the feasibility, you get a free bit of extra content regardless of if it is successful, the actual experiment/results



**Role:** You are a Data Protection Officer (DPO) specializing in analyzing and tracking the evolution of Privacy Policies. Your expertise lies in identifying actionable statements and core privacy principles within legal texts.

**Objective:** To thoroughly analyze an excerpt of a privacy policy and extract all concrete statements (affirmations) made within it, then rephrase these affirmations as a set of concise, "True" questions, each with a unique identifier.

**Task Breakdown:**

1.  **Policy Excerpt Analysis:**
    *   Carefully read the provided `Privacy Policy Excerpt`.
    *   Identify all explicit statements, declarations, or affirmations made by the policy regarding the collection, processing, storage, sharing, security, transfer, or user rights concerning personal data.
    *   Focus on what the policy *states as a fact* or *affirms as its practice or intention*.

2.  **Identification of Key Affirmations:**
    *   For each identified statement, determine if it represents a significant action, condition, or commitment related to personal data handling.
    *   Assign an identifier (e.g., "Q1", "Q2", "Q3") to each distinct affirmation.

3.  **Question Formulation:**
    *   For each key affirmation identified in Step 2, formulate a question that:
        *   **Starts with the precise prefix:** `Does the privacy policy affirm that...`
        *   **Is followed by a statement that is *verifiably true* based *only* on the provided excerpt.** The statement must directly reflect an affirmation from the text.
        *   **Ends with a question mark (`?`).**
        *   Is concise and directly to the point, avoiding speculation or information not explicitly stated in the excerpt.
    *   Present each question with its assigned ID.

4.  **Output Generation:**
    *   Present the formulated questions as a JSON, where each item includes the question ID and the question itself. The order of questions in this list does not matter for this step; simply list them as they are identified.

**Important Considerations:**
*   **Truthfulness:** Every statement embedded within your question *must* be directly supported and affirmed by the exact wording of the provided `Privacy Policy Excerpt`. Do not infer or speculate.
*   **Completeness (within excerpt):** Aim to capture all significant affirmations related to personal data within the given text.
*   **Conciseness:** Keep each question focused on a single, clear point.

**Example Input**
```Privacy Policy Excerpt
We may store and process personal information collected on our site in the United States or any other country in which Corperation Inc. or its agents maintain facilities. By using our services, you consent to the transfer of your information among these facilities, including those located outside your country.
```
**Example Response**
```
{
"Q1":" Does the privacy policy affirm that personal data can be transferred outside of the user's country of origin?",
"Q2": "Does the privacy policy affirm that personal data transfers are automatically consented to by using the service?",
"Q3": "Does the privacy policy affirm that personal data transfers are automatically consented to by using the service?"
}

```


2. Personal Data we collect
We collect personal data relating to you (“Personal Data”) as follows:

Personal Data You Provide: We collect Personal Data if you create an account to use our Services or communicate with us as follows:

Account Information: When you create an account with us, we will collect information associated with your account, including your name, contact information, account credentials, date of birth, payment information, and transaction history, (collectively, “Account Information”).
User Content: We collect Personal Data that you provide in the input to our Services (“Content”), including your prompts and other content you upload, such as files⁠(opens in a new window), images⁠(opens in a new window), and audio⁠(opens in a new window), depending on the features you use.
Communication Information: If you communicate with us, such as via email or our pages on social media sites, we may collect Personal Data like your name, contact information, and the contents of the messages you send (“Communication Information”).
Other Information You Provide: We collect other information that you may provide to us, such as when you participate in our events or surveys or provide us with information to establish your identity or age (collectively, “Other Information You Provide”).

{
"Q1": "Does the privacy policy affirm that it collects personal data relating to the user?",
"Q2": "Does the privacy policy affirm that Personal Data is collected if a user creates an account to use its Services?",
"Q3": "Does the privacy policy affirm that Personal Data is collected if a user communicates with the service?",
"Q4": "Does the privacy policy affirm that Account Information is collected when an account is created?",
"Q5": "Does the privacy policy affirm that Account Information includes name, contact information, account credentials, date of birth, payment information, and transaction history?",
"Q6": "Does the privacy policy affirm that it collects Personal Data provided as input to its Services, referred to as Content?",
"Q7": "Does the privacy policy affirm that Content includes user prompts and other uploaded content such as files, images, and audio?",
"Q8": "Does the privacy policy affirm that Communication Information may be collected if a user communicates via email or social media sites?",
"Q9": "Does the privacy policy affirm that Communication Information includes the user's name, contact information, and the contents of messages sent?",
"Q10": "Does the privacy policy affirm that Other Information You Provide is collected when a user participates in its events or surveys?",
"Q11": "Does the privacy policy affirm that Other Information You Provide is collected when a user provides information to establish their identity or age?",
"Q12": "Does the privacy policy affirm that it receives 'Technical Information' when a user visits, uses, or interacts with the Services?",
"Q13": "Does the privacy policy affirm that it collects 'Log Data' that a user's browser or device automatically sends?",
"Q14": "Does the privacy policy affirm that Log Data includes the user's IP address, browser type and settings, and the date and time of the request?",
"Q15": "Does the privacy policy affirm that it collects 'Usage Data' about the types of content a user views or engages with, the features they use, and the actions they take?",
"Q16": "Does the privacy policy affirm that Usage Data includes the user's time zone, country, dates and times of access, and user agent?",
"Q17": "Does the privacy policy affirm that it collects 'Device Information' such as the device name, operating system, device identifiers, and browser?",
"Q18": "Does the privacy policy affirm that the specific Device Information collected may depend on the type of device used and its settings?",
"Q19": "Does the privacy policy affirm that it may determine a user's general location area based on their IP address?",
"Q20": "Does the privacy policy affirm that general location is determined for security reasons, like detecting unusual login activity, and to improve the product experience?",
"Q21": "Does the privacy policy affirm that some services allow a user to choose to provide more precise location information from their device's GPS?",
"Q22": "Does the privacy policy affirm that it uses cookies and similar technologies to operate, administer, and improve its Services?",
"Q23": "Does the privacy policy affirm that for users without an account, it may store information in cookies to maintain preferences across browsing sessions?"
"Q25": "Does the privacy policy affirm that it receives information from security partners to protect against fraud and abuse?",
"Q26": "Does the privacy policy affirm that it receives information about potential business customers from marketing vendors?",
"Q27": "Does the privacy policy affirm that it collects information that is publicly available on the internet?",
"Q28": "Does the privacy policy affirm that information collected from public sources is used to develop the models that power its Services?",
"Q29": "Does the privacy policy affirm that more information on the sources used for model development is available in a separate article?"
}

1. Personal Information We Collect
Personal information typically means information that identifies or is reasonably capable of identifying an individual, directly or indirectly, and information that relates to, describes, is reasonably capable of being associated with or could reasonably be linked to an identified or reasonably identifiable individual. For the purposes of this Privacy Policy, only the definition of personal information from the applicable law of your legal residence will apply to you and be deemed your “Personal Information.”
A. Personal Information we collect from you
We may collect the following categories of Personal Information directly from you:
Identification Information, such as name, email, date of birth, phone number, postal address, and/or government-issued identity documents;
Commercial Information, such as trading activity, order activity, deposits, withdrawals, account balances;
Financial Information, such as bank account information, routing number, or other financial account information;
Correspondence, such as information that you provide to us in correspondence, including account opening and customer support;
Audio, Electronic, Visual, Thermal, Olfactory, or Similar Information, such as images and video collected for identity verification, audio recordings left on answering machines;
Biometric Information, such as scans of your face geometry extracted from identity documents;
Professional or Employment-related Information, such as job title, source of wealth;
Institutional Information, such as for institutional customers, we may collect additional information, including: institution’s legal name, Employer Identification Number (“EIN”) or any comparable identification number issued by a government, and proof of legal existence (which may include articles of incorporation, certificate of formation, business license, trust instrument, or other comparable legal document); and
Sensitive Personal Information, such as government-issued identification numbers (which may include Social Security Number or equivalent, driver’s license number, passport number) and financial account information.
Preferences, such as settings and preferences you select in the Gemini app.
Communications, such as survey responses, information provided to our Customer Support team, including communications with interfaces such as chatbots.
Referral information, such as your contacts’ phone or email addresses if you choose to invite those contacts to Gemini.

B. Personal Information we collect automatically
We may collect the following categories of Personal Information automatically through your use of our services:
Online Identifiers, such as IP address; domain name, geographic location;
Device Information, such as hardware, operating system, browser, screen size; and
Usage Data, such as system activity, internal and external information related to Gemini pages that you visit, clickstream information, keystrokes, mouse movements, form field entries, recordings of chat sessions or your use of and inputs to other AI-supported tools, and other use and overall engagement with our Services.
Our automatic collection of Personal Information may involve the use of Cookies and other tracking technologies, described in greater detail below.

C. Personal Information we collect from third parties
We may collect and/or verify the following categories of Personal Information about you from Third Parties:
Identification Information, such as name, email, phone number, postal address;  
Financial Information, such as bank account information, routing number. When you use third party services (for example, when you connect your Gemini account to your bank account) or websites that are linked through our Services, the providers of those services or products may receive information that Gemini, you, or others share with them. Those third party services are not governed by this Privacy Policy, and their own terms and privacy policies will apply to those products and services;
Transaction Information, such as public blockchain data (bitcoin, ether, and other Digital Assets are not truly anonymous). We, and any others who can match your public Digital Asset address to other Personal Information about you, may be able to identify you from a blockchain transaction because, in some circumstances, Personal Information published on a blockchain (such as your Digital Asset address and IP address) can be correlated with Personal Information that we and others may have. Furthermore, by using data analysis techniques on a given blockchain, it may be possible to identify other Personal Information about you; 
Credit and Fraud Information, such as credit investigation, credit eligibility, identity or account verification, fraud detection, or as may otherwise be required by applicable law;
Sensitive Personal Information, such as government identification numbers (which may include Social Security Number or equivalent, driver’s license number, passport number) and financial account information and
Additional Information, as permitted by law or required  to comply with legal obligations, which may include criminal records or alleged criminal activity, or information about any person or corporation with whom you have had, currently have, or may have a financial relationship. 
Personal Information you provide during the registration process may be retained, even if your registration is left incomplete or abandoned.

D. Combination of Personal information
Please note that we may combine Personal Information that we receive from various sources. For example, we may combine Personal Information that we receive from third parties with Personal Information we already have about you. We use, disclose, and protect combined Personal Information as described in this Privacy Policy.
Please also note that we may de-identify or aggregate Personal Information so that it will no longer be considered Personal Information and disclose such information to other parties for purposes consistent with those described in this Privacy Policy.

"Q1": "Does the privacy policy affirm that it collects personal data relating to the user?",
"Q2": "Does the privacy policy affirm that Personal Data is collected if a user creates an account to use its Services?",
"Q3": "Does the privacy policy affirm that Personal Data is collected if a user communicates with the service?",
"Q4": "Does the privacy policy affirm that Account Information is collected when an account is created?",
"Q5": "Does the privacy policy affirm that Account Information includes name, contact information, account credentials, date of birth, payment information, and transaction history?",
"Q6": "Does the privacy policy affirm that it collects Personal Data provided as input to its Services, referred to as Content?",
"Q7": "Does the privacy policy affirm that Content includes user prompts and other uploaded content such as files, images, and audio?",
"Q8": "Does the privacy policy affirm that Communication Information may be collected if a user communicates via email or social media sites?",
"Q9": "Does the privacy policy affirm that Communication Information includes the user's name, contact information, and the contents of messages sent?",

**Task:**

Does the privacy policy affirm any of the following questions:
{

}
**Output Format:**
Please answer **yes** or **no** to the questions
Please adapt your answer to the following format:
{
<Question_number_a>:<Yes>|<No>,
<Question_number_b>:<Yes>|<No>,
....
}


**Task:**

Does the privacy policy affirm any of the following questions:
{
"Q1": "Does the privacy policy affirm that it collects personal data relating to the user?",
"Q2": "Does the privacy policy affirm that Personal Data is collected if a user creates an account to use its Services?",
"Q3": "Does the privacy policy affirm that Personal Data is collected if a user communicates with the service?",
"Q4": "Does the privacy policy affirm that Account Information is collected when an account is created?",
"Q5": "Does the privacy policy affirm that Account Information includes name, contact information, account credentials, date of birth, payment information, and transaction history?",
"Q6": "Does the privacy policy affirm that it collects Personal Data provided as input to its Services, referred to as Content?",
"Q7": "Does the privacy policy affirm that Content includes user prompts and other uploaded content such as files, images, and audio?",
"Q8": "Does the privacy policy affirm that Communication Information may be collected if a user communicates via email or social media sites?",
"Q9": "Does the privacy policy affirm that Communication Information includes the user's name, contact information, and the contents of messages sent?",
}
**Output Format:**
Please answer **yes** or **no** to the questions
Please adapt your answer to the following format:
{
<Question_number_a>:<Yes>|<No>,
<Question_number_b>:<Yes>|<No>,
....
}


Briefly just it with an excerpt of OpenAI's policy and a similar exceprt from gemini, I didn't include the tree structure just yet but here are the questions it produced:
{
"Q1": "Does the privacy policy affirm that it collects personal data relating to the user?",
"Q2": "Does the privacy policy affirm that Personal Data is collected if a user creates an account to use its Services?",
"Q3": "Does the privacy policy affirm that Personal Data is collected if a user communicates with the service?",
"Q4": "Does the privacy policy affirm that Account Information is collected when an account is created?",
"Q5": "Does the privacy policy affirm that Account Information includes name, contact information, account credentials, date of birth, payment information, and transaction history?",
"Q6": "Does the privacy policy affirm that it collects Personal Data provided as input to its Services, referred to as Content?",
"Q7": "Does the privacy policy affirm that Content includes user prompts and other uploaded content such as files, images, and audio?",
"Q8": "Does the privacy policy affirm that Communication Information may be collected if a user communicates via email or social media sites?",
"Q9": "Does the privacy policy affirm that Communication Information includes the user's name, contact information, and the contents of messages sent?",
"Q10": "Does the privacy policy affirm that Other Information You Provide is collected when a user participates in its events or surveys?",
"Q11": "Does the privacy policy affirm that Other Information You Provide is collected when a user provides information to establish their identity or age?",
"Q12": "Does the privacy policy affirm that it receives 'Technical Information' when a user visits, uses, or interacts with the Services?",
"Q13": "Does the privacy policy affirm that it collects 'Log Data' that a user's browser or device automatically sends?",
"Q14": "Does the privacy policy affirm that Log Data includes the user's IP address, browser type and settings, and the date and time of the request?",
"Q15": "Does the privacy policy affirm that it collects 'Usage Data' about the types of content a user views or engages with, the features they use, and the actions they take?",
"Q16": "Does the privacy policy affirm that Usage Data includes the user's time zone, country, dates and times of access, and user agent?",
"Q17": "Does the privacy policy affirm that it collects 'Device Information' such as the device name, operating system, device identifiers, and browser?",
"Q18": "Does the privacy policy affirm that the specific Device Information collected may depend on the type of device used and its settings?",
"Q19": "Does the privacy policy affirm that it may determine a user's general location area based on their IP address?",
"Q20": "Does the privacy policy affirm that general location is determined for security reasons, like detecting unusual login activity, and to improve the product experience?",
"Q21": "Does the privacy policy affirm that some services allow a user to choose to provide more precise location information from their device's GPS?",
"Q22": "Does the privacy policy affirm that it uses cookies and similar technologies to operate, administer, and improve its Services?",
"Q23": "Does the privacy policy affirm that for users without an account, it may store information in cookies to maintain preferences across browsing sessions?"
"Q25": "Does the privacy policy affirm that it receives information from security partners to protect against fraud and abuse?",
"Q26": "Does the privacy policy affirm that it receives information about potential business customers from marketing vendors?",
"Q27": "Does the privacy policy affirm that it collects information that is publicly available on the internet?",
"Q28": "Does the privacy policy affirm that information collected from public sources is used to develop the models that power its Services?",
"Q29": "Does the privacy policy affirm that more information on the sources used for model development is available in a separate article?"
}

and here is the answers it produced:
{
"Q1":"Yes",
"Q2":"Yes",
"Q3":"Yes",
"Q4":"Yes",
"Q5":"No",
"Q6":"No",
"Q7":"No",
"Q8":"No",
"Q9":"Yes",
"Q10":"Yes",
"Q11":"Yes",
"Q12":"Yes",
"Q13":"Yes",
"Q14":"No",
"Q15":"Yes",
"Q16":"No",
"Q17":"No",
"Q18":"No",
"Q19":"Yes",
"Q20":No,
"Q21":No,
"Q22":No,
"Q23":No,
"Q25":Yes,
"Q26":No,
"Q27":Yes,
"Q28":No,
"Q29":No
}

{
"Q1":"Yes",
"Q2":"Yes",
"Q3":"Yes",
"Q4":"Yes",
"Q5":"No",
"Q6":"No",
"Q7":"No",
"Q8":"No",
"Q9":"Yes",
"Q10":"Yes",
"Q11":"Yes",
"Q12":"Yes",
"Q13":"Yes",
"Q14":"No",
"Q15":"Yes",
"Q16":"No",
"Q17":"No",
"Q18":"No",
"Q19":"Yes",
"Q20":No,
"Q21":No,
"Q22":No,
"Q23":No,
"Q25":Yes,
"Q26":No,
"Q27":Yes,
"Q28":No,
"Q29":No

}

In [12]:
# import requests
# from bs4 import BeautifulSoup
# from markdownify import markdownify as md

# # url = "https://openai.com/policies/privacy-policy/"
# url = "https://www.gemini.com/en-SG/legal/privacy-policy"
# headers = {
# 	"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
# }

# output_filename = "gemini.md"

# try:
# 	response = requests.get(url, headers=headers, timeout=10)
# 	response.raise_for_status()
# 	html_content = response.text

# 	soup = BeautifulSoup(html_content, "lxml")

# 	main_content_element = soup.find("main")

# 	if not main_content_element:
# 		raise TypeError(
# 			"Could not find the main content container with id='main'. The page structure may have changed."
# 		)

# 	markdown_content = md(str(main_content_element), heading_style="ATX")

# 	with open(output_filename, "w", encoding="utf-8") as file:
# 		file.write(markdown_content)


# except requests.exceptions.RequestException as e:
# 	print(f"An error occurred during the request: {e}")
# except Exception as e:
# 	print(f"An error occurred: {e}")

In [13]:
prompt2 = """

**Role:** You are a Data Protection Officer (DPO) specializing in analyzing and tracking the evolution of Privacy Policies. Your expertise lies in identifying actionable statements and core privacy principles within legal texts.

**Objective:** To thoroughly analyze an excerpt of a privacy policy and extract all concrete statements (affirmations) made within it, then rephrase these affirmations as a set of concise, "True" questions, each with a unique identifier.

**Task Breakdown:**

1.  **Policy Excerpt Analysis:**
    *   Carefully read the provided `Privacy Policy Excerpt`.
    *   Identify all explicit statements, declarations, or affirmations made by the policy regarding the collection, processing, storage, sharing, security, transfer, or user rights concerning personal data.
    *   Focus on what the policy *states as a fact* or *affirms as its practice or intention*.

2.  **Identification of Key Affirmations:**
    *   For each identified statement, determine if it represents a significant action, condition, or commitment related to personal data handling.
    *   Assign an identifier (e.g., "Q1", "Q2", "Q3") to each distinct affirmation.

3.  **Question Formulation:**
    *   For each key affirmation identified in Step 2, formulate a question that:
        *   **Starts with the precise prefix:** `Does the privacy policy affirm that...`
        *   **Is followed by a statement that is *verifiably true* based *only* on the provided excerpt.** The statement must directly reflect an affirmation from the text.
        *   **Ends with a question mark (`?`).**
        *   Is concise and directly to the point, avoiding speculation or information not explicitly stated in the excerpt.
    *   Present each question with its assigned ID.

4.  **Output Generation:**
    *   Present the formulated questions as a JSON, where each item includes the question ID and the question itself. The order of questions in this list does not matter for this step; simply list them as they are identified.

**Important Considerations:**
*   **Truthfulness:** Every statement embedded within your question *must* be directly supported and affirmed by the exact wording of the provided `Privacy Policy Excerpt`. Do not infer or speculate.
*   **Completeness (within excerpt):** Aim to capture all significant affirmations related to personal data within the given text.
*   **Conciseness:** Keep each question focused on a single, clear point.

**Example Input**
```Privacy Policy Excerpt
We may store and process personal information collected on our site in the United States or any other country in which Corperation Inc. or its agents maintain facilities. By using our services, you consent to the transfer of your information among these facilities, including those located outside your country.
```
**Example Response**
```
{
"Q1":" Does the privacy policy affirm that personal data can be transferred outside of the user's country of origin?",
"Q2": "Does the privacy policy affirm that personal data transfers are automatically consented to by using the service?",
"Q3": "Does the privacy policy affirm that personal data transfers are automatically consented to by using the service?"
}

```
"""


url = "https://openai.com/policies/privacy-policy/"
url = "https://www.gemini.com/en-SG/legal/privacy-policy"
url = "https://www.anthropic.com/legal/privacy"

In [14]:
data_source = {
	"gemini": "https://www.gemini.com/en-SG/legal/privacy-policy",
	"openai": "https://openai.com/policies/privacy-policy/",
	"anthropic": "https://www.anthropic.com/legal/privacy",
}

In [None]:
import requests
from bs4 import BeautifulSoup
from markdownify import markdownify as md
import re


def saveHash(key, content):
	sig = hash(content)
	# save to json with key = key, v = hash
	_id = sig
	if _id == sig:
		return True
	# pass
	return False


def splitMarkdown(markdown_text):
	heading_pattern = r"^#{1,6}\s+.*"
	parts = re.split(heading_pattern, markdown_text, flags=re.MULTILINE)
	content_list = [part.strip() for part in parts[1:] if part.strip()]
	return content_list


def removePreamble(markdown_text):
	pattern = r"\A.*?(?=^#\s)"
	cleaned_text = re.sub(pattern, "", markdown_text, flags=re.DOTALL | re.MULTILINE)
	return cleaned_text


def extractContent(url, headers=None):
	if headers is None:
		headers = {
			"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
		}

		response = requests.get(url, headers=headers, timeout=10)
		response.raise_for_status()
		html_content = response.text

		soup = BeautifulSoup(html_content, "lxml")

		main_content_element = soup.find("main")

		_markdown_content = md(str(main_content_element), heading_style="ATX")
		markdown_content = removePreamble(_markdown_content)
		return markdown_content


def saveMdFile(content, name):
	if not name.endswith(".md"):
		name = name + ".md"

	with open(name, "w", encoding="utf-8") as file:
		file.write(content)


def collatePolicy(data_source):
	for k, v in data_source.items():
		markdown_content = extractContent(v)
		if saveHash(k, markdown_content):
			# Runs LLM Question analysis of content:
			# adds data and new version questions to question json
			# reanalyses all saved privacy policies against that # (Do at end)
			# updates gui
			pass

		saveMdFile(markdown_content, k)
	return splitMarkdown(markdown_content)

In [None]:
a = collatePolicy(data_source)

'Effective May 1, 2025[Previous Version](/legal/archive/a2eecf43-807a-4a53-89dd-04c44c351138)\n\nAnthropic is an AI safety and research company working to build reliable, interpretable, and steerable AI systems.\n\nThis Privacy Policy explains how we collect, use, disclose, and process your personal data when you use our website and other places where Anthropic acts as a\xa0*data controller*—for example, when you interact with Claude.ai or other products as a consumer for personal use ("**Services**") or when Anthropic operates and provides our commercial customers and their end users with access to our commercial products, such as the Claude Team plan (“**Commercial Services**”).\n\nThis Privacy Policy does not apply where Anthropic acts as a\xa0*data processor*and processes personal data on behalf of commercial customers using Anthropic’s Commercial Services – for example, your employer has provisioned you a Claude for Work account, or you\'re using an app that is powered on the back

In [27]:
a[0]

'Effective May 1, 2025[Previous Version](/legal/archive/a2eecf43-807a-4a53-89dd-04c44c351138)\n\nAnthropic is an AI safety and research company working to build reliable, interpretable, and steerable AI systems.\n\nThis Privacy Policy explains how we collect, use, disclose, and process your personal data when you use our website and other places where Anthropic acts as a\xa0*data controller*—for example, when you interact with Claude.ai or other products as a consumer for personal use ("**Services**") or when Anthropic operates and provides our commercial customers and their end users with access to our commercial products, such as the Claude Team plan (“**Commercial Services**”).\n\nThis Privacy Policy does not apply where Anthropic acts as a\xa0*data processor*and processes personal data on behalf of commercial customers using Anthropic’s Commercial Services – for example, your employer has provisioned you a Claude for Work account, or you\'re using an app that is powered on the back

In [28]:
# Test String
markdown_with_preamble = """
This is some introductory text that we want to remove.
It can span multiple lines.

It might even have some other markdown like ## Subtitles or *italics*.

# My First Real Title
This is the content we want to keep.
It continues on another line.

# Another Title
This should not be affected.
"""

# # Example of how to use it:
# cleaned_text = remove_text_before_first_title(markdown_with_preamble)
# print(cleaned_text)
hash(markdown_with_preamble)

7300946887645284755

['This is the content we want to keep.\nIt continues on another line.',
 'Content for some heading',
 'This should not be affected.',
 'Some other content',
 'And the final content.']
