<a href="https://colab.research.google.com/github/arvyz/privacynlu/blob/main/qa_data_aug.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers --quiet
!pip install torch --quiet
!pip install accelerate --quiet
!pip install datasets --quiet
!pip install huggingface_hub --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.4/297.4 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m24.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
from transformers import AutoTokenizer, pipeline
import transformers
import torch
from huggingface_hub import HfApi

In [4]:
import pandas as pd
from datasets import Dataset
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Load Train & Test Data For Sentence Simplification

Training data from the privacy policies will be loaded and the privacy sentences will be "rewritten" for simplification.  Since each question in the QA set are mapped against all the sentences, there are many repeats of the same sentences across the dataset.  So we will only "rewrite" unique sentences and then map the newly generated senteces back to the QA set.

## Size
The full training corpus contains 3411 uniqe sentences (some sentences repeat across policies).

The full testing corpus is 1231 unique sentences (some sentences repeat across policies).

In [20]:
file_path = '/content/drive/My Drive/w266-personal/Final Project/PLUE/data/privacyqa/policy_train_data.csv'

# Load the CSV file
df = pd.read_csv(file_path, sep='\t')  # Adjust the path and separator based on your file

# Extract unique segments and create a Dataset
# We will also only look at unique sentences since they are repeated for each question.
unique_segments_df_full = pd.DataFrame({'Segment': df['Segment'].unique()})

In [50]:
file_path = '/content/drive/My Drive/w266-personal/Final Project/PLUE/data/privacyqa/policy_test_data.csv'

# Load the CSV file
df_test = pd.read_csv(file_path, sep='\t')  # Adjust the path and separator based on your file

# Extract unique segments and create a Dataset
# We will also only look at unique sentences since they are repeated for each question.
unique_segments_df_test = pd.DataFrame({'Segment': df_test['Segment'].unique()})

In [21]:
display(unique_segments_df_full.shape)
display(unique_segments_df_full.head())
display(unique_segments_df_test.shape)
display(unique_segments_df_test.head())

(3411, 1)

Unnamed: 0,Segment
0,"This privacy policy, with our Terms of Servic..."
1,We encourage you to read this privacy policy c...
2,By using our application or other online servi...
3,"When we post changes to this privacy policy, ..."
4,We encourage you to review this privacy policy...


(200, 1)

Unnamed: 0,Segment
1834,You may opt out at any time from allowing furt...
1094,Social Media Features.
2109,"This enables us, subject to your choices and p..."
37,We use advertising IDs such as Apples Adverti...
1264,13/14 of Regulation (EU) 2016/679 (General Dat...


## Sampling Process for getting heuristic parameters
We will look at a smaller "sample" of sentences (n=200) to defined the heuristic weights for the "rewriting" process.  The smaller set is sampled across all the polcies to get a wide variety of sentences etc to be able to valdate the data prep process and appropriate prompts for the final "rewriting"

In [None]:
# Let's first experiment on a few 100 sentences
sample = 200
unique_segments_df_sample = unique_segments_df_full.sample(n=sample)
display(unique_segments_df_sample.shape)
display(unique_segments_df_sample.head())

(200, 1)

Unnamed: 0,Segment
480,Any information that you post in these locatio...
2912,Provide personalized experiences
2833,When you provide your information to Epic as p...
604,Our name is an acronym for the Latin p...
88,"If you do not agree to the changes, you should..."


# Prompt Engineering

To simplify each of the sentences of the privacy policy, we will use a Large Language Model (LLM) to generate the ouput.  The model we will be using is Meta's LLAMA2 7B model that has been fine tuned for "Chat" contexts to be able to take a chat prompt and produce a reponses.  

## Prompt Attempts
LLMs are sensitve to prompts for generating appropriate text.  So it is important to test out various prompts to the drive the right type of reponse from the LLM.

The following prompts were tested.  The final prompt chosed in bolded:

1.   rephrase and simplify this sentence for clarity for an average user:
2.   Rephrase and simplify the following sentence for clarity for an average user and do not add any qualifiying statements to the response.
3.   **Rephrase and simplify the 'SENTENCE' for clarity for an average user and do not add any qualifiying statements to the response.  SENTENCE:**
4.  Rephrase and simplify for clarity for an average user and do not add any qualifiying statements to the response.  SENTENCE:
5. Rephrase, simplify, and clarify for an average user; do not add any qualifiying statements to the response.  SENTENCE:

A few additional other prompts were tested, but they did not work at all.  So I have excluded it from the data

## Prompt Selection Process

Sample "rewritten" data was generated for all the above prompts and the final prompt for the QA experiment was selected based on visual inspection of results to ensure the generated text reasonably and faithfully represented the underlying intent of the sentence.  ***NOTE:*** This was not done by a privacy experty by the Research principle in consultation with another lay-person.  I had to balance between some really great simplifications, but some really awkward hallucinations  See below for two different sets of such responses.



|                              | Sample Prompt & Responses 1                          | Sample Prompt & Responses 2                           |
|------------------------------|------------------------------------------------------|-------------------------------------------------------|
| **Original Sentence:**      | When You Give Consent                                | Web Behavior Data                                     |
|                              | We may share information about you with other       |                                                       |
|                              | companies if you give us permission or direct us   |                                                       |
|                              | to share the information.                            |                                                       |
| **Prompt 1 Response:**      | Answer: When you give permission, we may share      | *There was no response*                               |
|                              | information about you with other companies.         |                                                       |
| **Prompt 3 Response:**      | REPHRASED SENTENCE: We may share information       | Rephrased and simplified sentence: Data on how        |
|                              | about you with other companies if you give us       | people use the web.                                   |
|                              | permission or direct us to do so.                   |                                                       |
| **Prompt 4 Response:**      | REPHRASED SENTENCE: When you give permission,      | Web Behavior Data refers to the information          |
|                              | we may share information about you with other      | collected about a user's online activities,          |
|                              | companies.                                           | including their browsing history, search queries,    |
|                              |                                                      | and interactions with websites and apps. This data   |
|                              |                                                      | can be used to create a detailed profile of the     |
|                              |                                                      | user's online behavior, which can be used to        |
|                              |                                                      | personalize their online experience, improve the    |
|                              |                                                      | effectiveness of online advertising, and enhance    |
|                              |                                                      | the overall user experience.                         |
| **Prompt 5 Response:**      | REPHRASED SENTENCE: If you agree, we can share     | Web Behavior Data refers to the information          |
|                              | information about you with other companies.         | collected about a user's online activities,          |
|                              |                                                      | including their browsing history, search queries,    |
|                              |                                                      | and interactions with websites and applications.    |
|                              |                                                      | This data can be used to create a detailed profile |
|                              |                                                      | of the user's online behavior, which can be used to|
|                              |                                                      | personalize their online experience, improve the    |
|                              |                                                      | effectiveness of online advertising, and enhance    |
|                              |                                                      | the overall user experience.                         |


In [7]:
# # Function to add prompts to each segment
# def add_prompts(row):
#     prompt = f"rephrase and simplify this sentence for clarity for an average user: {row['Segment']}\n"
#     return {"Prompt": prompt}

# Let's use a common variable name to help with code replication
# dataset = Dataset.from_pandas(unique_segments_df)

# # Apply function to create prompts
# dataset = dataset.map(add_prompts)

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [8]:
# # Function to add prompts to each segment
# def add_prompts_2(row):
#     prompt = f"Rephrase and simplify the 'SENTENCE' for clarity for an average user and do not add any qualifiying statements to the response.  SENTENCE: {row['Segment']}\n"
#     return {"Prompt": prompt}

# Let's use a common variable name to help with code replication
# dataset_2 = Dataset.from_pandas(unique_segments_df)

# # Apply function to create prompts
# dataset_2 = dataset_2.map(add_prompts_2)

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [9]:
# # Function to add prompts to each segment
# def add_prompts_3(row):
#     prompt = f"Rephrase and simplify for clarity for an average user and do not add any qualifiying statements to the response.  SENTENCE: {row['Segment']}\n"
#     return {"Prompt": prompt}

# Let's use a common variable name to help with code replication
# dataset_3 = Dataset.from_pandas(unique_segments_df)

# # Apply function to create prompts
# dataset_3 = dataset_3.map(add_prompts_3)

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [10]:
# # Function to add prompts to each segment
# def add_prompts_4(row):
#     prompt = f"Rephrase, simplify, and clarify for an average user; do not add any qualifiying statements to the response.  SENTENCE: {row['Segment']}\n"
#     return {"Prompt": prompt}

# Let's use a common variable name to help with code replication
# dataset_4 = Dataset.from_pandas(unique_segments_df)

# # Apply function to create prompts
# dataset_4 = dataset_4.map(add_prompts_4)

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [11]:
# # Function to add prompts to each segment
# def add_prompts_5(row):
#     prompt = f"Rephrase and simplify the 'SENTENCE' for clarity for an average user and do not add any qualifiying statements to the response.  SENTENCE: {row['Segment']}\n"
#     return {"Prompt": prompt}

# # same as 2 but; just a small difference in beam search
# dataset_5 = Dataset.from_pandas(unique_segments_df)

# # Apply function to create prompts
# dataset_5 = dataset_5.map(add_prompts_5)

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [16]:
# # Function to add prompts to each segment
# def add_prompts_6(row):
#     prompt = f"Rephrase and simplify the 'SENTENCE' for clarity for an average user and do not add any qualifiying statements to the response.  SENTENCE: {row['Segment']}\n"
#     return {"Prompt": prompt}

# Let's use a common variable name to help with code replication
# dataset_6 = Dataset.from_pandas(unique_segments_df)

# # Apply function to create prompts
# dataset_6 = dataset_6.map(add_prompts_6)

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [22]:
# Based in Trail 2; Simple prompt; Low Temperature

# Function to add prompts to each segment
def add_prompts_full(row):
    prompt = f"Rephrase and simplify the 'SENTENCE' for clarity for an average user and do not add any qualifiying statements to the response.  SENTENCE: {row['Segment']}\n"
    return {"Prompt": prompt}

# Let's use a common variable name to help with code replication
dataset_full = Dataset.from_pandas(unique_segments_df_full)

# Apply function to create prompts
dataset_full = dataset_full.map(add_prompts_full)

Map:   0%|          | 0/3411 [00:00<?, ? examples/s]

In [23]:
# print("Dataset 1: ", dataset[0]['Prompt'])
# print("Dataset 2: ", dataset_2[0]['Prompt'])
# print("Dataset 3: ", dataset_3[0]['Prompt'])
# print("Dataset 4: ", dataset_4[0]['Prompt'])
# print("Dataset 5: ", dataset_5[0]['Prompt'])
# print("Dataset 6: ", dataset_6[0]['Prompt'])
print("Dataset full: ", dataset_full[0]['Prompt'])

Dataset full:  Rephrase and simplify the 'SENTENCE' for clarity for an average user and do not add any qualifiying statements to the response.  SENTENCE:  This privacy policy, with our Terms of Service & License Agreement, is the agreement between you and us that explains what data we collect, how we store and transfer it, and how we use it.



# Sentence Rewriting for Dataset Creation

## LLM Initialization - LLama-2-7b-chat

We are goign to use the open-source Meta LLama2 with 7B parameters to do our sentence rewriting.  This model was chosen because it has performed well in LLM tasks, is easy to load and work with on resoruce constrained environements and is freely available to researchers to use in their work.

We will specifically use the Huggingface implementation that has been pre-trained on Chat so that we can provide prompts for text generation purposes.

https://huggingface.co/meta-llama/Llama-2-7b-chat-hf

## PyTorch

For this project we will be using PyTorch as it is the baseline training / testing model for the QA and Privacy Dataset and finetuned models from PLUE and PrivaSeer

https://github.com/JFChi/PLUE
https://huggingface.co/mukund/privbert


In [24]:
#!pip install accelerate

# Initialize the tokenizer and the model pipeline
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
gen_pipeline = transformers.pipeline(
    "text-generation",
    model=model_name,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto",
    framework="pt"
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Hyperparameter Tuning

To decide the best hyper-parameters for Sentence generation for *simplifying, clarifying, and rephrasing the original sentence succinctly* I try out multiple variations of hyperparameters that can be used to control the language generation process.  These include:

*   temperature - Default, 0.25, 0.35, 0.4, 0.5, 0.6
*   top_p - Default, 0.8, 0.6, 0.7,
*   top_k - 5, 7, 10
*   num_beams - Default, 2, 3
*   early_stopping - True, False

Various combinations of these values were tried with a sample of the training sentence set and manually evaluated for characteristics that would be interesting for the experiment / finetuning process.

## Multiple Rewritten Sentence Datasets

I decided to try created multiple rewritten sentence datasets to test the hypothesis further.  The first set of rewriting parameter took a conservative approach to the rewriting rules.  The second set of parameters were much more liberal.  The goal being to see if "creativity" of vernacular of the LLM would improve QA answerability.

### Chosen Values for Sentence rewriting (Version 1 - Conservative)

 * temperature=0.25,
 * top_p=0.8,
 * top_k=5,
 * num_beams=2,
 * early_stopping=True,

### Chosen Values for Sentence rewriting (Version 2 - Liberal) - Please see other notebook

 * temperature=0.6,
 * top_p=0.7,
 * top_k=10,
 * num_beams=3,
 * early_stopping=False,




In [14]:
# # Function to generate text using the pipeline
# def generate_text(batch):
#     # Process batch of prompts through the pipeline
#     outputs = gen_pipeline(batch['Prompt'],
#                            do_sample=True,
#                            top_k=10,
#                            num_return_sequences=1,
#                            eos_token_id=tokenizer.eos_token_id,
#                            max_length=200,
#                            truncation=True)
#     # Extract the generated text and clean it up by removing the prompt text
#     return {"Generated_Text": [output[0]['generated_text'].split('\n')[-1].strip() for output in outputs]}

# # Apply the function batch-wise to the dataset
# result = dataset.map(generate_text, batched=True, batch_size=20)

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [15]:
# # Function to generate text using the pipeline
# def generate_text2(batch):
#     # Process batch of prompts through the pipeline
#     outputs_2 = gen_pipeline(batch['Prompt'],
#                            do_sample=True,
#                            temperature=0.25,
#                            top_p=0.8,
#                            top_k=5,
#                            num_beams=2,
#                            num_return_sequences=1,
#                            eos_token_id=tokenizer.eos_token_id,
#                            max_length=200,
#                            early_stopping=True,
#                            truncation=True)
#     # Extract the generated text and clean it up by removing the prompt text
#     return {"Generated_Text": [output[0]['generated_text'].split('\n')[-1].strip() for output in outputs_2]}

# # Apply the function batch-wise to the dataset
# result_2 = dataset_2.map(generate_text2, batched=True, batch_size=20)

Map:   0%|          | 0/200 [00:00<?, ? examples/s]



In [16]:
# # Function to generate text using the pipeline
# def generate_text3(batch):
#     # Process batch of prompts through the pipeline
#     outputs_3 = gen_pipeline(batch['Prompt'],
#                            do_sample=True,
#                            temperature=0.4,
#                            top_p=0.6,
#                            top_k=10,
#                            num_beams=3,
#                            num_return_sequences=1,
#                            eos_token_id=tokenizer.eos_token_id,
#                            max_length=200,
#                            early_stopping=False,
#                            truncation=True)
#     # Extract the generated text and clean it up by removing the prompt text
#     return {"Generated_Text": [output[0]['generated_text'].split('\n')[-1].strip() for output in outputs_3]}

# # Apply the function batch-wise to the dataset
# result_3 = dataset_3.map(generate_text3, batched=True, batch_size=20)

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [17]:
# # Function to generate text using the pipeline
# def generate_text4(batch):
#     # Process batch of prompts through the pipeline
#     outputs_4 = gen_pipeline(batch['Prompt'],
#                            do_sample=True,
#                            temperature=0.5,
#                            top_p=0.7,
#                            top_k=10,
#                            num_beams=3,
#                            num_return_sequences=1,
#                            eos_token_id=tokenizer.eos_token_id,
#                            max_length=200,
#                            early_stopping=False,
#                            truncation=True)
#     # Extract the generated text and clean it up by removing the prompt text
#     return {"Generated_Text": [output[0]['generated_text'].split('\n')[-1].strip() for output in outputs_4]}

# # Apply the function batch-wise to the dataset
# result_4 = dataset_4.map(generate_text4, batched=True, batch_size=20)

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [19]:
# # Function to generate text using the pipeline
# # Same as #2 except with num_beam set to 3 and early_stopping = False
# def generate_text5(batch):
#     # Process batch of prompts through the pipeline
#     outputs_5 = gen_pipeline(batch['Prompt'],
#                            do_sample=True,
#                            temperature=0.25,
#                            top_p=0.8,
#                            top_k=5,
#                            num_beams=3,
#                            num_return_sequences=1,
#                            eos_token_id=tokenizer.eos_token_id,
#                            max_length=200,
#                            early_stopping=False,
#                            truncation=True)
#     # Extract the generated text and clean it up by removing the prompt text
#     return {"Generated_Text": [output[0]['generated_text'].split('\n')[-1].strip() for output in outputs_5]}

# # Apply the function batch-wise to the dataset
# result_5 = dataset_5.map(generate_text5, batched=True, batch_size=20)

Map:   0%|          | 0/200 [00:00<?, ? examples/s]



In [18]:
# # Function to generate text using the pipeline
# def generate_text6(batch):
#     # Process batch of prompts through the pipeline
#     outputs_6 = gen_pipeline(batch['Prompt'],
#                            do_sample=True,
#                            temperature=0.35,
#                            top_p=0.7,
#                            top_k=7,
#                            num_beams=3,
#                            num_return_sequences=1,
#                            eos_token_id=tokenizer.eos_token_id,
#                            max_length=200,
#                            early_stopping=True,
#                            truncation=True)
#     # Extract the generated text and clean it up by removing the prompt text
#     return {"Generated_Text": [output[0]['generated_text'].split('\n')[-1].strip() for output in outputs_6]}

# # Apply the function batch-wise to the dataset
# result_6 = dataset_6.map(generate_text6, batched=True, batch_size=20)

Map:   0%|          | 0/200 [00:00<?, ? examples/s]



## Sentence Rewriting of the Full Dataset

There were multiple iterations of the function to create rewritten sentences for the fulldata set.  The process was initially designed to be run in a batch fashion using the Hugging Face Datasets library.  

However, some parameters of the sentence generation process sometimes interfered with the training / rewriting, for example, if the sentence was too long.  This would cause and exception and kill the whole batch.  So a sequential process was also added in the exception case to allow for automated handing of odd cases and continue the rewriting process.

**NOTE:**  If this exception is triggerd, for the offending Sentence, it will not be rewritten and the orginal sentence will be used.

### Chosen Values for Sentence rewriting (Version 2 - Liberal) - Please see other notebook

The full process uses the chosen hyperparameters along with the specifically selected prompt to re-write the sentences

**Prompt**

>"Rephrase and simplify the 'SENTENCE' for clarity for an average user and do not add any qualifiying statements to the response.  SENTENCE: {row['Segment']

**Hyperparameters**
 * temperature=0.6,
 * top_p=0.7,
 * top_k=10,
 * num_beams=3,
 * early_stopping=False,


In [33]:
# import json

# # Based on Trail #2; Simple prompt with SENTENCE Marker
# # Low temperature; Try to flex on Top_p and Beams.
# # Function to generate text using the pipeline
# def generate_text_full(batch):

#     # Initialize the batch counter attribute if it doesn't exist
#     if not hasattr(generate_text_full, 'batch_counter'):
#         generate_text_full.batch_counter = 0  # Static variable equivalent
#     generate_text_full.batch_counter += 1  # Increment the counter
#     print(f"Processing Batch #{generate_text_full.batch_counter}, Batch Size: {len(batch['Prompt'])}")

#     try:
#         # Process batch of prompts through the pipeline with torch.no_grad() for memory efficiency
#         with torch.no_grad():
#             outputs_full = gen_pipeline(batch['Prompt'],
#                                         do_sample=True,
#                                         temperature=0.25,
#                                         top_p=0.8,
#                                         top_k=5,
#                                         num_beams=2,
#                                         num_return_sequences=1,
#                                         eos_token_id=tokenizer.eos_token_id,
#                                         max_length=300,
#                                         early_stopping=True,
#                                         truncation=True)
#         # Extract the generated text and clean it up by removing the prompt text
#         return {"Generated_Text": [output[0]['generated_text'].split('\n')[-1].strip() for output in outputs_full]}
#     except Exception as e:
#         print(f"An error occurred: {e}")
#         # Return a dictionary with the same structure but indicate failure
#         return {"Generated_Text": ["ERROR_PROCESSING_TEXT" for _ in batch['Prompt']]}

# # Apply the function batch-wise to the dataset
# result_full = dataset_full.map(generate_text_full, batched=True, batch_size=100)

Map:   0%|          | 0/3411 [00:00<?, ? examples/s]

Processing Batch #1, Batch Size: 100




Processing Batch #2, Batch Size: 100
Processing Batch #3, Batch Size: 100
Processing Batch #4, Batch Size: 100
Processing Batch #5, Batch Size: 100
Processing Batch #6, Batch Size: 100
Processing Batch #7, Batch Size: 100
Processing Batch #8, Batch Size: 100
Processing Batch #9, Batch Size: 100
Processing Batch #10, Batch Size: 100
Processing Batch #11, Batch Size: 100
Processing Batch #12, Batch Size: 100
An error occurred: Input length of input_ids is 300, but `max_length` is set to 300. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.
Processing Batch #13, Batch Size: 100
Processing Batch #14, Batch Size: 100
Processing Batch #15, Batch Size: 100
Processing Batch #16, Batch Size: 100
Processing Batch #17, Batch Size: 100
Processing Batch #18, Batch Size: 100
Processing Batch #19, Batch Size: 100
Processing Batch #20, Batch Size: 100
Processing Batch #21, Batch Size: 100
Processing Batch #22, Batch Size: 100
P

## Helper function for Sequential Exception process

When the Batch process failes due to "Sentence too long" exception, this helper function is used to pull out the original sentence for the associated *PROMPT:SENTENCE*

**NOTE:**  If this exception is triggerd, for the offending Sentence, it will not be rewritten and the orginal sentence will be used.

In [40]:
def get_segment(dataset, prompt):
    """
    Retrieve the segment associated with a specific prompt from a Hugging Face Dataset.

    Parameters:
    - dataset (Dataset): A Hugging Face Dataset object with columns ['Prompt', 'Segment', 'Generated_Text'].
    - prompt (str): The prompt string to search for within the dataset.

    Returns:
    - str: The segment corresponding to the given prompt. Returns 'No segment found' if the prompt does not exist.
    """
    # Filter the dataset for the target prompt
    filtered_dataset = dataset.filter(lambda example: example['Prompt'] == prompt)

    # Extract the segment for the specific prompt
    if len(filtered_dataset) > 0:
        return filtered_dataset['Segment'][0]
    else:
        return "No segment found"


## 2nd / Final iteration of Sentence Rewriting Functinon

There were multiple iterations of the function to create rewritten sentences for the fulldata set.  This iteration fully incorporates the sequential exeption process.

In [55]:
import json
from datasets import concatenate_datasets, load_dataset

# Based on Trail #2; Simple prompt with SENTENCE Marker
# Low temperature; Try to flex on Top_p and Beams.
# Function to generate text using the pipeline
def generate_text_full(batch):

    # Initialize the batch counter attribute if it doesn't exist
    if not hasattr(generate_text_full, 'batch_counter'):
        generate_text_full.batch_counter = 0  # Static variable equivalent
    generate_text_full.batch_counter += 1  # Increment the counter
    print(f"Processing Batch #{generate_text_full.batch_counter}, Batch Size: {len(batch['Prompt'])}")

    try:
        # Process batch of prompts through the pipeline with torch.no_grad() for memory efficiency
        with torch.no_grad():
            outputs_full = gen_pipeline(batch['Prompt'], do_sample=True,
                                        temperature=0.25, top_p=0.8, top_k=5,num_beams=2, early_stopping=True,
                                        num_return_sequences=1, eos_token_id=tokenizer.eos_token_id,
                                        max_length=300, truncation=True)
        # Extract the generated text and clean it up by removing the prompt text
        return {"Generated_Text": [output[0]['generated_text'].split('\n')[-1].strip() for output in outputs_full]}
    except Exception as e:
        print(f"An error occurred: {e}")
        print("Reverting to sequential processing for this batch")
        generated_texts = []
        for prompt in batch['Prompt']:
          try:
            output = gen_pipeline(prompt, do_sample=True,
                                  temperature=0.25, top_p=0.8, top_k=5, num_beams=2, early_stopping=True,
                                  num_return_sequences=1, eos_token_id=tokenizer.eos_token_id,
                                  max_length=300, truncation=True)
            # Extract the generated text and clean it up by removing the prompt text
            generated_text = output[0]['generated_text'].split('\n')[-1].strip()

            # Append the cleaned generated text to the list
            generated_texts.append(generated_text)

          except Exception as ex:
            print(f"Error processing prompt: {prompt}.")
            print(f"Using source segment.")
            print(f"Error: {ex}")
            combined_dataset = concatenate_datasets([dataset_full, dataset_full_test])
            generated_texts.append(get_segment(combined_dataset,prompt))  # Use the original segment in case of an error

        # Return the list of generated texts
        return {"Generated_Text": generated_texts}

## Sequential Exception Patch

Instead of re-running the entire sentence re-writing process for the full dataset, I have pulled out the specific exceptional rows and run the sequential exception process.  The resulting dataset is then combined with the full dataset.

In [42]:
# Select the subset of the dataset (entries 1100 to 1200)
subset_dataset = dataset_full.select(range(1100, 1200))

# Apply the function batch-wise to the selected dataset subset
result_1100 = subset_dataset.map(generate_text_full, batched=True, batch_size=20)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Processing Batch #1, Batch Size: 20




An error occurred: Input length of input_ids is 200, but `max_length` is set to 200. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.
Reverting to sequential processing for this batch
Error processing prompt: Rephrase and simplify the 'SENTENCE' for clarity for an average user and do not add any qualifiying statements to the response.  SENTENCE:   US: https://security.intuit.com/index.php/privacy Canada (EN): http://www.intuit.ca/about-intuit-canada/info/privacy-policy.jsp Canada (FR): http://www.intuit.ca/a-propos-de-intuit/info/vie-privee.jsp France: http://quickbooks.intuit.fr/a-propos/confidentialite.jsp UK: https://www.quickbooks.co.uk/privacy-policy/ Australia: http://www.intuit.com.au/about/privacy.jsp India: http://www.quickbooks.in/privacy.jsp Singapore: http://quickbooks.intuit.sg/about/privacy.jsp Malaysia: http://www.intuit.com.my/about/privacy.jsp Hong Kong (traditional Chinese): http://www.intuit.h

Filter:   0%|          | 0/3411 [00:00<?, ? examples/s]

Processing Batch #2, Batch Size: 20
Processing Batch #3, Batch Size: 20
Processing Batch #4, Batch Size: 20
Processing Batch #5, Batch Size: 20


In [None]:
# Apply the function batch-wise to the dataset
result_full = dataset_full.map(generate_text_full, batched=True, batch_size=100)

In [46]:
# Apply the Sequential Exception Patch
# Convert both datasets to pandas DataFrames
df_full = result_full.to_pandas()
df_1100 = result_1100.to_pandas()

# Replace entries from 1100 to 1200 in df_full with entries from 0 to 100 in df_1100
df_full.iloc[1100:1200] = df_1100.iloc[0:100].values

# If you need to use the modified data as a Dataset again
md_full = Dataset.from_pandas(df_full)

In [None]:
sample = 50
ver = "full"
md_full.shape
for i in range(1100,1200):
  print("Sentence:",  md_full[i]['Segment'])
  print("Prompt:", md_full[i]['Prompt'])
  print("New Sentence:", md_full[i]['Generated_Text'])
  print()

In [48]:
result_full = md_full

## Save Unique Values to File


### Sample Values -> File

Print out a small sample for visual inspection for prompt engineering and hyper parameter tuning and final prompt and parameter selection.

In [None]:
sample = 50
ver = "1100"
result_1100.shape
for i in range(result_1100.num_rows):
  print("Sentence:",  result_1100[i]['Segment'])
  print("Prompt:", result_1100[i]['Prompt'])
  print("New Sentence:", result_1100[i]['Generated_Text'])
  print()

In [None]:
sample = 50
ver = "full"
result_full.shape
for i in range(1100,1200):
  print("Sentence:",  result_full[i]['Segment'])
  print("Prompt:", result_full[i]['Prompt'])
  print("New Sentence:", result_full[i]['Generated_Text'])
  print()

### Full Values -> File (train)
Wrtie the full set out to file.

**NOTE:** These are just the unique sentences that were rewritten.  These still need to be combined with the original dataset.

In [49]:
import pandas as pd

# write out to file
output_file_path = f'/content/drive/My Drive/w266-personal/Final Project/PLUE/data/privacyqa/policy_train_dataset_full.csv'

# Save to CSV
df_full = result_full.to_pandas()
df_full.to_csv(output_file_path, index=False)

# Test Set "rewriting" Simplification

Let's rewrite the test QA sentences using the same LLM text generation process

In [56]:
# Based on Trail 2; Simple prompt; Low Temperature
# Let's do the same treatment for the Test sentences as we did for the training sentences
dataset_full_test = Dataset.from_pandas(unique_segments_df_test)

# Apply function to create prompts
dataset_full_test = dataset_full_test.map(add_prompts_full)
# Apply the function batch-wise to the dataset
result_full_test = dataset_full_test.map(generate_text_full, batched=True, batch_size=100)

Map:   0%|          | 0/1231 [00:00<?, ? examples/s]

Map:   0%|          | 0/1231 [00:00<?, ? examples/s]

Processing Batch #1, Batch Size: 100




Processing Batch #2, Batch Size: 100
Processing Batch #3, Batch Size: 100
Processing Batch #4, Batch Size: 100
Processing Batch #5, Batch Size: 100
Processing Batch #6, Batch Size: 100
Processing Batch #7, Batch Size: 100
Processing Batch #8, Batch Size: 100
Processing Batch #9, Batch Size: 100
Processing Batch #10, Batch Size: 100
Processing Batch #11, Batch Size: 100
Processing Batch #12, Batch Size: 100
Processing Batch #13, Batch Size: 31


## Save Unique Values to File

### Review Sample Values by Inspection
Print out a small sample for visual inspection.

In [57]:
sample = 50
ver = "full_test"
result_full_test.shape
for i in range(sample):
  print("Sentence:",  result_full_test[i]['Segment'])
  print("Prompt:", result_full_test[i]['Prompt'])
  print("New Sentence:", result_full_test[i]['Generated_Text'])
  print()

Sentence:   At Fiverr we care about your privacy.
Prompt: Rephrase and simplify the 'SENTENCE' for clarity for an average user and do not add any qualifiying statements to the response.  SENTENCE:   At Fiverr we care about your privacy.

New Sentence: REPHRASED SENTENCE:  Fiverr prioritizes your privacy.

Sentence: We do not sell or rent your personal information to third parties for their direct marketing purposes without your explicit consent.
Prompt: Rephrase and simplify the 'SENTENCE' for clarity for an average user and do not add any qualifiying statements to the response.  SENTENCE: We do not sell or rent your personal information to third parties for their direct marketing purposes without your explicit consent.

New Sentence: REPHRASED SENTENCE: We don't share your personal information with third parties for their marketing purposes without your consent.

Sentence: We do not disclose it to others except as disclosed in this Policy or required to provide you with the services o

### Full Values -> File (test)
Wrtie the full set out to file.

**NOTE:** These are just the unique sentences that were rewritten.  These still need to be combined with the original dataset.

In [None]:
import pandas as pd

# write out to file
output_file_path = f'/content/drive/My Drive/w266-personal/Final Project/PLUE/data/privacyqa/policy_test_dataset_full.csv'

# Save to CSV
df_full_test = result_full_test.to_pandas()
df_full_test.to_csv(output_file_path, index=False)

# Save Unique Values to File for Trail Attempts

All the various Trial attemps from 1 -> 6
Print out a small sample for visual inspection. Wrtie the full set out to file.

In [19]:
sample = 200
ver = 6
result_6.shape
for i in range(result_6.num_rows):
  print("Sentence:",  result_6[i]['Segment'])
  print("Prompt:", result_6[i]['Prompt'])
  print("New Sentence:", result_6[i]['Generated_Text'])
  print()

import pandas as pd

# write out to file
output_file_path = f'/content/drive/My Drive/w266-personal/Final Project/PLUE/data/privacyqa/policy_train_dataset_small_{sample}_{ver}.csv'

# Save to CSV
df_6 = result_6.to_pandas()
df_6.to_csv(output_file_path, index=False)

Sentence: We may limit your individual rights in the following ways:
Prompt: Rephrase and simplify the 'SENTENCE' for clarity for an average user and do not add any qualifiying statements to the response.  SENTENCE: We may limit your individual rights in the following ways:

New Sentence: REPHRASED SENTENCE: We may limit your individual rights in certain ways.

Sentence: You can also change your in-app notification preferences through your device settings.
Prompt: Rephrase and simplify the 'SENTENCE' for clarity for an average user and do not add any qualifiying statements to the response.  SENTENCE: You can also change your in-app notification preferences through your device settings.

New Sentence: REPHRASED SENTENCE: You can change your in-app notification preferences through your device settings.

Sentence: How Long Do We Retain Your Personal Data?
Prompt: Rephrase and simplify the 'SENTENCE' for clarity for an average user and do not add any qualifiying statements to the response.

In [21]:
sample = 200
ver = 5
result_5.shape
for i in range(result_5.num_rows):
  print("Sentence:",  result_5[i]['Segment'])
  print("Prompt:", result_5[i]['Prompt'])
  print("New Sentence:", result_5[i]['Generated_Text'])
  print()

import pandas as pd

# write out to file
output_file_path = f'/content/drive/My Drive/w266-personal/Final Project/PLUE/data/privacyqa/policy_train_dataset_small_{sample}_{ver}.csv'

# Save to CSV
df_5 = result_5.to_pandas()
df_5.to_csv(output_file_path, index=False)

Sentence: We may limit your individual rights in the following ways:
Prompt: Rephrase and simplify the 'SENTENCE' for clarity for an average user and do not add any qualifiying statements to the response.  SENTENCE: We may limit your individual rights in the following ways:

New Sentence: REPHRASED SENTENCE: We may limit your individual rights in certain ways.

Sentence: You can also change your in-app notification preferences through your device settings.
Prompt: Rephrase and simplify the 'SENTENCE' for clarity for an average user and do not add any qualifiying statements to the response.  SENTENCE: You can also change your in-app notification preferences through your device settings.

New Sentence: REPHRASED SENTENCE: You can change your in-app notification preferences through your device settings.

Sentence: How Long Do We Retain Your Personal Data?
Prompt: Rephrase and simplify the 'SENTENCE' for clarity for an average user and do not add any qualifiying statements to the response.

In [22]:
ver = 4
result_4.shape
for i in range(result_4.num_rows):
  print("Sentence:",  result_4[i]['Segment'])
  print("Prompt:", result_4[i]['Prompt'])
  print("New Sentence:", result_4[i]['Generated_Text'])
  print()

import pandas as pd

# write out to file
output_file_path = f'/content/drive/My Drive/w266-personal/Final Project/PLUE/data/privacyqa/policy_train_dataset_small_{sample}_{ver}.csv'

# Save to CSV
df_4 = result_4.to_pandas()
df_4.to_csv(output_file_path, index=False)

Sentence: We may limit your individual rights in the following ways:
Prompt: Rephrase, simplify, and clarify for an average user; do not add any qualifiying statements to the response.  SENTENCE: We may limit your individual rights in the following ways:

New Sentence: REPHRASED SENTENCE: We may restrict your personal rights in certain ways.

Sentence: You can also change your in-app notification preferences through your device settings.
Prompt: Rephrase, simplify, and clarify for an average user; do not add any qualifiying statements to the response.  SENTENCE: You can also change your in-app notification preferences through your device settings.

New Sentence: CLEARANCE SENTENCE: You can customize your app notification preferences on your device.

Sentence: How Long Do We Retain Your Personal Data?
Prompt: Rephrase, simplify, and clarify for an average user; do not add any qualifiying statements to the response.  SENTENCE: How Long Do We Retain Your Personal Data?

New Sentence: Rete

In [26]:
ver = 3
result_3.shape
for i in range(result_3.num_rows):
  print("Sentence:",  result_3[i]['Segment'])
  print("Prompt:", result_3[i]['Prompt'])
  print("New Sentence:", result_3[i]['Generated_Text'])
  print()

import pandas as pd

# write out to file
output_file_path = f'/content/drive/My Drive/w266-personal/Final Project/PLUE/data/privacyqa/policy_train_dataset_small_{sample}_{ver}.csv'

# Save to CSV
df_3 = result_3.to_pandas()
df_3.to_csv(output_file_path, index=False)

Sentence: We may limit your individual rights in the following ways:
Prompt: Rephrase and simplify for clarity for an average user and do not add any qualifiying statements to the response.  SENTENCE: We may limit your individual rights in the following ways:

New Sentence: REPHRASED SENTENCE: We may restrict your personal rights in certain ways.

Sentence: You can also change your in-app notification preferences through your device settings.
Prompt: Rephrase and simplify for clarity for an average user and do not add any qualifiying statements to the response.  SENTENCE: You can also change your in-app notification preferences through your device settings.

New Sentence: SIMPLIFIED SENTENCE: You can change your app notification settings on your device.

Sentence: How Long Do We Retain Your Personal Data?
Prompt: Rephrase and simplify for clarity for an average user and do not add any qualifiying statements to the response.  SENTENCE: How Long Do We Retain Your Personal Data?

New Sent

In [27]:
ver = 2
result_2.shape
for i in range(result_2.num_rows):
  print("Sentence:",  result_2[i]['Segment'])
  print("Prompt:", result_2[i]['Prompt'])
  print("New Sentence:", result_2[i]['Generated_Text'])
  print()

import pandas as pd

# write out to file
output_file_path = f'/content/drive/My Drive/w266-personal/Final Project/PLUE/data/privacyqa/policy_train_dataset_small_{sample}_{ver}.csv'

# Save to CSV
df_2 = result_2.to_pandas()
df_2.to_csv(output_file_path, index=False)

Sentence: We may limit your individual rights in the following ways:
Prompt: Rephrase and simplify the 'SENTENCE' for clarity for an average user and do not add any qualifiying statements to the response.  SENTENCE: We may limit your individual rights in the following ways:

New Sentence: REPHRASED SENTENCE: We may limit your individual rights in certain ways.

Sentence: You can also change your in-app notification preferences through your device settings.
Prompt: Rephrase and simplify the 'SENTENCE' for clarity for an average user and do not add any qualifiying statements to the response.  SENTENCE: You can also change your in-app notification preferences through your device settings.

New Sentence: REPHRASED SENTENCE: You can change your in-app notification preferences through your device settings.

Sentence: How Long Do We Retain Your Personal Data?
Prompt: Rephrase and simplify the 'SENTENCE' for clarity for an average user and do not add any qualifiying statements to the response.

In [25]:
ver = 1
result.shape
for i in range(result.num_rows):
  print("Sentence:",  result[i]['Segment'])
  print("Prompt:", result[i]['Prompt'])
  print("New Sentence:", result[i]['Generated_Text'])
  print()

import pandas as pd

# write out to file
output_file_path = f'/content/drive/My Drive/w266-personal/Final Project/PLUE/data/privacyqa/policy_train_dataset_small_{sample}_{ver}.csv'

# Save to CSV
df_1 = result.to_pandas()
df_1.to_csv(output_file_path, index=False)

Sentence: We may limit your individual rights in the following ways:
Prompt: rephrase and simplify this sentence for clarity for an average user: We may limit your individual rights in the following ways:

New Sentence: We may restrict your individual rights in the following ways:

Sentence: You can also change your in-app notification preferences through your device settings.
Prompt: rephrase and simplify this sentence for clarity for an average user: You can also change your in-app notification preferences through your device settings.

New Sentence: You can adjust your in-app notification settings through your device's settings.

Sentence: How Long Do We Retain Your Personal Data?
Prompt: rephrase and simplify this sentence for clarity for an average user: How Long Do We Retain Your Personal Data?

New Sentence: In simpler terms, we keep your personal data for as long as we need it to provide you with our services and follow the law.

Sentence: We use third-party service providers t

# Clean up and format output for our modeling files
The output, upon inspection, has multiple types of labels that precede sentences like:

* "REPHRASED SENTENCE:",
* "ANALYSIS:",
* "CLEAR AND CLEAR SENTENCE:",
* etc.

We want to extract the sentences without these labels.  We'll do this with Regular expressions (regex).

In [63]:
def clean_generated_text(example):
    # Define the list of specific prefixes to remove
    prefixes = [
        "REPHRASED SENTENCE",
        "ANONYMIZED =",
        "SIMPLIFIED SENTENCE",
        "Answer",
        "CORRECTED SENTENCE",
        "Rephrased and simplified sentence",
        "Simplified Rephrased Sentence",
        "Rephrased and simplified",
    ]

    # Prepare regex to match any of these prefixes at the start of a string followed by possible colon and spaces
    import re
    pattern = re.compile(r'^(?:' + '|'.join(re.escape(prefix) for prefix in prefixes) + r')\s*[:=]?\s*')

    # Function to clean each text in the list
    def clean_text(text):
        # Apply regex to remove the prefix and strip whitespace
        return pattern.sub('', text).strip()

    # Apply the cleaning function to each text in the list 'Generated_Text'
    cleaned_texts = [clean_text(text) for text in example['Generated_Text']]

    return {'Clean_Gen_Segment': cleaned_texts}

In [64]:
# 'result_full' is the Dataset object from the previous text generation step
cleaned_results_train = result_full.map(clean_generated_text, batched=True)
cleaned_results_test = result_full_test.map(clean_generated_text, batched=True)

Map:   0%|          | 0/3411 [00:00<?, ? examples/s]

Map:   0%|          | 0/1231 [00:00<?, ? examples/s]

In [66]:
cleaned_results_train.to_pandas().head()

Unnamed: 0,Segment,Prompt,Generated_Text,Clean_Gen_Segment
0,"This privacy policy, with our Terms of Servic...",Rephrase and simplify the 'SENTENCE' for clari...,"REPHRASED SENTENCE: This privacy policy, alon...","This privacy policy, along with our Terms of S..."
1,We encourage you to read this privacy policy c...,Rephrase and simplify the 'SENTENCE' for clari...,REPHRASED SENTENCE: Please read our privacy po...,Please read our privacy policy carefully when ...
2,By using our application or other online servi...,Rephrase and simplify the 'SENTENCE' for clari...,REPHRASED SENTENCE: When you use our applicati...,When you use our application or other online s...
3,"When we post changes to this privacy policy, ...",Rephrase and simplify the 'SENTENCE' for clari...,REPHRASED SENTENCE: When we make changes to t...,"When we make changes to this privacy policy, w..."
4,We encourage you to review this privacy policy...,Rephrase and simplify the 'SENTENCE' for clari...,REPHRASED SENTENCE: Please review our privacy ...,Please review our privacy policy regularly.


In [67]:
# Convert results to DataFrame
results_df_train = cleaned_results_train.to_pandas()

# Create a dictionary to map original segments to their simplified versions
simplified_map = dict(zip(results_df_train['Segment'], results_df_train['Clean_Gen_Segment']))

# Map the simplified texts back to the original dataframe
df['Simplified_Segment'] = df['Segment'].map(simplified_map)

# Define the path to save the modified file
output_path_train = '/content/drive/My Drive/w266-personal/Final Project/PLUE/data/privacyqa/policy_train_data_modified.csv'

# Save the DataFrame back to a CSV file
df.to_csv(output_path_train, sep='\t', index=False)

In [68]:
# Convert results to DataFrame
results_df_test = cleaned_results_test.to_pandas()

# Create a dictionary to map original segments to their simplified versions
simplified_map_test = dict(zip(results_df_test['Segment'], results_df_test['Clean_Gen_Segment']))

# Map the simplified texts back to the original dataframe
df_test['Simplified_Segment'] = df_test['Segment'].map(simplified_map_test)

# Define the path to save the modified file
output_path_test = '/content/drive/My Drive/w266-personal/Final Project/PLUE/data/privacyqa/policy_test_data_modified.csv'

# Save the DataFrame back to a CSV file
df_test.to_csv(output_path_test, sep='\t', index=False)