<br>
<center><img src="https://gretel-public-website.s3.us-west-2.amazonaws.com/assets/brand/gretel_brand_wordmark.svg" alt="Gretel" width="350"/></center>
<br>


## 👋 Welcome to the Gretel Synthetic PII Finance Multilingual Notebook!
In this Notebook, we will explore the Gretel Synthetic PII Finance Multilingual dataset and demonstrate how to visualize and process the generated data containing PII spans.
<br>
<img src="https://cdn-uploads.huggingface.co/production/uploads/632ca8dcdbea00ca213d101a/nxKiabD9puCKhJCDciMto.webp" alt="Header" width="600"/>


## ✅ Set up your environment
To get started, we will install the necessary dependencies and load the dataset from Hugging Face.
* https://huggingface.co/datasets/gretelai/synthetic_pii_finance_multilingual
<br>

#### Ready? Let's go 🚀

## 💾 Install dependencies

In [None]:
!pip install -Uqq spacy datasets

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests==2.31.0, but you have requests 2.32.3 which is incompatible.[0m[31m
[0m

## 📊 Load and explore the dataset
- Load the Gretel Synthetic PII Finance Multilingual dataset from Hugging Face as a DataFrame.
- Explore the loaded dataset.

In [None]:
import ast
import random
from typing import Dict

import pandas as pd
from datasets import load_dataset
from spacy import displacy

# Load dataset from Huggingface as a Dataframe

df = pd.DataFrame(load_dataset("gretelai/synthetic_pii_finance_multilingual")['train'])
df

Unnamed: 0,level_0,index,document_type,document_description,expanded_type,expanded_description,language,language_description,domain,generated_text,pii_spans,conformance_score,quality_score,toxicity_score,bias_score,groundedness_score
0,40012,40012,Supply Chain Management Agreement,A legal contract outlining the terms and condi...,Vendor Management Contract,This subtype involves the contractual agreemen...,English,English language as spoken in the United State...,finance,SUPPLY CHAIN MANAGEMENT AGREEMENT\n\nThis Supp...,"[{""start"": 119, ""end"": 141, ""label"": ""date""}, ...",85,90,5,15,95
1,46425,46425,Supply Chain Management Agreement,A legal contract outlining the terms and condi...,Supply Chain Resilience Framework,This subtype details the framework for buildin...,English,English language as spoken in the United State...,finance,SUPPLY CHAIN RESILIENCE FRAMEWORK\n\nThis Supp...,"[{""start"": 119, ""end"": 142, ""label"": ""date""}, ...",92,87,5,12,95
2,4689,4689,Real Estate Loan Agreement,A legal contract outlining terms and condition...,International Real Estate Investment Loan Agre...,This subtype encompasses loans for internation...,Spanish,Spanish language as spoken in Spain or Mexico,finance,CONTRATO DE PRÉSTAMO PARA INVERSIÓN INMOBILIAR...,"[{""start"": 182, ""end"": 209, ""label"": ""street_a...",85,90,5,15,95
3,3002,3002,Real Estate Loan Agreement,A legal contract outlining terms and condition...,Commercial Property Loan Contract,This subtype focuses on loans for commercial r...,Italian,Italian language as spoken in Italy,finance,REPUBBLICA ITALIANA\n\nCONTRATTO DI PRESTITO I...,"[{""start"": 85, ""end"": 103, ""label"": ""name""}, {...",85,90,5,10,95
4,16187,16187,Email,A communication sent electronically containing...,Invitation,Create an email inviting recipients to an even...,France,French language as spoken in France,finance,Subject: Invitation à notre soirée de lancemen...,"[{""start"": 202, ""end"": 207, ""label"": ""time""}, ...",85,90,5,15,95
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50341,44732,44732,Investment Prospectus,"A formal document detailing the objectives, ri...",Fashion and Apparel,Prepare a detailed overview of investment pros...,English,English language as spoken in the United State...,finance,INVESTMENT PROSPECTUS\n\nINTRODUCTION\n\nWe ar...,"[{""start"": 98, ""end"": 122, ""label"": ""company""}...",90,85,5,10,95
50342,54343,54343,FIX Protocol,A messaging standard used for the electronic c...,MarketDataIncrementalRefresh,Develop a MarketDataIncrementalRefresh message...,English,English language as spoken in the United State...,finance,MDReqID:123456\nSubscriptionRequestType:0\nMar...,"[{""start"": 384, ""end"": 396, ""label"": ""name""}, ...",92,85,5,12,88
50343,38158,38158,Investment Prospectus,"A formal document detailing the objectives, ri...",Travel and Hospitality,Construct a comprehensive overview of investme...,English,English language as spoken in the United State...,finance,**Patrizio Gaito-Mascagni's Travel and Hospita...,"[{""start"": 2, ""end"": 25, ""label"": ""name""}, {""s...",85,90,5,15,95
50344,860,860,Loan Application,"A form capturing personal, financial, and empl...",Debt Consolidation Plan,Evaluate the potential for debt consolidation ...,German,German language as spoken in Germany,finance,**Kreditantrag: Debt Consolidation Plan**\n\nS...,"[{""start"": 346, ""end"": 363, ""label"": ""name""}, ...",85,90,5,10,95


## 🔍 Visualize PII entities
- Define a helper function to visualize PII entities in the generated data using a Displacy-like style.
- Select a random English record from the dataset and visualize its PII entities.

In [None]:
# Helper function to visualize entities in Displacy-like style

def visualize_entities(record: Dict) -> None:
    """
    Visualize the entities in the generated data using a predefined set of PII spans.
    Args:
        record: A dictionary containing the generated data and PII spans.
    """
    colors = [
        "#FFB3BA", "#FFDFBA", "#FFFFBA", "#BAFFC9", "#BAE1FF",
        "#D7BDE2", "#FAD7A0", "#D5F5E3", "#AED6F1", "#F9E79F",
        "#F5CBA7", "#D2B4DE", "#A9CCE3", "#FADBD8", "#D4E6F1",
        "#E6B0AA", "#F9E79F", "#A3E4D7", "#D7DBDD", "#F7DC6F",
        "#F1948A", "#C39BD3", "#7FB3D5", "#76D7C4", "#F0B27A",
        "#E59866", "#AF7AC5", "#5499C7", "#48C9B0"
    ]
    num_colors = len(colors)
    text = record["generated_text"]
    spans = ast.literal_eval(record["pii_spans"]) if isinstance(record["pii_spans"], str) else record["pii_spans"]
    doc = {"text": text, "ents": [], "title": None}
    for i, span in enumerate(spans):
        start, end, label = span["start"], span["end"], span["label"]
        color = colors[i % num_colors]
        doc["ents"].append({"start": start, "end": end, "label": label, "color": color})
    options = {
        "ents": [span["label"] for span in spans],
        "colors": {label: colors[i % num_colors] for i, label in enumerate(set(span["label"] for span in spans))},
        "compact": True,
    }
    displacy.render(doc, style="ent", manual=True, jupyter=True, options=options)

# Select a random index from the filtered DataFrame
random_index = random.randint(0, len(df.query("language == 'English'")) - 1)

# Select the record at the random index
record = df.query("language == 'English'").iloc[random_index]

# Visualize the entities
visualize_entities(record)

## 📝 Process and markup PII
- Define functions to process a single record and output marked-up text with labeled PII.
- Show a random record with marked-up PII.
- Apply processing to all records and create a new 'marked_up_text' column in the DataFrame.

In [None]:
# Function to process a single record and output marked-up text
def process_record(record: Dict) -> str:
    text = record['generated_text']
    spans = ast.literal_eval(record['pii_spans'])
    # Sort spans by start index in descending order
    spans = sorted(spans, key=lambda x: x['start'], reverse=True)
    for span in spans:
        start, end, label = span['start'], span['end'], span['label']
        value = text[start:end]
        # Insert the marked-up PII value into the text
        text = text[:start] + f"{{[{label}]{value}}}" + text[end:]
    return text

# Function to show a random record
def show_random_record(language: str = 'English') -> None:
    # Select a random index from the filtered DataFrame
    random_index = random.randint(0, len(df.query(f"language == '{language}'")) - 1)
    # Select the record at the random index
    record = df.query(f"language == '{language}'").iloc[random_index]
    # Process the record and print the marked-up text
    marked_up_text = process_record(record)
    print(marked_up_text)

# Function to apply processing to all records and create a new column
def process_all_records(language: str = 'English') -> pd.DataFrame:
    # Filter the DataFrame by language
    filtered_df = df.query(f"language == '{language}'")
    # Apply the process_record function to each record and create a new column
    filtered_df['marked_up_text'] = filtered_df.apply(process_record, axis=1)
    return filtered_df

# Markup Text using the format {[PII Label]PII Value}
show_random_record()

# Add a 'marked_up_text' column to the DataFrame containing the labeled PII for each row
labeled_dataset = process_all_records()
labeled_dataset


**{[company]Technology and Innovation Fund}**

**Investment Prospectus**

**Introduction**

The {[company]Technology and Innovation Fund} is a specialized investment vehicle focused on identifying and capitalizing on disruptive technologies and innovative companies. Our objective is to provide investors with long-term capital appreciation by investing in a diversified portfolio of technology and innovation-driven companies.

**Fund Objectives**

Our primary objective is to achieve capital growth by investing in a diversified portfolio of technology and innovation-driven companies. We aim to identify opportunities in sectors such as artificial intelligence, biotechnology, clean energy, and digital transformation.

**Investment Strategy**

Our investment strategy is based on rigorous research and analysis. We seek to identify companies with a sustainable competitive advantage, a strong management team, and a clear path to profitability. We employ a disciplined approach to portfolio const

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['marked_up_text'] = filtered_df.apply(process_record, axis=1)


Unnamed: 0,level_0,index,document_type,document_description,expanded_type,expanded_description,language,language_description,domain,generated_text,pii_spans,conformance_score,quality_score,toxicity_score,bias_score,groundedness_score,marked_up_text
0,40012,40012,Supply Chain Management Agreement,A legal contract outlining the terms and condi...,Vendor Management Contract,This subtype involves the contractual agreemen...,English,English language as spoken in the United State...,finance,SUPPLY CHAIN MANAGEMENT AGREEMENT\n\nThis Supp...,"[{""start"": 119, ""end"": 141, ""label"": ""date""}, ...",85,90,5,15,95,SUPPLY CHAIN MANAGEMENT AGREEMENT\n\nThis Supp...
1,46425,46425,Supply Chain Management Agreement,A legal contract outlining the terms and condi...,Supply Chain Resilience Framework,This subtype details the framework for buildin...,English,English language as spoken in the United State...,finance,SUPPLY CHAIN RESILIENCE FRAMEWORK\n\nThis Supp...,"[{""start"": 119, ""end"": 142, ""label"": ""date""}, ...",92,87,5,12,95,SUPPLY CHAIN RESILIENCE FRAMEWORK\n\nThis Supp...
5,32489,32489,SWIFT Message,A message format used by banks and financial i...,MT202,"Generate a synthetic MT202 message, including ...",English,English language as spoken in the United State...,finance,":20:OOFFF**222/987\n:25:USA\n:20C:USD150000,00...","[{""start"": 98, ""end"": 109, ""label"": ""name""}, {...",85,90,5,10,80,":20:OOFFF**222/987\n:25:USA\n:20C:USD150000,00..."
6,51960,51960,Health Insurance Claim Form,A form capturing details of a health insurance...,Dental Care Claim,Produce a synthetic data point for a dental ca...,English,English language as spoken in the United State...,finance,Patient Information\n\n* Full Name: Eufemia Go...,"[{""start"": 34, ""end"": 56, ""label"": ""name""}, {""...",85,90,5,15,95,Patient Information\n\n* Full Name: {[name]Euf...
8,44856,44856,Customer support conversational log,A log file capturing customer support interact...,Warranty Claims,Outline the process for assisting customers wi...,English,English language as spoken in the United State...,finance,----------------------------------------------...,"[{""start"": 323, ""end"": 333, ""label"": ""customer...",85,90,5,10,95,----------------------------------------------...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50338,6265,6265,Financial Aid Application,A form submitted by individuals to request fin...,Education Grant Application,"Fill out the education grant application form,...",English,English language as spoken in the United Kingdom,finance,Financial Aid Application - Education Grant\n\...,"[{""start"": 56, ""end"": 64, ""label"": ""name""}, {""...",85,90,5,10,95,Financial Aid Application - Education Grant\n\...
50339,54886,54886,Product Disclosure Statement,A document providing details about a financial...,Retail Investment Disclosure,"This subtype covers retail sector investments,...",English,English language as spoken in the United State...,finance,PRODUCT DISCLOSURE STATEMENT\n\nInvestment in ...,"[{""start"": 49, ""end"": 74, ""label"": ""company""},...",95,85,10,15,90,PRODUCT DISCLOSURE STATEMENT\n\nInvestment in ...
50341,44732,44732,Investment Prospectus,"A formal document detailing the objectives, ri...",Fashion and Apparel,Prepare a detailed overview of investment pros...,English,English language as spoken in the United State...,finance,INVESTMENT PROSPECTUS\n\nINTRODUCTION\n\nWe ar...,"[{""start"": 98, ""end"": 122, ""label"": ""company""}...",90,85,5,10,95,INVESTMENT PROSPECTUS\n\nINTRODUCTION\n\nWe ar...
50342,54343,54343,FIX Protocol,A messaging standard used for the electronic c...,MarketDataIncrementalRefresh,Develop a MarketDataIncrementalRefresh message...,English,English language as spoken in the United State...,finance,MDReqID:123456\nSubscriptionRequestType:0\nMar...,"[{""start"": 384, ""end"": 396, ""label"": ""name""}, ...",92,85,5,12,88,MDReqID:123456\nSubscriptionRequestType:0\nMar...
