# **Setup**
Run These cells first.

They will install the required dependencies and generate the dataset we will be using during the demo.

In [None]:
!pip install faker
!pip install faker_music
!pip install presidio_analyzer
!pip install presidio_anonymizer
!python -m spacy download en_core_web_lg

Collecting faker
  Downloading Faker-19.6.0-py3-none-any.whl (1.7 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.7 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/1.7 MB[0m [31m6.5 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.7/1.7 MB[0m [31m26.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: faker
Successfully installed faker-19.6.0
Collecting faker_music
  Downloading faker_music-0.4-py2.py3-none-any.whl (14 kB)
Installing collected packages: faker_music
Successfully installed faker_music-0.4
Collecting presidio_analyzer
  Downloading presidio_analyzer-2.2.33-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.0/76.0 kB[0m [31m2.1 MB/s[0m eta 

In [None]:
#@title Generate example dataset
import faker
from faker import Faker
from faker_music import MusicProvider
import pandas as pd
import random

fake = Faker()
fake.add_provider(faker.providers.job)
fake.add_provider(faker.providers.passport)

def generate_dummy_data(records):
  data={}
  domains = ['hotmail.com', 'gmail.com', 'yahoo.com', 'scarlet.com', 'dataroots.io', 'company.eu', 'kul.be']
  print(f"Creating a dataset containing {records} records...")
  for i in range(0, records):
      data[i]={}
      data[i]['id'] = fake.unique.random_number(8)
      name = fake.name()
      data[i]['name'] = name
      data[i]['email_address'] = f"{name.replace(' ', '').lower()}@{domains[random.randint(0, len(domains) - 1)]}"
      data[i]['address'] = fake.address()
      data[i]['country_of_birth'] = fake.country()
      data[i]['last_login'] = fake.date_time_this_year().strftime("%Y%m%d")
      data[i]['job'] = fake.job()
      data[i]['passport_number'] = fake.passport_number()
      data[i]['passport_gender'] = fake.passport_gender()
      data[i]['weight_kg'] = fake.random_int(min=50, max=150)
      data[i]['had_flu_last_3_months'] = fake.boolean()
      data[i]['number_of_romantic_dates'] = fake.random_int(0, 20)
      data[i]['in_relationship'] = fake.boolean()
      data[i]['shoesize_eu'] = fake.random_int(34, 49)
      data[i]['height_cm'] = fake.random_int(130, 210)
      data[i]['last_sent_email'] = fake.paragraph(nb_sentences=5)
      data[i]['consent_for_storage'] = fake.boolean(chance_of_getting_true=85)
      data[i]['consent_for_marketing'] = fake.boolean(chance_of_getting_true=50)
      data[i]['consent_for_internal_analytics'] = fake.boolean(chance_of_getting_true=70)
      data[i]['consent_for_usage_in_ai_model'] = fake.boolean(chance_of_getting_true=60)
  return data

fake_data = generate_dummy_data(50)

fake_data = pd.DataFrame(fake_data)
fake_data = fake_data.T
fake_data.head(5)

Creating a dataset containing 50 records...


Unnamed: 0,id,name,email_address,address,country_of_birth,last_login,job,passport_number,passport_gender,weight_kg,had_flu_last_3_months,number_of_romantic_dates,in_relationship,shoesize_eu,height_cm,last_sent_email,consent_for_storage,consent_for_marketing,consent_for_internal_analytics,consent_for_usage_in_ai_model
0,22752603,Destiny Walker,destinywalker@hotmail.com,"99451 Morris Grove Apt. 798\nSparksmouth, VT 9...",Israel,20230401,"Engineer, building services",925732089,M,87,True,18,False,38,206,Keep recently behind purpose choice. Budget se...,True,True,True,True
1,71418450,Brandon Berg,brandonberg@hotmail.com,"795 Nguyen Road Suite 344\nPort Erika, KS 34141",Iceland,20230806,Tree surgeon,297835329,M,106,True,16,True,35,144,Audience floor over forward reason fine third....,True,True,True,True
2,82505514,Lisa Erickson,lisaerickson@hotmail.com,"PSC 1028, Box 8310\nAPO AP 60119",Timor-Leste,20230216,"Engineer, civil (contracting)",134296601,M,92,True,12,False,43,205,Skin east much. Near imagine company film brin...,False,True,True,False
3,17119511,Annette Bryant,annettebryant@scarlet.com,"PSC 8371, Box 2699\nAPO AA 78465",Belgium,20230324,Air cabin crew,368388627,F,63,True,1,True,37,176,Human couple state pick various life. Five vis...,True,False,True,True
4,19195144,Cheryl Lowe,cheryllowe@dataroots.io,"289 Theresa Flats Suite 480\nEast Lisafort, WA...",Dominica,20230705,Therapeutic radiographer,K08927961,M,68,False,11,True,43,176,Service current everyone discussion future int...,True,True,True,True


# **Logging & Traceability**

Often times this is achieved with version controlling tools such as Git for example. Clear and correct commit messages are of key importance.

But even next to this Data Lineage can be used to identify where and what data is used. It creates a form of accountability for the business to ensure privacy and traceability.

Challenge: Don't forget to log everytime you apply a change to the dataset or write a new code snippet :)

In [None]:
import logging
import getpass

# Log file will be created in /content
logging.basicConfig(
    filename='data_transformations.log',
    filemode='a',
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    force=True
)

def log_transformation(operation):
  # Asuming you will act from a user account when performing ETL jobs
  user = getpass.getuser()
  logging.info(f'User {user} performed operation: {operation}')


# **Data Aggregation**
We can use Data Aggregation to anonymize numerical data by categorizing them in different ranges.

Weight or length are not sensitive data on its own, but could become a danger when used in combination with other columns that could identify a person.
When using such numerical values like weight or length we suggest to use data aggregation where possible.

Lets first apply this for the weight column.

In [None]:
# Identify our range treshholds
bins = [50, 70, 90, 110, 130, 150]
# Define the labels attached to each range to replace the exact values with
labels = ['50-70 kg', '71-90 kg', '91-110 kg', '111-130 kg', '130+ kg']

try:
  # Use the cut method provided by pandas to apply our ranges to the weight_kg column
  fake_data['weight_kg'] = pd.cut(fake_data['weight_kg'], bins=bins, labels=labels)
  log_transformation("Aggregated weight_kg")
except TypeError:
  print("Already converted weights into ranges, unable to do it a second time.")


Now you saw one of the ways to apply these ranges using pandas, replicate it for the height_cm column.

If you know any other methods to achieve similar results feel free to use these ofcourse!

In [None]:
# Write your code here


In [None]:
#@title Solution
bins = [130, 150, 170, 190, 210]
labels = ['130 - 150 cm', '151 - 170 cm', '171 - 190 cm', '190+ cm']

try:
  fake_data['height_cm'] = pd.cut(fake_data['height_cm'], bins=bins, labels=labels)
  log_transformation("Aggregated height_cm")
  fake_data.head()
except TypeError:
  print("Already converted height_cm into ranges, unable to do it a second time.")

# **Data retention**
Another important thing to think about is Data retention.
With this we try to limit how long we keep a persons data.
For this demo lets chose for a retention period of 60 days.
This means we do not want to store/use data older then 60 days. Lets say you Became a member 70 days ago your data should not be used/stored anymore.

Given the column 'member_since' with dates in string format '%Y%m%d', write a snippet to filter out rows older then 60 days.


In [None]:
from datetime import datetime, timedelta

retention_period_in_days = 60

# Write your code snippet here

In [None]:
#@title Solution
from datetime import datetime, timedelta

retention_period_in_days = 60

fake_data['last_login'] = pd.to_datetime(fake_data['last_login'], format='%Y%m%d')
fake_data = fake_data.loc[fake_data['last_login'] > datetime.now() - timedelta(days=retention_period_in_days)]
log_transformation(f"Applied retention period of {retention_period_in_days}")
fake_data.head(10)

Unnamed: 0,id,name,email_address,address,country_of_birth,last_login,job,passport_number,passport_gender,weight_kg,had_flu_last_3_months,number_of_romantic_dates,in_relationship,shoesize_eu,height_cm,last_sent_email,consent_for_storage,consent_for_marketing,consent_for_internal_analytics,consent_for_usage_in_ai_model
12,83740781,Tonya Lewis,tonyalewis@dataroots.io,"60840 Leah Summit Apt. 915\nParkerberg, WA 99482",Philippines,2023-09-09,General practice doctor,S02005420,F,91-110 kg,True,4,False,41,171 - 190 cm,How ever food only almost tough. Knowledge els...,True,True,True,True
21,54129705,Priscilla Diaz,priscilladiaz@dataroots.io,"975 Cameron Alley Apt. 791\nThomashaven, NJ 72764",Ghana,2023-08-07,Civil Service fast streamer,820496199,M,130+ kg,True,15,False,39,130 - 150 cm,Set cause agent another anything need specific...,True,True,False,False
25,15814954,Cynthia Watson,cynthiawatson@kul.be,9051 Torres Trail Apt. 547\nWest Marissaboroug...,Uganda,2023-08-22,"Conservation officer, historic buildings",918597124,M,91-110 kg,True,7,False,39,171 - 190 cm,When weight us simply market. Wonder hospital ...,True,False,True,True
26,41839385,Emily Garcia,emilygarcia@kul.be,Unit 6464 Box 7325\nDPO AE 29180,Sierra Leone,2023-08-07,Insurance account manager,B27726603,F,130+ kg,True,4,False,46,190+ cm,Political laugh single. Time administration mu...,True,False,False,False
31,41706731,Kathryn Anderson,kathrynanderson@dataroots.io,"972 Shawn Common\nBrownhaven, OH 84292",Sudan,2023-09-02,Geographical information systems officer,R86091394,F,130+ kg,False,9,False,47,190+ cm,Meeting a account inside expert well the. Page...,True,True,True,True
41,44992477,Roger Kemp,rogerkemp@kul.be,"6598 Carol Village Suite 315\nEast Tyler, AR 4...",Wallis and Futuna,2023-09-04,Senior tax professional/tax inspector,222822457,F,71-90 kg,True,16,False,37,151 - 170 cm,Man positive single threat. Impact call teach ...,True,False,False,True
48,62532565,Diane Hayden,dianehayden@company.eu,"791 Jesus Points\nJeffreymouth, SC 93827",Bulgaria,2023-08-18,"Nurse, mental health",318240974,F,71-90 kg,True,12,True,42,190+ cm,Top likely newspaper quality describe leg size...,True,True,False,True
49,71433787,Alexis Ortiz,alexisortiz@gmail.com,"7835 Chambers Course Suite 720\nJoycebury, AS ...",Fiji,2023-09-04,Comptroller,767984439,F,50-70 kg,True,19,False,36,130 - 150 cm,Through majority career management however it ...,True,True,True,False


#**Data Protegrity**

Now we want to make sure all PII data that remains is anonymized, so that we can not link the data to an actual person.

A way of achieving this is through protegrity, which is quite a complex process that in practice requires a lot of configuration and algorithmic complexity, but it remains one of the most secure ways of ensuring confidentiality and privacy in our datasets.

Protegrity makes use of tokenizing and detokenizing.
Below we will create a simplified example to showcase what it does.

In the example we use a dictionairy to store the token mapping, ofcourse in a real world scenario this would be stored in an external seperate secured database.

In [None]:
"""
In this example we use a dictionairy to store the token mapping,
Ofcourse in a real world scenario this would be stored in an external seperate secured database.
This would also remove the need to pass the mapping along when detokenizing.

The idea is that the tokenization is irreversable for someone with no detokenization rights.

So it is important here to note that a Role-Based Access Control approach is highly advised when working with protegrity.
This way only selected roles within a company will be able to detokenize the data.

A great advantage of this approach is that if the same policies are adapted to multiple or all datasets
data manipulation is still possible, such as joins for example.

In the example here we do not retain the characteristics of the original data.
This is quite a complex task but possible to achieve.
We might want the mail adresses to still look like mail adresses but with different names
or phone numbers to still be formated as phone numbers.
"""
def tokenize(df, column):
    unique_values = df[column].unique()
    tokens = {value: 'token_'+str(i) for i, value in enumerate(unique_values)}
    df[column] = df[column].map(tokens)
    log_transformation(f"Tokenized column '{column}'")
    return df, tokens

def detokenize(df, column, token_mapping):
    inv_mapping = {v: k for k, v in token_mapping.items()}
    df[column] = df[column].map(inv_mapping)
    log_transformation(f"Detokenized column '{column}'")
    return df

In [None]:
df, token_mapping_name = tokenize(fake_data, 'name')
df.head(3)

# Tokenize the remaining fields that you think are PII sensitive.

Unnamed: 0,id,name,email_address,address,country_of_birth,last_login,job,passport_number,passport_gender,weight_kg,had_flu_last_3_months,number_of_romantic_dates,in_relationship,shoesize_eu,height_cm,last_sent_email,consent_for_storage,consent_for_marketing,consent_for_internal_analytics,consent_for_usage_in_ai_model
12,83740781,token_0,tonyalewis@dataroots.io,"60840 Leah Summit Apt. 915\nParkerberg, WA 99482",Philippines,2023-09-09,General practice doctor,S02005420,F,91-110 kg,True,4,False,41,171 - 190 cm,How ever food only almost tough. Knowledge els...,True,True,True,True
21,54129705,token_1,priscilladiaz@dataroots.io,"975 Cameron Alley Apt. 791\nThomashaven, NJ 72764",Ghana,2023-08-07,Civil Service fast streamer,820496199,M,130+ kg,True,15,False,39,130 - 150 cm,Set cause agent another anything need specific...,True,True,False,False
25,15814954,token_2,cynthiawatson@kul.be,9051 Torres Trail Apt. 547\nWest Marissaboroug...,Uganda,2023-08-22,"Conservation officer, historic buildings",918597124,M,91-110 kg,True,7,False,39,171 - 190 cm,When weight us simply market. Wonder hospital ...,True,False,True,True


In [None]:
df = detokenize(fake_data, 'name', token_mapping_name)
df.head(3)

Unnamed: 0,id,name,email_address,address,country_of_birth,last_login,job,passport_number,passport_gender,weight_kg,had_flu_last_3_months,number_of_romantic_dates,in_relationship,shoesize_eu,height_cm,last_sent_email,consent_for_storage,consent_for_marketing,consent_for_internal_analytics,consent_for_usage_in_ai_model
12,83740781,Tonya Lewis,tonyalewis@dataroots.io,"60840 Leah Summit Apt. 915\nParkerberg, WA 99482",Philippines,2023-09-09,General practice doctor,S02005420,F,91-110 kg,True,4,False,41,171 - 190 cm,How ever food only almost tough. Knowledge els...,True,True,True,True
21,54129705,Priscilla Diaz,priscilladiaz@dataroots.io,"975 Cameron Alley Apt. 791\nThomashaven, NJ 72764",Ghana,2023-08-07,Civil Service fast streamer,820496199,M,130+ kg,True,15,False,39,130 - 150 cm,Set cause agent another anything need specific...,True,True,False,False
25,15814954,Cynthia Watson,cynthiawatson@kul.be,9051 Torres Trail Apt. 547\nWest Marissaboroug...,Uganda,2023-08-22,"Conservation officer, historic buildings",918597124,M,91-110 kg,True,7,False,39,171 - 190 cm,When weight us simply market. Wonder hospital ...,True,False,True,True


# **Purpose limitation**

A way to implement Purpose limitation is by adding consent columns for different type of usage purposes.
In our example here we use 4:

  - Consent for storage
  - Consent for marketing use
  - Consent for internal analytics
  - Consent for usage in an ai model (Training of a chatbot for example)

The values in these columns will be either True (Given consent) or False (No consent given)

Based on each usecase we should filter out the clients that wish not to be used for the matching consent types. This is an exercise that should be done for each usecase to determine which consents are in relation to the use case.

For example: Data that will be used to send out targeted ads by email through an ML system, will require consent for marketing use and consent for usage in an ai model. (Consent for Storage incase the data will be persisted)

Many more consent types are to be made ofcourse but lets first focus on these.

Now lets try to do this ourselfs, given a use case try to identify which consent filters are to be taken into account and apply the appropriate filtering.

**Use Case:**

In [None]:
# Lets filter out individuals which did not give consent for their data to be persisted.
fake_data = fake_data[fake_data.consent_for_storage]

# Continue for the other consent types that you identified as of importance for your usecase.


fake_data

Unnamed: 0,id,name,email_address,address,country_of_birth,last_login,job,passport_number,passport_gender,weight_kg,had_flu_last_3_months,number_of_romantic_dates,in_relationship,shoesize_eu,height_cm,last_sent_email,consent_for_storage,consent_for_marketing,consent_for_internal_analytics,consent_for_usage_in_ai_model
12,83740781,Tonya Lewis,tonyalewis@dataroots.io,"60840 Leah Summit Apt. 915\nParkerberg, WA 99482",Philippines,2023-09-09,General practice doctor,S02005420,F,91-110 kg,True,4,False,41,171 - 190 cm,How ever food only almost tough. Knowledge els...,True,True,True,True
21,54129705,Priscilla Diaz,priscilladiaz@dataroots.io,"975 Cameron Alley Apt. 791\nThomashaven, NJ 72764",Ghana,2023-08-07,Civil Service fast streamer,820496199,M,130+ kg,True,15,False,39,130 - 150 cm,Set cause agent another anything need specific...,True,True,False,False
25,15814954,Cynthia Watson,cynthiawatson@kul.be,9051 Torres Trail Apt. 547\nWest Marissaboroug...,Uganda,2023-08-22,"Conservation officer, historic buildings",918597124,M,91-110 kg,True,7,False,39,171 - 190 cm,When weight us simply market. Wonder hospital ...,True,False,True,True
26,41839385,Emily Garcia,emilygarcia@kul.be,Unit 6464 Box 7325\nDPO AE 29180,Sierra Leone,2023-08-07,Insurance account manager,B27726603,F,130+ kg,True,4,False,46,190+ cm,Political laugh single. Time administration mu...,True,False,False,False
31,41706731,Kathryn Anderson,kathrynanderson@dataroots.io,"972 Shawn Common\nBrownhaven, OH 84292",Sudan,2023-09-02,Geographical information systems officer,R86091394,F,130+ kg,False,9,False,47,190+ cm,Meeting a account inside expert well the. Page...,True,True,True,True
41,44992477,Roger Kemp,rogerkemp@kul.be,"6598 Carol Village Suite 315\nEast Tyler, AR 4...",Wallis and Futuna,2023-09-04,Senior tax professional/tax inspector,222822457,F,71-90 kg,True,16,False,37,151 - 170 cm,Man positive single threat. Impact call teach ...,True,False,False,True
48,62532565,Diane Hayden,dianehayden@company.eu,"791 Jesus Points\nJeffreymouth, SC 93827",Bulgaria,2023-08-18,"Nurse, mental health",318240974,F,71-90 kg,True,12,True,42,190+ cm,Top likely newspaper quality describe leg size...,True,True,False,True
49,71433787,Alexis Ortiz,alexisortiz@gmail.com,"7835 Chambers Course Suite 720\nJoycebury, AS ...",Fiji,2023-09-04,Comptroller,767984439,F,50-70 kg,True,19,False,36,130 - 150 cm,Through majority career management however it ...,True,True,True,False


# **Data Minimalisation**
Given the same use case try to identify which columns are needed to reach it's goal.

In [None]:
# Add the columns you wish to use for the mentioned use case
fake_data = fake_data[["", ""]]

KeyError: ignored

Now that you have selected the minimally required columns for your use case lets compare your choice to your neighbours selection.
Did you select the same columns? Why? Why not?

# *Cool tool to help with Data Privacy*
## **Identifying PII Data using presidio from Microsoft**
An example of a tool that we can use to identify PII data in a dataset is **presidio**.
It uses NLP to identify which words in a text or column are considered PII data.

On top of identification we can use it to apply multiple anonymization techniques such as: Masking, Encrypting, Hashing, etc.

There is quite some room for customization but for now lets just use it to identify what information we should look at when anonymizing our dataset and apply encryption to only our EMAIL ADDRESS



In [None]:
from presidio_analyzer import AnalyzerEngine, BatchAnalyzerEngine, RecognizerResult, DictAnalyzerResult
from presidio_anonymizer import AnonymizerEngine, BatchAnonymizerEngine, DeanonymizeEngine
from presidio_anonymizer.entities import EngineResult, OperatorConfig

# Set up the engine, loads the NLP module (spaCy model by default) and other PII recognizers
data_dict = fake_data.to_dict(orient="list")
analyzer = AnalyzerEngine()
batch_analyzer = BatchAnalyzerEngine(analyzer_engine=analyzer)
batch_anonymizer = BatchAnonymizerEngine()

# Analyse the data set using the engine to determine the PII Data
analyzer_results = batch_analyzer.analyze_dict(data_dict, language="en")

# Use the analyzer results to apply an action to the identified PII Data. eg. Encryption to email addresses
anonymizer_results = batch_anonymizer.anonymize_dict(analyzer_results, operators={"EMAIL_ADDRESS": OperatorConfig("encrypt", {"key": "ThisIsAnExampleEncryptionKey128*"})},)

# Convert our results back into a dataset
result = pd.DataFrame(anonymizer_results)
result.head(10)

Unnamed: 0,id,name,email_address,address,country_of_birth,last_login,job,passport_number,passport_gender,weight_kg,had_flu_last_3_months,number_of_romantic_dates,in_relationship,shoesize_eu,height_cm,last_sent_email,consent_for_storage,consent_for_marketing,consent_for_internal_analytics,consent_for_usage_in_ai_model
0,<US_BANK_NUMBER>,Destiny Walker,DWLSny9+69eNDudxFj1fJNPJvGGBhDNeqe0eRoc3DRpjR8...,"99451 Morris Grove Apt. 798\nSparksmouth, VT 9...",<LOCATION>,<DATE_TIME>,"Engineer, building services",<US_PASSPORT>,M,87,True,18,False,38,206,Keep recently behind purpose choice. Budget se...,True,True,True,True
1,<DATE_TIME>,<PERSON>,vHNh/xvEuKp9Lp0giNXnNaE/ztIkKKrCbOKcICDOrzIewZ...,"795 <PERSON> 344\nPort Erika, KS 34141",<LOCATION>,<DATE_TIME>,Tree surgeon,<US_PASSPORT>,M,106,True,16,True,35,144,Audience floor over forward reason fine third....,True,True,True,True
2,<US_BANK_NUMBER>,<PERSON>,RnS5lHit1YLDBxDxjv0O/GuMJHS1F7LyNm90nBvJ6oILeb...,"PSC 1028, Box 8310\nAPO AP 60119",<LOCATION>-Leste,<US_BANK_NUMBER>,"Engineer, civil (contracting)",<AU_ACN>,M,92,True,12,False,43,205,Skin east much. Near imagine company film brin...,False,True,True,False
3,<US_BANK_NUMBER>,<PERSON>,g0y3OyytaprSWg5iVZwIjwAsq8KHAlIUVPGBuJqu8J7cRf...,"PSC 8371, Box 2699\nAPO AA 78465",<LOCATION>,<US_BANK_NUMBER>,Air cabin crew,<US_PASSPORT>,F,63,True,1,True,37,176,Human couple state pick various life. Five vis...,True,False,True,True
4,<US_BANK_NUMBER>,<PERSON>,ULewvp0Zwxy07DlmzOK97DwCTyv0GSVsQrOkn7q3GjdUI2...,"289 Theresa Flats Suite 480\n<NRP>, <LOCATION>...",<LOCATION>,<US_BANK_NUMBER>,Therapeutic radiographer,<US_PASSPORT>,M,68,False,11,True,43,176,Service current everyone discussion future int...,True,True,True,True
5,<US_DRIVER_LICENSE>,<PERSON>,vcbNlmIKOwA71skTyx9pjWknIQJi+OlFgzetr21SDn7Z3D...,"<DATE_TIME> Lee Flat Suite <DATE_TIME>, <LOCAT...",<LOCATION>,<US_BANK_NUMBER>,"Restaurant manager, fast food",<US_PASSPORT>,F,145,True,17,True,48,182,Religious fall early art still budget school. ...,False,False,True,True
6,<US_BANK_NUMBER>,<PERSON>,gHrWdIKpK5e39ZIInvoZFRzmoSwXZhBm+NVh4nqWsmHdPX...,"4122 Joshua Port Suite 285\nSouth <PERSON>, MA...",<LOCATION>,<US_BANK_NUMBER>,Neurosurgeon,<US_PASSPORT>,M,60,True,20,True,49,187,Professional data some report bag also certain...,True,True,True,True
7,<US_BANK_NUMBER>,<PERSON>,6NVlgVSg6p5m5TCWG6khssoEThIf2uSO2Vysic/3JqOwLu...,"444 Richardson Place Apt. 827\nLake Cody, CT 7...",<LOCATION>,<US_BANK_NUMBER>,"Doctor, hospital",<US_PASSPORT>,F,131,False,14,False,47,168,Meet art shake least laugh whose. Coach per kn...,True,False,True,False
8,<US_DRIVER_LICENSE>,<PERSON>,7i40KMfM2Roxm8O9c+ckL6HQyC+HqDAMXk3g0xidd0dqAS...,"978 <PERSON>. 226\nNancyland, MS 57499",<PERSON> and Principe,<US_BANK_NUMBER>,"Exhibitions officer, museum/gallery",<DATE_TIME>,M,88,True,6,True,47,131,Avoid everybody field pull response. Wrong nat...,True,True,True,False
9,<US_BANK_NUMBER>,<PERSON>,lS6aS8QKnDLV7hUpdVDwTk0SrlkU6772SgBvpwwAOmV+fw...,"626 <PERSON><LOCATION>, OR 41696",<LOCATION>,<US_BANK_NUMBER>,Oceanographer,<PHONE_NUMBER>,M,125,True,17,False,37,186,Seat teach by director prevent seem chair. Gir...,True,False,False,False


As you can see the result does contain flaws, Some entries might me miss interpreted.
This is because the main goal of this library is to apply this to a larger block of text in which context might help the algorithm to identify PII Data.

On the other hand the emails were easily identified and encrypted by the algorithm. The nice thing about encryption is that we can decrypt the email addresses when wanting to use them later in various use cases. For this you will need the encrypytion key ofcourse.

# **More tools to possibly discuss?**

- **Protegrity**: Protegrity is a data security platform that provides advanced tokenization, encryption, and anonymization methods. It is built for enterprise use and supports a wide variety of data sources and formats.

- **IBM Guardium Data Protection**: This platform helps ensure the integrity of information in data centers and automates compliance controls. It provides real-time data protection and monitors data access.

- **Informatica Data Privacy Management**: Informatica's solution offers a wide range of data privacy features, including data risk analysis, data discovery, and classification. It also supports data masking and encryption.

- **Oracle Data Masking and Subsetting**: Oracle's solution helps manage and secure sensitive data in non-production environments. Data Masking helps ensure sensitive information is replaced with realistic values, allowing developers to work with real data, but not exposing sensitive information.

- **Imperva Data Masking**: Imperva's data masking solution ensures sensitive data is replaced with realistic, but not real, data – maintaining both operational and business value.

- **Privitar**: Privitar provides data privacy software focusing on customer-centric privacy preservation. It enables safe data use for analytics and machine learning.

- **OneTrust**: OneTrust is a privacy, security, and governance tool. It has modules for GDPR, CCPA, and other regulations. OneTrust assists with privacy impact assessments, data mapping, consent management, and more.

- **Varonis DataSecurity Platform**: Varonis is a pioneer in data security and analytics. It specializes in software for risk detection and response, user behavior analytics, data archiving, and more.

- **Microsoft Compliance Manager**: This solution, part of the larger Microsoft 365 compliance center, helps you manage your organization's compliance requirements with greater ease and convenience.

- **Google Cloud's Data Loss Prevention (DLP)**: This service helps you manage, secure, and prevent the loss of sensitive data, using techniques such as data redaction, data masking, and tokenization.

- **AWS Macie**: AWS Macie is a security service that uses machine learning to automatically discover, classify, and protect sensitive data like PII.