# üß™ Generative AI for Data Anonymization

This notebook demonstrates how to apply various data anonymization techniques using Python and pandas. Techniques covered:
- Pseudonymization
- Redaction
- Generalization
- Noise addition

We will use a synthetic dataset generated using the Faker library.

In [1]:
# üì• Import required libraries
import pandas as pd
import random

In [2]:
# üîó Load the synthetic dataset
df = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0273EN-SkillsNetwork/labs/v1/m1/data/synthetic_dataset.csv')
df.head()

Unnamed: 0,Name,Email,Age,Contact Number
0,Brenda Richards,michelle76@example.org,79,9898586166
1,Antonio Perez,psingleton@example.net,19,9876282758
2,Terry Monroe,edwardross@example.net,30,9782846470
3,Heather Floyd,cookbrooke@example.net,65,9739572462
4,Allen Shelton,craigcollins@example.net,63,9676063153


## üîê Redact `Name`: Show only vowels, replace others with `#`

In [3]:
def redact_name_vowels(name):
    vowels = 'aeiouAEIOU'
    return ''.join([char if char in vowels else '#' for char in name])

df['Name'] = df['Name'].apply(redact_name_vowels)
df.head()

Unnamed: 0,Name,Email,Age,Contact Number
0,##e##a##i##a###,michelle76@example.org,79,9898586166
1,A##o#io##e#e#,psingleton@example.net,19,9876282758
2,#e#####o##oe,edwardross@example.net,30,9782846470
3,#ea##e####o##,cookbrooke@example.net,65,9739572462
4,A##e####e##o#,craigcollins@example.net,63,9676063153


## üìß Pseudonymize `Email`: Convert to `user_i@pseudo.com`

In [4]:
df['Email'] = ['user_' + str(i) + '@pseudo.com' for i in range(1, len(df) + 1)]
df.head()

Unnamed: 0,Name,Email,Age,Contact Number
0,##e##a##i##a###,user_1@pseudo.com,79,9898586166
1,A##o#io##e#e#,user_2@pseudo.com,19,9876282758
2,#e#####o##oe,user_3@pseudo.com,30,9782846470
3,#ea##e####o##,user_4@pseudo.com,65,9739572462
4,A##e####e##o#,user_5@pseudo.com,63,9676063153


## üìä Generalize `Age`: Group into ranges like `30s`, `40s`, etc.

In [5]:
def generalize_age(age):
    return str(age)[0] + '0s'

df['Age'] = df['Age'].apply(generalize_age)
df.head()

Unnamed: 0,Name,Email,Age,Contact Number
0,##e##a##i##a###,user_1@pseudo.com,70s,9898586166
1,A##o#io##e#e#,user_2@pseudo.com,10s,9876282758
2,#e#####o##oe,user_3@pseudo.com,30s,9782846470
3,#ea##e####o##,user_4@pseudo.com,60s,9739572462
4,A##e####e##o#,user_5@pseudo.com,60s,9676063153


## üì± Add Noise to First 5 Digits of `Contact Number`

In [6]:
def noise_first_five(contact_number):
    contact_number = str(contact_number)
    noise = str(random.randint(10000, 99999)).zfill(5)
    return noise + contact_number[-5:]

df['Contact Number'] = df['Contact Number'].apply(noise_first_five)
df.head()

Unnamed: 0,Name,Email,Age,Contact Number
0,##e##a##i##a###,user_1@pseudo.com,70s,6632586166
1,A##o#io##e#e#,user_2@pseudo.com,10s,6349182758
2,#e#####o##oe,user_3@pseudo.com,30s,8587346470
3,#ea##e####o##,user_4@pseudo.com,60s,1769272462
4,A##e####e##o#,user_5@pseudo.com,60s,9581163153
