# Data anonymization 

In this example, we define a function anonymize_data() to anonymize personal information within the dataset. The function replaces names with pseudonyms (e.g., "Patient1", "Patient2"), masks the last four digits of Social Security Numbers, and partially masks the addresses by showing only the street number and masking the rest.

However, depending on specific privacy requirements and regulations, organizations may need to implement additional anonymization techniques, such as encryption or tokenization, to further enhance data privacy and security.

In [None]:
import pandas as pd


In [1]:

# Sample dataset with personal information
data = {
    'Name': ['John Smith', 'Jane Doe', 'Alice Johnson'],
    'Address': ['123 Main St', '456 Elm St', '789 Oak St'],
    'Social Security Number': ['123-45-6789', '987-65-4321', '456-78-9123']
}

# Load dataset into a pandas DataFrame
df = pd.DataFrame(data)

# Define a function to anonymize personal information
def anonymize_data(df):
    # Anonymize names by replacing with pseudonyms
    df['Name'] = ['Patient' + str(i+1) for i in range(len(df))]
    
    # Remove or mask Social Security Numbers
    df['Social Security Number'] = df['Social Security Number'].apply(lambda x: '***-**-' + x[-4:])
    
    # Mask addresses for privacy
    df['Address'] = df['Address'].apply(lambda x: x.split()[0] + ' ***')

    return df

# Anonymize personal information
anonymized_df = anonymize_data(df)

# Display anonymized dataset
print("Anonymized Dataset:")
print(anonymized_df)


Anonymized Dataset:
       Name  Address Social Security Number
0  Patient1  123 ***            ***-**-6789
1  Patient2  456 ***            ***-**-4321
2  Patient3  789 ***            ***-**-9123
