# Photography Business Analysis: PII Anonymization
## Author: Oliverius, Miranda

## Table of Contents

* [Overview](#overview)
* [Data Source](#source)
* [Preliminaries](#preliminaries)
* [Client ID Mapping](#mapping)
* [Data Anonymization](#anonymization)

## Overview <a class='anchor' id='overview'></a>

For the purposes of publishing this project, Personally Identifiable Information (PII) must be anonymized. The original source data will not be published on public platforms, but the code below allows insight into the process utilized for anonymization. After completing this process, the anonymized data will be included with the other dataset files.

## Data Source <a class='anchor' id='source'></a>

Two data sources are anonymized using Python for data manipulation:
1. The first data source is the business' project site report and contains site IDs, address information, geographical coordinates, and client names for each property that has been photographed.
    - This data will be anonymized by using Pandas' .map() function to replace client names with their ID numbers.
2. The second data source is the business' client list and contains client ID numbers, names, phone numbers, email addresses, and the dates they signed-up to use the client portal.
   - This data will be anonymized using the Faker package to generate fake names, phone numbers, and email addresses.

## Preliminaries <a class='anchor' id='preliminaries'></a>

Before creating the functions for anonymization and client ID mapping, Python libraries are loaded and the source data is imported.

In [1]:
# load libraries
## *** DATA MANIPULATION ***
import numpy as np
import pandas as pd
## *** DATA ANONYMIZATION ***
from faker import Faker

In [2]:
# import data from csv files
clients = pd.read_csv('clients.csv')
project_sites = pd.read_csv('project_sites.csv')

## Client ID Mapping <a class='anchor' id='mapping'></a>

Pandas .map() function is utilized to replace client names with their client ID numbers.

In [3]:
# create dictionary for mapping
client_id_dict = clients[['ID', 'Client Name']].set_index('Client Name')['ID'].to_dict()

In [4]:
# create function to map client IDs
def map_ids(df, str_original_col_name):
    df[str_original_col_name] = df[str_original_col_name].map(client_id_dict)

In [5]:
# apply the mapping function
map_ids(project_sites, 'Agent Name')

In [6]:
# change column name
project_sites.rename(columns={'Agent Name': 'Client ID'}, inplace=True)

In [7]:
# save the mapped data to a new csv file
project_sites.to_csv('anonymized_project_sites.csv', index=False)

## Data Anonymization <a class='anchor' id='anonymization'></a>

Faker is utilized below to generate fake information to replace client's PII.

In [8]:
# initialize Faker
fake = Faker()

In [9]:
# create function to anonymize PII
def anonymize_data(df):
    df['Client Name'] = df['Client Name'].apply(lambda x: fake.name())
    df['Phone'] = df['Phone'].apply(lambda x: fake.numerify('###-###-####')) # custom phone number format
    df['Email'] = df['Email'].apply(lambda x: fake.email())
    return df

In [10]:
# apply the anonymization function
clients_anonymized = anonymize_data(clients)

In [11]:
# save the anonymized data to a new csv file
clients_anonymized.to_csv('anonymized_clients.csv', index=False)