# APE: a cookbook

This notebook is a playground for Anonymisation Pseudonymisation Encryption (APE) techniques

The very basics on different techniques will be demonstrated and will rely on software provided in Python libraries. 

## TOC:
* [The source file](#source-file)
* [Removing a column](#removing-a-column)
* [Tokenization](#tokenization)


date: 2021-03

inspiration: 
* https://korniichuk.medium.com/gdpr-guide-2-7c399b44ba3#adce
* https://developer.ibm.com/solutions/security/articles/s-gdpr3/
* https://www.anonos.com/enisa-chapter-1-introduction

## The source file<a class="anchor" id="source-file"></a>

A csv-file (ape_0.csv) is taken as a starting point. The mock data contained in this file are created from https://www.mockaroo.com/

This file contains the columns:
* id	
* first_name	
* last_name	
* email	
* gender	
* ip_address	
* city	
* country	
* age
* income
* score_1	
* score_2

The data is imported as a pandas dataframe

In [None]:
import pandas as pd

working_source = 'https://raw.githubusercontent.com/franklbvp/didactic-dollop/main/ape/ape_0.csv'

# reading CSV input
df = pd.read_csv(working_source)
df

In [None]:
type(df)

In [None]:
type(df.email)

### Removing a column<a class="anchor" id="removing-a-column"></a>

If data is not necessary for the research purpose, just remove it. Keep on the minimal data rule!

drop the unnecessary column(s). In this example the column `email` will be dropped

In [None]:
df.drop(columns=["email"], inplace=True)
df

#### Saving a dataframe

* The dataframe containing only the necessary information can be saved back into a .csv file.  
* Delete the original file and continue with the file containing the minimal info.

In [None]:
df.to_csv('ape_0_drop.csv')  

### Tokenization<a class="anchor" id="tokenization"></a>

A token is a pseudonym, and this pseudonym is being used instead of the original data.

The `uuid` library is used in this case. Universal Unique Identifier, is a python library which helps in generating random objects

In [None]:
import pandas as pd
import uuid

def email_tokenization(email):
    if email not in key:
        token = uuid.uuid4()
        while token in key.values():  # the token must be unique
            token = uuid.uuid4()
        key[email] = token
        return token
    else:
        return key[email]

# reading CSV input
df = pd.read_csv(working_source)

key = {}
df.email = df.email.map(email_tokenization)  # original email address is overwritten
df

In [None]:
type(key)

In [None]:
df.to_csv('ape_0_token.csv')  

The code can be tweaked a little to create a lookup table.
This lookup table will enable returning back to the original data. Keep this lookup table safe and separately stored from the tokenized data.

The original data are imported

In [None]:
import pandas as pd
import uuid

# reading CSV input
df = pd.read_csv(working_source)

def email_tokenization(email):
    if email not in key:
        token = uuid.uuid4()
        while token in key.values():  # the token must be unique
            token = uuid.uuid4()
        key[email] = token
        return token
    else:
        return key[email]



key = {}
df['emailT'] = df.email.map(email_tokenization)  # a new column is created with the tokens

# create the lookup table and save it
df_lookup = df[['email', 'emailT']]
df_lookup.to_csv('lookup_0_email_token.csv') 

# remove the original email column
df.drop(columns=['email'], inplace=True)

df_lookup

In [None]:
whos

### Generalization

A simple example on generalizing numerical data. The values are checked and a string is returned.

In [None]:
import pandas as pd

def age_generalization(age):
    if age < 18:
        return 'Age <= 18'
    else:
        return 'Age > 18'
    
def income_generalization(income):
    if income < 24000:
        return 'Below Average'
    else:
        return 'Above Average'

# reading CSV input
df = pd.read_csv(working_source)


df.age = df.age.map(age_generalization)
df.income = df.income.map(income_generalization)
df

### Fake names

In [None]:
import pandas as pd
from faker import Faker

def email_pseudonymization(email):
    if email not in key:
        pseudonym = fake.email()
        while (pseudonym in key.values()) or (pseudonym in key):
            pseudonym = fake.email()
        key[email] = pseudonym
        return pseudonym
    else:
        return key[email]

# reading CSV input
df = pd.read_csv(working_source)

key = {}
fake = Faker()
df.email = df.email.map(email_pseudonymization)
df

### Hashing


In [None]:
import pandas as pd
import hashlib

def email_hashing(email):
    if email not in key:
        sha3 = hashlib.sha3_512()
        data = salt + email
        sha3.update(data.encode('utf-8'))
        hexdigest = sha3.hexdigest()
        key[email] = hexdigest
        return hexdigest
    else:
        return key[email]

# reading CSV input
df = pd.read_csv(working_source)

salt = 'medium'
key = {}
df.email = df.email.map(email_hashing)
df

## Encryption

source: https://www.geeksforgeeks.org/encrypt-and-decrypt-files-using-python/

The `cryptography` library will be used to encrypt a file. The cryptography library uses a symmetric algorithm to encrypt the file. In a symmetric algorithm, the same key is used to encrypt and decrypt the file. The fernet module of the cryptography package has inbuilt functions for the generation of the key, encryption of plain text into cipher text, and decryption of cipher text into plain text using the `encrypt()` and `decrypt()` methods respectively. The fernet module guarantees that data encrypted using it cannot be further manipulated or read without the key. 


### Generating a key
a key is used to encrypt text, in order to have a strong key, it can be generated by the software library

In [None]:
# import required module 
from cryptography.fernet import Fernet

In [None]:
# key generation 
key = Fernet.generate_key() 
  
# string the key in a file 
with open('filekey.key', 'wb') as filekey: 
   filekey.write(key)

### Encrypting a string

The methods work with bytes, this requires encoding and decoding the string

In [None]:
message = "My secret message".encode() # bytes
# initialize the Fernet class
f = Fernet(key)
# encrypt the message
encrypted = f.encrypt(message)
# print how it looks
print(encrypted)

In [None]:
decrypted_encrypted = f.decrypt(encrypted)
print(decrypted_encrypted)
original_message = decrypted_encrypted.decode() # bytes to string
print(original_message)

### Encrypting a file

Encrypt the file using the key generated

* Open the file that contains the key.
* Initialize the Fernet object and store it in the fernet variable.
* Read the original file.
* Encrypt the file and store it into an object.
* Then write the encrypted data into the same file nba.csv.

In [None]:
# opening the key 
with open('filekey.key', 'rb') as filekey: 
    key = filekey.read() 
  
# using the generated key 
fernet = Fernet(key) 
  
# opening the original file to encrypt 
with open('ape_0.csv', 'rb') as file: 
    original = file.read() 
      
# encrypting the file 
encrypted = fernet.encrypt(original) 
  
# opening the file in write mode and  
# writing the encrypted data 
with open('ape_0_encrypted.csv', 'wb') as encrypted_file: 
    encrypted_file.write(encrypted) 

### Decrypting an encrypted file


In [None]:
# using the key 
fernet = Fernet(key) 
  
# opening the encrypted file 
with open('ape_0_encrypted.csv', 'rb') as enc_file: 
    encrypted = enc_file.read() 
  
# decrypting the file 
decrypted = fernet.decrypt(encrypted) 
  
# opening the file in write mode and 
# writing the decrypted data 
with open('ape_0_decrypt.csv', 'wb') as dec_file: 
    dec_file.write(decrypted) 