# MSDS 7349 Anonymizing Data
Authors: Alex Frye, Michael Smith, Lindsay Vitovsky

## Introduction
The Z-Virus has broken out in remote parts of the US. However, due to its incubation period, the virus was able to spread quickly before it was caught. The CDC in conjunction with the WHO have decided to realease sensitive healthcare data in an effort to crowdsource a solution to determine those attributes necessary to identify those immune to the disease and those that are carriers and their correlation with infected. With this data it may be possible to save the world by containing the diease before it spreads any further.

## Creating a Dataset - Michael
brief description of data created and process / references used.

Demographic Data of Ethnicity and Age of USA: https://en.wikipedia.org/wiki/Demography_of_the_United_States  
Hair and Eye Color data: http://www.gnxp.com/blog/2008/12/nlsy-blogging-eye-and-hair-color-of.php

In [None]:
import os
import glob
import pandas as pd
import numpy as np
from IPython.display import display
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

pd.options.mode.chained_assignment = None

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

plt.rcParams['figure.figsize'] = (12, 6)

In [None]:
%%time

if os.path.isfile("dataset.csv"):
    print("Found the File!")
else:
    %run constructDataSet.py


In [None]:
%%time
############################################################
# Load the Compiled Data from CSV
############################################################

# Create CSV Reader Function and assign column headers
def reader(f, columns):
    d = pd.read_csv(f)
    d.columns = columns
    return d


# Identify All CSV FileNames needing to be loaded
path = r''
all_files = glob.glob(os.path.join(path, "dataset.csv"))

# Define File Columns
columns = ["ID","LastName", "FirstName", "MiddleName", "Sex", "Age", "Ethnicity", "Hispanic_Latino", "BloodType", "HairColor", "EyeColor", "StreetAddress", "City", "State", "Zip", "PhoneNumber", "SocialSecurityNumber", "ZVirus"]

# Load Data
Data = pd.concat([reader(f, columns) for f in all_files])

In [None]:
display(Data.head())

## Anonymizing Data

#### Personally Identifiable Information - Michael

Last Name, First Name, Middle Name, Address, Phone Number and Social Security Number are all PII. First step in anonymizing data.

In [None]:
#Output DataAnon as pd.frame

DataAnon = Data[["ID","Sex", "Age", "Ethnicity", "Hispanic_Latino", "BloodType", "HairColor", "EyeColor", "State", "ZVirus"]]
display(DataAnon.head())

#### K-Anonymization - Alex

Age, Ethnicity, Blood Type, City, State and Zip could all be examined for K-Anonymization, removing those elements that have a low repetition count.

Set K-Anonymization threshold to 5% of sample size. (AKA 500)

In [None]:
    # Define Threshold Value as 5% of Original Sample Size
KAThres = round(len(Data) * .05, 0)
print(KAThres)

#### Age
Split up into categories so that no individual age can be taken advantage of. 
Identify counts for each class.
the >=81 group does not meet the threshold, so values removed from dataset 

In [None]:
%%time
%matplotlib inline


DataAnon["AgeClass"] = np.where(DataAnon["Age"] <= 20,                                                         "<=20",
                                np.where((DataAnon["Age"] >= 21) &  (DataAnon["Age"] <= 40),                   "21-40",
                                         np.where((DataAnon["Age"] >= 41) &  (DataAnon["Age"] <= 60),          "41-60",
                                                  np.where((DataAnon["Age"] >= 61) &  (DataAnon["Age"] <= 80), "61-80",
                                                                                                               ">=81"
                                                          )
                                                 )
                                        )
                               )

    #Agg Classes
AgeAgg = pd.DataFrame({'count' : DataAnon.groupby(["AgeClass"]).size()}).reset_index()

    # display class counts
display(AgeAgg)
    
    # Pie class Distribution
AgeAgg.plot.pie(y = 'count', labels = AgeAgg["AgeClass"], autopct='%1.1f%%', figsize=(12,6))


In [None]:
    # MergeAgg
DataAnon = pd.merge(DataAnon, AgeAgg, on = "AgeClass")

    # Remove records with class count < Threshold Value
DataAnon = DataAnon[DataAnon["count"]>KAThres].drop(['Age', 'count'], axis = 1)

print("After removing records of Age Groups with a group population less than {0}, we are left with {1} records.".format(KAThres,len(DataAnon)))

    #Agg Classes
AgeAgg = pd.DataFrame({'count' : DataAnon.groupby(["AgeClass"]).size()}).reset_index()

    # display class counts
display(AgeAgg)

del AgeAgg

#### Ethnicity
Identify counts for each class.

We remove Ethnicities: "American Indian or Alaskan Native"
                       "Asian American"
                       "Native Hawaiian or Other Pacific Islander"

Should we just modify to "other"? so the records can stay?

In [None]:
%%time
%matplotlib inline


    #Agg Classes
EthAgg = pd.DataFrame({'count' : DataAnon.groupby(["Ethnicity"]).size()}).reset_index()

    # display class counts
display(EthAgg)
    
    # Pie class Distribution
EthAgg.plot.pie(y = 'count', labels = EthAgg["Ethnicity"], autopct='%1.1f%%', figsize=(12,6))


In [None]:
    # MergeAgg
DataAnon = pd.merge(DataAnon, EthAgg, on = "Ethnicity")

    # Remove records with class count < Threshold Value
DataAnon = DataAnon[DataAnon["count"]>KAThres].drop(['count'], axis = 1)

print("After removing records of ethnicities with a group population less than {0}, we are left with {1} records.".format(KAThres,len(DataAnon)))

    #Agg Classes
EthAgg = pd.DataFrame({'count' : DataAnon.groupby(["Ethnicity"]).size()}).reset_index()

    # display class counts
display(EthAgg)

del EthAgg

#### Blood Type
Identify counts for each class.

We remove Blood Types: {AB+, AB-, B-}

In [None]:
%%time
%matplotlib inline


    #Agg Classes
BTAgg = pd.DataFrame({'count' : DataAnon.groupby(["BloodType"]).size()}).reset_index()

    # display class counts
display(BTAgg)
    
    # Pie class Distribution
BTAgg.plot.pie(y = 'count', labels = BTAgg["BloodType"], autopct='%1.1f%%', figsize=(12,6))


In [None]:
    # MergeAgg
DataAnon = pd.merge(DataAnon, BTAgg, on = "BloodType")

    # Remove records with class count < Threshold Value
DataAnon = DataAnon[DataAnon["count"]>KAThres].drop(['count'], axis = 1)

print("After removing records of Blood Types with a group population less than {0}, we are left with {1} records.".format(KAThres,len(DataAnon)))

    #Agg Classes
BTAgg = pd.DataFrame({'count' : DataAnon.groupby(["BloodType"]).size()}).reset_index()

    # display class counts
display(BTAgg)

del BTAgg

#### State
Identify counts for each class.

We leave all states due to the near-even distribution across state populations within our sample. Once we apply pertubation to the data, we suspect this attribute to be extremely difficult to identify.

In [None]:
%%time
%matplotlib inline

    #Agg Classes
SAgg = pd.DataFrame({'count' : DataAnon.groupby(["State"]).size()}).reset_index()

    # display class counts
display(SAgg)
    
    # Pie class Distribution
SAgg.plot.pie(y = 'count', labels = SAgg["State"], autopct='%1.1f%%', figsize=(12,6))


#### Hair Color
Identify counts for each class.

We remove Hair Colors: {Grey, Light Blond, Red}

In [None]:
%%time
%matplotlib inline


    #Agg Classes
HairAgg = pd.DataFrame({'count' : DataAnon.groupby(["HairColor"]).size()}).reset_index()

    # display class counts
display(HairAgg)
    
    # Pie class Distribution
HairAgg.plot.pie(y = 'count', labels = HairAgg["HairColor"], autopct='%1.1f%%', figsize=(12,6))


In [None]:
    # MergeAgg
DataAnon = pd.merge(DataAnon, HairAgg, on = "HairColor")

    # Remove records with class count < Threshold Value
DataAnon = DataAnon[DataAnon["count"]>KAThres].drop(['count'], axis = 1)

print("After removing records of Hair Color with a group population less than {0}, we are left with {1} records.".format(KAThres,len(DataAnon)))

    #Agg Classes
HairAgg = pd.DataFrame({'count' : DataAnon.groupby(["HairColor"]).size()}).reset_index()

    # display class counts
display(HairAgg)

del HairAgg

#### Eye Color
Identify counts for each class.

We remove Eye Colors: {Black, Grey, Light Blue, Light Brown, Other}

In [None]:
%%time
%matplotlib inline


    #Agg Classes
EyeAgg = pd.DataFrame({'count' : DataAnon.groupby(["EyeColor"]).size()}).reset_index()

    # display class counts
display(EyeAgg)
    
    # Pie class Distribution
EyeAgg.plot.pie(y = 'count', labels = EyeAgg["EyeColor"], autopct='%1.1f%%', figsize=(12,6))


In [None]:
    # MergeAgg
DataAnon = pd.merge(DataAnon, EyeAgg, on = "EyeColor")

    # Remove records with class count < Threshold Value
DataAnon = DataAnon[DataAnon["count"]>KAThres].drop(['count'], axis = 1)

print("After removing records of Eye Color with a group population less than {0}, we are left with {1} records.".format(KAThres,len(DataAnon)))

    #Agg Classes
EyeAgg = pd.DataFrame({'count' : DataAnon.groupby(["EyeColor"]).size()}).reset_index()

    # display class counts
display(EyeAgg)

del EyeAgg

#### Hispanic/Latino Population

In [None]:
%%time
%matplotlib inline


    #Agg Classes
HisAgg = pd.DataFrame({'count' : DataAnon.groupby(["Hispanic_Latino"]).size()}).reset_index()

    # display class counts
display(HisAgg)
    
    # Pie class Distribution
HisAgg.plot.pie(y = 'count', labels = HisAgg["Hispanic_Latino"], autopct='%1.1f%%', figsize=(12,6))

#### Stratification of the Response Variable or Classifier - Alex?

By pulling equal distributions of each group, this changes the predictable distributions based on available population data.

we want to create a stratified sample based on ZVirus type. When identifying how many records are remaining in each category, we find that the least frequent ZVirus type is "Infected" with a frequency of 1528. In order to stratify, this means we cannot have a sample size larger than 1528 * 3 = 4584. 

In [None]:
%%time
%matplotlib inline

ZVirusDist = pd.DataFrame({'count' : DataAnon.groupby(["ZVirus"]).size()}).reset_index()
display(ZVirusDist)

ZVirusDist.plot.pie(y = 'count', labels = ZVirusDist['ZVirus'], autopct='%1.1f%%')

del ZVirusDist

In order to truly randomly sample observations from each class and keep a fairly round sample size number, we have chosen to utilize a stratified sample size of 3750. This sample size will be stratified three ways in a 33/33/33 split across the ZVirus classes. 

We are able to compute the sample size for each ZVirus type, and then take a random sample within each group. Below you will see that our sampled distribution matches the chosen 33/33/33 split across ZVirus types. 

*Note:* A seed value equal to the sample size of each type in order to ensure reproducibility for this report.

In [None]:
%%time
SampleSize = 3750

CarrierSample_Seed   = int(round(SampleSize * 33.3333 / 100.0,0))
ImmuneSample_Seed    = int(round(SampleSize * 33.3333 / 100.0,0))
InfectedSample_Seed  = int(round(SampleSize * 33.3333 / 100.0,0))


CarrierDataSampled  = DataAnon[DataAnon["ZVirus"] == 'Carrier'].sample(n=CarrierSample_Seed, replace = False, random_state = CarrierSample_Seed)
ImmuneDataSampled   = DataAnon[DataAnon["ZVirus"] == 'Immune'].sample(n=ImmuneSample_Seed, replace = False, random_state = ImmuneSample_Seed)
InfectedDataSampled = DataAnon[DataAnon["ZVirus"] == 'Infected'].sample(n=InfectedSample_Seed, replace = False, random_state = InfectedSample_Seed)


DataAnon = pd.concat([CarrierDataSampled,ImmuneDataSampled,InfectedDataSampled])

print(len(DataAnon))

ZVirusDist = pd.DataFrame({'count' : DataAnon.groupby(["ZVirus"]).size()}).reset_index()
display(ZVirusDist)

ZVirusDist.plot.pie(y = 'count', labels = ZVirusDist['ZVirus'], autopct='%1.1f%%')

del ZVirusDist

#### Perturbation - Michael

Removing identifiable feature names and values and translating them into non-sense that maintains the original distributions.

In [None]:
#Saving off a copy of the data set after our KA
DataKAnon = DataAnon

import hashlib

salt = "nintendo"

def hashbrowns(x):
    tohash = (salt + str(x)).encode('utf-8')
    return hashlib.sha256(tohash).hexdigest()

columns_to_hash = ["Sex","Ethnicity","Hispanic_Latino","BloodType","HairColor","EyeColor","State","AgeClass","ZVirus"]
for feature in columns_to_hash:
    DataAnon[feature] = DataAnon[feature].apply(hashbrowns)

display(DataAnon.head())

## Deanonymizing Data

#### Is Stratification and K-Anonymization after removing PII sufficient?

* Compare anonymized data with original data set
* Look at knowable demographic distributions
* Are there outliers that K-Anonymization missed?
* Do combinations of columns/features result in identifiable or recognizable observations?

#### Perturbation

* Computers don't need to know the semantic meaning of a value to understand its distribution
    * Linear vs Categorical Features

#### Cracking the Hash

>A note on this process: We used a weak hashing method on purpose to illustrate a possible method of attack. A cryptographically secure method would not use a dictionary based salt, nor would that salt be static and independent of both the generator of the dataset and the recipient of the downloaded dataset.

Essentially, what we're attempting to do is create a forced hash collision by using what we know about the data. For example, both the Hispanic_Latino and the Sex columns appear to be binary and since they have the same hash values in each, it's likely that whatever value is in one column should be equivalent to the value in the other. This means values like Y or N, and Yes and No are unlikely. This leaves us with context neutral classifier values such as 0 or 1. If they didn't have the same value, it would be simpler to begin with the Sex column as its likely possible values are much narrower: 0 or 1, Male or Female, m or f, and other variations.

The first step is to copy and paste our given hashes into google search. This will let us know if the hashes have already been calculated before, and if they have, what the value is. A quick search of "ddebf4bb08617e33fac3c0e43ea5c3f63df912887f984d80d61d4b685a036dc3" turned up zero results.

In [None]:
#%%time
#from itertools import product
#import string
#
#chars = string.ascii_lowercase # chars to look for
#
#for length in range(5, 10): # only do lengths of 1 + 2
#    to_attempt = product(chars, repeat=length)
#    for attempt in to_attempt:
#        if (attempt == "nintendo"):
#            print("Found It.")
#            break;

In [None]:
%%time
f = open("10k_most_common.txt","r")
passwords = f.readlines()
f.close()

possibleValues = ['0','1']
valueTest = DataAnon["Hispanic_Latino"][0]

for password in passwords:
    password = password.strip('\n')
    for value in possibleValues:
        saltBefore = (password + str(value)).encode('utf-8')
        saltAfter = (str(value) + password).encode('utf-8')
        
        if (hashlib.sha256(saltBefore).hexdigest() == valueTest or hashlib.sha256(saltAfter).hexdigest() == valueTest):
            print("Password is: " + password)
        

With the Password and Hash method now known, we can begin our attempts to guess the remaining data set. While hash functions are normally one way, as long as we know some basic meta information about the dataset it becomes possible to decode other attributes as it's much easier to guess the values of classifiers than it is to guess the salt for a hash.

In the case of ZVirus classification, we know that the CDC has been labeling their statistics and graphs with Carrier, Immune and Infected. So it's likely that these three values were used in this dataset especiall as it contains three unique classifiers in the ZVirus column. Of course it's possible they used non-contextual based classifiers such as 0, 1, 2, etc. Or even used partial contextual classifiers like C, Im, In. As always, start with the obvious, and work your way down.

In [None]:
def hashdebrown(hashvalue,possibleValues):
    password = "nintendo" #taken from earlier
    for value in possibleValues:
        saltBefore = (password + str(value)).encode('utf-8')
        saltAfter = (str(value) + password).encode('utf-8')
        
        #sha256, again, taken from above
        if (hashlib.sha256(saltBefore).hexdigest() == hashvalue or hashlib.sha256(saltAfter).hexdigest() == hashvalue):
            return [hashvalue,value]
    return [hashvalue,"Unknown"]

#According to the CDC, they've been classifying people as Carrier, Immune or Infected
ZVirusValues = ["Carrier","Immune","Infected","0","1","2","C","Im","In"]
for hashvalue in DataAnon['ZVirus'].unique():
    print(str(hashdebrown(hashvalue,ZVirusValues)))