# MSDS 7349 Anonymizing Data
Authors: Alex Frye, Michael Smith, Lindsay Vitovsky

## Introduction
The Z-Virus has broken out in remote parts of the US. However, due to its incubation period, the virus was able to spread quickly before it was caught. The CDC in conjunction with the WHO have decided to realease sensitive healthcare data in an effort to crowdsource a solution to determine those attributes necessary to identify those immune to the disease and those that are carriers and their correlation with infected. With this data it may be possible to save the world by containing the diease before it spreads any further.

## Creating a Dataset - Michael
brief description of data created and process / references used.

Demographic Data of Ethnicity and Age of USA: https://en.wikipedia.org/wiki/Demography_of_the_United_States  
Hair and Eye Color data: http://www.gnxp.com/blog/2008/12/nlsy-blogging-eye-and-hair-color-of.php

In [None]:
import os
import glob
import pandas as pd
import numpy as np
from IPython.display import display
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

pd.options.mode.chained_assignment = None

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

plt.rcParams['figure.figsize'] = (12, 6)

In [None]:
%%time

if os.path.isfile("dataset.csv"):
    print("Found the File!")
else:
    %run constructDataSet.py


In [None]:
%%time
############################################################
# Load the Compiled Data from CSV
############################################################

# Create CSV Reader Function and assign column headers
def reader(f, columns):
    d = pd.read_csv(f)
    d.columns = columns
    return d


# Identify All CSV FileNames needing to be loaded
path = r''
all_files = glob.glob(os.path.join(path, "dataset.csv"))

# Define File Columns
columns = ["ID","LastName", "FirstName", "MiddleName", "Sex", "Age", "Ethnicity", "Hispanic_Latino", "BloodType", "HairColor", "EyeColor", "StreetAddress", "City", "State", "Zip", "PhoneNumber", "SocialSecurityNumber", "ZVirus"]

# Load Data
Data = pd.concat([reader(f, columns) for f in all_files])

In [None]:
display(Data.head())

## Anonymizing Data

#### Personally Identifiable Information - Michael

Last Name, First Name, Middle Name, Address, Phone Number and Social Security Number are all PII. First step in anonymizing data.

In [None]:
#Output DataAnon as pd.frame

DataAnon = Data[["ID","Sex", "Age", "Ethnicity", "Hispanic_Latino", "BloodType", "HairColor", "EyeColor", "State", "ZVirus"]]
display(DataAnon.head())

#### K-Anonymization
The theory behind K-anonymization, suggests that by removing instances with less than K number of matching records, the data set seeks to protect the ease of finding loners by taking them out altogether. Attributes which may be used to easily identify an individual, and especially categorical data are key targets for K-anonymization techniques.

Attributes Age, Ethnicity, Blood Type, State, Hair Color, Eye Color, and Hispanic_Latino are all great candidates for examination with K-Anonymization, removing those elements that have a low repetition count.

The chosen "K" depends on the sample size of the original dataset, and contextual knowledge of the domain. We have chosen to set our "K" threshold for K-Anonymization to 5% of the original sample size. 
    <br> &emsp;&emsp;&emsp; $10000 \times .05 = 500$
    
This means, that for every attribute "class" we will analyze the frequency of that class, and for every class with less than 500 observations in the original dataset we will remove all observations in that class.  

In [None]:
    # Define Threshold Value as 5% of Original Sample Size
KAThres = round(len(Data) * .05, 0)
print(KAThres)

#### Age
Age, given the large range of values, is not possible to apply K-Anonymization to each age by itself. Doing so, would remove all observations in our dataset and ignoring the attribute all-together would severely compromize the security of all individuals in our dataset. To mitigate this issue, we have chosen to split our Age attribute into 5 subcategories as defined below:
* <=20
* 21-40
* 41-60
* 61-80
* \>=81

Doing this allows us to continue sharing the Age attribute, without disregarding the security of our study participants. 

In [None]:
%%time
%matplotlib inline


DataAnon["AgeClass"] = np.where(DataAnon["Age"] <= 20,                                                         "<=20",
                                np.where((DataAnon["Age"] >= 21) &  (DataAnon["Age"] <= 40),                   "21-40",
                                         np.where((DataAnon["Age"] >= 41) &  (DataAnon["Age"] <= 60),          "41-60",
                                                  np.where((DataAnon["Age"] >= 61) &  (DataAnon["Age"] <= 80), "61-80",
                                                                                                               ">=81"
                                                          )
                                                 )
                                        )
                               )

    #Agg Classes
AgeAgg = pd.DataFrame({'count' : DataAnon.groupby(["AgeClass"]).size()}).reset_index()

    # display class counts
display(AgeAgg)
    
    # Pie class Distribution
AgeAgg.plot.pie(y = 'count', labels = AgeAgg["AgeClass"], autopct='%1.1f%%', figsize=(12,6))


With our AgeClass category frequencies identified, we notice that the ">=81" demographic is very sparse, housing only a mere  364 individual (3.6%) of the dataset. These subjects would potentially be at risk for identification simply due to the sparcity of their occurance.

We have chosen to remove all observations for individuals identified in the >=81 demographic from the dataset. After removing records of Age Groups with a group population less than or equal to 500.0, we are left with 9635 records.

In [None]:
    # MergeAgg
DataAnon = pd.merge(DataAnon, AgeAgg, on = "AgeClass")

    # Remove records with class count < Threshold Value
DataAnon = DataAnon[DataAnon["count"]>KAThres].drop(['Age', 'count'], axis = 1)

print("After removing records of Age Groups with a group population less than or equal to {0}, we are left with {1} records.".format(KAThres,len(DataAnon)))

    #Agg Classes
AgeAgg = pd.DataFrame({'count' : DataAnon.groupby(["AgeClass"]).size()}).reset_index()

    # display class counts
display(AgeAgg)

del AgeAgg

#### Ethnicity
With our Ethnicity frequencies identified, we notice that the we have three demographic categories that are very sparse. These subjects would potentially be at risk for identification simply due to the sparcity of their occurance.

We have chosen to remove Ethnicities: 
* "American Indian or Alaskan Native"
* "Asian American"
* "Native Hawaiian or Other Pacific Islander"

After removing records of Ethnicities with a group population less than or equal to 500.0, we are left with 8940 records.

In [None]:
%%time
%matplotlib inline


    #Agg Classes
EthAgg = pd.DataFrame({'count' : DataAnon.groupby(["Ethnicity"]).size()}).reset_index()

    # display class counts
display(EthAgg)
    
    # Pie class Distribution
EthAgg.plot.pie(y = 'count', labels = EthAgg["Ethnicity"], autopct='%1.1f%%', figsize=(12,6))


In [None]:
    # MergeAgg
DataAnon = pd.merge(DataAnon, EthAgg, on = "Ethnicity")

    # Remove records with class count < Threshold Value
DataAnon = DataAnon[DataAnon["count"]>KAThres].drop(['count'], axis = 1)

print("After removing records of ethnicities with a group population less than or equal to {0}, we are left with {1} records.".format(KAThres,len(DataAnon)))

    #Agg Classes
EthAgg = pd.DataFrame({'count' : DataAnon.groupby(["Ethnicity"]).size()}).reset_index()

    # display class counts
display(EthAgg)

del EthAgg

#### Blood Type
With our Blood Type frequencies identified, we notice that the we have three demographic categories that are very sparse. These subjects would potentially be at risk for identification simply due to the sparcity of their occurance.

We have chosen to remove Blood Types: 
* AB+
* AB-
* B-

After removing records of Blood Types with a group population less than or equal to 500.0, we are left with 8497 records.

In [None]:
%%time
%matplotlib inline


    #Agg Classes
BTAgg = pd.DataFrame({'count' : DataAnon.groupby(["BloodType"]).size()}).reset_index()

    # display class counts
display(BTAgg)
    
    # Pie class Distribution
BTAgg.plot.pie(y = 'count', labels = BTAgg["BloodType"], autopct='%1.1f%%', figsize=(12,6))


In [None]:
    # MergeAgg
DataAnon = pd.merge(DataAnon, BTAgg, on = "BloodType")

    # Remove records with class count < Threshold Value
DataAnon = DataAnon[DataAnon["count"]>KAThres].drop(['count'], axis = 1)

print("After removing records of Blood Types with a group population less than or equal to {0}, we are left with {1} records.".format(KAThres,len(DataAnon)))

    #Agg Classes
BTAgg = pd.DataFrame({'count' : DataAnon.groupby(["BloodType"]).size()}).reset_index()

    # display class counts
display(BTAgg)

del BTAgg

#### State
When identifying frequencies across states, we notice that every state is underneath our specified "K" threshold. However, we also notice that we have consistency across every state at approximately ~2% (+/- .4%) of the remaining dataset per state. Given the consistency of this attribute across the dataset, we have chosen to leave this data present in our anonymized dataset. We have chosen to do this because 1) The value of location may be extremely important to infection, and 2) Once perterbed later in the process, we feel this data will be more secure. 

In [None]:
%%time
%matplotlib inline

    #Agg Classes
SAgg = pd.DataFrame({'count' : DataAnon.groupby(["State"]).size()}).reset_index()

    # display class counts
display(SAgg)
    
    # Pie class Distribution
SAgg.plot.pie(y = 'count', labels = SAgg["State"], autopct='%1.1f%%', figsize=(12,6))


#### Hair Color
With our Hair Color frequencies identified, we notice that the we have three demographic categories that are very sparse. These subjects would potentially be at risk for identification simply due to the sparcity of their occurance.

We have chosen to remove Blood Types: 
* Grey
* Light Blond
* Light Brown

After removing records of Hair Color with a group population less than or equal to 500.0, we are left with 8192 records.

In [None]:
%%time
%matplotlib inline


    #Agg Classes
HairAgg = pd.DataFrame({'count' : DataAnon.groupby(["HairColor"]).size()}).reset_index()

    # display class counts
display(HairAgg)
    
    # Pie class Distribution
HairAgg.plot.pie(y = 'count', labels = HairAgg["HairColor"], autopct='%1.1f%%', figsize=(12,6))


In [None]:
    # MergeAgg
DataAnon = pd.merge(DataAnon, HairAgg, on = "HairColor")

    # Remove records with class count < Threshold Value
DataAnon = DataAnon[DataAnon["count"]>KAThres].drop(['count'], axis = 1)

print("After removing records of Hair Color with a group population less than or equal to {0}, we are left with {1} records.".format(KAThres,len(DataAnon)))

    #Agg Classes
HairAgg = pd.DataFrame({'count' : DataAnon.groupby(["HairColor"]).size()}).reset_index()

    # display class counts
display(HairAgg)

del HairAgg

#### Eye Color
With our Eye Color frequencies identified, we notice that the we have five demographic categories that are very sparse. These subjects would potentially be at risk for identification simply due to the sparcity of their occurance.

We have chosen to remove Blood Types: 
* Black
* Grey
* Light Blue
* Light Brown
* Other

After removing records of Eye Color with a group population less than or equal to 500.0, we are left with 7662 records.

In [None]:
%%time
%matplotlib inline


    #Agg Classes
EyeAgg = pd.DataFrame({'count' : DataAnon.groupby(["EyeColor"]).size()}).reset_index()

    # display class counts
display(EyeAgg)
    
    # Pie class Distribution
EyeAgg.plot.pie(y = 'count', labels = EyeAgg["EyeColor"], autopct='%1.1f%%', figsize=(12,6))


In [None]:
    # MergeAgg
DataAnon = pd.merge(DataAnon, EyeAgg, on = "EyeColor")

    # Remove records with class count < Threshold Value
DataAnon = DataAnon[DataAnon["count"]>KAThres].drop(['count'], axis = 1)

print("After removing records of Eye Color with a group population less than or equal to {0}, we are left with {1} records.".format(KAThres,len(DataAnon)))

    #Agg Classes
EyeAgg = pd.DataFrame({'count' : DataAnon.groupby(["EyeColor"]).size()}).reset_index()

    # display class counts
display(EyeAgg)

del EyeAgg

#### Hispanic/Latino Population
With our Hispanic/Latino frequencies identified, we identify that both classes meet our "K" threshold. No observations are removed for this class.

In [None]:
%%time
%matplotlib inline


    #Agg Classes
HisAgg = pd.DataFrame({'count' : DataAnon.groupby(["Hispanic_Latino"]).size()}).reset_index()

    # display class counts
display(HisAgg)
    
    # Pie class Distribution
HisAgg.plot.pie(y = 'count', labels = HisAgg["Hispanic_Latino"], autopct='%1.1f%%', figsize=(12,6))

#### Stratification of the Response Variable or Classifier

We are concerned about the distributions of each classifier matching publicly available data. For example, blood type distributions by ethnicity type are publicly available. The American Red Cross provides this information on their <a href="http://www.redcrossblood.org/learn-about-blood/blood-types">website</a>:

<table border="0" cellpadding="0" cellspacing="0" width="480">
<tbody>
<tr class="dates-bg-dkgray">
<td align="center" valign="top" width="40">
<div><b>&nbsp;</b></div>
</td>
<td align="center" valign="top" width="110">
<div><b>Caucasian</b></div>
</td>
<td align="center" valign="top" width="110">
<div><b>African- American</b></div>
</td>
<td align="center" valign="top" width="110">
<div><b>Latino-American</b></div>
</td>
<td align="center" valign="top" width="110">
<div><b>Asian</b></div>
</td>
</tr>
<tr class="dates-bg-gray">
<td align="center" valign="top">
<div><b>O +</b></div>
</td>
<td align="center" valign="top">
<div align="center">37%</div>
</td>
<td align="center" valign="top">
<div align="center">47%</div>
</td>
<td align="center" valign="top">
<div align="center">53%</div>
</td>
<td align="center" valign="top">
<div align="center">39%</div>
</td>
</tr>
<tr class="dates-bg-white">
<td align="center" valign="top">
<div><b>O -</b></div>
</td>
<td align="center" valign="top">
<div align="center">8%</div>
</td>
<td align="center" valign="top">
<div align="center">4%</div>
</td>
<td align="center" valign="top">
<div align="center">4%</div>
</td>
<td align="center" valign="top">
<div align="center">1%</div>
</td>
</tr>
<tr class="dates-bg-gray">
<td align="center" valign="top">
<div><b>A +</b></div>
</td>
<td align="center" valign="top">
<div align="center">33%</div>
</td>
<td align="center" valign="top">
<div align="center">24%</div>
</td>
<td align="center" valign="top">
<div align="center">29%</div>
</td>
<td align="center" valign="top">
<div align="center">27%</div>
</td>
</tr>
<tr class="dates-bg-white">
<td align="center" valign="top">
<div><b>A -</b></div>
</td>
<td align="center" valign="top">
<div align="center">7%</div>
</td>
<td align="center" valign="top">
<div align="center">2%</div>
</td>
<td align="center" valign="top">
<div align="center">2%</div>
</td>
<td align="center" valign="top">
<div align="center">0.5%</div>
</td>
</tr>
<tr class="dates-bg-gray">
<td align="center" valign="top">
<div><b>B +</b></div>
</td>
<td align="center" valign="top">
<div align="center">9%</div>
</td>
<td align="center" valign="top">
<div align="center">18%</div>
</td>
<td align="center" valign="top">
<div align="center">9%</div>
</td>
<td align="center" valign="top">
<div align="center">25%</div>
</td>
</tr>
<tr class="dates-bg-white">
<td align="center" valign="top">
<div><b>B -</b></div>
</td>
<td align="center" valign="top">
<div align="center">2%</div>
</td>
<td align="center" valign="top">
<div align="center">1%</div>
</td>
<td align="center" valign="top">
<div align="center">1%</div>
</td>
<td align="center" valign="top">
<div align="center">0.4%</div>
</td>
</tr>
<tr class="dates-bg-gray">
<td align="center" valign="top">
<div><b>AB +</b></div>
</td>
<td align="center" valign="top">
<div align="center">3%</div>
</td>
<td align="center" valign="top">
<div align="center">4%</div>
</td>
<td align="center" valign="top">
<div align="center">2%</div>
</td>
<td align="center" valign="top">
<div align="center">7%</div>
</td>
</tr>
<tr class="dates-bg-white">
<td align="center" valign="top">
<div><b>AB -</b></div>
</td>
<td align="center" valign="top">
<div align="center">1%</div>
</td>
<td align="center" valign="top">
<div align="center">0.3%</div>
</td>
<td align="center" valign="top">
<div align="center">0.2%</div>
</td>
<td align="center" valign="top">
<div align="center">0.1%</div>
</td>
</tr>
</tbody>
</table>

We fear, that information like this - and similar information for other attributes may make it possible to identify data even after pertubation (as discussed later). By pulling equal distributions of each ZVirus type, this slightly skews/changes the predictable distributions based on available population data.

When identifying how many records are remaining in each ZVirus category, we find that the least frequent ZVirus type is "Infected" with a frequency of 1528. In order to stratify, this means we cannot have a sample size larger than 1528 * 3 = 4584. 

In [None]:
%%time
%matplotlib inline

ZVirusDist = pd.DataFrame({'count' : DataAnon.groupby(["ZVirus"]).size()}).reset_index()
display(ZVirusDist)

ZVirusDist.plot.pie(y = 'count', labels = ZVirusDist['ZVirus'], autopct='%1.1f%%')

del ZVirusDist

In order to truly randomly sample observations from each class and keep a fairly round sample size number, we have chosen to utilize a stratified sample size of 3750. This sample size will be stratified three ways in a 33/33/33 split (1250 Observations) across the ZVirus classes. 

We are able to compute the sample size for each ZVirus type, and then take a random sample within each group. Below you will see that our sampled distribution matches the chosen 33/33/33 split across ZVirus types. 

*Note:* A seed value equal to the sample size of each type in order to ensure reproducibility for this report.

In [None]:
%%time
SampleSize = 3750

CarrierSample_Seed   = int(round(SampleSize * 33.3333 / 100.0,0))
ImmuneSample_Seed    = int(round(SampleSize * 33.3333 / 100.0,0))
InfectedSample_Seed  = int(round(SampleSize * 33.3333 / 100.0,0))


CarrierDataSampled  = DataAnon[DataAnon["ZVirus"] == 'Carrier'].sample(n=CarrierSample_Seed, replace = False, random_state = CarrierSample_Seed)
ImmuneDataSampled   = DataAnon[DataAnon["ZVirus"] == 'Immune'].sample(n=ImmuneSample_Seed, replace = False, random_state = ImmuneSample_Seed)
InfectedDataSampled = DataAnon[DataAnon["ZVirus"] == 'Infected'].sample(n=InfectedSample_Seed, replace = False, random_state = InfectedSample_Seed)


DataAnon = pd.concat([CarrierDataSampled,ImmuneDataSampled,InfectedDataSampled])

print(len(DataAnon))

ZVirusDist = pd.DataFrame({'count' : DataAnon.groupby(["ZVirus"]).size()}).reset_index()
display(ZVirusDist)

ZVirusDist.plot.pie(y = 'count', labels = ZVirusDist['ZVirus'], autopct='%1.1f%%')

del ZVirusDist

#### Perturbation - Michael

Removing identifiable feature names and values and translating them into non-sense that maintains the original distributions.

In [None]:
#Saving off a copy of the data set after our KA
DataKAnon = DataAnon.copy()

In [None]:
import hashlib

salt = "nintendo"

def hashbrowns(x):
    tohash = (salt + str(x)).encode('utf-8')
    return hashlib.sha256(tohash).hexdigest()

columns_to_hash = ["Sex","Ethnicity","Hispanic_Latino","BloodType","HairColor","EyeColor","State","AgeClass","ZVirus"]

for feature in columns_to_hash:
    DataAnon[feature] = DataAnon[feature].apply(hashbrowns)

display(DataAnon.head())

## Deanonymizing Data

An attacker, Bob, wants to identify individuals who participated in this study, to identify Blood Type and ZVirus Type information. Bob was informed on the date / time / location the texas study took place for data capture. Bob sat on a bench outside the facility, taking pictures of individuals as they walked in/out of the facility. Bob Logged timestamps from each photo identifying when individuals walked into the facility, and out of the facility so he could remove those individuals he perceived as employees due to long time durations. 

Since Texas is not the first state in which this study has taken place, Bob is familiar with the data shared by the CDC and their descriptions of the dataset values. Using pictures, he visually logged his perceived record value for the following attributes:
* Sex {0, 1} (i.e Female, Male)
* Ethnicity {Black American, White American, Other Race}
* Hispanic_Latino {0, 1} (i.e. No, Yes)
* HairColor(i.e. Black, Blond, Brown, Light Brown)
* State

To produce an example dataset of this visual attack, I have randomly selected 50 records from Texas in our original Dataset - storing only these attributes. In practice, there would be an error margin reducing the effectiveness of this data gathering technique, but for the purpose of this exercise we will select straight from the original data.

In [None]:
%%time

BobData = Data[Data["State"] == "Texas"][["Sex", "Ethnicity", "Hispanic_Latino", "HairColor","State"]].sample(n=50, replace = False, random_state = 50)
display(BobData)
print(len(BobData))

#### Is Stratification and K-Anonymization after removing PII sufficient?

Pertubation on a dataset creates a number of problems when it comes to interpreting and utilizing the data, thus many times you may be inclined to avoid doing so. Given our dataset, shared after removing PII / K-Anon Records and Stratifying the result set, Bob is able to attempt to detect a given individual from his 50 observations. 

The first step in this process, is to aggregate the shared dataset on the attributes Bob was able to capture visually from his photographs ("Sex","Ethnicity", "Hispanic_Latino", "HairColor", and "State"), computing the count for each. This allows us to exploit the more unique characteristic of each observation by the combination of all values captured vs looking at one in particular. 

In [None]:
%%time

    #Agg Classes
Agg = pd.DataFrame({'count' : DataKAnon.groupby(["Sex","Ethnicity", "Hispanic_Latino", "HairColor", "State"]).size()}).reset_index()

    # display class counts
display(Agg.head())
    


With our frequencies identified, we merge Bob's Data with the aggregated shared data, to find only those records from the shared dataset with matches on the demographic information captured from the photographs. Furthermore, for the purpose of identifying the highest likelihood match, we limit this merged data to those with a count of 1.

In [None]:
    # MergeAgg
BobData = pd.merge(BobData, Agg, on = ["Sex", "Ethnicity", "Hispanic_Latino", "HairColor","State"])

display(BobData.sort_values(by=["count"]))

    # Remove records with class count < 1 Value
BobData = BobData[BobData["count"]==1].drop(['count'], axis = 1)

display(BobData)

This final result set below, is those individuals from the shared dataset you observed and have photos of that you have the highest confidence in a match. Further insights may be drawn with less confidence, but for the purpose of this report - only highest confidence matches have been printed here. Notice, I do not say that this result is with 100% confidence, even though these observations were classified as stand-alones in the shared dataset. Because we do not know if and what observations were removed from the original study, it is very possible that there was another observation in the study that was removed from the shared sample set we are unaware of. 

In [None]:
    # Merge BobData with DataKAnon to identify specific individuals you observed
DataKAnonIdentified = pd.merge(BobData, DataKAnon, on = ["Sex", "Ethnicity", "Hispanic_Latino", "HairColor", "State"])
display(DataKAnonIdentified)

#### Perturbation

* Computers don't need to know the semantic meaning of a value to understand its distribution
    * Linear vs Categorical Features

#### Cracking the Hash

>A note on this process: We used a weak hashing method on purpose to illustrate a possible method of attack. A cryptographically secure method would not use a dictionary based salt, nor would that salt be static and independent of both the generator of the dataset and the recipient of the downloaded dataset.

Essentially, what we're attempting to do is create a forced hash collision by using what we know about the data. For example, both the Hispanic_Latino and the Sex columns appear to be binary and since they have the same hash values in each, it's likely that whatever value is in one column should be equivalent to the value in the other. This means values like Y or N, and Yes and No are unlikely. This leaves us with context neutral classifier values such as 0 or 1. If they didn't have the same value, it would be simpler to begin with the Sex column as its likely possible values are much narrower: 0 or 1, Male or Female, m or f, and other variations.

The first step is to copy and paste our given hashes into google search. This will let us know if the hashes have already been calculated before, and if they have, what the value is. A quick search of "ddebf4bb08617e33fac3c0e43ea5c3f63df912887f984d80d61d4b685a036dc3" turned up zero results.

In [None]:
#%%time
#from itertools import product
#import string
#
#chars = string.ascii_lowercase # chars to look for
#
#for length in range(5, 10): # only do lengths of 1 + 2
#    to_attempt = product(chars, repeat=length)
#    for attempt in to_attempt:
#        if (attempt == "nintendo"):
#            print("Found It.")
#            break;

In [None]:
%%time
f = open("10k_most_common.txt","r")
passwords = f.readlines()
f.close()

possibleValues = ['0','1']
valueTest = DataAnon["Hispanic_Latino"][0]

for password in passwords:
    password = password.strip('\n')
    for value in possibleValues:
        saltBefore = (password + str(value)).encode('utf-8')
        saltAfter = (str(value) + password).encode('utf-8')
        
        if (hashlib.sha256(saltBefore).hexdigest() == valueTest or hashlib.sha256(saltAfter).hexdigest() == valueTest):
            print("Password is: " + password)
        

With the Password and Hash method now known, we can begin our attempts to guess the remaining data set. While hash functions are normally one way, as long as we know some basic meta information about the dataset it becomes possible to decode other attributes as it's much easier to guess the values of classifiers than it is to guess the salt for a hash.

In the case of ZVirus classification, we know that the CDC has been labeling their statistics and graphs with Carrier, Immune and Infected. So it's likely that these three values were used in this dataset especiall as it contains three unique classifiers in the ZVirus column. Of course it's possible they used non-contextual based classifiers such as 0, 1, 2, etc. Or even used partial contextual classifiers like C, Im, In. As always, start with the obvious, and work your way down.

In [None]:
def hashdebrown(hashvalue,possibleValues):
    password = "nintendo" #taken from earlier
    for value in possibleValues:
        saltBefore = (password + str(value)).encode('utf-8')
        saltAfter = (str(value) + password).encode('utf-8')
        
        #sha256, again, taken from above
        if (hashlib.sha256(saltBefore).hexdigest() == hashvalue or hashlib.sha256(saltAfter).hexdigest() == hashvalue):
            return [hashvalue,value]
    return [hashvalue,"Unknown"]

#According to the CDC, they've been classifying people as Carrier, Immune or Infected
ZVirusValues = ["Carrier","Immune","Infected","0","1","2","C","Im","In"]
for hashvalue in DataAnon['ZVirus'].unique():
    print(str(hashdebrown(hashvalue,ZVirusValues)))

## Conclusion (needs cleaning up.. just alex thoughts)

**K-Anonymization**

* When removing so many observations / entire demographic categories, reduces the effectiveness / usefullness of results provided back from open source contributors. Any algorithms produced, identifying ways to predict Carriers vs. Infected, etc. would be subject ONLY to the demographic provided to them and should not be used on any demographic removed from this dataset. This means, that those outlier demographics in the data would not benefit from this study in the same way others would.
* Sample size is important!!!! Given our relatively small sample size after our anonymization techniques, it became relatively easy to identify an individual. With a larger sample size, the number of duplicates with the same demographic characteristics would increase (making outliers less of targets), fewer classes would have needed to be removed, and ultimately results provided back to us would be more meaningful. One way we could have accomplished this, would be to perform a bootstrap on our original dataset with replacement to produce a much larger original dataset to work with. 
* Combinations of attributes is just as important, if not more important, than the attribute itself. Although we eliminated outlier classes through utilization of our "K" threshold, we did not analyze the frequency of combinations of attributes. This left a huge security gap in our dataset that was very easy for our attacker, Bob, to exploit. Also, leaving in State was a mistake during K-Anonymization, as the number of observations remaining once narrowed down to a state was VERY small. This was a key component to Bob's attack.