# k-anonymity
In this you will practice exploring and linking various Fake datasets and try to de-identify and re-identify owners of records. Think about an attackers who wants to gain as much information as possible. The attacker may want to ask for money based on the value of the information found about each person. 

## Datasets
There are four datasets:
1. income.csv: It is the dataset that an imaginary tax-related organization has about its clients.
2. ip.csv: This is a simple example of an internet provider company (e.g. Shaw)
3. hospital.csv: The dataset by an insurance company that provides insurance for travellers.
4. creditcard.csv: A third party organization for credit checks. 

### Load the datasets
Load each dataset as a separate dataframe and explore the data.

In [293]:
import pandas as pd
credit_card = pd.read_csv("creditcard.csv")
hospital = pd.read_csv("hospital.csv")
income = pd.read_csv("income.csv")
ip = pd.read_csv("ip.csv")

### De-identification
For each dataset, justify your answers for the columns as each being: 
1. explicit identifier
2. quasi identifiers
3. sensitive data
4. other

### credit_card
* name -> explicit identifier because it is unique and directly identify a person
* lastname -> explicit identifier because it is unique and directly identify a person
* DOB -> quasi identifiers because this type of information cannot be used alone to determine an individual’s identity.
* postal_code -> quasi identifiers because this type of information cannot be used alone to determine an individual’s identity. 
* credit_number -> sensitive data because this attribute is in credit_card dataset, the odds are they need to be studied. 
* credit_provider -> other because this is seems like an irrelevant information
* credit_security_code -> sensitive data because this number is uniquely linked with credit card number, so the researchers may need this.

### hospital
* name -> explicit identifier because it is unique and directly identify a person
* lastname -> explicit identifier because it is unique and directly identify a person
* DOB -> quasi identifiers because this type of information cannot be used alone to determine an individual’s identity. 
* last_food -> other because this is like an extra information
* medical reason -> sensitive data because this attribute is in hospital dataset, so the researchers need this information.



### income
* name -> explicit identifier because it is unique and directly identify a person
* lastname -> explicit identifier because it is unique and directly identify a person
* ID -> explicit identifier because some people may use their fullname as user Id, and we want to consider the worst case. So this is unique and directly identify a person. 
* DOB -> quasi identifiers because this type of information cannot be used alone to determine an individual’s identity. 
* postal_code -> quasi identifiers because this type of information cannot be used alone to determine an individual’s identity. 
* color -> other because this is not a sensitive information
* companies -> sensitive data because this attribute is in income dataset, so the researchers may need this information.
* income -> sensitive data because this attribute is in income dataset, so the researchers may also need this information.

### ip
* name -> explicit identifier because it is unique and directly identify a person
* lastname -> explicit identifier because it is unique and directly identify a person
* DOB -> quasi identifiers because this type of information cannot be used alone to determine an individual’s identity. 
* ip_address -> sensitive data because this attribute is in ip dataset, so the researchers need this information.
* location -> sensitive data because this attribute is in ip dataset, so they need to be studied. Furthermore, this is attribute is consider non-unique because people living in a same apartment have the same address.

#### anonymize data by removing explicit identifiers for each dataset

In [294]:
credit_card1 = credit_card.drop(["name","lastname"], axis=1)
hospital1 = hospital.drop(["name","lastname"],axis=1)
income1 = income.drop(["name","lastname","ID"],axis=1)
ip1 = ip.drop(["name","lastname"],axis=1)

### Re-identification by linking
Try to link the records from the datasets and re-identify the records. Notice that you might only get matching information about a record not specifically identify the individuals.


In [295]:
df = pd.merge(credit_card1,hospital1, how="inner", on = "DOB")
df = pd.merge(df, income1, how="inner", on = "DOB")
df = pd.merge(df, ip1, how="inner", on = "DOB")
df

Unnamed: 0,DOB,postal_code_x,credit_number,credit_provider,credit_security_code,last_food,medical reason,postal_code_y,color,companies,income,ip_address,location
0,1990-10-13,92310,4760000000000000.0,VISA 13 digit,8,banana,back pain,92310,DarkBlue,Joseph-Burns,120000,192.0.8.93,"('53.7446', '-0.33525', 'Kingston upon Hull', ..."
1,2000-03-21,73196,2220000000000000.0,Discover,644,apple,flue,73196,SeaShell,Nguyen PLC,70000,203.48.10.235,"('48.73218', '11.18709', 'Neuburg an der Donau..."
2,1992-03-19,86372,373000000000000.0,JCB 16 digit,542,steak,vomiting,86372,Bisque,Byrd-Walton,223546,198.51.98.53,"('35.06544', '1.04945', 'Frenda', 'DZ', 'Afric..."
3,1945-04-02,19557,4.42e+18,Maestro,454,coffee,fever,19557,LightYellow,Pena Group,62345,192.160.182.167,"('35.85', '117.7', 'Dongdu', 'CN', 'Asia/Shang..."
4,1983-11-25,94306,4.85e+18,JCB 16 digit,297,mocha,cancer,94306,DarkSlateBlue,Schneider Inc,146098,213.43.91.75,"('32.05971', '34.8732', 'Ganei Tikva', 'IL', '..."
5,1951-02-14,29648,4.66e+18,American Express,188,strawberry,cold,29648,LightBlue,Ferguson Group,56000,192.29.160.209,"('-20.87306', '-48.29694', 'Viradouro', 'BR', ..."
6,1949-02-24,10124,4450000000000000.0,VISA 16 digit,565,apple,knee problem,10124,Fuchsia,"Martin, Alvarez and Young",231456,198.51.2.188,"('22.37066', '114.10479', 'Tsuen Wan', 'HK', '..."
7,1947-01-31,78788,36300000000000.0,American Express,76,gala,accident,78788,OrangeRed,"Burns, Michael and Collins",210900,198.58.178.92,"('48.52961', '12.16179', 'Landshut', 'DE', 'Eu..."
8,1958-10-26,77075,4010000000000.0,JCB 16 digit,445,chicken,flue,77075,MediumAquaMarine,"Miller, Hanson and Roberts",93567,203.3.238.205,"('48.07667', '8.64409', 'Trossingen', 'DE', 'E..."
9,1983-12-17,82698,3510000000000000.0,VISA 19 digit,368,chickenpie,injury,82698,IndianRed,Freeman-Perry,90000,192.52.207.100,"('38.37255', '34.02537', 'Aksaray', 'TR', 'Eur..."


Since every row is unique, I can identify every individual

### Anonymize 
Anonymize the income and credit card datasets. Use Generalization or Supression methods on postal code. 

In [296]:
from datetime import datetime

def anonymize_postal(df):
    if df["postal_code"]>0 and df["postal_code"]<=50000:
        return "[0,50000]"
    else:
        return "(50000,99999]"

def anonymize_DOB(df):
    if df["DOB"]>=1900 and df["DOB"]<1980:
        return "[1900, 1980)"
    else:
        return "[1980, 2000]"

In [297]:
credit_card1['postal_code'] = credit_card1.apply(anonymize_postal, axis=1)
income1["postal_code"] = income1.apply(anonymize_postal, axis=1)

In [298]:
credit_card1["DOB"] = pd.to_datetime(credit_card1["DOB"]).apply(lambda x: x.strftime('%Y')).astype(int)
income1["DOB"] = pd.to_datetime(income1["DOB"]).apply(lambda x: x.strftime('%Y')).astype(int)


In [299]:
credit_card1["DOB"] = credit_card1.apply(anonymize_DOB, axis=1)
income1["DOB"] = income1.apply(anonymize_DOB, axis=1)

In [300]:
#### Question: Is it k-anonymized? 
What is the maximum k that you can make each of the credit car or income datasets k-anonymized?


Object `anonymized` not found.


In [301]:
credit_card1[["DOB","postal_code"]].value_counts()

DOB           postal_code  
[1900, 1980)  [0,50000]        7
[1980, 2000]  (50000,99999]    5
              [0,50000]        5
[1900, 1980)  (50000,99999]    3
dtype: int64

In [302]:
income1[["DOB","postal_code"]].value_counts()

DOB           postal_code  
[1900, 1980)  [0,50000]        7
[1980, 2000]  (50000,99999]    5
              [0,50000]        5
[1900, 1980)  (50000,99999]    3
dtype: int64

This is k-anonymized and the maximum k I can make is 3 and QI = {DOB, Postal_code}

#### Question: Does it need l-diversity?

In [303]:
credit_card1

Unnamed: 0,DOB,postal_code,credit_number,credit_provider,credit_security_code
0,"[1980, 2000]","(50000,99999]",4760000000000000.0,VISA 13 digit,8
1,"[1980, 2000]","(50000,99999]",2220000000000000.0,Discover,644
2,"[1980, 2000]","(50000,99999]",373000000000000.0,JCB 16 digit,542
3,"[1900, 1980)","[0,50000]",4.42e+18,Maestro,454
4,"[1980, 2000]","(50000,99999]",4.85e+18,JCB 16 digit,297
5,"[1900, 1980)","[0,50000]",4.66e+18,American Express,188
6,"[1900, 1980)","[0,50000]",4450000000000000.0,VISA 16 digit,565
7,"[1900, 1980)","(50000,99999]",36300000000000.0,American Express,76
8,"[1900, 1980)","(50000,99999]",4010000000000.0,JCB 16 digit,445
9,"[1980, 2000]","(50000,99999]",3510000000000000.0,VISA 19 digit,368


Dataset credit_card does not need l-diversity because both credit_number and credit_security_code can uniquely identify every individual

In [304]:
income1.sort_values(["DOB","postal_code"])

Unnamed: 0,DOB,postal_code,color,companies,income
7,"[1900, 1980)","(50000,99999]",OrangeRed,"Burns, Michael and Collins",210900
8,"[1900, 1980)","(50000,99999]",MediumAquaMarine,"Miller, Hanson and Roberts",93567
15,"[1900, 1980)","(50000,99999]",DarkOrchid,"Murphy, Martinez and Jones",2000000
3,"[1900, 1980)","[0,50000]",LightYellow,Pena Group,62345
5,"[1900, 1980)","[0,50000]",LightBlue,Ferguson Group,56000
6,"[1900, 1980)","[0,50000]",Fuchsia,"Martin, Alvarez and Young",231456
11,"[1900, 1980)","[0,50000]",LightSalmon,"Garcia, Barker and Kim",43000
12,"[1900, 1980)","[0,50000]",Yellow,"Weber, Brown and Brooks",125000
16,"[1900, 1980)","[0,50000]",LightSteelBlue,"Smith, Graham and Smith",122000
17,"[1900, 1980)","[0,50000]",LightSalmon,Evans LLC,56000


Dataset income does not need l-diversity because companies can uniquely identify every individual

### Try relocating the credit cards
Try finding out the location of the credit card holders by linking the dataset to the ip dataset. What do you find?

In [305]:
ip1["DOB"] = pd.to_datetime(ip1["DOB"]).apply(lambda x: x.strftime('%Y')).astype(int)
ip1["DOB"] = ip1.apply(anonymize_DOB, axis=1)
cc_ip_merge = pd.merge(credit_card1,ip1, how="inner", on = "DOB")

In [311]:
cc_ip_merge

Unnamed: 0,DOB,postal_code,credit_number,credit_provider,credit_security_code,ip_address,location
0,"[1980, 2000]","(50000,99999]",4.760000e+15,VISA 13 digit,8,192.0.8.93,"('53.7446', '-0.33525', 'Kingston upon Hull', ..."
1,"[1980, 2000]","(50000,99999]",4.760000e+15,VISA 13 digit,8,203.48.10.235,"('48.73218', '11.18709', 'Neuburg an der Donau..."
2,"[1980, 2000]","(50000,99999]",4.760000e+15,VISA 13 digit,8,198.51.98.53,"('35.06544', '1.04945', 'Frenda', 'DZ', 'Afric..."
3,"[1980, 2000]","(50000,99999]",4.760000e+15,VISA 13 digit,8,213.43.91.75,"('32.05971', '34.8732', 'Ganei Tikva', 'IL', '..."
4,"[1980, 2000]","(50000,99999]",4.760000e+15,VISA 13 digit,8,192.52.207.100,"('38.37255', '34.02537', 'Aksaray', 'TR', 'Eur..."
...,...,...,...,...,...,...,...
195,"[1900, 1980)","[0,50000]",4.270000e+15,VISA 19 digit,845,100.38.177.193,"('7.6', '4.18333', 'Olupona', 'NG', 'Africa/La..."
196,"[1900, 1980)","[0,50000]",4.270000e+15,VISA 19 digit,845,203.16.148.93,"('34.75856', '136.13108', 'Ueno-ebisumachi', '..."
197,"[1900, 1980)","[0,50000]",4.270000e+15,VISA 19 digit,845,192.31.67.82,"('0.46005', '34.11169', 'Busia', 'KE', 'Africa..."
198,"[1900, 1980)","[0,50000]",4.270000e+15,VISA 19 digit,845,192.58.175.42,"('34.06635', '-84.67837', 'Acworth', 'US', 'Am..."


I cannot find the location of the credit card holders by linking the dataset to ip dataset. This is because when merging one anonymized dataset with an un-anonymized dataset, the merged dataset will be expanded with a lot of noise which prevent the ability to re-link that data. The credit_number is no longer unique!