### ANALYSIS OF BEST T PARAMETER GIVEN A K-ANONYMIZED DATASET

In order to stablish an enough restrictive value for t when applying t-closeness we make use of the pycanon library to get the already existing value of t given a k-anonymized dataset:

In [1]:
import pandas as pd
import pycanon
from pycanon import report, anonymity

### K=2

In [2]:
file_name = "/home/carmen/Escritorio/TFM/ml_anonymization/datasets/bank_dataset/csv/bank_k_10-anonymized.csv"
df = pd.read_csv(file_name, sep=";")

In [3]:
df.isnull().sum()

age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
emp.var.rate      0
cons.price.idx    0
cons.conf.idx     0
euribor3m         0
y                 0
dtype: int64

In [4]:
print(df.head())
print(df.info())
print(df.describe())
print(df.shape)

        age         job   marital            education  default housing loan  \
0  [20, 40[    Employed    single  Professional Course       no     yes   no   
1  [60, 80[     Retired   married                Basic       no     yes   no   
2  [40, 60[    Employed  divorced                Basic  unknown     yes  yes   
3  [40, 60[  unemployed   married                Basic  unknown     yes   no   
4  [40, 60[    Employed   married          HIgh School       no     yes   no   

    contact month day_of_week  emp.var.rate  cons.price.idx  cons.conf.idx  \
0  cellular   may         wed          -1.8          92.893          -46.2   
1  cellular   aug         fri          -2.9          92.201          -31.4   
2  cellular   jul         mon           1.4          93.918          -42.7   
3  cellular   jul         mon           1.4          93.918          -42.7   
4  cellular   apr         wed          -1.8          93.075          -47.1   

   euribor3m    y  
0      1.281   no  
1      0.8

In [5]:
q_i=["age","job","marital","education","default","housing","loan"]
a_s = ["y"]

def delete_rows(file_name, quasi_ident, fillna=True):
    """Delete the rows of the given file in which all QIs are set to *."""
    df = pd.read_csv(file_name, sep = ";")
    df_qi = df[quasi_ident]
    tmp = df_qi[df_qi == ["*"] * len(quasi_ident)]
    tmp.dropna(inplace=True)
    df_new=df.drop(tmp.index.values, axis = 0).reset_index()
    print(df_new)
    return df_new

In [6]:
df = delete_rows(file_name,q_i)

       index       age         job   marital            education  default  \
0          0  [20, 40[    Employed    single  Professional Course       no   
1          1  [60, 80[     Retired   married                Basic       no   
2          2  [40, 60[    Employed  divorced                Basic  unknown   
3          3  [40, 60[  unemployed   married                Basic  unknown   
4          4  [40, 60[    Employed   married          HIgh School       no   
...      ...       ...         ...       ...                  ...      ...   
39340  41182  [40, 60[    Employed  divorced    University Degree       no   
39341  41183  [20, 40[    Employed    single    University Degree       no   
39342  41184  [20, 40[    Employed   married    University Degree  unknown   
39343  41186  [40, 60[     Unknown   married              Unknown  unknown   
39344  41187  [20, 40[    Employed   married          HIgh School       no   

      housing loan    contact month day_of_week  emp.var.rate  

In [7]:
df.head()

Unnamed: 0,index,age,job,marital,education,default,housing,loan,contact,month,day_of_week,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,y
0,0,"[20, 40[",Employed,single,Professional Course,no,yes,no,cellular,may,wed,-1.8,92.893,-46.2,1.281,no
1,1,"[60, 80[",Retired,married,Basic,no,yes,no,cellular,aug,fri,-2.9,92.201,-31.4,0.849,yes
2,2,"[40, 60[",Employed,divorced,Basic,unknown,yes,yes,cellular,jul,mon,1.4,93.918,-42.7,4.962,yes
3,3,"[40, 60[",unemployed,married,Basic,unknown,yes,no,cellular,jul,mon,1.4,93.918,-42.7,4.962,no
4,4,"[40, 60[",Employed,married,HIgh School,no,yes,no,cellular,apr,wed,-1.8,93.075,-47.1,1.415,no


In [8]:
df["job"] = df["job"].astype("category").cat.codes
df["marital"] = df["marital"].astype("category").cat.codes
df["education"] = df["education"].astype("category").cat.codes
df["default"] = df["default"].astype("category").cat.codes
df["housing"] = df["housing"].astype("category").cat.codes
df["loan"] = df["loan"].astype("category").cat.codes
df["contact"] = df["contact"].astype("category").cat.codes
df["month"] = df["month"].astype("category").cat.codes
df["day_of_week"] = df["day_of_week"].astype("category").cat.codes
df["y"] = df["y"].astype("category").cat.codes


In [9]:
df.head()

Unnamed: 0,index,age,job,marital,education,default,housing,loan,contact,month,day_of_week,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,y
0,0,"[20, 40[",0,2,2,0,2,0,0,6,4,-1.8,92.893,-46.2,1.281,0
1,1,"[60, 80[",1,1,0,0,2,0,0,1,0,-2.9,92.201,-31.4,0.849,1
2,2,"[40, 60[",0,0,0,1,2,2,0,3,1,1.4,93.918,-42.7,4.962,1
3,3,"[40, 60[",4,1,0,1,2,0,0,3,1,1.4,93.918,-42.7,4.962,0
4,4,"[40, 60[",0,1,1,0,2,0,0,0,4,-1.8,93.075,-47.1,1.415,0


In [10]:
pycanon.report.print_report(df, q_i, a_s)

c for (c,l)-diversity cannot be calculated as l=1
The dataset verifies:
          	 - k-anonymity with k = 10
          	 - (alpha,k)-anonymity with alpha = 1.0 and k = 10
          	 - l-diversity with l = 1
          	 - entropy l-diversity with l = 1
          	 - (c,l)-diversity with c = nan and l = 1
          	 - basic beta-likeness with beta = 5.478462754396364
          	 - enhanced beta-likeness with beta = 2.204955489582264
          	 - t-closeness with t = 0.6040302815750777
          	 - delta-disclosure privacy with delta = 2.3276440035709918


The "already exisiting" value for t with the 10-anonymity dataset is **0.604**. Therefore, when choosing the t value for anonymizating the dataset, it has to be bellow this value to be more restrictive and applying more anonymization. 

We will stablish the t value of t=0.2 when generating the anonymized dataset of k=10.