### ANALYSIS OF BEST T PARAMETER GIVEN A K-ANONYMIZED DATASET

In order to stablish an enough restrictive value for t when applying t-closeness we make use of the pycanon library to get the already existing value of t given a k-anonymized dataset:

In [1]:
import pandas as pd
import pycanon
from pycanon import report, anonymity

### K=2

In [2]:
file_name = "/home/carmen/Escritorio/TFM/ml_anonymization/datasets/bank_dataset/csv/bank_k_2-anonymized.csv"
df = pd.read_csv(file_name, sep=";")

In [3]:
df.isnull().sum()

age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
emp.var.rate      0
cons.price.idx    0
cons.conf.idx     0
euribor3m         0
y                 0
dtype: int64

In [4]:
print(df.head())
print(df.info())
print(df.describe())
print(df.shape)

        age       job  marital          education default housing loan  \
0         *         *        *                  *       *       *    *   
1  [40, 60[  Employed  married        HIgh School      no      no   no   
2  [40, 60[  Employed  married  University Degree      no      no   no   
3  [20, 40[  Employed  married              Basic      no      no   no   
4  [20, 40[  Employed  married        HIgh School      no     yes   no   

     contact month day_of_week  emp.var.rate  cons.price.idx  cons.conf.idx  \
0          *     *           *          -1.1          94.767          -50.8   
1   cellular     2         fri           1.4          93.444          -36.1   
2   cellular     3         tue          -0.1          93.200          -42.0   
3  telephone     2         mon           1.4          94.465          -41.8   
4  telephone     2         wed          -1.8          92.893          -46.2   

   euribor3m   y  
0      1.028  no  
1      4.964  no  
2      4.153  no  
3   

In [5]:
q_i=["age","job","marital","education","default","housing","loan","contact","month","day_of_week"]
a_s = ["y"]

def delete_rows(file_name, quasi_ident, fillna=True):
    """Delete the rows of the given file in which all QIs are set to *."""
    df = pd.read_csv(file_name, sep = ";")
    df_qi = df[quasi_ident]
    tmp = df_qi[df_qi == ["*"] * len(quasi_ident)]
    tmp.dropna(inplace=True)
    df_new=df.drop(tmp.index.values, axis = 0).reset_index()
    print(df_new)
    return df_new

In [6]:
df = delete_rows(file_name,q_i)

       index       age       job   marital            education  default  \
0          1  [40, 60[  Employed   married          HIgh School       no   
1          2  [40, 60[  Employed   married    University Degree       no   
2          3  [20, 40[  Employed   married                Basic       no   
3          4  [20, 40[  Employed   married          HIgh School       no   
4          5  [20, 40[  Employed   married                Basic       no   
...      ...       ...       ...       ...                  ...      ...   
38169  41180  [40, 60[   Retired   married  Professional Course  unknown   
38170  41181  [40, 60[  Employed  divorced          HIgh School       no   
38171  41183  [40, 60[  Employed   married                Basic  unknown   
38172  41185  [20, 40[  Employed   married    University Degree       no   
38173  41187  [40, 60[  Employed   married  Professional Course       no   

      housing loan    contact month day_of_week  emp.var.rate  cons.price.idx  \
0     

In [7]:
df.head()

Unnamed: 0,index,age,job,marital,education,default,housing,loan,contact,month,day_of_week,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,y
0,1,"[40, 60[",Employed,married,HIgh School,no,no,no,cellular,2,fri,1.4,93.444,-36.1,4.964,no
1,2,"[40, 60[",Employed,married,University Degree,no,no,no,cellular,3,tue,-0.1,93.2,-42.0,4.153,no
2,3,"[20, 40[",Employed,married,Basic,no,no,no,telephone,2,mon,1.4,94.465,-41.8,4.96,no
3,4,"[20, 40[",Employed,married,HIgh School,no,yes,no,telephone,2,wed,-1.8,92.893,-46.2,1.334,no
4,5,"[20, 40[",Employed,married,Basic,no,yes,no,cellular,2,mon,1.4,93.918,-42.7,4.962,no


In [8]:
df["job"] = df["job"].astype("category").cat.codes
df["marital"] = df["marital"].astype("category").cat.codes
df["education"] = df["education"].astype("category").cat.codes
df["default"] = df["default"].astype("category").cat.codes
df["housing"] = df["housing"].astype("category").cat.codes
df["loan"] = df["loan"].astype("category").cat.codes
df["contact"] = df["contact"].astype("category").cat.codes
df["month"] = df["month"].astype("category").cat.codes
df["day_of_week"] = df["day_of_week"].astype("category").cat.codes
df["y"] = df["y"].astype("category").cat.codes


In [9]:
df.head()

Unnamed: 0,index,age,job,marital,education,default,housing,loan,contact,month,day_of_week,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,y
0,1,"[40, 60[",0,1,1,0,0,0,0,1,0,1.4,93.444,-36.1,4.964,0
1,2,"[40, 60[",0,1,4,0,0,0,0,2,3,-0.1,93.2,-42.0,4.153,0
2,3,"[20, 40[",0,1,0,0,0,0,1,1,1,1.4,94.465,-41.8,4.96,0
3,4,"[20, 40[",0,1,1,0,2,0,1,1,4,-1.8,92.893,-46.2,1.334,0
4,5,"[20, 40[",0,1,0,0,2,0,0,1,1,1.4,93.918,-42.7,4.962,0


In [10]:
t = pycanon.anonymity.t_closeness(df, q_i, a_s)
print(t)

0.8923612930266673


The "already exisiting" value for t with the 2-anonymity dataset is **0.892**. Therefore, when choosing the t value for anonymizating the dataset, it has to be bellow this value to be more restrictive and applying more anonymization. 

We will stablish the t value of t=0.8 when generating the anonymized dataset of k=2.