Testing what epsilon values are returned when attack is applied to unprotected data (i.e., the only protection is sampling variability). This file uses the CPS ASEC (IPUMS) data.

In [1]:
# first we need to import some dependencies.
import pandas as pd
from sklearn.model_selection import train_test_split

# import privacy attack
# !pip install git+https://github.com/GilianPonte/likelihood_based_privacy_attack.git
from likelihood_based_privacy_attack import attacks

# select some variables
samples = 300 # select the number of observations

Define swapping privacy function.

In [2]:
# define our data protected method. We use swapping 25% of the observations.
def swapping(percent, data):
  import random
  import numpy as np
  swap_data = data
  idx = random.randint(0,data.shape[1]) # pick a random variable
  variable = np.array(data.iloc[:,idx]) # select variable from data
  ix_size = int(percent * len(variable) * 0.5) # select proportion to shuffle
  ix_1 = np.random.choice(len(variable), size=ix_size, replace=False) # select rows to shuffle
  ix_2 = np.random.choice(len(variable), size=ix_size, replace=False) # select rows to shuffle
  b1 = variable[ix_1] # take rows from variable and create b
  b2 = variable[ix_2] # take rows from variable and create b

  variable[ix_2] = b1 # swap 1
  variable[ix_1] = b2 # swap 2

  swap_data.iloc[:,idx] = variable  # place variable back in original data
  return swap_data

Let's test whether the IPUMS data returns infinite epsilon values when no protection is applied. We'll do two cases - one with a very small number of observations, and one with the actual number of observations.

In [3]:
# here we import the external data, train data and adversary training data (unprotected)
ipums_data = pd.read_csv("../../Data/IPUMS/cleaned_ipums_data.csv")
ipums_data = pd.DataFrame.drop_duplicates(ipums_data) # drop duplicates

In [4]:
# here we create the train, adversary and outside_training set.
ipums_data, evaluation_outside_training = train_test_split(ipums_data, train_size = int(samples*2/3), test_size = int(samples*1/3)) 
train, adversary_training = train_test_split(ipums_data, train_size = int(samples*1/3))

In [5]:
attacks.privacy_attack(seed=1,
                       simulations=10,
                       train=train,
                       adversary=adversary_training,
                       outside_training=evaluation_outside_training,
                       protected_training=train,
                       protected_adversary=adversary_training)

iteration is 0
FPR is 0.37
FNR is 0.42000000000000004
TPR is 0.58
TNR is 0.63
empirical epsilon = 0.2602830982636665
iteration is 1
FPR is 0.37
FNR is 0.42000000000000004
TPR is 0.58
TNR is 0.63
empirical epsilon = 0.2602830982636665
iteration is 2
FPR is 0.37
FNR is 0.42000000000000004
TPR is 0.58
TNR is 0.63
empirical epsilon = 0.2602830982636665
iteration is 3
FPR is 0.37
FNR is 0.42000000000000004
TPR is 0.58
TNR is 0.63
empirical epsilon = 0.2602830982636665
iteration is 4
FPR is 0.37
FNR is 0.42000000000000004
TPR is 0.58
TNR is 0.63
empirical epsilon = 0.2602830982636665
iteration is 5
FPR is 0.37
FNR is 0.42000000000000004
TPR is 0.58
TNR is 0.63
empirical epsilon = 0.2602830982636665
iteration is 6
FPR is 0.37
FNR is 0.42000000000000004
TPR is 0.58
TNR is 0.63
empirical epsilon = 0.2602830982636665
iteration is 7
FPR is 0.37
FNR is 0.42000000000000004
TPR is 0.58
TNR is 0.63
empirical epsilon = 0.2602830982636665
iteration is 8
FPR is 0.37
FNR is 0.42000000000000004
TPR is 0.5

(array([0.2602831, 0.2602831, 0.2602831, 0.2602831, 0.2602831, 0.2602831,
        0.2602831, 0.2602831, 0.2602831, 0.2602831]),
 0.37,
 0.63,
 0.42000000000000004,
 0.58)

In this case, we do observe relatively small epsilon values.

What if we apply swapping protection to the IPUMS data with the small number of observations?

In [6]:
# apply protection to train and adversary
swap25_train = swapping(percent = 0.25, data = train) # apply swapping 25% to train
swap25_adversary_training = swapping(percent = 0.25, data = adversary_training)  # apply swapping 25% to adv

In [7]:
attacks.privacy_attack(seed=1,
                       simulations=10,
                       train=train,
                       adversary=adversary_training,
                       outside_training=evaluation_outside_training,
                       protected_training=swap25_train,
                       protected_adversary=swap25_adversary_training)

iteration is 0
FPR is 0.28
FNR is 0.48
TPR is 0.52
TNR is 0.72
empirical epsilon = 0.4054651081081644
iteration is 1
FPR is 0.28
FNR is 0.48
TPR is 0.52
TNR is 0.72
empirical epsilon = 0.4054651081081644
iteration is 2
FPR is 0.28
FNR is 0.48
TPR is 0.52
TNR is 0.72
empirical epsilon = 0.4054651081081644
iteration is 3
FPR is 0.28
FNR is 0.48
TPR is 0.52
TNR is 0.72
empirical epsilon = 0.4054651081081644
iteration is 4
FPR is 0.28
FNR is 0.48
TPR is 0.52
TNR is 0.72
empirical epsilon = 0.4054651081081644
iteration is 5
FPR is 0.28
FNR is 0.48
TPR is 0.52
TNR is 0.72
empirical epsilon = 0.4054651081081644
iteration is 6
FPR is 0.28
FNR is 0.48
TPR is 0.52
TNR is 0.72
empirical epsilon = 0.4054651081081644
iteration is 7
FPR is 0.28
FNR is 0.48
TPR is 0.52
TNR is 0.72
empirical epsilon = 0.4054651081081644
iteration is 8
FPR is 0.28
FNR is 0.48
TPR is 0.52
TNR is 0.72
empirical epsilon = 0.4054651081081644
iteration is 9
FPR is 0.28
FNR is 0.48
TPR is 0.52
TNR is 0.72
empirical epsilon =

(array([0.40546511, 0.40546511, 0.40546511, 0.40546511, 0.40546511,
        0.40546511, 0.40546511, 0.40546511, 0.40546511, 0.40546511]),
 0.28,
 0.72,
 0.48,
 0.52)

The privacy actually gets worse after swapping.

What about when we many more observations from the full IPUMS data? We won't use the full data here because it is computationally intensive.

In [8]:
samples = 10000

In [9]:
# here we import the external data, train data and adversary training data (unprotected)
ipums_data = pd.read_csv("../../Data/IPUMS/cleaned_ipums_data.csv")
ipums_data = pd.DataFrame.drop_duplicates(ipums_data) # drop duplicates

In [10]:
# here we create the train, adversary and outside_training set.
ipums_data, evaluation_outside_training = train_test_split(ipums_data, train_size = int(samples*2/3), test_size = int(samples*1/3)) 
train, adversary_training = train_test_split(ipums_data, train_size = int(samples*1/3))

In [11]:
attacks.privacy_attack(seed=1,
                       simulations=10,
                       train=train,
                       adversary=adversary_training,
                       outside_training=evaluation_outside_training,
                       protected_training=train,
                       protected_adversary=adversary_training)

iteration is 0
FPR is 0.4965496549654965
FNR is 0.2076207620762076
TPR is 0.7923792379237924
TNR is 0.5034503450345035
empirical epsilon = 0.8797946273010226
iteration is 1
FPR is 0.4965496549654965
FNR is 0.2076207620762076
TPR is 0.7923792379237924
TNR is 0.5034503450345035
empirical epsilon = 0.8797946273010226
iteration is 2
FPR is 0.4965496549654965
FNR is 0.2076207620762076
TPR is 0.7923792379237924
TNR is 0.5034503450345035
empirical epsilon = 0.8797946273010226
iteration is 3
FPR is 0.4965496549654965
FNR is 0.2076207620762076
TPR is 0.7923792379237924
TNR is 0.5034503450345035
empirical epsilon = 0.8797946273010226
iteration is 4
FPR is 0.4965496549654965
FNR is 0.2076207620762076
TPR is 0.7923792379237924
TNR is 0.5034503450345035
empirical epsilon = 0.8797946273010226
iteration is 5
FPR is 0.4965496549654965
FNR is 0.2076207620762076
TPR is 0.7923792379237924
TNR is 0.5034503450345035
empirical epsilon = 0.8797946273010226
iteration is 6
FPR is 0.4965496549654965
FNR is 0.20

(array([0.87979463, 0.87979463, 0.87979463, 0.87979463, 0.87979463,
        0.87979463, 0.87979463, 0.87979463, 0.87979463, 0.87979463]),
 0.4965496549654965,
 0.5034503450345035,
 0.2076207620762076,
 0.7923792379237924)

The epsilon value is very small.

Apply swapping to the larger subset of IPUMS data.

In [12]:
# apply protection to train and adversary
swap25_train = swapping(percent = 0.25, data = train) # apply swapping 25% to train
swap25_adversary_training = swapping(percent = 0.25, data = adversary_training)  # apply swapping 25% to adv

In [13]:
attacks.privacy_attack(seed=1,
                       simulations=10,
                       train=train,
                       adversary=adversary_training,
                       outside_training=evaluation_outside_training,
                       protected_training=swap25_train,
                       protected_adversary=swap25_adversary_training)

iteration is 0
FPR is 0.5079507950795079
FNR is 0.19741974197419743
TPR is 0.8025802580258026
TNR is 0.4920492049204921
empirical epsilon = 0.907130362475491
iteration is 1
FPR is 0.5079507950795079
FNR is 0.19741974197419743
TPR is 0.8025802580258026
TNR is 0.4920492049204921
empirical epsilon = 0.907130362475491
iteration is 2
FPR is 0.5079507950795079
FNR is 0.19741974197419743
TPR is 0.8025802580258026
TNR is 0.4920492049204921
empirical epsilon = 0.907130362475491
iteration is 3
FPR is 0.5079507950795079
FNR is 0.19741974197419743
TPR is 0.8025802580258026
TNR is 0.4920492049204921
empirical epsilon = 0.907130362475491
iteration is 4
FPR is 0.5079507950795079
FNR is 0.19741974197419743
TPR is 0.8025802580258026
TNR is 0.4920492049204921
empirical epsilon = 0.907130362475491
iteration is 5
FPR is 0.5079507950795079
FNR is 0.19741974197419743
TPR is 0.8025802580258026
TNR is 0.4920492049204921
empirical epsilon = 0.907130362475491
iteration is 6
FPR is 0.5079507950795079
FNR is 0.19

(array([0.90713036, 0.90713036, 0.90713036, 0.90713036, 0.90713036,
        0.90713036, 0.90713036, 0.90713036, 0.90713036, 0.90713036]),
 0.5079507950795079,
 0.4920492049204921,
 0.19741974197419743,
 0.8025802580258026)

Swapping marginally increased the privacy risk.