## SynDiffix Salt Usage Tutorial

zzzz

### Setup

The `syndiffix` package requires Python 3.10 or later. Let's install it and other packages we'll need for the notebook.

In [1]:
%pip install -q syndiffix requests pandas

Note: you may need to restart the kernel to use updated packages.


### Loading the dataset

We'll use the `loan` dataset from the Czech banking dataset. A cleaned-up version is available at open-diffix.org.

In [2]:
import requests
import bz2
import pickle
def download_and_load(url):
    response = requests.get(url)
    data = bz2.decompress(response.content)
    df = pickle.loads(data)
    return df

# Usage
df_loan = download_and_load('http://open-diffix.org/datasets/loan.pbz2')
print(df_loan.head())

  loan_id account_id  loan_date  amount  duration  payments status  defaulted
0    5314       1787 1993-07-05   96396        12    8033.0      B       True
1    5316       1801 1993-07-11  165960        36    4610.0      A      False
2    6863       9188 1993-07-28  127080        60    2118.0      A      False
3    5325       1843 1993-08-03  105804        36    2939.0      A      False
4    7240      11013 1993-09-06  274740        60    4579.0      A      False


### Inspect the salt

SynDiffix has a secret salt that it uses to ensure "stickyness" in the synthetic data. The salt is stored in a user-specific application file so that it is always the same when the same user runs SynDiffix on the same machine.

Let's start up an Synthesizer class for the `defaulted` column in the loans table and view the salt (a byte string).

In [3]:
from syndiffix import Synthesizer
from syndiffix.common import AnonymizationParams

df_pid = df_loan[['account_id']]
syn = Synthesizer(df_loan[['defaulted']], pids=df_pid)
print(syn.forest.anonymization_params.salt)

b'\xb62(<\x0b\xc4\x83C'


Here is where the file holding the salt can be found (though normally there is no reason to view the file):

In [6]:
from appdirs import user_config_dir
print(user_config_dir("SynDiffix", "OpenDiffix"))

C:\Users\local_francis\AppData\Local\OpenDiffix\SynDiffix


Let's synthesize multiple `loan_date` column datasets. We can see that each one produces the same results.

In [14]:
for i in range(5):
    df_syn_date = Synthesizer(df_loan[['loan_date']], pids=df_pid).sample()
    print(f"Num rows: {len(df_syn_date)}")
    print(df_syn_date.sort_values(by='loan_date').head())

Num rows: 682
            loan_date
4 1993-08-30 08:04:34
5 1993-09-01 19:11:49
1 1993-09-03 10:06:47
3 1993-09-12 20:04:39
2 1993-09-15 16:54:13
Num rows: 682
            loan_date
4 1993-08-30 08:04:34
5 1993-09-01 19:11:49
1 1993-09-03 10:06:47
3 1993-09-12 20:04:39
2 1993-09-15 16:54:13
Num rows: 682
            loan_date
4 1993-08-30 08:04:34
5 1993-09-01 19:11:49
1 1993-09-03 10:06:47
3 1993-09-12 20:04:39
2 1993-09-15 16:54:13
Num rows: 682
            loan_date
4 1993-08-30 08:04:34
5 1993-09-01 19:11:49
1 1993-09-03 10:06:47
3 1993-09-12 20:04:39
2 1993-09-15 16:54:13
Num rows: 682
            loan_date
4 1993-08-30 08:04:34
5 1993-09-01 19:11:49
1 1993-09-03 10:06:47
3 1993-09-12 20:04:39
2 1993-09-15 16:54:13


From the above, we see that the number of rows and first five dates are all the same.

Now let's do it again, but this time manually setting the salt each time.

In [15]:
for i in range(5):
    df_syn_date = Synthesizer(df_loan[['loan_date']], pids=df_pid,
            anonymization_params=AnonymizationParams(salt=bytes([i]))).sample()
    print(f"Num rows: {len(df_syn_date)}")
    print(df_syn_date.sort_values(by='loan_date').head())

Num rows: 682
            loan_date
4 1993-08-30 08:04:34
5 1993-09-01 19:11:49
1 1993-09-03 10:06:47
3 1993-09-12 20:04:39
2 1993-09-15 16:54:13
Num rows: 683
            loan_date
4 1993-07-15 19:01:57
1 1993-07-23 23:06:23
3 1993-08-11 19:02:07
2 1993-08-17 12:41:15
0 1993-08-30 22:27:45
Num rows: 682
            loan_date
4 1993-07-15 19:01:57
5 1993-07-20 17:16:26
1 1993-07-23 23:06:23
3 1993-08-11 19:02:07
2 1993-08-17 12:41:15
Num rows: 681
            loan_date
4 1993-07-15 19:01:57
5 1993-07-20 17:16:26
1 1993-07-23 23:06:23
3 1993-08-11 19:02:07
2 1993-08-17 12:41:15
Num rows: 680
            loan_date
4 1993-07-15 19:01:57
7 1993-07-16 09:47:40
5 1993-07-20 17:16:26
1 1993-07-23 23:06:23
3 1993-08-11 19:02:07


This time, we see that the number of rows changes slightly with each different salt, and the dates themselves are different.

The main reason for setting the salt is so that different users building synthetic data on different machines can still benefit from stickyness. This strengthens anonymity compared to allowing different randomness with each synthetic dataset.