## SynDiffix Usage Tutorial

This notebook demonstrates how to use __SynDiffix__, an open-source library for generating statistically-accurate
and strongly anonymous synthetic data from structured data.

We'll go through the process of loading and inspecting a dataset, creating synthetic datasets that mimics the original, and computing some statistical properties over the two datasets and comparing them.

### Setup

The `syndiffix` package requires Python 3.10 or later. Let's install it and other packages we'll need for the notebook.

In [None]:
%pip install -q syndiffix requests pandas scipy

### Loading the dataset

We'll use the `loan` dataset from the Czech banking dataset. A cleaned-up version is available at open-diffix.org.

In [2]:
import requests
import bz2
import pickle
def download_and_load(url):
    response = requests.get(url)
    data = bz2.decompress(response.content)
    df = pickle.loads(data)
    return df

# Usage
df_loan = download_and_load('http://open-diffix.org/datasets/loan.pbz2')
print(df_loan.head())

  loan_id account_id       date  amount  duration  payments status
0    5314       1787 1993-08-05   96396        12    8033.0      B
1    5316       1801 1993-08-11  165960        36    4610.0      A
2    6863       9188 1993-08-28  127080        60    2118.0      A
3    5325       1843 1993-09-03  105804        36    2939.0      A
4    7240      11013 1993-10-06  274740        60    4579.0      A


### Creating synthetic datasets

Before creating synthetic datasets, it may be necessary to identify if there is some entity in the data whose privacy must be protected. We call this the *protected entity*. The `loans` dataset has an `account_id` column. Since the account is related to individual persons, we want to ensure that the privacy of individual accounts are protected.

To do this, we prepare a dataframe consisting of only the `account_id`.

In [3]:
df_pid = df_loan[['account_id']]

Let's start by looking at the correlation between the `amount` attribute and the `duration` and `loan_id` (we expect strong correlation with `duration` and none with `loan_id`). To do this, we'll create two synthetic datasets of two columns each.

In [4]:
from syndiffix import Synthesizer

df_amt_dur = Synthesizer(df_loan[['amount','duration']], pids=df_pid).sample()
df_amt_lid = Synthesizer(df_loan[['amount','loan_id']], pids=df_pid).sample()

We'll use the Spearman rank-order correlation to measure the correlation, and compare the results for both the original and synthetic data.

In [5]:
import scipy.stats

print("amount <-> duration:")
print("Original",scipy.stats.spearmanr(df_loan['amount'], df_loan['duration']))
print("Synthetic",scipy.stats.spearmanr(df_amt_dur['amount'], df_amt_dur['duration']))
print("amount <-> loan_id:")
print("Original",scipy.stats.spearmanr(df_loan['amount'], df_loan['loan_id']))
print("Synthetic",scipy.stats.spearmanr(df_amt_lid['amount'], df_amt_lid['loan_id']))

amount <-> duration:
Original SignificanceResult(statistic=0.6276759903171304, pvalue=5.408495176711555e-76)
Synthetic SignificanceResult(statistic=0.6475724573499293, pvalue=4.374883861691819e-82)
amount <-> loan_id:
Original SignificanceResult(statistic=-0.037362151151157305, pvalue=0.32992360906471985)
Synthetic SignificanceResult(statistic=-0.0379950086356936, pvalue=0.3221482798617139)


The correlations computed from the synthetic data are very close to those of the original data.  As expected, we see a strong correlation between loan amount and loan duration, and virtually no correlation between loan amount and the loan id.

### A simpler (but less accurate) approach

Having to create a separate synthetic dataset for each column pair is inconvenient. It would be easier to create one synthetic data containing all of the columns. This is how other synthetic data products work. Let's try that and look at the resulting correlations.

In [6]:
df_loan_syn = Synthesizer(df_loan, pids=df_pid).sample()

print("amount <-> duration:")
print("Original",scipy.stats.spearmanr(df_loan['amount'], df_loan['duration']))
print("Synthetic (2-col)",scipy.stats.spearmanr(df_amt_dur['amount'], df_amt_dur['duration']))
print("Synthetic (all)",scipy.stats.spearmanr(df_loan_syn['amount'], df_loan_syn['duration']))
print("amount <-> loan_id:")
print("Original",scipy.stats.spearmanr(df_loan['amount'], df_loan['loan_id']))
print("Synthetic (2-col)",scipy.stats.spearmanr(df_amt_lid['amount'], df_amt_lid['loan_id']))
print("Synthetic (all)",scipy.stats.spearmanr(df_loan_syn['amount'], df_loan_syn['loan_id']))

amount <-> duration:
Original SignificanceResult(statistic=0.6276759903171304, pvalue=5.408495176711555e-76)
Synthetic (2-col) SignificanceResult(statistic=0.6475724573499293, pvalue=4.374883861691819e-82)
Synthetic (all) SignificanceResult(statistic=0.6524030556117626, pvalue=2.5062793995535786e-83)
amount <-> loan_id:
Original SignificanceResult(statistic=-0.037362151151157305, pvalue=0.32992360906471985)
Synthetic (2-col) SignificanceResult(statistic=-0.0379950086356936, pvalue=0.3221482798617139)
Synthetic (all) SignificanceResult(statistic=-0.0028815796809600258, pvalue=0.9403435522830113)


We see that the Spearman measures are only slightly less accurate when all columns are synthesized. This is the case here because there are relatively few columns in this dataset. As a rule, the more columns, the lower the accuracy.

### Machine Learning

Now we give an example of using **SynDiffix** to build an ML model. Here we want to build a model that predicts the `duration` of a loan. 

Here are the possible values:

In [7]:
print("Load Durations (months):", df_loan['duration'].unique())

Load Durations (months): [12 36 60 24 48]


We're going to use the `sklearn` DecisionTreeClassifier. Let's prepare the dataset for modeling.

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import pandas as pd

# Let's drop 'loan_id' because we know it is of no predictive value
df = df_loan.drop(columns=["loan_id"])
# Change date to a float because DecisionTreeClassifier requires it
df['date'] = df['date'].astype('int64') / 10**9
# Make the PID dataframe (we did this before, but put it here again for completeness)
df_pid = df[['account_id']]
# Drop the PID from the dataset because it also has no predictive value
df = df.drop(columns=["account_id"])

Build the synthetic data. Setting `target_column` improves the quality of the synthetic data with respect to building an ML model for the target. If we were to build another ML model for another target, we'd make a new synthetic dataset.

In [9]:
target = 'duration'
df_syn = Synthesizer(df, pids=df_pid, target_column=target).sample()

We are going to build models from both the original and synthetic data so that we can compare the results. We need to run all of the modeling preperation steps on both the original and synthetic data.

Note that there is a one-hot encoding step here (`pd.get_dummies()`). It is import to synthesize the data **before** one-hot encoding rather than after, especially if there are a lot of values to be encoded. This is because the quality of the synthetic data decreases with an increase in the number of columns.

Note also that in order to test the quality of the synthetic data model, we must take the test data from the original data, not from the synthetic data.

In [10]:
# Split the data into features (X) and the target variable (y)
X = df.drop(target, axis=1)
y = df[target]
X_syn = df_syn.drop(target, axis=1)
y_syn = df_syn[target]

# And we need to convert strings to one-hot encoding
# (Important to do this after synthesis, not before)
X = pd.get_dummies(X)
X_syn = pd.get_dummies(X_syn)

# Split the original dataset into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Do the same for the synthetic data, but noting that we'll use the original test set for testing both original and synthetic
X_train_syn, _, y_train_syn, _ = train_test_split(X_syn, y_syn, test_size=0.3, random_state=42)

Build an run the models, and display prediction accuracy.

In [11]:

def runModel(X_train, X_test, y_train, y_test, dataSource):
    # Create a decision tree classifier and fit it to the training data
    clf = DecisionTreeClassifier()
    clf.fit(X_train, y_train)

    # Use the trained classifier to make predictions on the test data
    y_pred = clf.predict(X_test)

    # Calculate the accuracy of the model
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy {dataSource} is {accuracy}")

runModel(X_train, X_test, y_train, y_test, "Original")
runModel(X_train_syn, X_test, y_train_syn, y_test, "Synthetic Target")

Accuracy Original is 0.8536585365853658
Accuracy Synthetic Target is 0.7853658536585366


The quality of the synthetic data model is almost 10% below that of the original data model. This is not horrible, but certainly we'd like to do better. We have some ideas in mind as to how to improve this, but may not get to them for a while. If you are interested in contributing, do let us know!