## SynDiffix Usage Tutorial

This notebook demonstrates how to use __SynDiffix__, an open-source library for generating statistically-accurate
and strongly anonymous synthetic data from structured data.

We'll go through the process of loading and inspecting a toy dataset, creating a synthetic dataset that mimics the original,
computing some statistical properties over the two datasets and comparing them, and, finally, how to improve accuracy when
analyzing synthetic data.

### Setup

The `syndiffix` package requires Python 3.10 or later. We can install it using `pip`:

In [None]:
%pip install syndiffix

We'll need a toy dataset to play with, so let's install the `scikit-learn` package in order to use one of their popular reference datasets:

In [None]:
%pip install scikit-learn

We'll want to compute some statistical properties of the original and synthetic datasets in order to compare them, so let's install the `scipy` package as well: 

In [None]:
%pip install scipy

### Loading our data

For this tutorial, we are going to use the Diabetes dataset, which is a popular reference dataset containing some attributes of patients with diabetes and the progression of their illness one year after baseline.

You can find more info about it [here](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset).
You can see all the available toy datasets [here](https://scikit-learn.org/stable/datasets/toy_dataset.html).

First, let's load our data and display some summary information about it:

In [4]:
import sklearn.datasets

data = sklearn.datasets.load_diabetes(as_frame=True)
print(data.DESCR)
print(data.frame.info())
print(data.frame)

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, total serum cholesterol
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, total cholesterol / HDL
      - s5      ltg, possibly log of serum triglycerides level
      - s6      glu, blood sugar level

Note: Each of these 1

Let's look at some of the attribute correlations in this data:

In [5]:
import scipy.stats

print(scipy.stats.spearmanr(data.frame['target'], data.frame['age']))
print(scipy.stats.spearmanr(data.frame['target'], data.frame['sex']))
print(scipy.stats.spearmanr(data.frame['target'], data.frame['bmi']))
print(scipy.stats.spearmanr(data.frame['target'], data.frame['bp']))

SignificanceResult(statistic=0.19782187832853038, pvalue=2.806132121751573e-05)
SignificanceResult(statistic=0.03740081502886254, pvalue=0.4328318674689041)
SignificanceResult(statistic=0.5613820101065616, pvalue=4.567023927725032e-38)
SignificanceResult(statistic=0.4162408981534322, pvalue=5.992783653793038e-20)


We can see that there is a moderate correlation between disease progression and body mass index / blood pressure and a low or no correlation with age and sex.

### Creating a synthetic dataset

Data with health information about individuals is usually privacy-sensitive and can't be shared freely with non-authorized analysts.
Fortunately, using __SynDiffix__ we can create a synthetic dataset that preserves most of the statistical properties of the data while, at the same time, protecting subjects' privacy.

In [6]:
from syndiffix import Synthesizer

syn_data = Synthesizer(data.frame).sample()

print(syn_data.info())
print(syn_data)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     442 non-null    float64
 1   sex     442 non-null    float64
 2   bmi     442 non-null    float64
 3   bp      442 non-null    float64
 4   s1      442 non-null    float64
 5   s2      442 non-null    float64
 6   s3      442 non-null    float64
 7   s4      442 non-null    float64
 8   s5      442 non-null    float64
 9   s6      442 non-null    float64
 10  target  442 non-null    float64
dtypes: float64(11)
memory usage: 38.1 KB
None
          age       sex       bmi        bp        s1        s2        s3  \
0    0.009016 -0.044642 -0.024529 -0.026328 -0.011201 -0.018393  0.008142   
1    0.009016 -0.044642 -0.023451 -0.046985  0.010876  0.028465  0.011824   
2    0.009016 -0.044642 -0.069168 -0.037844 -0.047352 -0.015401  0.015229   
3    0.012648 -0.044642 -0.069014 -0.026328 -0.060683 -0.04

Now let's measure the same correlations over the synthetic data:

In [7]:
print(scipy.stats.spearmanr(syn_data['target'], syn_data['age']))
print(scipy.stats.spearmanr(syn_data['target'], syn_data['sex']))
print(scipy.stats.spearmanr(syn_data['target'], syn_data['bmi']))
print(scipy.stats.spearmanr(syn_data['target'], syn_data['bp']))

SignificanceResult(statistic=0.10105320609583315, pvalue=0.03367470392579032)
SignificanceResult(statistic=0.250068026895882, pvalue=9.964767433609745e-08)
SignificanceResult(statistic=0.5524355817173319, pvalue=1.113492287222233e-36)
SignificanceResult(statistic=0.17325645519948155, pvalue=0.00025230483799152704)


The correlation between the `target` and the `bmi` attributes is preserved well, but the others are distorted.
This happens because noise is produced during anonymization and synthesization.
The greater the number of columns in the input and the fewer the rows, the noisier the output gets.

### Improving accuracy

We can make our analysis more accurate by not synthesizing unnecessary columns.
When computing correlations, we only need 2 attributes at each step, so we can create a custom synthetic dataset for each computation separately.

In [8]:
syn_data = Synthesizer(data.frame[['target', 'age']]).sample()
print(scipy.stats.spearmanr(syn_data['target'], syn_data['age']))

syn_data = Synthesizer(data.frame[['target', 'sex']]).sample()
print(scipy.stats.spearmanr(syn_data['target'], syn_data['sex']))

syn_data = Synthesizer(data.frame[['target', 'bmi']]).sample()
print(scipy.stats.spearmanr(syn_data['target'], syn_data['bmi']))

syn_data = Synthesizer(data.frame[['target', 'bp']]).sample()
print(scipy.stats.spearmanr(syn_data['target'], syn_data['bp']))

SignificanceResult(statistic=0.20049336982640237, pvalue=2.1724356135790848e-05)
SignificanceResult(statistic=0.024652406428380392, pvalue=0.6048170855895556)
SignificanceResult(statistic=0.518249695364917, pvalue=8.14201967530878e-32)
SignificanceResult(statistic=0.47405933294930946, pvalue=3.8035222181486985e-26)


All the computed correlations are now close to the originals, making the utility of such an analysis high.