## SynDiffix Usage Tutorial

This notebook demonstrates how to use __SynDiffix__, an open-source library for generating statistically-accurate
and strongly anonymous synthetic data from structured data.

We'll go through the process of loading and inspecting a toy dataset, creating a synthetic dataset that mimics the original,
computing some statistical properties over the two datasets and comparing them, training ML models to predict a target column
from a set of feature columns, and, finally, seeing how to improve accuracy when working synthetic data.

### Setup

The `syndiffix` package requires Python 3.10 or later. We can install it using `pip`:

In [1]:
%pip install -q syndiffix

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: C:\Users\local_francis\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


We'll need a toy dataset to play with and a way to train and evaluate ML models, so let's install the `scikit-learn` package in order to use one of their popular reference datasets and ML API:

In [2]:
%pip install -q scikit-learn

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: C:\Users\local_francis\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


We'll want to compute some statistical properties of the original and synthetic datasets in order to compare them, so let's install the `scipy` package as well: 

In [3]:
%pip install -q scipy

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: C:\Users\local_francis\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


### Loading our data

For this tutorial, we are going to use the Diabetes dataset, which is a popular reference dataset containing some attributes of patients with diabetes and the progression of their illness one year after baseline.

You can find more info about it [here](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset).
You can see all the available toy datasets [here](https://scikit-learn.org/stable/datasets/toy_dataset.html).

First, let's load our data and display some summary information about it:

In [4]:
import sklearn.datasets

data = sklearn.datasets.load_diabetes(as_frame=True)
print(data.DESCR)
print(data.frame.info())
print(data.frame)

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, total serum cholesterol
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, total cholesterol / HDL
      - s5      ltg, possibly log of serum triglycerides level
      - s6      glu, blood sugar level

Note: Each of these 1

Let's look at some of the attribute correlations in this data:

In [5]:
import scipy.stats

print(scipy.stats.spearmanr(data.frame['target'], data.frame['age']))
print(scipy.stats.spearmanr(data.frame['target'], data.frame['sex']))
print(scipy.stats.spearmanr(data.frame['target'], data.frame['bmi']))
print(scipy.stats.spearmanr(data.frame['target'], data.frame['bp']))

SignificanceResult(statistic=0.19782187832853038, pvalue=2.806132121751573e-05)
SignificanceResult(statistic=0.03740081502886254, pvalue=0.4328318674689041)
SignificanceResult(statistic=0.5613820101065616, pvalue=4.567023927725032e-38)
SignificanceResult(statistic=0.4162408981534322, pvalue=5.992783653793038e-20)


We can see that there is a moderate correlation between disease progression and body mass index / blood pressure and a low or no correlation with age and sex.

### Creating a synthetic dataset

Data with health information about individuals is usually privacy-sensitive and can't be shared freely with non-authorized analysts.
Fortunately, using __SynDiffix__ we can create a synthetic dataset that preserves most of the statistical properties of the data while, at the same time, protecting subjects' privacy.

In [6]:
from syndiffix import Synthesizer

syn_data = Synthesizer(data.frame).sample()

print(syn_data.info())
print(syn_data)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 441 entries, 0 to 440
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     441 non-null    float64
 1   sex     441 non-null    float64
 2   bmi     441 non-null    float64
 3   bp      441 non-null    float64
 4   s1      441 non-null    float64
 5   s2      441 non-null    float64
 6   s3      441 non-null    float64
 7   s4      441 non-null    float64
 8   s5      441 non-null    float64
 9   s6      441 non-null    float64
 10  target  441 non-null    float64
dtypes: float64(11)
memory usage: 38.0 KB
None
          age       sex       bmi        bp        s1        s2        s3  \
0    0.001751 -0.044642 -0.019493 -0.026328 -0.118899 -0.081469  0.052322   
1    0.005383 -0.044642 -0.012328 -0.052483 -0.118702 -0.122760  0.052322   
2    0.005383 -0.044642 -0.009547 -0.043542 -0.078052 -0.074091  0.074412   
3    0.001751 -0.044642 -0.042026 -0.049240 -0.095613 -0.08

Now let's measure the same correlations over the synthetic data:

In [7]:
print(scipy.stats.spearmanr(syn_data['target'], syn_data['age']))
print(scipy.stats.spearmanr(syn_data['target'], syn_data['sex']))
print(scipy.stats.spearmanr(syn_data['target'], syn_data['bmi']))
print(scipy.stats.spearmanr(syn_data['target'], syn_data['bp']))

SignificanceResult(statistic=0.11478491924518482, pvalue=0.015881919884663025)
SignificanceResult(statistic=0.1249802923202103, pvalue=0.008602838462521191)
SignificanceResult(statistic=0.5275652229231301, pvalue=5.776345758770752e-33)
SignificanceResult(statistic=0.28985899319648795, pvalue=5.524432019731527e-10)


The correlation between the `target` and the `bmi` attributes is preserved well, but the others are distorted.
This happens because noise is produced during anonymization and synthesization.
The greater the number of columns in the input and the fewer the rows, the noisier the output gets.

### Improving accuracy

We can make our analysis more accurate by not synthesizing unnecessary columns.
When computing correlations, we only need 2 attributes at each step, so we can create a custom synthetic dataset for each computation separately.

In [8]:
syn_data = Synthesizer(data.frame[['target', 'age']]).sample()
print(scipy.stats.spearmanr(syn_data['target'], syn_data['age']))

syn_data = Synthesizer(data.frame[['target', 'sex']]).sample()
print(scipy.stats.spearmanr(syn_data['target'], syn_data['sex']))

syn_data = Synthesizer(data.frame[['target', 'bmi']]).sample()
print(scipy.stats.spearmanr(syn_data['target'], syn_data['bmi']))

syn_data = Synthesizer(data.frame[['target', 'bp']]).sample()
print(scipy.stats.spearmanr(syn_data['target'], syn_data['bp']))

SignificanceResult(statistic=0.18868560044107685, pvalue=6.563883891333138e-05)
SignificanceResult(statistic=0.037366975495306806, pvalue=0.4342970241156707)
SignificanceResult(statistic=0.5040352106944338, pvalue=7.423589628985371e-30)
SignificanceResult(statistic=0.4979893145478151, pvalue=3.85665229429197e-29)


All the computed correlations are now close to the originals, making the utility of such an analysis high.

### Training a ML model with synthetic data

In this section, we are going to use a simple linear regression model to predict the `target` column given the other attributes.
We are going to create the model twice, once on the raw data and again on the synthetic data, and see how the predictive power
of the model varies between the two approaches.

First, we need to split the raw data into a training dataset and a test dataset. Furthermore, we'll separate the attribute columns from the target column:

In [9]:
raw_data_test = data.frame.iloc[:80, :]
raw_data_test_y = raw_data_test["target"]
raw_data_test_X = raw_data_test.drop(columns=["target"])

raw_data_train = data.frame.iloc[80:, :]
raw_data_train_y = raw_data_train["target"]
raw_data_train_X = raw_data_train.drop(columns=["target"])

Then, we'll fit a model on the raw training dataset and use it to predict the `target` column:

In [10]:
import sklearn.linear_model, sklearn.metrics

model = sklearn.linear_model.LinearRegression().fit(raw_data_train_X, raw_data_train_y)
sklearn.metrics.r2_score(raw_data_test_y, model.predict(raw_data_test_X))

0.40482699584417636

Let's create a synthetic dataset from the raw training data and then repeat the previous process:

In [11]:
syn_data_train = Synthesizer(raw_data_train).sample()
syn_data_train_y = syn_data_train["target"]
syn_data_train_X = syn_data_train.drop(columns=["target"])

model = sklearn.linear_model.LinearRegression().fit(syn_data_train_X, syn_data_train_y)
sklearn.metrics.r2_score(raw_data_test_y, model.predict(raw_data_test_X))

0.2579601525270222

As we can see, the predictive power of the model is significantly lower.
Fortunately, we can pass a focus column to the `Synthesizer` which will create a synthetic dataset better suited for ML tasks:

In [12]:
syn_data_train = Synthesizer(raw_data_train, target_column="target").sample()
syn_data_train_y = syn_data_train["target"]
syn_data_train_X = syn_data_train.drop(columns=["target"])

model = sklearn.linear_model.LinearRegression().fit(syn_data_train_X, syn_data_train_y)
sklearn.metrics.r2_score(raw_data_test_y, model.predict(raw_data_test_X))

0.2363106536534617