# Demonstration of Synthius

This notebook displays the core functionality of Synthius, wrapped in a high-level programming interface

Essentially, Synthius consists of two steps at this stage:
1. Generation of Synthetic data given generation models and original data.
2. Evaluation of the synthetic data with various metrics as output.

The highest-level API allows both steps to be done with one call though it is designed to be loosely coupled and each step can be executed individually.



### Step 1 - Install Synthius

Run the following command in the CLI: `pip install synthius`.

### Step 2 - Prepare Your Data

Ensure your original data is in its own directory and in a csv file.

e.g.)

Data directory: `/data`

Data filename: `data.csv`

Data absolute path: `/data/data.csv`

### Step 3 - Prepare Field Metadata

There are certain metadata that should be specified for the evaluation phase of Synthius. Notably, one must provide `key_fields`, `sensitive_fields`, and `aux_cols`.

**key_fields:** Columns treated as the primary predictors for inference-attack evaluation, chosen as a meaningful subset of the dataset.

**sensitive_fields:** Columns considered sensitive for inference attacks, selected by the user or iterated automatically by treating each feature as sensitive in turn.

**aux_cols:** Two disjoint column groups used by linkage-style attacks that attempt to match or infer one partition of attributes from the other.
 

In [1]:
key_fields = [
    "Age",
    "Education",
    "Occupation",
    "Income",
    "Marital-status",
    "Native-country",
    "Relationship",
]

sensitive_fields = ["Race", "Sex"]


aux_cols = [
    ["Occupation", "Education", "Education-num", "Hours-per-week", "Capital-loss", "Capital-gain"],
    ["Race", "Sex", "Fnlwgt", "Age", "Native-country", "Workclass", "Marital-status", "Relationship"],
]

### Step 4 - Run Synthius

Now we simply call the API and pass in the appropriate caching directories. 

The synthetic data directory is specified by `synth_dir`.

The resulting metrics directory is specified by `results_dir`.

When no models are specified, the default models in Synthius are used.

Note the random seed only controls the train test split of the original data.

In [None]:
from synthius import run_synthius

run_synthius(
    original_data_filename="adult_subset.csv",
    data_dir="./data",
    synth_dir="./synthetic_data",
    models_dir="./models",
    results_dir="./metrics",
    target_column="Income",
    key_fields=key_fields,
    sensitive_fields=sensitive_fields,
    aux_cols=aux_cols,
    random_seed=42,
)

### Step 5 (Optional) - Continue The Demo

The notebook `examples/2-synthius-models.ipynb` continues the demo by showing how to use a subset of the Synthius models.