# Performance analysis on synthetic data generators

This notebooks applies multiple synthetic data generator to simulate the real data set, Complete Journey. We provide step-by-step instructions on how to use each generator and analyze their performance from perspectives of Fidelity, Utility, and Privacy.

In [1]:
from syncomp import metrics

## Real Data Exploration

> The Complete Journey dataset characterizes household level transactions over one year from a group of 2,469 households who are frequent shoppers at a grocery store. It contains all of each household’s purchases, not just those from a limited number of categories. For certain households, demographic information as well as direct marketing contact history are captured.

For a deep dive into the dataset exploration and data cleaning, please refer to notebook [Complete Journey Data Exploration](https://github.com/RetailMarketingAI/retailsynth/blob/main/analysis_workflow/1_complete_journey_eda/1_preprocess_analysis.ipynb).

In [None]:
# TODO: Show distributions of some key categorical variables and numerical variables we're gonna use to evaluate the model.

## Models
1. StaSy
2. TabDDPM
3. StaSy-AutoDiff
4. Tab-AutoDiff
5. RetailSynth[Optional]: Notice that RetailSynth focuses on imitating the transaction only, that it does not necessarily synthesize the categorical features, like customer demographics.

In [None]:
# TODO: need to finalize the list of models with Chi-Hua
# TODO: add shell command to run training jobs

## Metrics

### Fidelity
1. Wasserstein distance of probability distributions for the following numerical features:
    - Store visit probability
    - Product purchase probability
    - Product demand
    - Basket size
    - Time between purchase
    - Price elasticity
2. Jensen Shannon divergence of distributions for the following categorical features:
    - Customer demographics (age, household size, etc.)
    - Store information (state, city, etc.)
3. Pearson Correlation between listed numerical features
4. Theil's U for listed categorical features
5. Correlation Ratio between listed categorical and numerical features
6. Business insights
    - Customer segmentation
    - Customer retention
    - Category penetration

In [None]:
# TODO: fill the following metrics with all models.
# TODO: find existing implementation of the following metrics potentially in the lab repo

### Utility
We define a classification task and a regression task to justify if statistical ML models can learn the patterns from the synthetic data. Classification task is to predict whether the customer will make a purchase at one time step. Regression task is to predict the total demand or revenue of a customer at one time step. 
1. Classification task
    - Accuracy
    - F1 score
    - ROC AUC
2. Regression task
    - Mean Squared Error
    - R2 score

We also define an unsupervised learning task to justify if clustering algorithms can learn the patterns from the synthetic data. The clustering task is to cluster the customers based on their purchase behavior.
1. Silhouette Score between synthetic cluster and real cluster

### Privacy
1. Distances to Closest Records

In [None]:
# TODO: find existing implementation of the following metrics potentially in the lab repo