A unified, reproducible benchmark for evaluating synthetic time series generators in finance. All results, metrics, and experiment outputs are automatically saved and organized.
- Python: 3.11+ (recommended)
- Install all dependencies:
pip install -r requirements.txt
The easiest way to run the complete pipeline is using Docker Compose, which orchestrates all stages from data download to evaluation and plotting.
docker-compose build basedocker-compose upThis command runs all services in dependency order:
- data-download: Downloads and preprocesses SPXUSD time series data
- generate-data: Generates synthetic data using both parametric and non-parametric models
- eval: Evaluates all generated data using the unified evaluator
- plot: Generates publication-ready figures from evaluation results
# Run only data download
docker-compose up data-download
# Run data download and generation only
docker-compose up data-download generate-data
# Run through evaluation, skip plotting
docker-compose up data-download generate-data evalSet environment variables via .env file or export in your shell:
# Set the CUDA device (if using CUDA)
export CUDA_VISIBLE_DEVICES=0The following local directories are mapped into containers:
./data→/data(raw and processed data)./generated_data→/generated_data(synthetic data outputs)./results→/results(evaluation results)./evaluation_plots→/evaluation_plots(plots and figures)./configs→/app/configs(read-only configuration files)
Fetch the required dataset:
python src/data_downloader.py --index spxusd --year 2023 2024This saves data to data/raw/ and processed data to data/processed/.
Generate synthetic data using the unified script (handles both parametric and non-parametric models):
python src/generation_scripts/generate_data.py \
--generation_length 52 \
--num_samples 1000 \
--seed 42 \
--output_dir generated_dataThe script trains models on the training set at the ACF-inferred sequence length, then generates samples by stitching log returns to reach the target generation length. Artifacts are saved under generated_data/<ModelName>/<ModelName>_seq_<L>.pt.
Evaluate all generated artifacts:
python src/unified_evaluator.py \
--generated_dir generated_data \
--results_dir results \
--seq_lengths 52 60 120 180 240 300Outputs are saved to:
/results/seq_<L>/<ModelName>/metrics.json- Evaluation metrics/results/seq_<L>/<ModelName>/visualizations/- Visualization outputs
Generate comprehensive, publication-ready plots for all evaluation metrics:
python src/plot_statistics/evaluation_plotter.pyThis automatically finds the latest evaluation results, generates publication-quality plots (300 DPI), and saves them to evaluation_plots/ directory.
What happens:
- Data Preprocessing:
- Non-parametric models: The data is segmented into overlapping sub-sequences of shape
(R, l, N)whereRis the number of sequences,lis the sequence length, andNis the number of features. - Parametric models: The original time series is used without segmentation, resulting in data of shape
(l, N).
- Non-parametric models: The data is segmented into overlapping sub-sequences of shape
- Models are trained on the training set at the ACF-inferred sequence length
- Generated samples are stitched to reach target generation lengths
- All taxonomy metrics (fidelity, diversity, efficiency, and stylized facts) are computed
- Results are printed in the console and saved to detailed JSON files in the results directory
configs/dataset_cfgs.yaml: Modify the preprocessing of the dataset for parametric/non-parametric models.
docker-compose logs -f generate-datadocker-compose build base
docker-compose up# Build the base image first
docker-compose build base
# Run with a specific python command
docker-compose run --rm generate-data python src/generation_scripts/generate_data.py --generation_length 52# Stop all containers
docker-compose down
# Remove volumes (WARNING: deletes data!)
docker-compose down -v
# Remove images
docker-compose down --rmi allUnified-benchmark-for-SDGFTS-main/
├─ data/ # Raw and preprocessed datasets
├─ notebooks/ # Validate functionality of parts of the pipeline
├─ results/ # Evaluation results (JSON files)
├─ evaluation_plots/ # Publication-ready plots (generated)
├─ src/
│ ├─ models/ # Generative model implementations
│ ├─ taxonomies/
│ │ ├─ diversity.py # Diversity metrics (e.g., ICD, ED, DTW)
│ │ ├─ efficiency.py # Efficiency metrics (runtime, memory)
│ │ ├─ fidelity.py # Fidelity/feature metrics + Visualization (MDD, MD, SDD, KD, ACD, t-SNE, Distrib. Plots)
│ │ └─ stylized_facts.py # Stylized facts metrics (tails, autocorr, volatility)
│ ├─ plot_statistics/ # Plotting functionality for evaluation results
│ │ └─ evaluation_plotter.py # Main plotting script (executable)
│ ├─ utils/ # Configs, display, math, evaluation classes, preprocessing, etc.
│ │ └─ eval_plot_utils.py # Utilities for evaluation plotting
│ └─ data_downloader.py # Dataset download utility
├─ configs/ # Experiment and preprocessing config templates
├─ requirements.txt
└─ README.md
The benchmark supports a range of both traditional parametric models and modern deep learning approaches:
Parametric Models
- Geometric Brownian Motion (GBM)
- Ornstein-Uhlenbeck (OU) Process
- Merton Jump Diffusion (MJD)
- Double Exponential Jump Diffusion (DEJD)
- GARCH(1,1)
Non-parametric & Deep Learning Models
- TimeGAN
- QuantGAN
- TimeVAE
- Sig-WGAN
- Block Bootstrap
All models share a unified interface for training, sample generation, and comprehensive metric evaluation.
- Feature-based Distances
- Marginal Distribution Difference (MDD)
- Mean Difference (MD)
- Standard Deviation Difference (SDD)
- Kurtosis Difference (KD)
- AutoCorrelation Difference (ACD)
- Visualization
- t-SNE Visualization
- Distribution Comparison Plots
- Intra-Class Distance
- Euclidean Distance (ED)
- Dynamic Time Warping (DTW)
- Generation Time (seconds for generating 500 samples)
- Heavy Tails (Excess Kurtosis)
- Lag-1 Autocorrelation of Returns
- Volatility Clustering
- Long Memory in Volatility
- Non-Stationarity Detection
Refer to src/taxonomies/ for implementation details and to src/utils/ for utility functions.
- Implement your model in
src/models/and ensure you inherit from the appropriate base class (ParametricModelorDeepLearningModel). - Register your model in
notebooks/pipeline_validation.pyby specifying it underrun_complete_evaluation. - Rerun the pipeline and review your results in the
results/directory!
All results are available in:
- The console (summary tables per model)
results/directory (will be created with JSON results containing all metrics, parameters, and evaluation outputs)
| Name | Role | |
|---|---|---|
| Eddison Pham | Machine Learning Researcher & Engineer | eddison.pham@mail.utoronto.ca |
| Albert Lam Ho | Quantitative Researcher | uyenlam.ho@mail.utoronto.ca |
| Yiqing Irene Huang | Research Supervisor/Professor | iy.huang@mail.utoronto.ca |
- For detailed examples and model-by-model usage, see
notebooks/. - To report issues or contribute, see the Contributing section below.