# Time-Series Generation using Contrastive Learning

Consider learning a generative model for time-series data.

The sequential setting poses a unique challenge: Not only should the generator capture the conditional dynamics of (stepwise) transitions, but its open-loop rollouts should also preserve the joint distribution of (multi-step) trajectories.

On one hand, autoregressive models
trained by MLE allow learning and computing explicit transition distributions, but suffer from compounding error during rollouts.

On the other hand, adversarial models based on GAN training alleviate such exposure bias, but transitions are implicit and hard to assess.

In this work, we study a generative framework that seeks to combine the strengths of both: Motivated by a moment-matching objective to mitigate
compounding error, we optimize a local (but forward-looking) *transition policy*, where the reinforcement signal is provided by a global (but stepwise-decomposable) *energy model* trained by contrastive estimation. 

At **training**, the two components are learned cooperatively, avoiding the instabilities typical of adversarial objectives. 

At **inference**, the learned policy serves as the generator for iterative sampling, and the learned energy serves as a trajectory-level measure for evaluating sample quality.

By expressly training a policy to imitate sequential behavior of time-series features in a dataset, this approach embodies *“generation by imitation”*. Theoretically, we illustrate the correctness of this formulation and the consistency of the algorithm.

Empirically, we evaluate its ability to generate predictively useful samples from real-world datasets, verifying that it performs at the standard of existing benchmarks.

## 1 Setup

### 1.1 Install libraries

Run the cell below to **install** the necessary libraries.

In [None]:
%pip install wandb
%pip install pytorch-lightning
%pip install matplotlib
%pip install numpy
%pip install pandas
%pip install scikit-learn

Change or remove these commands with the right ones for your machine

In [None]:
%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
%pip install cuda-python

In [1]:
import torch
print(torch.__version__)

2.2.0+cu118


### 1.2 Import Libraries

Run the cell below to **import** the necessary libraries

In [None]:
import random
import numpy as np
import torch

In [None]:
import pytorch_lightning as pl
from pytorch_lightning.callbacks import EarlyStopping
from pytorch_lightning import Trainer
# from pytorch_lightning.callbacks.model_checkpoint import ModelCheckpoint

In [None]:
import wandb
from pytorch_lightning.loggers.wandb import WandbLogger

Eh eh

In [None]:
import warnings
warnings.filterwarnings("ignore")

### 1.3 Hyper-parameters

The cell below contains *all* the hyper-parameters nedded by this script, for easy tweaking.

In [None]:
from hyperparamets import Config
hparams = Config()

Comment this cell if you don't want to use Weights & Biases to log the process

In [None]:
#!wandb login

### 1.5 Initialization

Initialize the modules needed by running the cells in this section.

#### 1.5.1 reproducibility.

In [None]:
np.random.seed(0)
random.seed(0)

torch.cuda.manual_seed(0)
torch.manual_seed(0)
torch.backends.cudnn.deterministic = True # Can have performance impact
torch.backends.cudnn.benchmark = False

_ = pl.seed_everything(0)

#### 1.5.2 Device

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f"Using device {device}.")

#### 1.5.3 Data

Generate the dataset as requested

In [None]:
import os
from hyperparamets import Config
print(os.listdir('.'))
hparams = Config()

Path to the folder containing the datasets.

In [None]:
datasets_folder = hparams.dataset_folder

In [None]:
from data_generation import iid_sequence_generator, sine_process, wiener_process

In [None]:
if hparams.dataset_name in ['sine', 'wien', 'iid', 'cov']:
  # Generate and store the dataset as requested
  dataset_path = f"../datasets/{hparams.dataset_name}_generated_stream.csv"
  if hparams.dataset_name == 'sine':
    sine_process.save_sine_process(p=hparams.data_dim, N=hparams.num_samples, file_path=dataset_path)
  elif hparams.dataset_name == 'wien':
    wiener_process.save_wiener_process(p=hparams.data_dim, N=hparams.num_samples, file_path=dataset_path)
  elif hparams.dataset_name == 'iid':
    iid_sequence_generator.save_iid_sequence(p=hparams.data_dim, N=hparams.num_samples, file_path=dataset_path)
  elif hparams.dataset_name == 'cov':
    iid_sequence_generator.save_cov_sequence(p=hparams.data_dim, N=hparams.num_samples, file_path=dataset_path)
  else:
    raise ValueError
  print(f"The {hparams.dataset_name} dataset has been succesfully created and stored into:\n\t- {dataset_path}")
elif hparams.dataset_name == 'real':
  pass
else:
  raise ValueError("Dataset not supported.")

Train / Test split

In [None]:
from dataset_handling import train_test_split

In [None]:
if hparams.dataset_name in ['sine', 'wien', 'iid', 'cov']:
    train_dataset_path = f"{datasets_folder}{hparams.dataset_name}_training.csv"
    test_dataset_path = f"{datasets_folder}{hparams.dataset_name}_testing.csv"
    val_dataset_path  = f"{datasets_folder}{hparams.dataset_name}_validating.csv"

    # Train & Test
    train_test_split(X=np.loadtxt(dataset_path, delimiter=",", dtype=np.float32),
                    split=hparams.train_test_split,
                    train_file_name=train_dataset_path,
                    test_file_name=test_dataset_path    
                    )

    # Train & Validation
    train_test_split(X=np.loadtxt(train_dataset_path, delimiter=",", dtype=np.float32),
                    split=hparams.train_val_split,
                    train_file_name=train_dataset_path,
                    test_file_name=val_dataset_path    
                    )
    
    print(f"The {hparams.dataset_name} dataset has been split successfully into:\n\t- {train_dataset_path}\n\t- {val_dataset_path}")
elif hparams.dataset_name == 'real':
    train_dataset_path = datasets_folder + hparams.train_file_name
    test_dataset_path  = datasets_folder + hparams.test_file_name
    val_dataset_path   = datasets_folder + hparams.val_file_name
else:
  raise ValueError("Dataset not supported.")

## 1.6 Model

This cell loads the TimeGAN model class.

In [None]:
from timegan_model import TimeGAN

## 2 Train

This chapter will train the model according to the hyper-parameters defined above in section [Hyper-parameters](#13-hyper-parameters).

In [None]:
# Instantiate the model
timegan = TimeGAN(hparams=hparams,
                    train_file_path=train_dataset_path,
                    val_file_path=val_dataset_path
                    )

In [None]:
# Define the logger -> https://www.wandb.com/articles/pytorch-lightning-with-weights-biases.
wandb_logger = WandbLogger(project="TimeGAN PyTorch (2024)", log_model=True)
wandb_logger.experiment.watch(timegan, log='all', log_freq=100)

In [None]:
# Define the trainer
early_stop = EarlyStopping(
    monitor="val_loss",
    mode="min",
    patience=hparams.early_stop_patience,
    strict=False,
    verbose=False
)
trainer = Trainer(logger=wandb_logger,
                max_epochs=hparams.n_epochs,
                val_check_interval=0.10,
                )

In [None]:
# Start the training
trainer.fit(timegan)

In [None]:
# Log the trained model
trainer.save_checkpoint('timegan.pth')
wandb.save('timegan.pth')

# 3 Testing

## 3.1 Linear Deterministic Anomaly Detector

In these tests the model will be asked to generate sequences that a deterministic PCA-based anomaly detector will scan looking for irregularities with respect to the real sequences.  

Get the anomaly detection deterministic models.

In [None]:
from anomaly_detection import anomaly_detector_api as AD_API
import dataset_handling as dh

In [None]:
import pandas as pd

Get the test dataset and precompute the noise.

In [None]:
test_dataset = dh.RealDataset(
                file_path=test_dataset_path,
                seq_len=hparams.seq_len
            )


The metrics.

In [None]:
FAR_tot = 0.0 # False Alarm Rate (on nominal data)
TAR_tot = 0.0 # True Alarm Rate (on synthetic data)

Working folder and file paths.

In [None]:
AD_folder = "./src/anomaly_detection/"
AD_offline_path = f"{datasets_folder}{hparams.dataset_name}_testing_AD_offline.csv"
AD_online_path  = f"{datasets_folder}{hparams.dataset_name}_testing_AD_online.csv"

Since the model is trained on normalized data, we must train the AD on normalized data as well.

In [None]:
if hparams.operating_system != 'windows':
    df = pd.DataFrame( np.transpose(test_dataset.get_whole_stream().numpy()) )
    df.to_csv(AD_offline_path, index=False, header=False)

    # train AD
    AD_API.pca_offline(AD_offline_path, folder=AD_folder)
else:
    print("The PCA-based Anomaly Detector related tests are not currently supported for this operating system.")

Anomaly rate on the real data, thus the AD's false alarm rate.

In [None]:
if hparams.operating_system != 'windows':
    anomalies_found = AD_API.pca_online(file_path=AD_offline_path, folder=AD_folder, h=hparams.h, alpha=hparams.alpha)
    FAR_tot += anomalies_found

    # free memory
    os.system(f"rm {AD_offline_path}")
else:
    print("The PCA-based Anomaly Detector related tests are not currently supported for this operating system.")

Run tests on generated sequences.

In [None]:
if hparams.operating_system != 'windows':
    for idx, (X, Z) in enumerate(test_dataset):
        # Get the synthetic sequence
        Z_seq = Z.reshape(1, hparams.seq_len, hparams.noise_dim)
        X_seq = timegan(Z_seq).detach().reshape(hparams.seq_len, hparams.data_dim)

        # save synthetic sequence to a file
        X_seq = np.transpose(X_seq.numpy())
        df = pd.DataFrame(X_seq)
        df.to_csv(AD_online_path, index=False, header=False)

        # run simulation
        anomalies_found = AD_API.pca_online(file_path=AD_online_path, folder=AD_folder, h=hparams.h, alpha=hparams.alpha)
        TAR_tot += anomalies_found
    TAR_tot /= len(test_dataset)

    # free memory
    os.system(f"rm {AD_online_path}")
    AD_API.cleanup_files()

else:
    print("The PCA-based Anomaly Detector related tests are not currently supported for this operating system.")

Show the results.

In [None]:
print(f"Anomalies found on real data: {round(FAR_tot*100, 2)}%")
print(f"Anomalies found on fake data: {round(TAR_tot*100, 2)}%")

## 3.2 Other Metrics

## 4 Visualize Results