# Synthetic Multimodal Data Generation

In this notebook we will learn how to generate synthetic multimodal data.
This synthetic data is based on use of Markov chains. Each hypothetical "client" is represented by chain of events sampled from markov process. Transformation matrises sampled for each client individualy from sphere of a given radius.

One may set data generation that every client will have multiple chains. Each such chain will represent different modality. In this setting one chain might depend on another. If set so, then matrix of the dependable chain will be replaced with transformation tensor. During sequence generation transformation matrix of the dependable chain will be chosen from this tensor based on the state of master chain.

This synthetic data allows to use target labeling based on the position of the transformation matrices on the sphere. This labeling might depend on one or multiple transformation matrices (tensors).

In [1]:
# lets import some libraries and configure our work environment

%load_ext autoreload
%autoreload 2

import multiprocessing
multiprocessing.set_start_method('fork')

import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')

import warnings
import numpy as np
import pandas as pd


from functools import partial
from ptls.data_load.datasets import SyntheticDataset, ParquetFiles, ParquetDataset
from ptls.frames.supervised import SeqToTargetDataset, SeqToTargetIterableDataset, SequenceToTarget
from ptls.frames import PtlsDataModule
from functools import partial


from ptls.data_load.iterable_processing import SeqLenFilter
from ptls.frames.coles import ColesDataset
from ptls.frames.coles.split_strategy import SampleSlices
from ptls.data_load.datasets import Config, MonoTargetSyntheticDatasetWriter

warnings.filterwarnings('ignore')

  warn(


In [2]:
def get_config():
    '''
    Firstly, we must define how many markov chains will be presented in our data,
    how many states each one will have, from sphere of wich radius they will be sampled
    and a noise level in every chain.
    '''
    chain_confs = {
        # n_states, sphere_R, noise
        "A": (8, 8, 0),
        "B": (8, 16, 0.2),
        "C": (8, 16, 0.5)
    }
    '''
    One might think of the noise as a second "shadow" Markov chain.
    Here for example we have chain "C" with noise level 0.5
    Under the hood there exists two chains C and C' with its own unique transformation matrix:
    C1 C2 C3 C4 C5 ...
    C'1 C'2 C'3 C'4 C'5 ...
    Noise level p means that in the result of generation each state Ci
    will be replaced with state C'i with probability p.
    Chains C and C' share common state spaces.
    '''

    
    # next we will define set which chains will depend on one another
    state_from = ["A"]
    state_to = ["B"] # chain B will depend on chain A

    
    # now we must configure impact of every chain on labeling
    labeling_conf = {
        0: {"A": 1.,  # this digit means how many coordpath to save generated datanumber of generated train filesinates of matrix will affect labeling
            "B": 0.,
            "C": 0.2} # 20% of chain C matrix will affect resulting label
    }
    

    config = Config(chain_confs, state_from, state_to, labeling_conf)
    return config

In [3]:
writer = MonoTargetSyntheticDatasetWriter(config=get_config(),
                                          path="syndata/example_data", # path to save generated data
                                          seq_len=360,  # length of each sequence times number of chains
                                          n_train_files=50, # number of generated train files
                                          n_eval_files=10,   # number of generated validation files
                                          n_test_files=0,    # number of generated test files
                                          train_per_file=256*4,  # number of clients in each train file
                                          eval_per_file=256*4,   # number of clients in each validation file
                                          test_per_file=256*4,   # number of clients in each test file
                                          save_config_name="config.cfg", # name for saving cfg object
                                          n_procs=16  # number of parallel processes to use for generation
                                         )

In [4]:
writer.write_dataset()  # actual data generation

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [01:45<00:00,  2.11s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:21<00:00,  2.13s/it]
0it [00:00, ?it/s]


In [5]:
file = pd.read_parquet("syndata/example_data/train/train_0.parquet")
file.head()

Unnamed: 0,A,B,C,event_time,class_label
0,"[21, 40, 7, 60, 34, 19, 24, 2, 23, 63, 60, 34,...","[26, 22, 51, 26, 17, 12, 32, 1, 12, 39, 60, 38...","[31, 57, 10, 21, 41, 10, 22, 48, 2, 21, 45, 42...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",0
1,"[57, 11, 24, 2, 17, 13, 45, 45, 45, 43, 24, 0,...","[18, 23, 57, 9, 12, 37, 44, 34, 17, 15, 59, 30...","[61, 43, 28, 37, 40, 2, 22, 53, 41, 9, 9, 14, ...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",1
2,"[37, 47, 60, 37, 44, 37, 41, 13, 44, 37, 40, 3...","[47, 61, 40, 7, 57, 14, 49, 15, 57, 12, 32, 1,...","[24, 2, 19, 29, 41, 15, 56, 2, 20, 32, 2, 20, ...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",0
3,"[43, 30, 50, 16, 2, 23, 60, 38, 49, 13, 44, 32...","[26, 17, 8, 7, 61, 43, 31, 61, 44, 33, 15, 61,...","[0, 5, 41, 8, 6, 53, 41, 8, 6, 53, 45, 47, 61,...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",0
4,"[38, 54, 53, 45, 41, 11, 29, 45, 41, 8, 5, 42,...","[51, 25, 11, 27, 30, 49, 9, 8, 2, 23, 63, 59, ...","[63, 59, 26, 17, 13, 44, 34, 17, 15, 63, 61, 4...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",1


We can see that each sequence have more states than we have set. This is because instead of saving raw states we save transition from one state to another. This should be kept in mind.