# 1. Define the `Population`
The `Population` object/class in `src\dataloaders\populations\base.py` is used to store information about the cohort we want to use for our dataset. Based on the populations, we further create a dataset that contains specific cohorts of people. You can define multiple populations and create various datasets based on various specifications of users.

#### How to use the `Population`?
Preferably, you would define a new populations object for each new cohort.

For my dummy cohort, I defined a `UserPopulation` object in `src\dataloaders\populations\base.py` that specifies what information (and which users) to keep in the dataset. Here is what happens inside of the `UserPopulation`:
1. it takes the **raw** input file containing some information about the users (for example, some kind of demographic information as such as birthday, sex),
2. preprocess and filter users based on some criteria (for example, you want to exclude people below certain age),
3. create data splits (train, val, test).

The `UserPopulation` object runs all these processes and saves outputs in `data\processed\populations` folder. 


In [1]:
# Define the Population object
from src.dataloaders.populations.users import UserPopulation
users = UserPopulation()
## run the preprocessing part
users.population()
## create datasplits 
users.data_split()
## You can also just run the prepare function to do both of the above steps
users.prepare() 

After you run these commands for the first time, you will see that the results are saved in `data\procesed\populations`. 
Next time you the same functions, instead of calculating everything, the object would read the data from the `data\processed\populations` folder. **This is important for very big datasets**.

If you want to redo the calculations, you need to empty the `data\processed\populations` folder. It also applies to cases, when you change the code of the `src\dataloaders\populations\users.py`, as the Population object saves `arguments` so it can validate that you call a specific version of the Population object. 

# 2. Define the `TokenSource`
The `TokenSource` class/object in `src\data\sources\base.py` specifies how to process a specific *source* of data (for example, the *Synthetic Labor* dataset). You have to define a `TokenSource` object for each new data source. For example, for the *Synthetic Labor Dataset*, I have defined a `SyntheticLaborSource` class that is specifically tailored for the *Synthetic Labor Dataset*. 

If any of the variables is continious and you need to bin it, you can use the `Binned` class (see example in `src\dataloaders\sources\synth_labor.py`).

All in all, it makes it easier to process data from different data streams (or datasets) and specify how to convert each variable to tokens.

#### How to use `TokenSource`?
For example, `SyntheticLaborSource` (in `src\dataloaders\sources\synth_labor.py`) specifies how to tokenize the `data\rawdata\synth_labor.csv`.


In [2]:
from src.dataloaders.sources.synth_labor import SyntheticLaborSource

In [3]:
## First we initialize the TokenSource instance.
synth_labor = SyntheticLaborSource()
## process the raw file (and maybe do some preprocessing)
# synth_labor.parsed()
## index the files
# synth_labor.indexed()
## tokenize the files
# synth_labor.tokenized()
### Or use the prepare function to do all of the above
synth_labor.prepare()

After you run these commands for the first time, you will see that the results are saved in `data\procesed\sources`. 
Next time you the same functions, instead of calculating everything, the object would read the data from the `data\processed\sources` folder. **This is important for very big datasets**.

If you want to redo the calculations, you need to empty the *corresponding* file in the `data\processed\sources` folder. It also applies to cases, when you change the code of the `src\dataloaders\sources\synth_labor.py`, as the `SyntheticLaborSource` object saves `arguments` so it can validate that you call it on future runs. 

# 3. Define Corpus, Vocabulary and Task
### 3.1. Let's assemble a corpus
Now we can reuse both `Populations` and `Source` objects to actually create a dataset (based on the specification in both `UserPopulation` and `SyntheticLaborSource`). 

The `Corpus` object in `src\dataloaders\datamodule.py` takes all the specifications and creates a dataset (i.e. creates sentences out of tabular records). It also saves data in the corresponding data splits.

In [4]:
from src.dataloaders.datamodule import Corpus

In [5]:
corpus = Corpus(population=users, sources=[synth_labor], name="synthetic")
corpus.prepare()

In [6]:
corpus.combined_sentences(split="train").head()
#corpus.combined_sentences(split="val").head()
#corpus.combined_sentences(split="test").head()

Unnamed: 0_level_0,RECORD_DATE,SENTENCE,BIRTHDAY,SEX,AGE,AFTER_THRESHOLD
USER_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,122,INCOME_45 CITY_6 OCC_29 IND_32,1957-06-25,Male,63,False
1,271,INCOME_21 CITY_18 OCC_91 IND_20,1957-06-25,Male,63,False
1,284,INCOME_2 CITY_21 OCC_29 IND_18,1957-06-25,Male,63,False
1,292,INCOME_31 CITY_17 OCC_93 IND_15,1957-06-25,Male,63,False
1,326,INCOME_2 CITY_3 OCC_48 IND_17,1957-06-25,Male,63,False


### 3.2. Let's create the vocabulary
The `CorpusVocabulary` object takes the information about the `Corpus` (i.e. sentences that exist in your `train` dataset) and creates a vocabulary! Here you can choose to remove words that appear with low frequency.

In [7]:
from src.dataloaders.vocabulary import CorpusVocabulary

In [8]:
vocab = CorpusVocabulary(corpus, name="synthetic")
vocab.prepare()

### 3.3. Let's speficy the task
The `Task` object in `src\dataloaders\tasks\base.py` specifies how to further process data to feed it into the model. For example, in case of the `MLM` task (specified in `src\dataloaders\tasks\pretrain.py`), we specify the data augmentation procedures, as well as how to mask tokens (and create targets for the prediction task).

In [9]:
from src.dataloaders.tasks.pretrain import MLM

In [10]:
## Specify the task we are going to use with the data
task = MLM(name="mlm", 
           max_length=200, # the maximum length of the sequence
           no_sep = False, # you can decide to create data with or without the [SEP] token
           # Augmentation
            p_sequence_timecut = 0.0,
            p_sequence_resample = 0.01,
            p_sequence_abspos_noise = 0.1,
            p_sequence_hide_background = 0.01,
            p_sentence_drop_tokens = 0.01,
            shuffle_within_sentences = True,
            # MLM specific options
            mask_ratio = 0.1)

# 4. DataModule
Datamodule object takes information about the `Corpus`, `Task` and `CorpusVocabulary` and assembles inputs that we further provide to the model. To learn more about the datamodules see the [Pytorch Lightning: Dataloader](https://lightning.ai/docs/pytorch/stable/data/datamodule.html) documentation.

In [11]:
from src.dataloaders.datamodule import L2VDataModule

In [12]:
## On the first initialization of the datamodule, we do not have a vocabulary object
datamodule = L2VDataModule(corpus, batch_size=2, task=task, vocabulary=vocab, num_workers=1)

In [13]:
datamodule.prepare()

In [14]:
# MORE ANNOTATIONS TO COME

In [15]:
datamodule.batch_size

2

# 5. Add model

In [16]:
### use code from the experiments
from src.models.pretrain import TransformerEncoder
from pytorch_lightning import Trainer

In [17]:
hparams = {
    "hidden_size": 96,  # size of the hidden layers and embeddings
    "hidden_ff": 128,  # size of the position-wise feed-forward layer
    "n_encoders": 4,  # number of encoder blocks
    "n_heads": 8,  # number of attention heads in the multiheadattention module
    "n_local": 2,  # number of local attention heads
    "local_window_size": 4,  # size of the window for local attention
    "max_length": task.max_length,  # maximum length of the input sequence
    "vocab_size": vocab.size(),  # size of the vocabulary
    "num_classes": -1,  
    "cls_num_targs": 3, # number of classes for the SOP class (we have 3: original, reversed, shuffled)
    "learning_rate": 0.001,
    "batch_size": datamodule.batch_size,
    "num_epochs": 30,
    "device": 'cuda',
    "attention_type": "performer",
    "norm_type": "rezero",
    "num_random_features": 32,  # number of random features for the Attention module (Performer uses this)
    "parametrize_emb": True,  # whether to center the token embedding matrix
    "emb_dropout": 0.1,  # dropout for the embedding block
    "fw_dropout": 0.1,  # dropout for the position-wise feed-forward layer
    "att_dropout": 0.1,  # dropout for the multiheadattention module
    "dc_dropout": 0.1,  # dropout for the decoder block
    "hidden_act": "swish",  # activation function for the hidden layers (attention layers use ReLU)
    "optimizer": "adam",  # optimizer to use
    "training_task": task.name,
    "weight_tying": True,
    "norm_output_emb": True,
    "epsilon": 1e-8,
    "weight_decay": 0.01,
    "beta1": 0.9,
    "beta2": 0.999,
}


In [18]:
l2v = TransformerEncoder(hparams=hparams)

In [19]:
trainer = Trainer(max_epochs=hparams["num_epochs"])

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/carlomarx/anaconda3/envs/torch/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:67: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default


In [20]:
trainer.fit(model=l2v, datamodule=datamodule)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name        | Type                | Params
----------------------------------------------------
0 | transformer | Transformer         | 276 K 
1 | mlm_decoder | MaskedLanguageModel | 37.6 K
2 | sop_decoder | SOP_Decoder         | 9.6 K 
3 | sop_loss    | CrossEntropyLoss    | 0     
4 | mlm_loss    | CrossEntropyLoss    | 0     
----------------------------------------------------
323 K     Trainable params
0         Non-trainable params
323 K     Total params
1.294     Total estimated model params size (MB)


data/processed/datasets/synthetic/mlm/_arguments
Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/home/carlomarx/anaconda3/envs/torch/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.


                                                                           

/home/carlomarx/anaconda3/envs/torch/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.
/home/carlomarx/anaconda3/envs/torch/lib/python3.11/site-packages/pytorch_lightning/loops/fit_loop.py:293: The number of training batches (35) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.


Epoch 3:   3%|▎         | 1/35 [00:00<00:04,  7.81it/s, v_num=3] 

ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/carlomarx/anaconda3/envs/torch/lib/python3.11/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/carlomarx/anaconda3/envs/torch/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/carlomarx/anaconda3/envs/torch/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
            ~~~~~~~~~~~~^^^^^
  File "/home/carlomarx/anaconda3/envs/torch/lib/python3.11/site-packages/torch/utils/data/dataset.py", line 335, in __getitem__
    return self.datasets[dataset_idx][sample_idx]
           ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
  File "/home/carlomarx/life2vec-light/src/dataloaders/dataset.py", line 93, in __getitem__
    content = self._transform(content)
              ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/carlomarx/life2vec-light/src/dataloaders/dataset.py", line 52, in _transform
    return self.transform(x)
           ^^^^^^^^^^^^^^^^^
  File "/home/carlomarx/life2vec-light/src/dataloaders/tasks/base.py", line 94, in preprocessor
    x = self.augment_document(x, is_train=is_train)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/carlomarx/life2vec-light/src/dataloaders/tasks/base.py", line 124, in augment_document
    document = resample_document(document)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/carlomarx/life2vec-light/src/dataloaders/augment.py", line 76, in resample_document
    num_to_remove = np.random.randint(
                    ^^^^^^^^^^^^^^^^^^
  File "numpy/random/mtrand.pyx", line 782, in numpy.random.mtrand.RandomState.randint
  File "numpy/random/_bounded_integers.pyx", line 1334, in numpy.random._bounded_integers._rand_int64
ValueError: high <= 0


In [None]:
## More annotations to come.