# 1. Define the `Population`
The `Population` object/class in `src\dataloaders\populations\base.py` is used to store information about the cohort we want to use for our dataset. Based on the populations, we further create a dataset that contains specific cohorts of people. You can define multiple populations and create various datasets based on various specifications of users.

#### How to use the `Population`?
Preferably, you would define a new populations object for each new cohort.

For my dummy cohort, I defined a `UserPopulation` object in `src\dataloaders\populations\base.py` that specifies what information (and which users) to keep in the dataset. Here is what happens inside of the `UserPopulation`:
1. it takes the **raw** input file containing some information about the users (for example, some kind of demographic information as such as birthday, sex),
2. preprocess and filter users based on some criteria (for example, you want to exclude people below certain age),
3. create data splits (train, val, test).

The `UserPopulation` object runs all these processes and saves outputs in `data\processed\populations` folder. 


In [None]:
# Define the Population object
from src.dataloaders.populations.users import UserPopulation
users = UserPopulation()
## run the preprocessing part
users.population()
## create datasplits 
users.data_split()
## You can also just run the prepare function to do both of the above steps
users.prepare() 

After you run these commands for the first time, you will see that the results are saved in `data\procesed\populations`. 
Next time you the same functions, instead of calculating everything, the object would read the data from the `data\processed\populations` folder. **This is important for very big datasets**.

If you want to redo the calculations, you need to empty the `data\processed\populations` folder. It also applies to cases, when you change the code of the `src\dataloaders\populations\users.py`, as the Population object saves `arguments` so it can validate that you call a specific version of the Population object. 

# 2. Define the `TokenSource`
The `TokenSource` class/object in `src\data\sources\base.py` specifies how to process a specific *source* of data (for example, the *Synthetic Labor* dataset). You have to define a `TokenSource` object for each new data source. For example, for the *Synthetic Labor Dataset*, I have defined a `SyntheticLaborSource` class that is specifically tailored for the *Synthetic Labor Dataset*. 

If any of the variables is continious and you need to bin it, you can use the `Binned` class (see example in `src\dataloaders\sources\synth_labor.py`).

All in all, it makes it easier to process data from different data streams (or datasets) and specify how to convert each variable to tokens.

#### How to use `TokenSource`?
For example, `SyntheticLaborSource` (in `src\dataloaders\sources\synth_labor.py`) specifies how to tokenize the `data\rawdata\synth_labor.csv`.


In [None]:
from src.dataloaders.sources.synth_labor import SyntheticLaborSource

In [None]:
## First we initialize the TokenSource instance.
synth_labor = SyntheticLaborSource()
## process the raw file (and maybe do some preprocessing)
# synth_labor.parsed()
## index the files
# synth_labor.indexed()
## tokenize the files
# synth_labor.tokenized()
### Or use the prepare function to do all of the above
synth_labor.prepare()

After you run these commands for the first time, you will see that the results are saved in `data\procesed\sources`. 
Next time you the same functions, instead of calculating everything, the object would read the data from the `data\processed\sources` folder. **This is important for very big datasets**.

If you want to redo the calculations, you need to empty the *corresponding* file in the `data\processed\sources` folder. It also applies to cases, when you change the code of the `src\dataloaders\sources\synth_labor.py`, as the `SyntheticLaborSource` object saves `arguments` so it can validate that you call it on future runs. 

# 3. Define Corpus, Vocabulary and Task
### 3.1. Let's assemble a corpus
Now we can reuse both `Populations` and `Source` objects to actually create a dataset (based on the specification in both `UserPopulation` and `SyntheticLaborSource`). 

The `Corpus` object in `src\dataloaders\datamodule.py` takes all the specifications and creates a dataset (i.e. creates sentences out of tabular records). It also saves data in the corresponding data splits.

In [None]:
from src.dataloaders.datamodule import Corpus

In [None]:
corpus = Corpus(population=users, sources=[synth_labor], name="synthetic")
corpus.prepare()

In [None]:
corpus.combined_sentences(split="train").head()
#corpus.combined_sentences(split="val").head()
#corpus.combined_sentences(split="test").head()

### 3.2. Let's create the vocabulary
The `CorpusVocabulary` object takes the information about the `Corpus` (i.e. sentences that exist in your `train` dataset) and creates a vocabulary! Here you can choose to remove words that appear with low frequency.

In [None]:
from src.dataloaders.vocabulary import CorpusVocabulary

In [None]:
vocab = CorpusVocabulary(corpus, name="synthetic")
vocab.prepare()

### 3.3. Let's speficy the task
The `Task` object in `src\dataloaders\tasks\base.py` specifies how to further process data to feed it into the model. For example, in case of the `MLM` task (specified in `src\dataloaders\tasks\pretrain.py`), we specify the data augmentation procedures, as well as how to mask tokens (and create targets for the prediction task).

In [1]:
from src.dataloaders.tasks.pretrain import MLM

In [2]:
## Specify the task we are going to use with the data
task = MLM(name="pretrain", 
           max_length=200, # the maximum length of the sequence
           no_sep = False, # you can decide to create data with or without the [SEP] token
           # Augmentation
            p_sequence_timecut = 0.0,
            p_sequence_resample = 0.01,
            p_sequence_abspos_noise = 0.1,
            p_sequence_hide_background = 0.01,
            p_sentence_drop_tokens = 0.01,
            shuffle_within_sentences = True,
            # MLM specific options
            mask_ratio = 0.1)

# 4. DataModule
Datamodule object takes information about the `Corpus`, `Task` and `CorpusVocabulary` and assembles inputs that we further provide to the model. To learn more about the datamodules see the [Pytorch Lightning: Dataloader](https://lightning.ai/docs/pytorch/stable/data/datamodule.html) documentation.

In [None]:
from src.dataloaders.datamodule import L2VDataModule

In [None]:
## On the first initialization of the datamodule, we do not have a vocabulary object
datamodule = L2VDataModule(corpus, batch_size=4, task=task, vocabulary=vocab, num_workers=4)

In [None]:
datamodule.prepare()

In [None]:
# MORE ANNOTATIONS TO COME

# 5. Add model

In [None]:
### use code from the experiments
from src.models.pretrain import TransformerEncoder
from pytorch_lightning import Trainer

In [None]:
hparams = {
    "hidden_size": 96,  # size of the hidden layers and embeddings
    "hidden_ff": 128,  # size of the position-wise feed-forward layer
    "n_encoders": 4,  # number of encoder blocks
    "n_heads": 8,  # number of attention heads in the multiheadattention module
    "n_local": 2,  # number of local attention heads
    "local_window_size": 4,  # size of the window for local attention
    "max_length": task.max_length,  # maximum length of the input sequence
    "vocab_size": vocab.size(),  # size of the vocabulary
    "num_classes": "",  
    "cls_num_targs": 3, # number of classes for the SOP class (we have 3: original, reversed, shuffled)
    "learning_rate": 0.001,
    "batch_size": 4,
    "num_epochs": 30,
    "device": 'cuda',
    "attention_type": "performer",
    "norm_type": "rezero",
    "num_random_features": 32,  # number of random features for the Attention module (Performer uses this)
    "parametrize_emb": True,  # whether to center the token embedding matrix
    "emb_dropout": 0.1,  # dropout for the embedding block
    "fw_dropout": 0.1,  # dropout for the position-wise feed-forward layer
    "att_dropout": 0.1,  # dropout for the multiheadattention module
    "dc_dropout": 0.1,  # dropout for the decoder block
    "hidden_act": "swish",  # activation function for the hidden layers (attention layers use ReLU)
    "optimizer": "adam",  # optimizer to use
    "training_task": "mlm",
    "weight_tying": True,
    "norm_output_emb": True,
    "epsilon": 1e-8,
    "weight_decay": 0.01,
    "beta1": 0.9,
    "beta2": 0.999,
}


In [None]:
l2v = TransformerEncoder(hparams=hparams)

In [None]:
trainer = Trainer(max_epochs=hparams["num_epochs"])

In [None]:
trainer.fit(model=l2v, datamodule=datamodule)

In [None]:
## More annotations to come.