# 1. Define the Populations
First, we use `src\dataloaders\populations\users.py` to define the Population object. Here, we 
1. provide a **raw** input file containing some information about the users (for example, some kind of demographic information as birthday),
2. preprocess and filter users based on soem criteria (for example, you want to exclude people beyond certain age),
3. create data splits (train, val, test).

As you create the object and call specific methods, the Population object runs all these processes and saves outputs in `data\processed\populations` folder. 

### Why do we need the Populations object?

It defines cohort of users that we want to work with.
Based on the populations, we further create a dataset that contains specific cohorts of people. You can define multiple populations and create various datasets based on various specifications of users.

In [None]:
# Define the Population object
from src.dataloaders.populations.users import UserPopulation
users = UserPopulation()
## run the preprocessing part
users.population()
## create datasplits 
users.data_split()
## You can also just run the prepare function to do both of the above steps
users.prepare() 

After you run these commands for the first time, you will see that the results are saved in `data\procesed\populations`. 
Next time you the same functions, instead of calculating everything, the object would read the data from the `data\processed\populations` folder. **This is important for very big datasets**.

If you want to redo the calculations, you need to empty the `data\processed\populations` folder. It also applies to cases, when you change the code of the `src\dataloaders\populations\users.py`, as the Population object saves `arguments` so it can validate that you call a specific version of the Population object. 

# 2. Define the Source
The Source objects in `src\data\sources` specify how to process a specific *source* of data (for example, the dummy labor dataset). Inside of this object we specify how to preprocess and tokenie the dataset. 

#### What does a Source do?
The Source specifies how to process a specific type of the input data. For example, `SyntheticLaborSource` (in `src\dataloaders\sources`) specifies how to process the `data\rawdata\synth_labor.csv`.

#### Why do we need a Source?
It makes it easier to process data from different data streams (or datasets).

In [None]:
from src.dataloaders.sources.synth_labor import SyntheticLaborSource

In [None]:
## How to create a SOURCE!
synth_labor = SyntheticLaborSource()
## process the raw file (and maybe do some preprocessing)
synth_labor.parsed()
## index the files
synth_labor.indexed()
## tokenize the files
synth_labor.tokenized()
### Or use the prepare function to do all of the above
synth_labor.prepare()

After you run these commands for the first time, you will see that the results are saved in `data\procesed\sources`. 
Next time you the same functions, instead of calculating everything, the object would read the data from the `data\processed\sources` folder. **This is important for very big datasets**.

If you want to redo the calculations, you need to empty the *corresponding* file in the `data\processed\sources` folder. It also applies to cases, when you change the code of the `src\dataloaders\sources\synth_labor.py`, as the `SyntheticLaborSource` object saves `arguments` so it can validate that you call it on future runs. 

# 3. Define Corpus
Now we can reuse both `Populations` and `Source` objects to actually create a dataset. It happens in `src\dataloaders\datamodule.py`.

For the pretraining, we can use `L2VDataModule` in `src\dataloaders\datamodule.py`.

In [None]:
from src.dataloaders.datamodule import Corpus

In [None]:
corpus = Corpus(population=users, sources=[synth_labor], name="synthetic")
corpus.prepare()

In [None]:
corpus.combined_sentences(split="train").head()

### Let's create a vocabulary
The `Vocabulary` object - it removes words that do not appear too much for example. For example, you can see since we defined *Income* feature as `Binned` in the `SyntheticLaborSource` - it become binned. 

In [None]:
from src.dataloaders.vocabulary import CorpusVocabulary

In [None]:
vocab = CorpusVocabulary(corpus, name="synthetic")
vocab.prepare()

# 4. Datamodule

In [None]:
from src.dataloaders.datamodule import L2VDataModule
from src.dataloaders.tasks.pretrain import MLM

In [None]:
## Specify the task we are going to use with the data
task = MLM(name="pretrain", 
           max_length=200,
           no_sep = False, 
           # Augmentation
            p_sequence_timecut = 0.0,
            p_sequence_resample = 0.01,
            p_sequence_abspos_noise = 0.1,
            p_sequence_hide_background = 0.01,
            p_sentence_drop_tokens = 0.01,
            shuffle_within_sentences = True,
            # MLM specific options
            mask_ratio = 0.1)

In [None]:
## On the first initialization of the datamodule, we do not have a vocabulary object
datamodule = L2VDataModule(corpus, batch_size=32, task=task, vocabulary=vocab, num_workers=4)

In [None]:
datamodule.prepare()

In [None]:
# MORE ANNOTATIONS TO COME

# 5. Add model

In [None]:
### use code from the experiments
from src.models.pretrain import TransformerEncoder
import argparse
from pytorch_lightning import Trainer

In [None]:
class Hparams(argparse.Namespace):
  hidden_size = 96 #size of the hidden layers and embeddings
  hidden_ff = 512 #size of the position-wise feed-forward layer
  n_encoders = 4 # number of encoder blocks
  n_heads = 8 # number of attention heads in the multiheadattention module
  n_local = 2 # number of local attention heads 
  local_window_size = 4 # size of the window for local attention
  max_length = 100 # maximum length of the input sequence
  vocab_size = 100 # size of the vocabulary
  num_classes = 3 # number of classes for the SOP class (we have 3: original, reversed, shuffled)
  lr = 0.001
  batch_size = 4
  num_epochs = 30
  device = 'cuda'
  attention_type = "performer"
  norm_type = "rezero"
  num_random_features = 32 # number of random features for the Attention module (Performer uses this)
  parametrize_emb = True # whether to center the token embeddin matrix
  
  emb_dropout = 0.1 #dropout for the embedding block
  fw_dropout = 0.1 #dropout for the position-wise feed-forward layer
  att_dropout = 0.1 # dropout for the multiheadattention module
  hidden_act = "swish" # activation function for the hidden layers (attention layers use ReLU)
  optimizer = "adam" # optimizer to use
hparams=Hparams()

In [None]:
hparams = {
    "hidden_size": 96,  # size of the hidden layers and embeddings
    "hidden_ff": 512,  # size of the position-wise feed-forward layer
    "n_encoders": 4,  # number of encoder blocks
    "n_heads": 8,  # number of attention heads in the multiheadattention module
    "n_local": 2,  # number of local attention heads
    "local_window_size": 4,  # size of the window for local attention
    "max_length": task.max_length,  # maximum length of the input sequence
    "vocab_size": vocab.size(),  # size of the vocabulary
    "num_classes": "",  
    "cls_num_targs": 3, # number of classes for the SOP class (we have 3: original, reversed, shuffled)
    "learning_rate": 0.001,
    "batch_size": 4,
    "num_epochs": 30,
    "device": 'cuda',
    "attention_type": "performer",
    "norm_type": "rezero",
    "num_random_features": 32,  # number of random features for the Attention module (Performer uses this)
    "parametrize_emb": True,  # whether to center the token embedding matrix
    "emb_dropout": 0.1,  # dropout for the embedding block
    "fw_dropout": 0.1,  # dropout for the position-wise feed-forward layer
    "att_dropout": 0.1,  # dropout for the multiheadattention module
    "dc_dropout": 0.1,  # dropout for the decoder block
    "hidden_act": "swish",  # activation function for the hidden layers (attention layers use ReLU)
    "optimizer": "adam",  # optimizer to use
    "training_task": "mlm",
    "weight_tying": True,
    "norm_output_emb": True,
    "epsilon": 1e-8,
    "weight_decay": 0.01,
    "beta1": 0.9,
    "beta2": 0.999,
}


In [None]:
l2v = TransformerEncoder(hparams=hparams)

In [None]:
trainer = Trainer(max_epochs=hparams["num_epochs"])

In [None]:
trainer.fit(model=l2v, datamodule=datamodule)

In [None]:
## More annotations to come.