# 1. Define the Populations
First, we use `src\dataloaders\populations\users.py` to define the Population object. Here, we 
1. provide a **raw** input file containing some information about the users (for example, some kind of demographic information as birthday),
2. preprocess and filter users based on soem criteria (for example, you want to exclude people beyond certain age),
3. create data splits (train, val, test).

As you create the object and call specific methods, the Population object runs all these processes and saves outputs in `data\processed\populations` folder. 

### Why do we need the Populations object?

It defines cohort of users that we want to work with.
Based on the populations, we further create a dataset that contains specific cohorts of people. You can define multiple populations and create various datasets based on various specifications of users.

In [1]:
# Define the Population object
from src.dataloaders.populations.users import UserPopulation
users = UserPopulation()
## run the preprocessing part
users.population()
## create datasplits 
users.data_split()
## You can also just run the prepare function to do both of the above steps
users.prepare() 

After you run these commands for the first time, you will see that the results are saved in `data\procesed\populations`. 
Next time you the same functions, instead of calculating everything, the object would read the data from the `data\processed\populations` folder. **This is important for very big datasets**.

If you want to redo the calculations, you need to empty the `data\processed\populations` folder. It also applies to cases, when you change the code of the `src\dataloaders\populations\users.py`, as the Population object saves `arguments` so it can validate that you call a specific version of the Population object. 

# 2. Define the Source
The Source objects in `src\data\sources` specify how to process a specific *source* of data (for example, the dummy labor dataset). Inside of this object we specify how to preprocess and tokenie the dataset. 

#### What does a Source do?
The Source specifies how to process a specific type of the input data. For example, `SyntheticLaborSource` (in `src\dataloaders\sources`) specifies how to process the `data\rawdata\synth_labor.csv`.

#### Why do we need a Source?
It makes it easier to process data from different data streams (or datasets).

In [2]:
from src.dataloaders.sources.synth_labor import SyntheticLaborSource

In [3]:
## How to create a SOURCE!
synth_labor = SyntheticLaborSource()
## process the raw file (and maybe do some preprocessing)
synth_labor.parsed()
## index the files
synth_labor.indexed()
## tokenize the files
synth_labor.tokenized()
### Or use the prepare function to do all of the above
synth_labor.prepare()

After you run these commands for the first time, you will see that the results are saved in `data\procesed\sources`. 
Next time you the same functions, instead of calculating everything, the object would read the data from the `data\processed\sources` folder. **This is important for very big datasets**.

If you want to redo the calculations, you need to empty the *corresponding* file in the `data\processed\sources` folder. It also applies to cases, when you change the code of the `src\dataloaders\sources\synth_labor.py`, as the `SyntheticLaborSource` object saves `arguments` so it can validate that you call it on future runs. 

# 3. Define Corpus
Now we can reuse both `Populations` and `Source` objects to actually create a dataset. It happens in `src\dataloaders\datamodule.py`.

For the pretraining, we can use `L2VDataModule` in `src\dataloaders\datamodule.py`.

In [4]:
from src.dataloaders.datamodule import Corpus

In [5]:
corpus = Corpus(population=users, sources=[synth_labor], name="synthetic")
corpus.prepare()

In [6]:
corpus.combined_sentences(split="train").head()

Unnamed: 0_level_0,RECORD_DATE,SENTENCE,BIRTHDAY,SEX,AGE,AFTER_THRESHOLD
USER_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,122,INCOME_45 CITY_6 OCC_29 IND_32,1957-06-25,Male,63,False
1,271,INCOME_21 CITY_18 OCC_91 IND_20,1957-06-25,Male,63,False
1,284,INCOME_2 CITY_21 OCC_29 IND_18,1957-06-25,Male,63,False
1,292,INCOME_31 CITY_17 OCC_93 IND_15,1957-06-25,Male,63,False
1,326,INCOME_2 CITY_3 OCC_48 IND_17,1957-06-25,Male,63,False


### Let's create a vocabulary
The `Vocabulary` object - it removes words that do not appear too much for example. For example, you can see since we defined *Income* feature as `Binned` in the `SyntheticLaborSource` - it become binned. 

In [7]:
from src.dataloaders.vocabulary import CorpusVocabulary

In [8]:
vocab = CorpusVocabulary(corpus, name="synthetic")
vocab.prepare()

# 4. Datamodule

In [9]:
from src.dataloaders.datamodule import L2VDataModule
from src.dataloaders.tasks.pretrain import MLM

In [10]:
## Specify the task we are going to use with the data
task = MLM(name="pretrain", max_length=200)

In [11]:
## On the first initialization of the datamodule, we do not have a vocabulary object
datamodule = L2VDataModule(corpus, batch_size=32, task=task, vocabulary=vocab, num_workers=4)

In [12]:
datamodule.prepare_data()

In [13]:
# MORE ANNOTATIONS TO COME