![a](res/banner.jpg)

<h1 style="text-align: center;">Getting into Modalities in 15mins</h1>

<hr/>

**Let's train a dense model with Modalities involving the following steps:**

1. Data Preprocessing (Indexation, Tokenization)
2. Model Pretraining (GPT Model)
3. Monitoring (Weights&Biases)


**Folder structure:**

Throughout the tutorial, we will use the Jupyter Notebook `modalities_demo.ipynb` to guide us through the process. The notebook is located in the root directory of the tutorial, along with the `configs` and `data` directories. The `configs` directory contains configuration files for the model pretraining and tokenization, while the `data` directory contains subdirectories for storing checkpoints, preprocessed data, raw data, and tokenizer-related files.

```text
└── getting_started_15mins                 # Root directory for the tutorial
    ├── modalities_demo.ipynb              # Jupyter Notebook used for the tutorial.
    ├── configs                      
    │   ├── pretraining_config.yaml        # Config file for the model pretraining
    │   └── tokenization_config.yaml       # Config file for tokenization
    └── data                         
        ├── checkpoints                    # Dir where model and optimizer checkpoints  are stored.
        │   └── <checkpoints>        
        ├── preprocessed                   # Dir containing preprocessed training and eval data.
        │   └── <files>              
        ├── raw                      
        │   └── fineweb_edu_num_docs_483606.jsonl   # JSONL file with raw data for training and eval.
        └── tokenizer                
            ├── tokenizer.json             # JSON file defining the tokenizer model.
            └── tokenizer_config.json      # Config file specifying all tokenizer settings
```

## Prepraration steps

Firstly, we need to install Modalities via pip

```bash
pip install modalities
```

and download the raw training data. 
We are going to use a  subset (500k documents) of the FineWeb-Edu dataset, as it is already cleaned, filtered and deduplicated.

In [5]:
!cd data/raw && wget https://huggingface.co/datasets/ModalitiesTeam/FW_EDU_SUBSET_500k_docs/resolve/main/fineweb_edu_num_docs_483606.jsonl?download=true -O fineweb_edu_num_docs_483606.jsonl

--2025-08-07 11:41:02--  https://huggingface.co/datasets/ModalitiesTeam/FW_EDU_SUBSET_500k_docs/resolve/main/fineweb_edu_num_docs_483606.jsonl?download=true
Resolving huggingface.co (huggingface.co)... 3.160.39.15, 3.160.39.99, 3.160.39.100, ...
Connecting to huggingface.co (huggingface.co)|3.160.39.15|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cas-bridge.xethub.hf.co/xet-bridge-us/66dacd19ae7192d89df8e0e6/14e72def15ff6a8cc24e71c9919d272c710b548b68090c3de667803d69b3d95c?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=cas%2F20250807%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250807T094102Z&X-Amz-Expires=3600&X-Amz-Signature=2d16da7016f41c0b4dd4dc58846f297c5c01ea6723e1ed98a62a146a13a02cba&X-Amz-SignedHeaders=host&X-Xet-Cas-Uid=public&response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27fineweb_edu_num_docs_483606.jsonl%3B+filename%3D%22fineweb_edu_num_docs_483606.jsonl%22%3B&x-id=GetObject&

**Disclaimer:**

Don't run modalities in jupyter notebooks!


But this time for demonstration purposes:

<img src="res/notebooks_1.png" alt="Alt text" style="width:30%;"/>

<small> credits: Joel Grus - I don't like Notebooks</small>

# Data Preprocessing


Before training the model, we will preprocess the raw data. In the first step, we will create an index of the data that stores the starting byte position and byte length of every document. The index will be used to efficiently index the JSONL file during the tokenization in the second step. 

The raw JSONL dataset and has the following properties:

* Subset of FineWeb-Edu (~500k documents) encoded as JSONL file
* already cleaned, filtered and deduplicated

Each line in the JSONL is a proper JSON object containing a single document. 
```json
{
   "text":"What is the difference between 50 Ohm and 75 Ohm Coax? [...]",
   "id":"<urn:uuid:57e09efe-1c29-49f8-a086-e1bb5dd552c9>",
   "dump":"CC-MAIN-2021-39",
   "url":"http://cablesondemandblog.com/wordpress1/2014/03/",
   "file_path":"s3://commoncrawl/crawl-data/[...]20210918002307-00380.warc.gz",
   "language":"en",
   "language_score":0.9309850335121155,
   "token_count":2355,
   "score":3.625,
   "int_score":4
}
```

While the meta data is generally interesing and can be used to further filter the dataset, we are only interested in the text field for now, providing us with the actual training data.

## Indexation



The goal of the indexation process is to determine the starting byte position and length of each document in the raw data file.

Architecturally, as shown in the diagram below, a reader process reads the raw data file line by line and writes the starting byte position and length of each document to the queue. For each line in the queue, the processor first validates the JSON object and then writes the starting byte position and length of the document to the index file.


<img src="res/modalities_indexation_bright.svg" alt="Alt text" style="width:80%;"/>

We run the indexation with the command shown below. 

The `modalities data create_raw_index` command triggers the process of creating the index from the raw data.
The `--index_path argument` specifies the location where the generated index file will be saved. In this example, the index will be stored at `data/preprocessed/fineweb_edu_num_docs_483606.idx`.
The last part, i.e., `data/raw/fineweb_edu_num_docs_483606.jsonl` is the input file in JSONL (JSON Lines) format containing the raw data. The command will process this file to create the index.


In [6]:
!modalities data create_raw_index --index_path data/preprocessed/fineweb_edu_num_docs_483606.idx \
                                               data/raw/fineweb_edu_num_docs_483606.jsonl

main - INFO - Reading raw data from data/raw/fineweb_edu_num_docs_483606.jsonl and writing index to data/preprocessed/fineweb_edu_num_docs_483606.idx ...
main - INFO - Index file created at data/preprocessed/fineweb_edu_num_docs_483606.idx


## Throughput optimized tokenization


Now that the we have the raw JSONL dataset indexed, we can proceed with the tokenization. 

In Modalities, tokenization is the process of converting raw text data into a sequence of tokens that can be used as input to the model. This is achieved by scaling up the number of processors performing the tokenization on batches of documents in parallel, as shown in the diagram below. Typically, we use one processor per CPU core to maximize throughput and adapt the queue sizes and batches sizes for optimal throughput. 

The processors place the tokenized documents as byte streams in the queue from which the writer reads and writes the tokenized documents to the output file.

<img src="res/modalities_tokenization_bright.svg" alt="Alt text" style="width:100%;"/>

The tokenized dataset file is heavily optimized for efficient indexing. As layed out in the diagram below, the header specifies the size of the data segment and size of a single token in bytes. With this information at hand, the file format is self-contained and does not need any additional information to be read. The data segment contains the concatenated byte streams of the tokenized documents.
The documents are indexed by their starting byte position and length stored in the index segment. This allows for efficient random access to the tokenized documents in O(1) time complexity.

Additionally, the shuffling of the data can be performed independently of the actual documents, as only the index can be shuffled which has a much lower memory-footprint. Internally, we implemented a numpy array-like view on top of the data segment. 

<img src="res/modalities_file_format_bright.svg" alt="Alt text" style="width:70%;"/>


We define the tokenization config as printed out below. It defines the tokenizer component including all the necessary settings to make it fully reproducible. Under settings we additionally define the performance optimization settings, such as number of CPUs to use and queue sizes, as well as, the input and output file paths.  

In [7]:
from IPython.display import Markdown, display

def display_markdown(file_path):
    with open(file_path, 'r') as file:
        code = file.read()
    display(Markdown(f'```yaml\n{code}\n```'))


In [2]:
tokenization_config_path = "configs/tokenization_config.yaml"
display_markdown(tokenization_config_path)

```yaml
settings:
  src_path: data/raw/fineweb_edu_num_docs_483606.jsonl
  dst_path: data/preprocessed/fineweb_edu_num_docs_483606.pbin
  index_path: data/preprocessed/fineweb_edu_num_docs_483606.idx
  jq_pattern: .text
  num_cpus: ${node_env:num_cpus}
  eod_token: <|endoftext|>
  processing_batch_size: 10
  raw_samples_queue_size: 300
  processed_samples_queue_size: 300

tokenizer:
  component_key: tokenizer
  variant_key: pretrained_hf_tokenizer
  config:
    pretrained_model_name_or_path: data/tokenizer
    padding: false
    truncation: false
```

In [8]:
!modalities data pack_encoded_data configs/tokenization_config.yaml

Instantiated <class 'modalities.tokenization.tokenizer_wrapper.PreTrainedHFTokenizer'>: tokenizer
Processed batches:   0%|                             | 0/483606 [00:00<?, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (2355 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1644 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1045 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1645 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for

# Training

In Modalities, we scale up the training via Fully Sharded Data Parallel (FSDP), as defined in the paper [Zhao, Yanli, et al. "Pytorch fsdp: experiences on scaling fully sharded data parallel." arXiv preprint arXiv:2304.11277 (2023).](https://arxiv.org/pdf/2304.11277)

**Goal:** Maximizing the token throughput during training by trading off communication overhead for a lower memory footprint. 

* Before training model is split into FSDP units and each FSDP unit is sharded across all ranks
* Each rank is a data parallel process receiving only a subset of the data
* Each rank materializes one FSDP unit at a time during the forward pass by receving the sharded weights from its peers

<img src="res/fsdp_bright.svg" alt="Alt text" style="width:90%;"/>


adopted from Zhao, Yanli, et al. "Pytorch fsdp: experiences on scaling fully sharded data parallel." arXiv preprint arXiv:2304.11277 (2023).

While FSDP happens under the hood of Modalities the user can still parameterize the training process via the `pretraining_config.yaml` file. In fact, the training config is specified in a way that every component during training, e.g, dataset, dataloader, model, etc., are fully reproducible. On the one hand, this leads to larger, somewhat more complex config files, however it also allows to fully reproduce the training process. Especially in the field of LLMs, where the training process is expensive, complex and involves excessive amounts of ablations, this is a crucial feature to keep track of the entire configuration of the system in a reproducible manner. 

The config file is shown in the print out below. 


In [3]:
tokenization_config_path = "configs/pretraining_config.yaml"
display_markdown(tokenization_config_path)

```yaml
settings:  
  experiment_id: ${modalities_env:experiment_id}
  config_file_path: ${modalities_env:config_file_path}
  referencing_keys:
    sample_key: input_ids
    target_key: target_ids
    prediction_key: logits
  cuda_env:
    local_rank: ${cuda_env:LOCAL_RANK}
    global_rank: ${cuda_env:RANK}
    world_size: ${cuda_env:WORLD_SIZE}
  paths:
    checkpoint_saving_path: data/checkpoints
    train_dataset_path: data/preprocessed/fineweb_edu_num_docs_483606.pbin
  intervals:
    training_log_interval_in_steps: 5
    checkpointing_interval_in_steps: 50
    evaluation_interval_in_steps: 50
  consistency_enforcement:
    enforce_tokens_per_step_consistency: true
    enforce_last_step_logged: false
    enforce_last_step_evaluated: false
    enforce_last_step_checkpointed: false
  step_profile: 
    gradient_accumulation_steps: 1
    local_train_micro_batch_size: 64
    sequence_length: 256
  training_target:
    num_target_tokens:
      component_key: number_conversion
      variant_key: num_tokens_from_packed_mem_map_dataset_continuous
      config:
        dataset_path: ${settings.paths.train_dataset_path}
        sequence_length: ${settings.step_profile.sequence_length}
        num_ranks: ${settings.cuda_env.world_size}
        local_micro_batch_size: ${settings.step_profile.local_train_micro_batch_size}
        gradient_accumulation_steps: ${settings.step_profile.gradient_accumulation_steps}
    num_target_steps:  # for the batch progress subscriber
      component_key: number_conversion
      variant_key: num_steps_from_num_tokens
      config:
        num_ranks: ${settings.cuda_env.world_size}
        local_micro_batch_size: ${settings.step_profile.local_train_micro_batch_size}
        global_num_tokens: ${settings.training_target.num_target_tokens}
        sequence_length: ${settings.step_profile.sequence_length}
        gradient_accumulation_steps: ${settings.step_profile.gradient_accumulation_steps}
  training_progress: 
    global_num_seen_tokens: 0
    num_seen_steps: 0
    num_seen_samples: 0
    last_step: -1

collate_fn:
  component_key: collate_fn
  variant_key: gpt_2_llm_collator
  config:
    sample_key: ${settings.referencing_keys.sample_key}
    target_key: ${settings.referencing_keys.target_key}

train_dataset:
  component_key: dataset
  variant_key: packed_mem_map_dataset_continuous
  config:
    raw_data_path: ${settings.paths.train_dataset_path}
    sequence_length: ${settings.step_profile.sequence_length}
    sample_key:  ${settings.referencing_keys.sample_key}

train_dataloader:
  component_key: data_loader
  variant_key: default
  config:
    num_workers: 2
    pin_memory: true
    dataloader_tag: train
    dataset:
      instance_key: train_dataset
      pass_type: BY_REFERENCE
    batch_sampler:
      component_key: batch_sampler
      variant_key: default
      config:
        batch_size: ${settings.step_profile.local_train_micro_batch_size}
        drop_last: true
        sampler:
          component_key: sampler
          variant_key: resumable_distributed_sampler
          config:
            dataset:
              instance_key: train_dataset
              pass_type: BY_REFERENCE
            rank: ${settings.cuda_env.global_rank}
            num_replicas: ${settings.cuda_env.world_size}
            shuffle: true
            seed: 42
            drop_last: true
            skip_num_global_samples: ${settings.training_progress.num_seen_samples}
    collate_fn:
      instance_key: collate_fn
      pass_type: BY_REFERENCE

eval_dataloaders: []

checkpoint_saving:
  component_key: checkpoint_saving
  variant_key: default
  config:
    checkpoint_saving_strategy:
      component_key: checkpoint_saving_strategy
      variant_key: save_k_most_recent_checkpoints_strategy
      config:
        k: -1   # -1 to save all checkpoints
    checkpoint_saving_execution:
      component_key: checkpoint_saving_execution
      variant_key: fsdp1
      config:
        checkpoint_path: ${settings.paths.checkpoint_saving_path}
        global_rank: ${settings.cuda_env.global_rank}
        experiment_id: ${settings.experiment_id}

loss_fn:
  component_key: loss
  variant_key: clm_cross_entropy_loss
  config:
    target_key: ${settings.referencing_keys.target_key}
    prediction_key: ${settings.referencing_keys.prediction_key}

wrapped_model:
  component_key: model
  variant_key: fsdp_wrapped
  config:
    model:
      instance_key: model
      pass_type: BY_REFERENCE
    sync_module_states: true
    mixed_precision_settings: BF_16
    sharding_strategy: FULL_SHARD
    block_names: [GPT2Block]

model:
  component_key: model
  variant_key: model_initialized
  config:
    model:
      instance_key: model_raw
      pass_type: BY_REFERENCE
    model_initializer:
      component_key: model_initialization
      variant_key: composed
      config:
        model_type: gpt2
        weight_init_type: scaled
        mean: 0.0
        std: 0.02
        num_layers: ${model_raw.config.n_layer}

model_raw:
  component_key: model
  variant_key: gpt2
  config:
    sample_key: ${settings.referencing_keys.sample_key}
    poe_type: NOPE
    sequence_length: ${settings.step_profile.sequence_length}
    prediction_key: ${loss_fn.config.prediction_key}
    vocab_size: 50304 # GPT-2 vocab_size of 50257, padded up to nearest multiple of 64 for efficiency
    n_layer: 2
    n_head_q: 8
    n_head_kv: 4
    ffn_hidden: 128
    n_embd: 128
    dropout: 0.0
    bias: false
    attention_config:
      qkv_transforms:
      - type_hint: RotaryTransform
        config:
          n_embd: ${model_raw.config.n_embd}
          n_head: ${model_raw.config.n_head_q}
          seq_length_dim: -2
          base_freq: 100000
    attention_implementation: pytorch_flash
    activation_type: swiglu
    attention_norm_config:
      norm_type: rms_norm
      config:
        ndim: ${model_raw.config.n_embd}
        bias: true
        epsilon: 1e-5
    ffn_norm_config:
      norm_type: rms_norm
      config:
        ndim: ${model_raw.config.n_embd}
        bias: true
        epsilon: 1e-5
    lm_head_norm_config:
      norm_type: rms_norm
      config:
        ndim: ${model_raw.config.n_embd}
        bias: true
        epsilon: 1e-5

scheduler:
  component_key: scheduler
  variant_key: onecycle_lr
  config:
    optimizer:
      instance_key: optimizer
      pass_type: BY_REFERENCE
    max_lr: 6e-4
    div_factor: 10
    final_div_factor: 1
    total_steps: ${settings.training_target.num_target_steps}
    pct_start: 0.01
    anneal_strategy: cos
    last_epoch: ${settings.training_progress.last_step}

optimizer:
  component_key: optimizer
  variant_key: adam_w
  config:
    lr: 0.0001
    betas: [0.9, 0.95]
    eps: 1e-8
    weight_decay: 1e-1
    weight_decay_groups_excluded: [embedding, layernorm]
    wrapped_model: 
      instance_key: wrapped_model
      pass_type: BY_REFERENCE

gradient_clipper:
  component_key: gradient_clipper
  variant_key: fsdp
  config:
    wrapped_model:
      instance_key: wrapped_model
      pass_type: BY_REFERENCE
    norm_type: P2_NORM
    max_norm: 1.0

progress_subscriber:
  component_key: progress_subscriber
  variant_key: dummy
  config: {}

evaluation_subscriber:
  component_key: results_subscriber
  variant_key: wandb
  config:
    global_rank: ${settings.cuda_env.global_rank}
    project: ai_24_demo
    mode: OFFLINE
    experiment_id: ${settings.experiment_id}
    directory: wandb_storage
    config_file_path: ${settings.config_file_path}
```

Below you find the command for running the distributed training with modalities across multiple 4 GPUs on a single node. Let's break it down into its components:

* `CUDA_VISIBLE_DEVICES=0,1,2,3`: This environment variable specifies which GPUs will be used for the job. In this case, GPUs with IDs 0, 1, 2, 3 are selected for training.

* `torchrun`: This is a utility from PyTorch used to launch distributed training. It automatically manages multiple processes for distributed training.

* `--rdzv-endpoint localhost:29515`: Specifies the rendezvous endpoint. Here, localhost is the machine's address, and 29515 is the port. The rendezvous endpoint coordinates the processes involved in distributed training.

* `--nnodes 1`: Specifies the number of nodes to be used in the distributed setup. Since this is a single-node setup, 1 is used.

* `--nproc_per_node 4`: This argument tells torchrun how many processes to launch on each node. In this case, 4 processes are launched per node, corresponding to the 4 GPUs (IDs 0, 1, 2, 3) specified by `CUDA_VISIBLE_DEVICES`.

* `$(which modalities) run`: This part dynamically finds the path to the modalities executable and runs it. The run command triggers the main process to start the training.

* `--config_file_path configs/pretraining_config.yaml`: The `--config_file_path` argument provides the path to the configuration file for the training job. In this example, the configuration is provided in `configs/pretraining_config.yaml`, which includes settings like model architecture, optimizer, dataset, dataloader and other training components.


Once executed, the training process will start, and you will see the training logs in the terminal. The logs will include information about the training progress, such as the loss values, learning rate, and other metrics. Additionally, you can monitor the training process using Weights & Biases, which modalities automatically logs. Make sure that you are logged into your Weights & Biases account to track the training metrics.

In [9]:
! CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --rdzv-endpoint localhost:29515 \
                                        --nnodes 1 \
                                        --nproc_per_node 4 \
                                        $(which modalities) run --config_file_path configs/pretraining_config.yaml

W0807 11:43:40.118000 3516751 site-packages/torch/distributed/run.py:792] 
W0807 11:43:40.118000 3516751 site-packages/torch/distributed/run.py:792] *****************************************
W0807 11:43:40.118000 3516751 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0807 11:43:40.118000 3516751 site-packages/torch/distributed/run.py:792] *****************************************
Rank 0 received experiment_id: 2025-08-07__11-43-45_7d9fc15e
Rank 2 received experiment_id: 2025-08-07__11-43-45_7d9fc15e
Rank 3 received experiment_id: 2025-08-07__11-43-45_7d9fc15e
Rank 1 received experiment_id: 2025-08-07__11-43-45_7d9fc15e
[rank0]: Traceback (most recent call last):
[rank0]:   File "/raid/s3/opengptx/behzad_shomali/miniforge3/envs/modalities/bin/modalities", line 7, in <module>
[rank0]