# 1016 Final Project: Sequence Length Warmup

Coding Reference:
- Pretrain GPT-2: https://huggingface.co/learn/nlp-course/chapter7/6
- Composer Warmup: https://github.com/mosaicml/composer

## CUDA
- Get GPU ready

In [1]:
import sys
import os
import platform
import torch
import pandas as pd
import sklearn as sk
from tqdm import tqdm
import random
import wandb
from datetime import datetime
from datasets import load_dataset
from datasets import load_from_disk

has_gpu = torch.cuda.is_available()
has_mps = torch.backends.mps.is_built()
device = "mps" if has_mps else "cuda" if torch.cuda.is_available() else "cpu"

print(f"Python Platform: {platform.platform()}")
print(f"PyTorch Version: {torch.__version__}")
print()
print(f"Python {sys.version}")
print(f"Pandas {pd.__version__}")
print(f"Scikit-Learn {sk.__version__}")
print("NVIDIA/CUDA GPU is", "available" if has_gpu else "NOT AVAILABLE")
print(f"Target device is {device}")
num_gpus = torch.cuda.device_count() if has_gpu else 0
print(f"Number of GPUs utilized: {num_gpus}")

  from .autonotebook import tqdm as notebook_tqdm


Python Platform: Windows-10-10.0.22631-SP0
PyTorch Version: 2.2.1

Python 3.11.5 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:26:23) [MSC v.1916 64 bit (AMD64)]
Pandas 2.1.4
Scikit-Learn 1.2.2
NVIDIA/CUDA GPU is available
Target device is cuda
Number of GPUs utilized: 1


In [2]:
# function to track memory usage and prevent outofmem error
def check_cur_memory_percentage(device):
    total_mem = torch.cuda.get_device_properties(device).total_memory
    max_mem = torch.cuda.memory_allocated()
    return max_mem/total_mem

# Wikitext Model

## Download Dataset
- Use 'wikitext-2-v1' for now (smaller)
- If have time, may use 'wikitext-103-v1'

In [3]:
# # download wikitext data and save it locally (only do this for the first time)
# save_path = ''
# dataset = load_dataset('wikitext', 'wikitext-2-raw-v1')
# dataset.save_to_disk(save_path)

In [4]:
# reload wikitext data from the directory
load_path = 'C:\\Users\\Lucaw\\Desktop\\2024 Spring\\LLM\\wikitext-2-raw-v1'
ds = load_from_disk(load_path)

## Data Preprocessing
- Filter data by length (remove short lines and blanks)
- Tokenize sequences using GPT2 tokenizer fast

In [5]:
num_proc = 10
def filter_dataset(dataset, min_len=1):
    """
    Filter a Dataset based on 'text' values not being blank and meeting a minimum length criteria.

    Args:
        dataset (Dataset): Input Dataset with 'text' values.
        min_length (int): Minimum length criteria for 'text' values. Default is 1.

    Returns:
        Filtered Dataset.
    """
    filtered_dataset = dataset.filter(lambda x: len(x['text'].strip()) >= min_len, num_proc=num_proc)
    return filtered_dataset

In [6]:
# filter training, validation and testing set, with min_len = 50
filtered_train = filter_dataset(ds['train'], 50)
filtered_test = filter_dataset(ds['test'], 50)
filtered_validation = filter_dataset(ds['validation'], 50)

In [7]:
from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

## Only run this part for the first time
- this part tokenizes data, and store tokenized data locally for future use

In [8]:
# from transformers import GPT2TokenizerFast
# class TokenizerWrapper:
#     def __init__(self, tokenizer):
#         self.tokenizer = tokenizer
    
#     def tokenize_function(self, examples):
#         return self.tokenizer(
#             examples["text"],
#             padding=False,
#             truncation=True,
#         )

# tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")#, add_prefix_space=True
# tokenizer.pad_token = tokenizer.eos_token
# tokenizer_wrapper = TokenizerWrapper(tokenizer)

In [9]:
# tokenized_path = 'C:\\Users\\Lucaw\\Desktop\\2024 Spring\\LLM\\tokenized-2'
# def tokenize_and_save(dataset, dataset_name):
#     tokenized_dataset = dataset.map(tokenizer_wrapper.tokenize_function, batched=True, num_proc = num_proc)
#     tokenized_dataset.save_to_disk(os.path.join(tokenized_path, dataset_name))

# num_proc = 4
# tokenize_and_save(filtered_train, 'train')
# tokenize_and_save(filtered_test, 'test')
# tokenize_and_save(filtered_validation, 'validation')

## Continued Data Preprocessing

In [10]:
# load tokenized data from disk
tokenized_path = 'C:\\Users\\Lucaw\\Desktop\\2024 Spring\\LLM\\tokenized-2'
tokenized_train = load_from_disk(os.path.join(tokenized_path, 'train'))
tokenized_test = load_from_disk(os.path.join(tokenized_path, 'test'))
tokenized_validation = load_from_disk(os.path.join(tokenized_path, 'validation'))

In [11]:
tokenized_train

Dataset({
    features: ['text', 'input_ids', 'attention_mask'],
    num_rows: 16216
})

In [12]:
# combine and truncate tokenzied dataset to maximum sequence length
def truncate_dataset(dataset, max_seq_len):
    """
    Combine and truncate a tokenized dataset based on maximum sequence length
    During the process, sequences will be combined together and then truncated
    Difference sequences will be splitted by <EOS>

    Args:
        dataset (Dataset): Input Tokenized Dataset 
        max_seq_len(int): Max Sequence Length

    Returns:
        Processed Dataset with input_ids and attention_masks.
    """
    
    # define split and shuffle function to be mapped (default batch size = 1000)
    def combine_and_truncate(batch):
        # set empty list to store splitted sequence
        tokens = []
        attentions = []
        padding_token_id = tokenizer.pad_token_id

        for sequence in batch['input_ids']:
            piece = sequence + [padding_token_id]
            attention = [1] * len(sequence)
            attention += [0]
            tokens += piece
            attentions += attention

        length = len(tokens) // max_seq_len * max_seq_len

        input_ids = []
        attention_masks = []
        for i in range(0, length, max_seq_len):
            input_ids.append(tokens[i:i+max_seq_len])
            attention_masks.append(attentions[i:i+max_seq_len])

        return {'input_ids': input_ids, 'attention_mask': attention_masks}

    # map the split_and_shuffle function to batches
    truncated_dataset = dataset.map(combine_and_truncate, batched=True, remove_columns=dataset.column_names)
    return truncated_dataset

In [13]:
# combine and truncate train, test and eval
max_sequence_length = 32
truncated_train = truncate_dataset(tokenized_train, max_sequence_length)
truncated_test = truncate_dataset(tokenized_test, max_sequence_length)
truncated_eval = truncate_dataset(tokenized_validation, max_sequence_length)

In [14]:
truncated_train

Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 72965
})

In [15]:
# # check the nth element of truncated_train
# idx = 0
# print(truncated_train[idx]['input_ids'])
# print(truncated_train[idx]['attention_mask'])
# tokenizer.decode(truncated_train[idx]["input_ids"])

## Model Building
- Set model configuration
- Set data collator and data loader
- Convert model to ComposeModel instance
- Try original model, sequence warmup, and non-linear sequence warmup

In [16]:
# create data collator
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

In [17]:
# create data loader
from torch.utils.data import DataLoader
train_dataloader = DataLoader(truncated_train, batch_size=64, collate_fn=data_collator, drop_last = True)
test_dataloader = DataLoader(truncated_test, batch_size=64, collate_fn=data_collator, drop_last = True)
eval_dataloader = DataLoader(truncated_eval, batch_size=64, collate_fn=data_collator, drop_last = True)

In [18]:
# set model configuration
from transformers import GPT2LMHeadModel, AutoConfig

config = AutoConfig.from_pretrained(
    "gpt2",
    vocab_size=len(tokenizer),
    n_ctx=max_sequence_length,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_position_embeddings=max_sequence_length
)

## Trials: Run One Trial at a Time
- Trial 1: Original model with fixed sequence length
- Trial 2: Model w/ linear warmup, implemented by Composer.algorithms.SeqLengthWarmup
- Trial 3: Model w/ non-linear warmup, with modified code on Composer library

For each trail, the model has maximum sequence length of 128 and is trained for 2 epochs. The evaluation metrices are CrossEntropy and Perplexity. The training and evaluation loss are plotted on https://wandb.ai/site

### Trial 1: Original Model

In [16]:
# create model
model = GPT2LMHeadModel(config)
model.to(device)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(32, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [17]:
# convert model to ComposeModel instance
from composer.metrics.nlp import LanguageCrossEntropy, LanguagePerplexity
from composer.models.huggingface import HuggingFaceModel

metrics = [
    LanguageCrossEntropy(ignore_index=-100),
    LanguagePerplexity(dist_sync_on_step=False)
]

# package as a Composer model
composer_model_original = HuggingFaceModel(model, tokenizer=tokenizer, metrics=metrics, use_logits=True)

In [18]:
# set trainer and train the model
from composer.trainer import Trainer
from composer.loggers import WandBLogger

current_date = datetime.now()
formatted_date = current_date.strftime('%m%d')
max_duration = 2

wandb_logger = WandBLogger(
    project='sequence-length-warmup', 
    entity='lucawangnfls',
    name=f'T1-{formatted_date}-{max_duration}e-{max_sequence_length}l')

trainer = Trainer(
    model=composer_model_original,
    train_dataloader=train_dataloader,
    eval_dataloader=eval_dataloader,
    eval_interval="200ba",
    max_duration=max_duration,
    log_to_console=True,
    progress_bar=False,
    console_log_interval='200ba',
    loggers=wandb_logger
)

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mlucawangnfls[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [19]:
check_cur_memory_percentage('cuda')

0.07696021801970523

In [20]:
trainer.fit()

******************************
Config:
composer_commit_hash: None
composer_version: 0.19.1
node_name: unknown because NODENAME environment variable not set
num_gpus_per_node: 1
num_nodes: 1
rank_zero_seed: 520306327

******************************
[epoch=1][batch=1/1140]:
	 Train time/epoch: 0
	 Train time/batch: 0
	 Train time/sample: 0
	 Train time/batch_in_epoch: 0
	 Train time/sample_in_epoch: 0
	 Train time/token: 0
	 Train time/token_in_epoch: 0
	 Train trainer/device_train_microbatch_size: 64
	 Train loss/train/total: 10.9564
	 Train metrics/train/LanguageCrossEntropy: 10.9564
	 Train metrics/train/LanguagePerplexity: 57322.4531
[epoch=1][batch=200/1140]:
	 Train time/batch: 199
	 Train time/sample: 12736
	 Train time/batch_in_epoch: 199
	 Train time/sample_in_epoch: 12736
	 Train time/token: 407552
	 Train time/token_in_epoch: 407552
	 Train trainer/device_train_microbatch_size: 64
	 Train loss/train/total: 9.5455
	 Train metrics/train/LanguageCrossEntropy: 9.5455
	 Train metri

In [21]:
# clean-up GPU storage
import gc

print("Before Cleaning: ", check_cur_memory_percentage('cuda'))
model = composer_model_original.model

del trainer
del train_dataloader
del composer_model_original

gc.collect()
torch.cuda.empty_cache()
print("After Cleaning: ", check_cur_memory_percentage('cuda'))

Before Cleaning:  0.19004100274305838


0,1
loss/train/total,█▇▆▇▅▆▄▄▃▃▃▃▁▂▁▃▂▂▆▅▅▄▅▅▅▄▂▃▃▃▂▂▂▂▃▂▃▂▃▃
metrics/eval/LanguageCrossEntropy,█▅▂▂▁▅▃▂▂▂▂▁
metrics/eval/LanguagePerplexity,█▃▁▁▁▃▂▁▁▁▁▁
metrics/train/LanguageCrossEntropy,█▇▆▇▅▆▄▄▃▃▃▃▁▂▁▃▂▂▆▅▅▄▅▅▅▄▂▃▃▃▂▂▂▂▃▂▃▂▃▃
metrics/train/LanguagePerplexity,█▆▄▆▃▅▂▂▂▂▂▂▁▁▁▂▁▁▄▃▃▃▃▃▃▂▁▂▂▂▁▁▁▁▂▁▂▁▂▁
time/batch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
time/batch_in_epoch,▁▁▂▂▃▃▃▄▄▄▅▅▅▆▆▆▇▇██▁▁▂▂▃▃▃▄▄▄▅▅▅▆▆▆▇▇██
time/epoch,▁▅█
time/sample,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
time/sample_in_epoch,▁▁▂▂▃▃▃▄▄▄▅▅▅▆▆▆▇▇██▁▁▂▂▃▃▃▄▄▄▅▅▅▆▆▆▇▇██

0,1
loss/train/total,6.86055
metrics/eval/LanguageCrossEntropy,6.97467
metrics/eval/LanguagePerplexity,1069.20068
metrics/train/LanguageCrossEntropy,6.86055
metrics/train/LanguagePerplexity,953.89563
time/batch,2280.0
time/batch_in_epoch,0.0
time/epoch,2.0
time/sample,145920.0
time/sample_in_epoch,0.0


After Cleaning:  0.1579868356098852


### Trial 2: Linear Warmup

In [15]:
# create model
model = GPT2LMHeadModel(config)
model.to(device)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(32, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [16]:
# convert model to ComposeModel instance
from composer.metrics.nlp import LanguageCrossEntropy, LanguagePerplexity
from composer.models.huggingface import HuggingFaceModel

metrics = [
    LanguageCrossEntropy(ignore_index=-100),
    LanguagePerplexity(dist_sync_on_step=False)
]

# package as a Composer model
composer_model_linear = HuggingFaceModel(model, tokenizer=tokenizer, metrics=metrics, use_logits=True)

In [17]:
# set trainer and train the model
from composer.trainer import Trainer
from composer.loggers import WandBLogger

current_date = datetime.now()
formatted_date = current_date.strftime('%m%d')
max_duration = 2

wandb_logger = WandBLogger(
    project='sequence-length-warmup', 
    entity='lucawangnfls',
    name=f'T2-{formatted_date}-{max_duration}e-{max_sequence_length}l')

In [19]:
from composer.algorithms import SeqLengthWarmup
seq_length_warmup = SeqLengthWarmup(duration=0.3,
                                    min_seq_length=2,
                                    max_seq_length=32,
                                    step_size=1,
                                    truncate=True,
                                    preserve_end_of_sequence=False)

trainer = Trainer(
    model=composer_model_linear,
    train_dataloader=train_dataloader,
    eval_dataloader=eval_dataloader,
    eval_interval="200ba",
    max_duration=max_duration,
    algorithms=[seq_length_warmup],
    log_to_console=True,
    progress_bar=False,
    console_log_interval='200ba',
    loggers=wandb_logger
)

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mlucawangnfls[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [20]:
trainer.fit()

******************************
Config:
composer_commit_hash: None
composer_version: 0.19.1
enabled_algorithms/SeqLengthWarmup: true
node_name: unknown because NODENAME environment variable not set
num_gpus_per_node: 1
num_nodes: 1
rank_zero_seed: 774115335

******************************
[epoch=1][batch=1/1140]:
	 Train time/epoch: 0
	 Train seq_length_warmup/curr_seq_len: 2
	 Train seq_length_warmup/curr_bs: 64
	 Train time/batch: 0
	 Train time/sample: 0
	 Train time/batch_in_epoch: 0
	 Train time/sample_in_epoch: 0
	 Train time/token: 0
	 Train time/token_in_epoch: 0
	 Train trainer/device_train_microbatch_size: 64
	 Train loss/train/total: 10.9504
	 Train metrics/train/LanguageCrossEntropy: 10.9504
	 Train metrics/train/LanguagePerplexity: 56979.0820
[epoch=1][batch=200/1140]:
	 Train seq_length_warmup/curr_seq_len: 11
	 Train seq_length_warmup/curr_bs: 64
	 Train time/batch: 199
	 Train time/sample: 12736
	 Train time/batch_in_epoch: 199
	 Train time/sample_in_epoch: 12736
	 Train

In [None]:
# clean-up GPU storage
import gc

print("Before Cleaning: ", check_cur_memory_percentage('cuda'))
model = composer_model_linear.model

del trainer
del train_dataloader
del composer_model_linear

gc.collect()
torch.cuda.empty_cache()
print("After Cleaning: ", check_cur_memory_percentage('cuda'))

### Trial 3: Non-Linear Warmup

In [19]:
# create model
model = GPT2LMHeadModel(config)
model.to(device)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(32, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [20]:
# convert model to ComposeModel instance
from composer.metrics.nlp import LanguageCrossEntropy, LanguagePerplexity
from composer.models.huggingface import HuggingFaceModel

metrics = [
    LanguageCrossEntropy(ignore_index=-100),
    LanguagePerplexity(dist_sync_on_step=False)
]

# package as a Composer model
composer_model_nonlinear = HuggingFaceModel(model, tokenizer=tokenizer, metrics=metrics, use_logits=True)

In [21]:
# set trainer and train the model
from composer.trainer import Trainer
from composer.loggers import WandBLogger

current_date = datetime.now()
formatted_date = current_date.strftime('%m%d')
max_duration = 2

wandb_logger = WandBLogger(
    project='sequence-length-warmup', 
    entity='lucawangnfls',
    name=f'T3-{formatted_date}-{max_duration}e-{max_sequence_length}l')

#### Modify Source Code of Composer Library to Enable Non-Linear Warmup 

In [22]:
# self-defined subclass that modify arguments and functions of composer.algorithms.SeqLengthWarmup class
# the modifications are only applied to __init__ and apply()
# the basic idea is to replace the linear warmup curve with non-linear curves
# the original max_seq_len, min_seq_len and step_size arguments are removed
# add a 'curve' argument, which is a list that represents the changing curve of warmup sequence length

from composer.algorithms.seq_length_warmup.seq_length_warmup import SeqLengthWarmup, set_batch_sequence_length
from typing import Dict, Mapping, Optional
from composer.core import Algorithm, Batch, Event, State, TimeUnit, get_precision_context
from composer.loggers import Logger

class MySeqLengthWarmup(SeqLengthWarmup):
    def __init__(
        self,
        duration: float = 0.3,
        max_seq_length: int = 32,
        curve: list = range(2, 32),
        truncate: bool = True,
        preserve_end_of_sequence: bool = False,
    ):
        self.duration = duration
        self.max_seq_length = max_seq_length
        self.curve = curve
        self.truncate = truncate
        self.preserve_end_of_sequence = preserve_end_of_sequence

        if self.duration < 0 or self.duration > 1:
            raise ValueError(f'Duration must be between 0 and 1, got: {self.duration}')
            
        self._activated = False
        self._original_model = None

    def apply(self, event: Event, state: State, logger: Logger) -> Optional[int]:
        if event == Event.INIT:
            if not isinstance(state.model, HuggingFaceModel):
                raise RuntimeError(
                    textwrap.dedent(
                        f"""\
                    {type(self).__name__} requires state.model to be of type {HuggingFaceModel.__name__}, not of type {type(state.model)}""",
                    ),
                )

            self._original_model = state.model
            return

        assert state.dataloader is not None, 'dataloader should be set on AFTER_DATALOADER'
        assert state.max_duration is not None, 'max_duration should be set on AFTER_DATALOADER'

        # in order to avoid OOMs, we do a forward and a backward pass on a dummy input.
        if not self._activated:
            self._activate_model(state, logger)

        if state.max_duration.unit == TimeUnit.EPOCH:
            if state.dataloader_len is None:
                raise RuntimeError('Sequential Length Warmup requires the dataloader to be sized.')
            num_optimization_steps = int(state.dataloader_len) * state.max_duration.value
        elif state.max_duration.unit == TimeUnit.BATCH:
            num_optimization_steps = state.max_duration.value
        else:
            raise NotImplementedError(
                textwrap.dedent(
                    """\
                    To use sequential length warmup, the max_duration must be in epochs or batches.
                    Specifying the `max_duration` in tokens or samples for use with sequential
                    length warmup will be supported in a future Composer release. See
                    https://github.com/mosaicml/composer/issues/226.""",
                ),
            )
        num_warmup_steps = int(num_optimization_steps * self.duration)  # in batches

        # assume the full sequence length is the unaltered sequence length
        num_update_steps = len(self.curve)
        update_every_n_steps = num_warmup_steps // num_update_steps

        curve_idx = int(state.timestamp.batch) // update_every_n_steps

        if curve_idx >= num_update_steps:
            curr_seq_len = self.max_seq_length
        else:
            curr_seq_len = self.curve[int(state.timestamp.batch) // update_every_n_steps] 

        state.batch = set_batch_sequence_length(state.batch, curr_seq_len, self.truncate, self.preserve_end_of_sequence)

        batch_size = state.batch['input_ids'].shape[0]
        logger.log_metrics({
            'seq_length_warmup/curr_seq_len': curr_seq_len,
            'seq_length_warmup/curr_bs': batch_size,
        })

In [23]:
seq_length_warmup = MySeqLengthWarmup(duration=0.3,
                                      max_seq_length = 32,
                                      curve=range(2, 32),
                                      truncate=True,
                                      preserve_end_of_sequence=False)

trainer = Trainer(
    model=composer_model_nonlinear,
    train_dataloader=train_dataloader,
    eval_dataloader=eval_dataloader,
    eval_interval="200ba",
    max_duration=max_duration,
    algorithms=[seq_length_warmup],
    log_to_console=True,
    progress_bar=False,
    console_log_interval='200ba',
    loggers=wandb_logger
)

[34m[1mwandb[0m: Currently logged in as: [33mlucawangnfls[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [24]:
trainer.fit()

******************************
Config:
composer_commit_hash: None
composer_version: 0.19.1
enabled_algorithms/MySeqLengthWarmup: true
node_name: unknown because NODENAME environment variable not set
num_gpus_per_node: 1
num_nodes: 1
rank_zero_seed: 3617631979

******************************
[epoch=1][batch=1/1140]:
	 Train time/epoch: 0
	 Train seq_length_warmup/curr_seq_len: 2
	 Train seq_length_warmup/curr_bs: 64
	 Train time/batch: 0
	 Train time/sample: 0
	 Train time/batch_in_epoch: 0
	 Train time/sample_in_epoch: 0
	 Train time/token: 0
	 Train time/token_in_epoch: 0
	 Train trainer/device_train_microbatch_size: 64
	 Train loss/train/total: 10.8973
	 Train metrics/train/LanguageCrossEntropy: 10.8973
	 Train metrics/train/LanguagePerplexity: 54032.3867
[epoch=1][batch=200/1140]:
	 Train seq_length_warmup/curr_seq_len: 11
	 Train seq_length_warmup/curr_bs: 64
	 Train time/batch: 199
	 Train time/sample: 12736
	 Train time/batch_in_epoch: 199
	 Train time/sample_in_epoch: 12736
	 Tr

In [26]:
# clean-up GPU storage
import gc

print("Before Cleaning: ", check_cur_memory_percentage('cuda'))
model = composer_model_nonlinear.model

del trainer
del train_dataloader
del composer_model_nonlinear

gc.collect()
torch.cuda.empty_cache()
print("After Cleaning: ", check_cur_memory_percentage('cuda'))

Before Cleaning:  0.18996705085599708


0,1
loss/train/total,█▅▃▃▃▃▂▂▂▂▂▂▁▂▁▂▂▂▂▂▂▁▂▂▂▂▁▂▁▁▁▁▂▁▂▁▂▁▂▂
metrics/eval/LanguageCrossEntropy,█▃▂▂▃▂▂▁▂▂▃▁
metrics/eval/LanguagePerplexity,█▂▂▁▂▂▁▁▁▂▂▁
metrics/train/LanguageCrossEntropy,█▅▃▃▃▃▂▂▂▂▂▂▁▂▁▂▂▂▂▂▂▁▂▂▂▂▁▂▁▁▁▁▂▁▂▁▂▁▂▂
metrics/train/LanguagePerplexity,█▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
seq_length_warmup/curr_bs,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
seq_length_warmup/curr_seq_len,▁▁▂▃▃▄▄▅▆▇▇█████████████████████████████
time/batch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
time/batch_in_epoch,▁▁▂▂▃▃▃▄▄▄▅▅▅▆▆▆▇▇██▁▁▂▂▃▃▃▄▄▄▅▅▅▆▆▆▇▇██
time/epoch,▁▅█

0,1
loss/train/total,7.35711
metrics/eval/LanguageCrossEntropy,7.50224
metrics/eval/LanguagePerplexity,1812.08936
metrics/train/LanguageCrossEntropy,7.35711
metrics/train/LanguagePerplexity,1567.2998
seq_length_warmup/curr_bs,64.0
seq_length_warmup/curr_seq_len,32.0
time/batch,2280.0
time/batch_in_epoch,0.0
time/epoch,2.0


After Cleaning:  0.18996705085599708
