# Task

This notebook will contain a reference experiment with short pretraining of the Mistral-architecture model on a subset of wikitext

## Results

I did not managed to run it fully because of some Google Colab issues (**TODO: also checkpoints and tensorboard logs are not available due to the same error**), however I got the following training logs

```
Step	Training Loss	Validation Loss	Memory Usage Mb
2000	5.178500	4.655300	29107
4000	4.267300	4.253386	29541
6000	3.997500	4.030161	30021
8000	4.739200	3.861828	30521
10000	3.159500	3.761141	30521
12000	4.065400	3.672445	30521
14000	3.764200	3.598749	30521
16000	3.897100	3.530349	30521
18000	3.261500	3.468710	30521
20000	2.736200	3.411213	30521
22000	2.150800	3.359339	30521
24000	2.949400	3.317924	30521
```
(out of these ~31Gb - ~7Gb taken by the forward pass + loss computation, so update process itself consumed ~24Gb)

While I had the following config
```
MistralConfig(
    vocab_size=32000,
    hidden_size=4160, # Original Mistral have 4090, this is closest multiplier for both 5 and 32
    intermediate_size=14400, # Original Mistral have 14336, this is closest multiplier for both 5 and 32
    num_hidden_layers=5, # Instead of 32 - to make model roughly 1-billion params
    num_attention_heads=32,
    num_key_value_heads=8,
    hidden_act="silu",
    max_position_embeddings=32768,
    initializer_range=0.02,
    rms_norm_eps=1e-5,
    use_cache=True,
    rope_theta=10000.0,
    sliding_window=4096,
    attention_dropout=0.0,
)
```
- Schedule - linear warmup, 1000 warmup steps, 50000 decay steps
- Optimizer - AdamW
- LR = 5e-5
- Batch size=16 (4 actual batch size, 4 accumulation steps)

## Implementation

In [1]:
"""
Install the following libraries:
- bitlinear from https://github.com/alex4321/bitlinear.git
- flash attention 2 (pip install flash-attn --no-build-isolation)
- datasets
"""
!pip install bitlinear@git+https://github.com/alex4321/bitlinear.git \
    flash-attn --no-build-isolation \
    datasets \
    accelerate \
    bitsandbytes

Collecting bitlinear@ git+https://github.com/alex4321/bitlinear.git
  Cloning https://github.com/alex4321/bitlinear.git to /tmp/pip-install-vqe_764x/bitlinear_31639637fa4b4686be752936d87cffd7
  Running command git clone --filter=blob:none --quiet https://github.com/alex4321/bitlinear.git /tmp/pip-install-vqe_764x/bitlinear_31639637fa4b4686be752936d87cffd7
  Resolved https://github.com/alex4321/bitlinear.git to commit 5731f7f3c171051f8398dcbca4901dcb52654096
  Preparing metadata (setup.py) ... [?25l[?25hdone


In [2]:
from google.colab import drive
from google. colab import runtime

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
CHECKPOINT_DIR = "basic--mistral-2b--lr-5e-5--checkpoint"
TENSORBOARD_DIR = "basic--mistral-2b--lr-5e-5--tensorboard"

MAX_LR = 5e-5

In [4]:
import os
import datasets
from transformers import AutoTokenizer, DataCollatorForLanguageModeling, TrainingArguments, Trainer, \
    TrainerCallback
from transformers.models.mistral import MistralConfig, MistralForCausalLM
from torch.optim import AdamW
from torch.optim.lr_scheduler import LambdaLR
import subprocess
import torch
from bitlinear.relora import LinearWarmupSchedule

import warnings
warnings.filterwarnings("ignore")

In [None]:
config = MistralConfig(
    vocab_size=32000,
    hidden_size=4160, # Original Mistral have 4090, this is closest multiplier for both 5 and 32
    intermediate_size=14400, # Original Mistral have 14336, this is closest multiplier for both 5 and 32
    num_hidden_layers=5, # Instead of 32 - to make model roughly 1-billion params
    num_attention_heads=32,
    num_key_value_heads=8,
    hidden_act="silu",
    max_position_embeddings=32768,
    initializer_range=0.02,
    rms_norm_eps=1e-5,
    use_cache=True,
    rope_theta=10000.0,
    sliding_window=4096,
    attention_dropout=0.0,
)
model = MistralForCausalLM(
    config=config,
).to("cuda:0")

In [7]:
optimizer = AdamW(
    model.parameters(),
    lr=MAX_LR,
)
lr_scheduler = LambdaLR(
    optimizer,
    lr_lambda=LinearWarmupSchedule(1000, 50000),
)

In [8]:
dataset_text = datasets.load_dataset("wikitext", "wikitext-103-v1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
tokenizer.pad_token_id = tokenizer.eos_token_id

Downloading readme:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/722k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/156M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/156M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/655k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/1801350 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

In [9]:
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=1024)

# Tokenize all parts of the dataset
tokenized_datasets = dataset_text.map(tokenize_function, batched=True, remove_columns=["text"])
tokenized_datasets

Map:   0%|          | 0/4358 [00:00<?, ? examples/s]

Map:   0%|          | 0/1801350 [00:00<?, ? examples/s]

Map:   0%|          | 0/3760 [00:00<?, ? examples/s]

DatasetDict({
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 4358
    })
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 1801350
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 3760
    })
})

In [10]:
data_collator = DataCollatorForLanguageModeling(tokenizer,
                                                mlm=False,
                                                pad_to_multiple_of=8)

In [11]:
class GpuMemoryLoggingCallback(TrainerCallback):
    """A custom callback for logging GPU memory usage."""

    def on_log(self, args, state, control, logs=None, **kwargs):
        # Check if CUDA is available to avoid errors on CPU-only environments
        if torch.cuda.is_available():
            # Assuming a single-GPU setup here; adjust for multi-GPU as needed
            result = subprocess.run(['nvidia-smi', '--query-gpu=memory.used', '--format=csv,nounits,noheader'],
                                    capture_output=True, text=True)
            memory_usage = result.stdout.strip()

            # Convert memory usage to an integer (MB) and log it
            logs['gpu_memory_usage_mb'] = int(memory_usage)
        else:
            logs['gpu_memory_usage_mb'] = 0  # Default to 0 if not using GPU

In [12]:
def test_memory_callback():
    logs = {}
    GpuMemoryLoggingCallback().on_log(None, None, None, logs)

    !nvidia-smi
    print("")
    print(f"USED {logs['gpu_memory_usage_mb']} MB")


test_memory_callback()

Thu Apr 11 14:40:46 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P0              55W / 400W |   5931MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [13]:
%load_ext tensorboard


model.train()
model.gradient_checkpointing_enable()
training_args = TrainingArguments(
    output_dir=CHECKPOINT_DIR,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=4,
    eval_accumulation_steps=4,
    logging_dir=TENSORBOARD_DIR,
    logging_steps=1,
    save_strategy="steps",
    save_steps=2000,
    evaluation_strategy="steps",
    eval_steps=2000,
    fp16=True,
    gradient_checkpointing=True,
    report_to="tensorboard",
    max_steps=30000,
    # No need to specify data collator here, it's passed to the Trainer constructor
)

# Initialize the Trainer with the data collator
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],  # Assuming these are ready; dynamically tokenized if not
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    optimizers=(optimizer, lr_scheduler),
    callbacks=[GpuMemoryLoggingCallback()],
)

# Train
trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss,Validation Loss,Memory Usage Mb
2000,5.1785,4.6553,29107
4000,4.2673,4.253386,29541
6000,3.9975,4.030161,30021
8000,4.7392,3.861828,30521
10000,3.1595,3.761141,30521
12000,4.0654,3.672445,30521
14000,3.7642,3.598749,30521
16000,3.8971,3.530349,30521
18000,3.2615,3.46871,30521
20000,2.7362,3.411213,30521


Step,Training Loss,Validation Loss,Memory Usage Mb
2000,5.1785,4.6553,29107
4000,4.2673,4.253386,29541
6000,3.9975,4.030161,30021
8000,4.7392,3.861828,30521
10000,3.1595,3.761141,30521
12000,4.0654,3.672445,30521
14000,3.7642,3.598749,30521
16000,3.8971,3.530349,30521
18000,3.2615,3.46871,30521
20000,2.7362,3.411213,30521


SafetensorError: Error while serializing: IoError(Os { code: 28, kind: StorageFull, message: "No space left on device" })

In [1]:
!cp -r {CHECKPOINT_DIR}  /content/drive/MyDrive/BitLinear-experiments/{CHECKPOINT_DIR}
!cp -r {TENSORBOARD_DIR} /content/drive/MyDrive/BitLinear-experiments/{TENSORBOARD_DIR}

cp: cannot stat '{CHECKPOINT_DIR}': No such file or directory
cp: cannot stat '{TENSORBOARD_DIR}': No such file or directory


In [None]:
runtime.unassign()