# Task

This notebook will contain a BitNet + ReLoRA experiment with short pretraining of the Mistral-architecture model on a subset of wikitext

## Results

I did not managed to run it fully because of some Google Colab issues (**TODO: also checkpoints and tensorboard logs are not available due to the same error**), however I got the following training logs

```
Step	Training Loss	Validation Loss	Memory Usage Mb
2000	5.049300	4.500881	8928
4000	4.113500	4.085669	8840
6000	3.948200	3.910413	8636
8000	4.600700	3.776079	10518
10000	3.124400	3.722620	9386
12000	4.122400	3.669651	8794
14000	3.781300	3.606225	8486
16000	3.858200	3.570461	8912
18000	3.319800	3.529242	8826
20000	2.763000	3.487872	9796
22000	2.137100	3.442672	9616
24000	3.097400	3.421924	8994
26000	2.706200	3.380465	9468
28000	3.897200	3.357405	10138
30000	3.217000	3.332875	9786
```
(out of these ~31Gb - ~10Gb taken by the forward pass + loss computation, so update process itself consumed ~3Gb)

While I had the following config
```
MistralConfig(
    vocab_size=32000,
    hidden_size=4160, # Original Mistral have 4090, this is closest multiplier for both 5 and 32
    intermediate_size=14400, # Original Mistral have 14336, this is closest multiplier for both 5 and 32
    num_hidden_layers=5, # Instead of 32 - to make model roughly 1-billion params
    num_attention_heads=32,
    num_key_value_heads=8,
    hidden_act="silu",
    max_position_embeddings=32768,
    initializer_range=0.02,
    rms_norm_eps=1e-5,
    use_cache=True,
    rope_theta=10000.0,
    sliding_window=4096,
    attention_dropout=0.0,
)
```
- Schedule - linear warmup, 1000 warmup steps, 50000 decay steps
- Optimizer - AdamW
- LR = 5e-5
- Batch size=16 (4 actual batch size, 4 accumulation steps)
- LoRA rank = 128
- ReLoRA restarts performed every 2000 steps
- ReLoRA warmup takes 100 steps

## Implementation

In [1]:
"""
Install the following libraries:
- bitlinear from https://github.com/alex4321/bitlinear.git
- flash attention 2 (pip install flash-attn --no-build-isolation)
- datasets
"""
!pip install bitlinear@git+https://github.com/alex4321/bitlinear.git \
    flash-attn --no-build-isolation \
    datasets \
    accelerate \
    bitsandbytes

Collecting bitlinear@ git+https://github.com/alex4321/bitlinear.git
  Cloning https://github.com/alex4321/bitlinear.git to /tmp/pip-install-hsplho9t/bitlinear_19af0540fa064d7b8e287f9adc6808cd
  Running command git clone --filter=blob:none --quiet https://github.com/alex4321/bitlinear.git /tmp/pip-install-hsplho9t/bitlinear_19af0540fa064d7b8e287f9adc6808cd
  Resolved https://github.com/alex4321/bitlinear.git to commit 5731f7f3c171051f8398dcbca4901dcb52654096
  Preparing metadata (setup.py) ... [?25l[?25hdone


In [2]:
from google.colab import drive
from google. colab import runtime

drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
!ls /content/drive/MyDrive/BitLinear-experiments/

bitmistral-training-lr-1e-4-rank-128.ipynb
bitmistral-training-lr-2e-4-rank-128--2000-restart.ipynb
bitmistral-training-lr-2e-4-rank-128-cosine.ipynb
bitmistral-training-lr-2e-4-rank-128.ipynb
bitmistral-training-lr-2e-4-rank-128--restarts-2000.ipynb
bitmistral-training-lr-5e-4-rank-128.ipynb
bitmistral-training-lr-5e-5-rank-128.ipynb
mistral-training.ipynb


In [4]:
import os
import datasets
from transformers import AutoTokenizer, DataCollatorForLanguageModeling, TrainingArguments, Trainer, \
    TrainerCallback
from transformers.models.mistral import MistralConfig
from bitlinear.adapters import LoRAAdapter
from bitlinear.models.mistral import BitMistralForCausalLM
from bitlinear.relora import ReLoRAOptimizer, ReLoRASchedulerLambda, LinearWarmupSchedule
from torch.optim import AdamW
from torch.optim.lr_scheduler import LambdaLR
import subprocess
import torch

import warnings
warnings.filterwarnings("ignore")

In [5]:
CHECKPOINT_DIR = "mistral-2b--lr-2e-4--rank-128--2000-restart--checkpoint"
TENSORBOARD_DIR = "mistral-2b--lr-2e-4--rank-128--2000-restart--tensorboard"

MAX_LR = 2e-4
LORA_RANK = 128
RESET_STEPS = 2000
REWARMUP_STEPS = 100

In [6]:
STORE_DIR = "StoredWeights"

os.makedirs(STORE_DIR, exist_ok=True)

config = MistralConfig(
    vocab_size=32000,
    hidden_size=4160, # Original Mistral have 4090, this is closest multiplier for both 5 and 32
    intermediate_size=14400, # Original Mistral have 14336, this is closest multiplier for both 5 and 32
    num_hidden_layers=5, # Instead of 32 - to make model roughly 1-billion params
    num_attention_heads=32,
    num_key_value_heads=8,
    hidden_act="silu",
    max_position_embeddings=32768,
    initializer_range=0.02,
    rms_norm_eps=1e-5,
    use_cache=True,
    rope_theta=10000.0,
    sliding_window=4096,
    attention_dropout=0.0,
)
model = BitMistralForCausalLM(
    config=config,
    fname_prefix=f"{STORE_DIR}/bitmistal"
).to("cuda:0")

In [7]:
model.add_adapters(
    LoRAAdapter,
    {
        "lora_rank": LORA_RANK
    }
)

[LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter(),
 LoRAAdapter()]

In [8]:
optimizer = ReLoRAOptimizer(
    model.parameters(),
    model.mergeable_layers(),
    optimizer_cls=AdamW,
    optimizer_params={},
    reset_n_steps=RESET_STEPS,
    lr=MAX_LR,
)
lr_scheduler = LambdaLR(
    optimizer,
    ReLoRASchedulerLambda(
        lr_lambda=LinearWarmupSchedule(1000, 50000),
        warmup_n_steps=REWARMUP_STEPS,
        reset_n_steps=RESET_STEPS,
    )
)

In [9]:
dataset_text = datasets.load_dataset("wikitext", "wikitext-103-v1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
tokenizer.pad_token_id = tokenizer.eos_token_id

Downloading readme:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/722k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/156M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/156M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/655k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/1801350 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

In [10]:
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=1024)

# Tokenize all parts of the dataset
tokenized_datasets = dataset_text.map(tokenize_function, batched=True, remove_columns=["text"])
tokenized_datasets

Map:   0%|          | 0/4358 [00:00<?, ? examples/s]

Map:   0%|          | 0/1801350 [00:00<?, ? examples/s]

Map:   0%|          | 0/3760 [00:00<?, ? examples/s]

DatasetDict({
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 4358
    })
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 1801350
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 3760
    })
})

In [11]:
data_collator = DataCollatorForLanguageModeling(tokenizer,
                                                mlm=False,
                                                pad_to_multiple_of=8)

In [12]:
class GpuMemoryLoggingCallback(TrainerCallback):
    """A custom callback for logging GPU memory usage."""

    def on_log(self, args, state, control, logs=None, **kwargs):
        # Check if CUDA is available to avoid errors on CPU-only environments
        if torch.cuda.is_available():
            # Assuming a single-GPU setup here; adjust for multi-GPU as needed
            result = subprocess.run(['nvidia-smi', '--query-gpu=memory.used', '--format=csv,nounits,noheader'],
                                    capture_output=True, text=True)
            memory_usage = result.stdout.strip()

            # Convert memory usage to an integer (MB) and log it
            logs['gpu_memory_usage_mb'] = int(memory_usage)
        else:
            logs['gpu_memory_usage_mb'] = 0  # Default to 0 if not using GPU

In [13]:
def test_memory_callback():
    logs = {}
    GpuMemoryLoggingCallback().on_log(None, None, None, logs)

    !nvidia-smi
    print("")
    print(f"USED {logs['gpu_memory_usage_mb']} MB")


test_memory_callback()

Fri Apr 12 12:13:54 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla V100-SXM2-16GB           Off | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0              41W / 300W |   2020MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [14]:
%load_ext tensorboard


model.train()
model.gradient_checkpointing_enable()
training_args = TrainingArguments(
    output_dir=CHECKPOINT_DIR,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=4,
    eval_accumulation_steps=4,
    logging_dir=TENSORBOARD_DIR,
    logging_steps=1,
    save_strategy="steps",
    save_steps=2000,
    evaluation_strategy="steps",
    eval_steps=2000,
    fp16=True,
    gradient_checkpointing=True,
    report_to="tensorboard",
    max_steps=30000, # 30000 for more full process
    # No need to specify data collator here, it's passed to the Trainer constructor
)

# Initialize the Trainer with the data collator
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],  # Assuming these are ready; dynamically tokenized if not
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    optimizers=(optimizer, lr_scheduler),
    callbacks=[GpuMemoryLoggingCallback()],
)

# Train
trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss,Validation Loss,Memory Usage Mb
2000,5.0493,4.500881,8928
4000,4.1135,4.085669,8840
6000,3.9482,3.910413,8636
8000,4.6007,3.776079,10518
10000,3.1244,3.72262,9386
12000,4.1224,3.669651,8794
14000,3.7813,3.606225,8486
16000,3.8582,3.570461,8912
18000,3.3198,3.529242,8826
20000,2.763,3.487872,9796


Step,Training Loss,Validation Loss,Memory Usage Mb
2000,5.0493,4.500881,8928
4000,4.1135,4.085669,8840
6000,3.9482,3.910413,8636
8000,4.6007,3.776079,10518
10000,3.1244,3.72262,9386
12000,4.1224,3.669651,8794
14000,3.7813,3.606225,8486
16000,3.8582,3.570461,8912
18000,3.3198,3.529242,8826
20000,2.763,3.487872,9796


TrainOutput(global_step=30000, training_loss=3.874628075146675, metrics={'train_runtime': 27295.3677, 'train_samples_per_second': 17.585, 'train_steps_per_second': 1.099, 'total_flos': 2.2902183717227712e+17, 'train_loss': 3.874628075146675, 'epoch': 0.27, 'gpu_memory_usage_mb': 9786})

In [None]:
!cp -r {CHECKPOINT_DIR}  /content/drive/MyDrive/BitLinear-experiments/{CHECKPOINT_DIR}
!cp -r {TENSORBOARD_DIR} /content/drive/MyDrive/BitLinear-experiments/{TENSORBOARD_DIR}

In [None]:
runtime.unassign()