# Training with LoRA

We'll be training a selected model with LoRA against the LCC_CSharp dataset that has been adjusted for the task of in-filling or fill-in-the-middle (FIM)

We'll then be evaluating the model on the Multi PL-E benchmark, which is a multiple language representation of the HumanEval benchmark. We'll solely be focusing on the C# code within the benchmark as our goal is to create a competent C# Generative LLM.

In [1]:
%env TOKENIZERS_PARALLELISM=False

env: TOKENIZERS_PARALLELISM=False


In [2]:
%pip install transformers==4.36.2 accelerate==0.26.1 evaluate datasets peft==0.7.1 bitsandbytes trl python-dotenv wandb -qU

[0mNote: you may need to restart the kernel to use updated packages.


In [3]:
!apt update
!apt install -y mono-devel
!ln -s /usr/bin/mono-csc /usr/bin/csc

Hit:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease
Hit:2 http://security.ubuntu.com/ubuntu focal-security InRelease               [0m
Hit:3 https://deb.nodesource.com/node_16.x focal InRelease          [0m       
Hit:4 http://archive.ubuntu.com/ubuntu focal InRelease              
Hit:5 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu focal InRelease
Get:6 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]
Hit:7 http://archive.ubuntu.com/ubuntu focal-backports InRelease
Fetched 114 kB in 1s (145 kB/s)
Reading package lists... Done
Building dependency tree       
Reading state information... Done
169 packages can be upgraded. Run 'apt list --upgradable' to see them.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
mono-devel is already the newest version (6.8.0.105+dfsg-2).
0 upgraded, 0 newly installed, 0 to remove and 169 not upgraded.
ln: failed to create symbolic link '/usr/bin/csc

In [4]:
import transformers
import accelerate
import peft
import torch

print(f"Transformers version: {transformers.__version__}")
print(f"Accelerate version: {accelerate.__version__}")
print(f"PEFT version: {peft.__version__}")
print(f"PyTorch version: {torch.__version__}")

Transformers version: 4.36.2
Accelerate version: 0.26.1
PEFT version: 0.7.1
PyTorch version: 2.1.2+cu121


In [5]:
%reload_ext dotenv
%dotenv 

In [6]:
import os
os.environ["WANDB_PROJECT"] = "csharp-stable-code"
os.environ["WANDB_LOG_MODEL"] = "checkpoint"

In [7]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoConfig
from peft import LoraConfig
import evaluate
import re
from transformers import TrainingArguments
from datetime import datetime
from random import randint
from datasets import load_dataset


model_id="stabilityai/stable-code-3b"

In [8]:
quantization_config = BitsAndBytesConfig(
  load_in_4bit=True,
  bnb_4bit_compute_dtype=torch.float16,
  bnb_4bit_quant_type="nf4",
  # llm_int8_enable_fp32_cpu_offload=True
)

model = AutoModelForCausalLM.from_pretrained(
  model_id,
  trust_remote_code=True,
  # device_map=device_map,
  # offload_folder="offload",
  # offload_state_dict = True,
  # torch_dtype=torch.float16,
  quantization_config=quantization_config
  )

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

lora_config = LoraConfig(
  r=8,
  target_modules=[
    "q_proj",
    "o_proj",
    "k_proj",
    "v_proj",
    "gate_proj",
    "up_proj",
    "down_proj",
  ],
  bias="none",
  task_type="CAUSAL_LM"
)

model.add_adapter(lora_config)
model.config.use_cache = False
model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant":False})

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.


In [9]:
# tokenizer.add_tokens(['<fim_prefix>','<fim_suffix>','<fim_middle>'])
# model.resize_token_embeddings(len(tokenizer))

In [10]:
shards=100
# from datasets import load_from_disk
# train_dataset=load_from_disk('./train_dataset/')
# eval_dataset=load_from_disk('./eval_dataset/')
raw_dataset=load_dataset("fasterinnerlooper/lcc_csharp")['train']
raw_dataset=raw_dataset.train_test_split()
train_dataset=raw_dataset['train'].shard(shards, randint(1, shards-1))
eval_dataset=raw_dataset['test'].shard(shards,randint(1, shards-1))

In [11]:
def formatting_func(source):
  # Written for stable-code-3b
  ret = "<fim_prefix>"+source['prefix']+"<fim_suffix>"+source['suffix']+"<fim_middle>"
  return {'input':ret}


In [12]:
import datasets
LANG = "cs"

def compute_metrics():
  problems = datasets.load_dataset("nuprl/MultiPL-E", f"humaneval-{LANG}", trust_remote_code=True)

  problem_len = len(problems['test'])
  mid_tok = tokenizer("<fim_middle>")['input_ids'][0]
  references = []
  predictions = []

  for x in range(problem_len):
    problem = problems['test'][x]
    prompt = problem['prompt']
    tests = problem['tests']
    fim = f"<fim_prefix>{prompt}<fim_suffix>{tests}<fim_middle>"
    inputs = tokenizer(fim, return_tensors="pt").to(model.device)
    tokens = model.generate(
      **inputs,
      max_new_tokens=200,
      temperature=0.2,
      do_sample=True,
      pad_token_id=tokenizer.eos_token_id
    )
    mid_pos = (tokens[0]==mid_tok).nonzero().item()
    masked_index = torch.nonzero(tokens[0] == mid_tok, as_tuple=False)[0].item()
    fim = tokenizer.decode(tokens[0][masked_index:], skip_special_tokens=True)
    with open("program.cs", "w", encoding='utf8') as doc:
      doc.write(prompt)
      doc.write(fim)
      doc.write(tests)
    import subprocess
    build = subprocess.run(['csc','/d:DEBUG','-r:System.Numerics.dll', 'program.cs', '/out:program.exe'], capture_output=True)
    references.append(1 if build.returncode == 0 else 0)
    predictions.append(1)
    print(problem['name']+f'({x}): {"✔️" if build.returncode == 0 else "❌"}')
  accuracy = evaluate.load("accuracy")
  results = accuracy.compute(references=references, predictions=predictions)

In [13]:
YOUR_HF_USERNAME = "fasterinnerlooper"

output_dir = re.sub(r'.*/',f'{YOUR_HF_USERNAME}/', model_id)
per_device_train_batch_size = 2
gradient_accumulation_steps = 4
optim = "paged_adamw_8bit"
save_steps = 100
logging_steps = 10
learning_rate = 3e-5
max_grad_norm = 0.3
max_steps = 50
warmup_ratio = 0.3
lr_scheduler_type = "cosine"
num_epochs=20

training_arguments = TrainingArguments(
    output_dir=output_dir,
    report_to="wandb",
    run_name=f"stable-code-training-{datetime.now().strftime('%Y-%m-%d-%H-%M')}",
    per_device_train_batch_size=per_device_train_batch_size,
    per_device_eval_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    # learning_rate=learning_rate,
    # max_grad_norm=max_grad_norm,
    # max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    lr_scheduler_type=lr_scheduler_type,
    gradient_checkpointing=True,
    push_to_hub=True,
    do_train=True,
    do_eval=True,
    # resume_from_checkpoint=f'{output_dir}/checkpoint-130',
    save_total_limit=3,
    evaluation_strategy="steps",
    eval_accumulation_steps=1,
    load_best_model_at_end=True,
    # num_train_epochs=num_epochs
)

tokenizer.pad_token = tokenizer.eos_token


In [14]:
from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    packing=False,
    dataset_text_field="prediction",
    tokenizer=tokenizer,
    max_seq_length=512,
    formatting_func=formatting_func,
    compute_metrics=compute_metrics,
    peft_config=lora_config
)

You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.


Map:   0%|          | 0/750 [00:00<?, ? examples/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

In [15]:
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33mshafiq-jetha[0m. Use [1m`wandb login --relogin`[0m to force relogin


You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss


: 

In [None]:
import wandb
wandb.finish()



VBox(children=(Label(value='49.846 MB of 49.846 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/loss,█▂▁
eval/runtime,█▃▁
eval/samples_per_second,▁▅█
eval/steps_per_second,▁▅█
train/epoch,▁▅██
train/global_step,▁▅██
train/total_flos,▁
train/train_loss,▁
train/train_runtime,▁
train/train_samples_per_second,▁

0,1
eval/loss,1.20266
eval/runtime,17.5745
eval/samples_per_second,1.423
eval/steps_per_second,1.423
train/epoch,2.67
train/global_step,50.0
train/total_flos,2194587209428992.0
train/train_loss,0.97225
train/train_runtime,527.0506
train/train_samples_per_second,0.379
