# Finetuning CodeLlama for Leetcode

## Setup

Run the cells below to setup and install the required libraries. For our experiment we will need `accelerate`, `peft`, `transformers`, `datasets` and TRL to leverage the recent [`SFTTrainer`](https://huggingface.co/docs/trl/main/en/sft_trainer). We will use `bitsandbytes` to [quantize the base model into 4bit](https://huggingface.co/blog/4bit-transformers-bitsandbytes). We will also install `einops` as it is a requirement to load Falcon models.

In [1]:
!pip install  -U trl transformers accelerate git+https://github.com/huggingface/peft.git
!pip install  datasets bitsandbytes einops wandb

Collecting git+https://github.com/huggingface/peft.git
  Cloning https://github.com/huggingface/peft.git to /tmp/pip-req-build-9s_41b_z
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/peft.git /tmp/pip-req-build-9s_41b_z
  Resolved https://github.com/huggingface/peft.git to commit a634f6a13e1b0b55a78b78b39bbc6fde425f58a5
  Installing build dependencies ... [?25l- \ | / done
[?25h  Getting requirements to build wheel ... [?25l- done
[?25h  Preparing metadata (pyproject.toml) ... [?25l- done
[?25hCollecting trl
  Downloading trl-0.7.4-py3-none-any.whl (133 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.9/133.9 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
Collecting transformers
  Downloading transformers-4.35.2-py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m68.7 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate
  Downloading accel

## Dataset



In [2]:
import json
import pandas as pd
from datasets import Dataset, Features, Value

with open('/kaggle/input/leetcode-compressed/leetcode_compressed.json', 'r') as file:
    data = json.load(file)

df = pd.DataFrame(data)

print("Data types in the DataFrame:")
print(df.dtypes)
print("\nFirst few rows of the DataFrame:")
print(df.head())

features = Features({
    'id': Value('int32'),
    'question': Value('string'),
    'code': Value('string'),
    'solution': Value('string')
})

dataset = Dataset.from_pandas(df, features=features)

Data types in the DataFrame:
id           int64
question    object
code        object
solution    object
dtype: object

First few rows of the DataFrame:
   id                                           question  \
0   1  Can you solve this real interview question? Fi...   
1   2  Can you solve this real interview question? Tw...   
2   3  Can you solve this real interview question? Ad...   
3   4  Can you solve this real interview question? Me...   
4   5  Can you solve this real interview question? Lo...   

                                                code  \
0  class Solution {\npublic:\n    vector<int> fin...   
1  class Solution {\npublic:\n    vector<int> two...   
2  /**\n * Definition for singly-linked list.\n *...   
3  class Solution {\npublic:\n    double findMedi...   
4  class Solution {\npublic:\n    string longestP...   

                                            solution  
0  class Solution {\npublic:\n    vector<int> fin...  
1  class Solution {\n    public int[] t

## Loading the model

In [3]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer
from accelerate import infer_auto_device_map
model_name = "codellama/CodeLlama-7b-Instruct-hf"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=False,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True,
    device_map="auto",
    offload_folder='offload'
)
model.config.use_cache = False
model.config.pretraining_tp = 1

config.json:   0%|          | 0.00/646 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [4]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

tokenizer_config.json:   0%|          | 0.00/749 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

In [5]:
from peft import LoraConfig, get_peft_model

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM"
)

## Loading the trainer

In [6]:
from transformers import TrainingArguments

output_dir = "./results"
per_device_train_batch_size = 1
gradient_accumulation_steps = 4
optim = "paged_adamw_32bit"
adam_beta1= 0.9
adam_beta2= 0.95
save_steps = 210
logging_steps = 10
learning_rate = 2e-4
max_grad_norm = 0.3
# max_steps = 200
warmup_ratio = 0.03
lr_scheduler_type = "constant"
weight_decay= 0.001
num_train_epochs=1

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
#     max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
    weight_decay=weight_decay,
    adam_beta1= adam_beta1,
    adam_beta2= adam_beta2,
    num_train_epochs=num_train_epochs,
      report_to='tensorboard'
)

In [7]:
def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['question'])):
        text = f"""<s>[INST] <<SYS>>
You are a smart assistant. You will solve the problem based on the following source code and question.
<</SYS>>
###Question: 
{example['question'][i]}
###CODE: 
{example['code'][i]} [/INST]
###SOLUTION: 
{example['solution'][i]} 
###END</s>"""
        output_texts.append(text)
    return output_texts

### Creating the trainer

In [8]:
from trl import SFTTrainer

max_seq_length = 512

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    formatting_func=formatting_prompts_func,
    packing=False
)

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


  0%|          | 0/1 [00:00<?, ?ba/s]

In [9]:
for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)

## Training the model

In [10]:
trainer.train()

You're using a CodeLlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,0.8872
20,0.6491
30,0.4135
40,0.351
50,0.2192
60,0.2811
70,0.2878
80,0.2561
90,0.2593
100,0.2595


TrainOutput(global_step=101, training_loss=0.38475710389637713, metrics={'train_runtime': 623.6423, 'train_samples_per_second': 0.649, 'train_steps_per_second': 0.162, 'total_flos': 8085904440606720.0, 'train_loss': 0.38475710389637713, 'epoch': 1.0})

In [11]:
trainer.model.save_pretrained("checkpoint-final")
trainer.tokenizer.save_pretrained("checkpoint-final")

('checkpoint-final/tokenizer_config.json',
 'checkpoint-final/special_tokens_map.json',
 'checkpoint-final/tokenizer.model',
 'checkpoint-final/added_tokens.json',
 'checkpoint-final/tokenizer.json')

# Inference

In [12]:
example = dataset[0] 
text = f"""<s>[INST] <<SYS>>
You are a smart assistant. You will solve the problem based on the following source code and question.
<</SYS>>
###Question: 
{example['question']}
###CODE: 
{example['code']} [/INST]
###SOLUTION: 
"""
text

'<s>[INST] <<SYS>>\nYou are a smart assistant. You will solve the problem based on the following source code and question.\n<</SYS>>\n###Question: \nCan you solve this real interview question? Find The Original Array of Prefix Xor - You are given an integer array pref of size n. Find and return the array arr of size n that satisfies:\n\n * pref[i] = arr[0] ^ arr[1] ^ ... ^ arr[i].\n\nNote that ^ denotes the bitwise-xor operation.\n\nIt can be proven that the answer is unique.\n\n \n\nExample 1:\n\n\nInput: pref = [5,2,0,3,1]\nOutput: [5,7,2,3,2]\nExplanation: From the array [5,7,2,3,2] we have the following:\n- pref[0] = 5.\n- pref[1] = 5 ^ 7 = 2.\n- pref[2] = 5 ^ 7 ^ 2 = 0.\n- pref[3] = 5 ^ 7 ^ 2 ^ 3 = 3.\n- pref[4] = 5 ^ 7 ^ 2 ^ 3 ^ 2 = 1.\n\n\nExample 2:\n\n\nInput: pref = [13]\nOutput: [13]\nExplanation: We have pref[0] = arr[0] = 13.\n\n\n \n\nConstraints:\n\n * 1 <= pref.length <= 105\n * 0 <= pref[i] <= 106\n###CODE: \nclass Solution {\npublic:\n    vector<int> findArray(vector<

In [13]:
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[INST] <<SYS>>
You are a smart assistant. You will solve the problem based on the following source code and question.
<</SYS>>
###Question: 
Can you solve this real interview question? Find The Original Array of Prefix Xor - You are given an integer array pref of size n. Find and return the array arr of size n that satisfies:

 * pref[i] = arr[0] ^ arr[1] ^ ... ^ arr[i].

Note that ^ denotes the bitwise-xor operation.

It can be proven that the answer is unique.

 

Example 1:


Input: pref = [5,2,0,3,1]
Output: [5,7,2,3,2]
Explanation: From the array [5,7,2,3,2] we have the following:
- pref[0] = 5.
- pref[1] = 5 ^ 7 = 2.
- pref[2] = 5 ^ 7 ^ 2 = 0.
- pref[3] = 5 ^ 7 ^ 2 ^ 3 = 3.
- pref[4] = 5 ^ 7 ^ 2 ^ 3 ^ 2 = 1.


Example 2:


Input: pref = [13]
Output: [13]
Explanation: We have pref[0] = arr[0] = 13.


 

Constraints:

 * 1 <= pref.length <= 105
 * 0 <= pref[i] <= 106
###CODE: 
class Solution {
public:
    vector<int> findArray(vector<int>& pref) {
        
    }
}; [/INST]
###SOLUT