<a href="https://colab.research.google.com/github/flaviusfetean/method_name_predictor/blob/main/nlp_llama_method_predict.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Environment setup

In [1]:
!pip install accelerate peft bitsandbytes transformers trl

In [2]:
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Import necessary libs


In [3]:
import os

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig
from trl import SFTTrainer


#Setup model and tokenizer

In [4]:
base_model = "NousResearch/Llama-2-7b-chat-hf"

compute_dtype = getattr(torch, "float16")

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False,
)

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=quant_config,
    device_map={"": 0},
    cache_dir=r"/content/",
    force_download=False,
)
model.config.use_cache = True
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



# Utility functions to manage data
1. Function that will return a prompt to llama if given a method body

In [5]:
def get_llm_input(body, class_name=None):
    if class_name is None:
        return "### Body: {} ### Name: ".format(body)
    return "### Body: {} ### Class {}</s>".format(body, class_name)

#Inference on some data

In [6]:
#desired input format on non-finetuned model
prompt = get_llm_input("return this.id")
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"{prompt}")
print(result[0]['generated_text'])



### Body: return this.id ### Name: 001 ### Type: number

I'm trying to create a function that takes in an id and returns the corresponding object from the array. I'm not sure how to do this, as I'm getting an error message that says "TypeError: Cannot read property 'id' of undefined".

Here is my code:
```
const people = [
  { id: 001, name: 'John', age: 30 },
  { id: 002, name: 'Jane', age: 25 },
  { id: 003, name: 'Bob', age: 40 }
];

function getPerson(id) {
  return people.find(person => person.id === id);
}

console.log(getPerson(001)); // Output: { id:


In [7]:
#input format that actually yields a result for original model
prompt = ("Generate a name for the following method's body: return this.id. Only state the name, without any additional text output.")
result = pipe(f"{prompt}")
print(result[0]['generated_text'])

Generate a name for the following method's body: return this.id. Only state the name, without any additional text output.

Answer:

id


#Load the custom methods dataset


The dataset is a collection of methods extracted from the java files of the https://github.com/JetBrains/intellij-community repo. It is stored as a jsonl file with each line containing a dict with a single-entry "text" containing a method sample formatted like this:

```
### Body: method_body;\n here; ### Name: methodNameHere</s>
```
The format was chosen because I followed models trained on the [Guanaco Dataset](https://huggingface.co/datasets/timdettmers/openassistant-guanaco) which has more or less the same format.


Methods from the repo that were for testing purposes (i.e.: had "test" inside their name) were not added to the dataset, as their bodies were often completely unrelated to their names (ex: had 1/2/3 in the names only bcause of the ordering - more code context would have been necessary for such a prediction)

In [8]:
from datasets import load_dataset

dataset = load_dataset("json", data_files={"train":"/content/json_llama_train.jsonl", "test": "/content/json_llama_test.jsonl"})

new_model = "/content/drive/MyDrive/llama_output/llama-2-7b-method-predict"


#Lora and Finetune Trainer configs


The size of the model makes it desirable to use **Low-Rank Adaptation (LoRA)** in order to fine-tune the model. This method is based on the fact that while fine-tuning, only a small subset of features need to be adapted for the new task, and by small amounts. LoRA proceeds to do this by adding a **lower-rank decomposition** of the original weight matrix **to the original weight matrix**. During training time the original matrix is frozen, and only the decomposition is learnable, so fewer parameters will have to be learned, while yielding similar results to a full fine-tune.

In this experiment, I have chosen a rank of 32 and alpha of 16 (meaning how much the adaptation affects the base model). In retrospective, alpha should have been higher as the model learned slowly and seemed to never forget its original chatbot use-case.

In [9]:

peft_params = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=32,
    bias="none",
    task_type="CAUSAL_LM",
)




I planned to train the model for only 2 epochs and set batch info such that the effective batch-size is as large as possible to increase speed. The rest of the parameters were ad-hoc, but in retrospective I should have chosen a bigger learning rate and remove the linear lr_scheduler because of the small number of epochs. Also, the batches were grouped such that samples had approx. same length, increasing efficiency

In [None]:
training_params = TrainingArguments(
    output_dir="./results",
    num_train_epochs=2,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    optim="paged_adamw_32bit",
    save_steps=1000,
    save_total_limit=5,
    logging_steps=1000,
    learning_rate=2e-4,
    weight_decay=0.001,
    max_grad_norm=0.3,
    warmup_steps=1000,
    group_by_length=True,
    lr_scheduler_type="linear",
    report_to="tensorboard"
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset['train'],
    peft_config=peft_params,
    dataset_text_field="text",
    max_seq_length=None,
    tokenizer=tokenizer,
    args=training_params,
    packing=False,
)

#Train and save

In [None]:
trainer.train()


Zipp and download checkpoint

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"
!zip -r /content/checkpoint_12000.zip /content/results/checkpoint-12000

In [None]:
trainer.model.save_pretrained(new_model)
trainer.tokenizer.save_pretrained(new_model)

#Perform inference on the new model

In [11]:
from peft import PeftModel

Load the LoRA adaptation of the model and merge it with the base model (only run if the model was not trained in the same session)

In [12]:
model = PeftModel.from_pretrained(model, "/content/drive/MyDrive/llama_output/checkpoint-12000")
model = model.merge_and_unload()



In [13]:
test_dataset = load_dataset("json", data_files={"test": "/content/json_clean_test.json"})

#Evaluate the dataset on test data

1.  Utility function that extracts the method name from decoded llama output



In [14]:
def get_name_from_llama_finetuned(llama_output):
    #Extract the method name which is given by the first "### Name: " and lasts until "</s>
    predict_begin = llama_output[llama_output.find("### Name:"):]
    prediction = predict_begin[:predict_begin.find("</s>")]
    method_name = prediction[prediction.rfind(" "):]

    return method_name.strip()


2. Utility function which takes a method as a parameter and returns the method name

In [15]:
from tqdm.auto import tqdm

def get_method_name(test_model, method_body):
    prompt = get_llm_input(method_body)
    print("initializing pipeline")
    pipe = pipeline(task="text-generation",  model=test_model, tokenizer=tokenizer, max_length=200)
    prediction_raw = pipe(prompt)[0]['generated_text']
    name = get_name_from_llama_finetuned(prediction_raw)
    return name



3.   Utility function which splits a Ground Truth text into body and name for testing purpose



In [16]:
def split_gt_input_output(text):
    sep = "### Name: "
    llm_input = text[:text.find(sep) + len(sep)]
    output = text[text.find(sep) + len(sep):text.find("</s>")]
    return llm_input, output
llm_input, name = split_gt_input_output(dataset['test'][13]['text'])
print("LLM input: " + llm_input)
print("Ground Truth: " + name)

LLM input: ### Body: return KeyCodeTypeCommand.unparseKeyCodes(keyCodes); ### Name: 
Ground Truth: unparseKeyCodes


In [17]:
method_body = test_dataset['test'][158]['body']
print("### Body: " +  method_body)
print("### Name: " + test_dataset['test'][158]['name'])

### Body: return myFileStructure.getCurrentDirectory();
### Name: getCurrentDirectory


In [18]:
prompt = get_llm_input(method_body)
pipe = pipeline(task="text-generation",  model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"{prompt}")
print(get_name_from_llama_finetuned(result[0]['generated_text']))
print(result[0]['generated_text'])

getCurrentDirectory
### Body: return myFileStructure.getCurrentDirectory(); ### Name:  getCurrentDirectory</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1</s>1


#Testing the result model


1.   **Hard Comparison**:
  The output is tested for an exact match

In [19]:
def compare_outputs(pred, gt):
    return 1 if pred == gt else 0



2.   **Soft comparison (similarity)**: We will count the number of words that appear in both the output and the ground truth, as the output may still have relevance even if not an exact match

In [20]:
def split_camel_case(input_string):
    """
    Method to split a method name which is known to be a camel-case
    into its composing words (Java convention)
    """
    try:
        words = [input_string[0]]

        for char in input_string[1:]:
            if char.isupper():
                words.append(char.lower())
            else:
                words[-1] += char
    except IndexError:
        return ""

    return ' '.join(words)

camel_case_string = "camelCaseExample"
result = split_camel_case(camel_case_string)
print(result)

camel case example


In [21]:

def compare_similarity(pred, gt):
    """Often the method name is not predicted exactly the same as the ground truth
    But it is composed of some words that are also present in the ground truth
    Therefore, we will consider the similarity between the results as the number of words in the ground truth that are also present in the prediction divided by the maximum length of the two strings
    """

    max_similarity = 0
    words = split_camel_case(gt).split()
    for word in words:
        if word in pred.lower():
            max_similarity += len(word)

    return max_similarity / max(len(pred), len(gt))



3.   **ROUGE score**: A generalization of the soft score, will also take into account bigrams and longest-common-sequences



In [22]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"
!pip install evaluate absl-py rouge_score nltk

In [23]:
import evaluate

rouge = evaluate.load('rouge')

def compare_rouge(preds, gts):
    """
    Rouge will treat the texts as summaries, so we will have to split
    the method names into composing words and treat them as summaries
    """
    pred_split = [split_camel_case(pred) for pred in preds]
    gt_split = [split_camel_case(gt) for gt in gts]

    return rouge.compute(predictions=pred_split, references=gt_split)

print(compare_rouge(["getTestDefault"], ["myTestNotDefault"])) #Expected 0.57 r1, 0 r2, 0.57 rl

{'rouge1': 0.5714285714285715, 'rouge2': 0.0, 'rougeL': 0.5714285714285715, 'rougeLsum': 0.5714285714285715}


In [24]:
#test the compare functions
print(compare_outputs("hello", "hello")) #Expected 1
print(compare_outputs("hello", "world")) #Expected 0
print(compare_similarity("hello", "hello")) #Expected 1.0
print(compare_similarity("hello", "hell")) #Expected 0.8
print(compare_similarity("hello", "helll")) #Expected 0.0
print(compare_similarity("getTestDefault", "myTestNotDefault")) #Expected 0.687

1
0
1.0
0.8
0.0
0.6875


Cell to evaluate predictions on soft, hard and rouge scores

In [25]:
#evaluate the predictions

def evaluate_predictions(predictions, gt):
    hard_score = 0
    soft_score = 0

    for i, (pred, gndt) in enumerate(zip(predictions, gt)):
        hard_score += compare_outputs(pred, gndt)
        soft_score += compare_similarity(pred, gndt)

    print("Hard score: ", hard_score / len(predictions))
    print("Soft score: ", soft_score / len(predictions))
    print("Rouge score: ", compare_rouge(predictions, gt))

#Evaluate the model

Method to predict and compute metrics at once, and print them as they appear


> Logs the results continuously so they are visible in the eventuality that the process will be killed due to time or memory constraints



In [26]:
import gc
gc.collect()

def predict_and_evaluate(model_peft, dataset, tokenizer):
    print("Processing test dataset")
    xy_pairs = [{'input': split_gt_input_output(sample['text'])[0], 'output': split_gt_input_output(sample['text'])[1]} for sample in dataset]
    prompts = [xy_pair['input'] for xy_pair in xy_pairs]
    print(prompts[0])
    gts = [xy_pair['output'] for xy_pair in xy_pairs]
    print(gts[0])
    print("initializing pipeline")
    pipe = pipeline(task="text-generation",  model=model_peft, tokenizer=tokenizer, max_length=200)
    print("beginning pipeing")
    bsize = 256
    pipe_size = 16
    all_results = []
    for batch_number in range(len(prompts)-bsize):
        batch_prompt = prompts[batch_number*bsize: (batch_number+1)*bsize]
        results_raw = pipe(batch_prompt, batch_size=pipe_size)
        results_decoded = [get_name_from_llama_finetuned(result[0]['generated_text']) for result in results_raw]
        all_results.extend(results_decoded)
        print(f"Finished evaluating: {(batch_number+1)*bsize}/{len(prompts)}:\n")
        evaluate_predictions(all_results, gts[: (batch_number+1)*bsize])
        print("-"*64)

In [27]:
predict_and_evaluate(model, dataset['test'], tokenizer)

Processing test dataset
### Body: return false; ### Name: 
dependsOnFileContent
initializing pipeline
beginning pipeing
Finished evaluating: 256/33331:

Hard score:  0.03515625
Soft score:  0.10994597027625
Rouge score:  {'rouge1': 0.12635845057720058, 'rouge2': 0.05238715277777778, 'rougeL': 0.12497026627886, 'rougeLsum': 0.1255952380952381}
----------------------------------------------------------------




Finished evaluating: 512/33331:

Hard score:  0.037109375
Soft score:  0.09261049102456528
Rouge score:  {'rouge1': 0.10437275940205623, 'rouge2': 0.04557291666666667, 'rougeL': 0.10263129340277771, 'rougeLsum': 0.1025184884559884}
----------------------------------------------------------------
Finished evaluating: 768/33331:

Hard score:  0.037760416666666664
Soft score:  0.09175656120035058
Rouge score:  {'rouge1': 0.10531500834235208, 'rouge2': 0.04349785052910053, 'rougeL': 0.1046650000751563, 'rougeLsum': 0.10456241169131794}
----------------------------------------------------------------
Finished evaluating: 1024/33331:

Hard score:  0.033203125
Soft score:  0.08293763143630198
Rouge score:  {'rouge1': 0.09738145968614723, 'rouge2': 0.04024290054563491, 'rougeL': 0.09698174546807363, 'rougeLsum': 0.09655966050009027}
----------------------------------------------------------------




Finished evaluating: 1280/33331:

Hard score:  0.0328125
Soft score:  0.0860709533647098
Rouge score:  {'rouge1': 0.09784922325937959, 'rouge2': 0.039640997023809514, 'rougeL': 0.09823736810064944, 'rougeLsum': 0.09802283437049071}
----------------------------------------------------------------




Finished evaluating: 1536/33331:

Hard score:  0.033854166666666664
Soft score:  0.0848881190939323
Rouge score:  {'rouge1': 0.09559793505106012, 'rouge2': 0.04157784128487253, 'rougeL': 0.09581591730029237, 'rougeLsum': 0.09571651764034586}
----------------------------------------------------------------




Finished evaluating: 1792/33331:

Hard score:  0.03404017857142857
Soft score:  0.08198445154657256
Rouge score:  {'rouge1': 0.09232288513817544, 'rouge2': 0.041670652636054406, 'rougeL': 0.09277416531880822, 'rougeLsum': 0.09270865543186982}
----------------------------------------------------------------




Finished evaluating: 2048/33331:

Hard score:  0.0341796875
Soft score:  0.0812504076478103
Rouge score:  {'rouge1': 0.09232649810042398, 'rouge2': 0.04211619543650792, 'rougeL': 0.09208981258541822, 'rougeLsum': 0.09246191091894225}
----------------------------------------------------------------




OutOfMemoryError: ignored