In [5]:
from datasets import load_dataset
import pandas as pd

### Data Exploration for 🤗 CommitPackFt

In [20]:
dataset = load_dataset("bigcode/commitpackft", "python")

Downloading data:   0%|          | 0.00/59.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/56025 [00:00<?, ? examples/s]

The dataset contains following features:

1.   Unique commit ID
2.   New and old file names
3.   New and old contents
4.   Subject
5.   Message
6.   Programming Language
7.   License
8.   Repository





Dataset consists of natural language - code pairs, where messages can serve as instructions and code as solutions.

Python makes up roughly 8% percent of the total dataset, and contains 56025 samples. So models finetuned on the dataset must remain quite general.

The prompt and answer can be constructed from the samples in the following way:



> Old content (context) + message (instruction) = new content (answer)








Example of such a triple:

In [30]:
print(dataset['train'][1231]['old_contents'])

# -*- coding: utf-8 -*-

from django.db import models
from django.contrib.contenttypes.models import ContentType
from django.contrib.contenttypes import generic

class Authors(models.Model):
    author = models.ForeignKey(ContentType)
    object_id = models.PositiveIntegerField()
    content_object = generic.GenericForeignKey('author', 'object_id')

    def __unicode__(self):
        return self.content_object.name

class Announcements(models.Model):
    title = models.CharField(max_length = 500)
    pubdate = models.DateTimeField()
    creator = models.ForeignKey(Authors)
    unique = models.CharField(max_length = 255, unique = True)
    url = models.URLField()
    summary = models.TextField(null = True)
    enclosure = models.CharField("Attachment URL", max_length = 255, null = True)

    def __unicode__(self):
        return self.title



In [32]:
print(dataset['train'][1231]['message'])

Rename of the author field to content_type in the model, in order to
avoid confusion



In [29]:
print(dataset['train'][1231]['new_contents'])

# -*- coding: utf-8 -*-

from django.db import models
from django.contrib.contenttypes.models import ContentType
from django.contrib.contenttypes import generic

class Authors(models.Model):
    content_type = models.ForeignKey(ContentType)
    object_id = models.PositiveIntegerField()
    content_object = generic.GenericForeignKey('content_type', 'object_id')

    def __unicode__(self):
        return self.content_object.name

class Announcements(models.Model):
    title = models.CharField(max_length = 500)
    pubdate = models.DateTimeField()
    creator = models.ForeignKey(Authors)
    unique = models.CharField(max_length = 255, unique = True)
    url = models.URLField()
    summary = models.TextField(null = True)
    enclosure = models.CharField("Attachment URL", max_length = 255, null = True)

    def __unicode__(self):
        return self.title



The examples shows that dataset is very useful for learning small, targeted and precise changes in code, which is especially valuable for Bug Fixing

Because of very strict filtering, the instructions are typically high-quality.

## Evaluating Refact-1.6B-fim

I will be using [Code Generation LM Evaluation Harness library](https://github.com/bigcode-project/bigcode-evaluation-harness)
for evaluation.  


As this library does not support exactly Refact-1.6B-fim, I made a few changes to it. I implemented the prompt generation for Refact-1.6B-fim, which allows it to act in code compeletion and chat modes.


In [3]:
!chmod 755 ./run_evaluate.sh
!./run_evaluate.sh

fatal: destination path 'bigcode-evaluation-harness' already exists and is not an empty directory.
2023-12-29 12:56:44.800114: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-29 12:56:44.800164: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-29 12:56:44.801572: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Selected Tasks: ['humanevalfixtests-python']
Loading model in bf16
number of problems for this task is 164
100% 328/328 [26:12<00:00,  4.79s/it]
generations were saved at generations.json
Evaluating generations...
Downloading builder script: 100% 7.92k/7.92k [00:00<

The hyperparameters are chosen to be as close to the original Octopack paper as possible.

I tried to make prompts that would comply both with what Refact-1_6B-fim expects and what prompts are like in HumanEvalFix benchmark. Here is an example of a prompt for code completion:

```
<empty_output>SYSTEM from typing import List


def has_close_elements(numbers: List[float], threshold: float) -> bool:
    for idx, elem in enumerate(numbers):
        for idx2, elem2 in enumerate(numbers):
            if idx != idx2:
                distance = elem - elem2
                if distance < threshold:
                    return True

    return False


def check(has_close_elements):
    assert has_close_elements([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True
    assert has_close_elements([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == False
    assert has_close_elements([1.0, 2.0, 5.9, 4.0, 5.0], 0.95) == True
    assert has_close_elements([1.0, 2.0, 5.9, 4.0, 5.0], 0.8) == False
    assert has_close_elements([1.0, 2.0, 3.0, 4.0, 5.0, 2.0], 0.1) == True
    assert has_close_elements([1.1, 2.2, 3.1, 4.1, 5.1], 1.0) == True
    assert has_close_elements([1.1, 2.2, 3.1, 4.1, 5.1], 0.5) == False

check(has_close_elements)
<empty_output>USER Fix bugs in has_close_elements.
<empty_output>ASSISTANT
from typing import List


def has_close_elements(numbers: List[float], threshold: float) -> bool:
```

In chat mode, pass@1 is **0.167** and pass@10 is **0.262**.

#### Drawbacks and problems of the model

1. The model is very sensetive to prompting. Slight changes in prompting can dramatically change the performance. Again, it is expectable of small models to **not** be robust.
2. Just as reported in the article, model typically fails by reproducing exactly the buggy code. It is less likely to happen in the code completion setting.

For instance, in has_close_element function, the model reproduces the buggy solution without adding the missing abs() operator. (However, it manages to produce right solutions when sampling several times)

```
from typing import List

def has_close_elements(numbers: List[float], threshold: float) -> bool:
    for idx, elem in enumerate(numbers):
        for idx2, elem2 in enumerate(numbers):
            if idx != idx2:
                distance = elem - elem2
                if distance < threshold:
                    return True

    return False
```



### Positive observations...

3. The model is suprisingly powerful. It performs better than StarCoder, StarCoder-beta and CodeGeeX2, which are all 16B parameters models. The reason why Refact performs better than these larger models maybe that finetuning on CommitPackFt has likely taught the model to make small, targeted changes, which are typically required to fix bugs.
4. It is also worth noting that the model as not finetuned for Python excusively, but provides comparable performance to  Python-specific models. This suggests, that further finetuning on language specific datasets would make it a powerful instrument for that specific language.

# Git commits as a source for Code Instruction-Tuning Datasets

Git commits prove to be a valuable source of insutruction data for code generation models. The paper and these experimental results show that small changes in code, coupled with clear and precise commit messages can be used to teach models to make precise, targeted and small changes in code. This is the main capability required for HumanEvalFix, and many larger models which haven't had such fine-tuning perform poorly on this benchmark.

I believe git data can be used in a varity other ways too. For istance, authours drop any commits which create new files, but these can be used for code synthesis tasks.

Additionally, some other data from Git commit can used for related tasks. For instance, git diffs can be used for performing small changes, and some other data.