### Distilling DeepSeek Coder 1.3B for the purpose of creating a student model for test case assertion generation

First we install and import the needed requirements:

In [23]:
!pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Looking in indexes: https://download.pytorch.org/whl/cu118


In [29]:
import json
from torch.utils.data import Dataset

from Data.decompression_test import decompress_tensor_optimized

Testing decompression of logits...
Entry loaded successfully.

Teacher prediction:
assertNotNull(ret);
assertEquals(Integer.valueOf(3), ret);

Reference assertions:
assertNotNull(ret);
assertEquals(Long.class, ret.getClass());
assertEquals(123L, ret);

Successfully decompressed logits!
Shape: torch.Size([512, 32100])
Data type: torch.float32
Min value: -12.0625
Max value: 39.4688
Sample values (first 5): [11.985416412353516, 29.16250228881836, 1.6791677474975586, -1.7562494277954102, -1.7562494277954102]
Compression format: quantized_4bit
Compression ratio: 59.81x
Original size: 31.35 MB
Compressed size: 0.52 MB


Let's start with understanding the data format. We have the /Data/dataset_with_predictions.jsonl file, containing the data (both input and output) for the teacher model.

In [30]:
NUM_LINES_TO_INSPECT = 5
DATA_PATH = "Data/dataset_with_predictions.jsonl"

inspected_data = []

with open(DATA_PATH, 'r') as data_file:
    for i, line_content in enumerate(data_file):
        if i >= NUM_LINES_TO_INSPECT:
            break
        data = json.loads(line_content.strip())
        inspected_data.append(data)

Now let's look closer at the parsed JSON entry:

In [31]:
print(inspected_data[0].keys())
print(inspected_data[0]["test_method_masked"])
print(inspected_data[0]["assertions"])
print(inspected_data[0]["teacher_prediction"])
print(inspected_data[0]["teacher_parsed_assertions"])
print(inspected_data[0]["teacher_metrics"])

dict_keys(['repository', 'focal_file', 'test_method_original', 'test_method_masked', 'assertions', 'method_under_test', 'teacher_prediction', 'teacher_parsed_assertions', 'teacher_metrics', 'teacher_logits'])
@Test
    public void testNaturalNumber() throws Exception {
        Object ret = reader.read("123");

    }
['assertNotNull(ret);', 'assertEquals(Long.class, ret.getClass());', 'assertEquals(123L, ret);']
assertNotNull(ret);
assertEquals(Integer.valueOf(3), ret);
['assertNotNull(ret);', 'assertEquals(Integer.valueOf(3), ret);']
{'precision': 1.0, 'recall': 0.6666666666666666, 'f1': 0.8, 'accuracy': 0.6666666666666666, 'similarity': 1.0, 'exact_matches': 2, 'generated_count': 2, 'reference_count': 3}


As we can see, the data contains the repository from which the data is taken, the file that contains the class that is being tested, the test method that was written, as well as a separated version of it (a masked test and the assertions separately), the method from the original class that is being tested, the prediction of the teacher model (both as a string and as a list of assertions), some teacher metrics regarding its prediction performance and the teacher's output logits (which we will use for the loss function of the student model).

Now, for every entry from the dataset, we need to construct an input for the student model that follows the same format as the input for the teacher model (as defined in DataGeneration/train_teacher_model.py). We also need to tokenize those inputs. We do this using the StudentDataset class, that will manage and tokenize the student model's input data:

In [32]:
class StudentDataset(Dataset):
    """Dataset for the student model"""

    def __init__(self, data, tokenizer, max_src_length=1024, max_tgt_length=512):
        self.data = data
        self.tokenizer = tokenizer
        self.max_src_length = max_src_length
        self.max_tgt_length = max_tgt_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]

        # Construct input: We combine focal code, test method without assertions
        input_text = f"FOCAL CODE:\n{item['focal_file']}\n\nTEST METHOD:\n{item['test_method_masked']}"

        # Target: The assertions that need to be generated
        target_text = "\n".join(item['assertions'])

        # Tokenize inputs
        source_encoding = self.tokenizer(
            input_text,
            max_length=self.max_src_length,
            padding="max_length",
            truncation=True,
            return_tensors="pt"
        )

        # Tokenize targets
        target_encoding = self.tokenizer(
            target_text,
            max_length=self.max_tgt_length,
            padding="max_length",
            truncation=True,
            return_tensors="pt"
        )

        input_ids = source_encoding["input_ids"].squeeze()
        attention_mask = source_encoding["attention_mask"].squeeze()
        labels = target_encoding["input_ids"].squeeze()

        # Replace padding token id with -100 so it's ignored in loss computation
        labels[labels == self.tokenizer.pad_token_id] = -100
        
        decompressed_teacher_logits = decompress_tensor_optimized(item['teacher_logits'])

        return {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "labels": labels,
            "original_input": input_text,
            "original_target": target_text,
            "idx": idx,
            "teacher_logits": decompressed_teacher_logits,
        }

Now that we have the dataset class itself, we will also need a method to load the data that we have:

In [28]:
def load_dataset(jsonl_path):
    """Load data from JSONL file"""
    data = []
    with open(jsonl_path, 'r') as f:
        for line in f:
            if line.strip():
                try:
                    data.append(json.loads(line))
                except json.JSONDecodeError:
                    continue
    return data

We also need a method to train the student model: