## Dataset Collection

The dataset collection code part consists of two classes:  

*   **GitHubRepo** class aims to initialize the github repository and retrieve its content using requests library
*   **KotlinDatasetBuilder** class allows one to walk through the files in the repository and extract the code in Kotlin

P.S.: when running the code, do not forget to insert your github token.


In [None]:
import requests
import os
import re
import json
import random
from typing import Any, Dict, Optional, List

In [None]:
class GitHubRepo:

    def __init__(self, owner: str, repo: str, token: str = '') -> None:

        self.base_url = f"https://api.github.com/repos/{owner}/{repo}"
        self.session = requests.Session()
        if token:
            self.session.headers.update({'Authorization': f'token {token}'})


    def get_contents(self, path: str = '') -> Any:

        """ Retrieve the contents of a directory in a repository """

        url = f"{self.base_url}/contents/{path}"
        response = self.session.get(url)
        response.raise_for_status()
        return response.json()


    def download_file(self, file_url: str) -> str:

        """ Download a single file from GitHub """

        response = self.session.get(file_url)
        response.raise_for_status()
        return response.text

In [None]:
class KotlinDatasetBuilder:

    def __init__(self, github_repo: GitHubRepo) -> None:
        self.repo = github_repo
        self.dataset = []


    def explore_and_extract(self, path: str = '') -> None:

        """ Recursively explore given repository path and extract Kotlin files """

        contents = self.repo.get_contents(path)
        for content in contents:
            if content['type'] == 'dir':
                try:
                  self.explore_and_extract(content['path'])
                except:
                   pass
            elif content['name'].endswith('.kt') or content['name'].endswith('.kts'):
                file_content = self.repo.download_file(content['download_url'])
                self.dataset.append({'path': content['path'],
                                     'content': file_content})


    def save_dataset(self, filename: str = 'kotlin_code_dataset.json') -> None:

        """ Save the collected Kotlin code data to a JSON file """

        with open(filename, 'w', encoding='utf-8') as f:
            json.dump(self.dataset, f, indent=4)

Insert your GitHub token below to run the code

In [None]:
# Initialize the GitHub repository handler
github_repo = GitHubRepo(owner='Kotlin', repo='kotlinx.coroutines',
                         token='YOUR_GITHUB_TOKEN')

# Initialize the dataset builder
dataset_builder = KotlinDatasetBuilder(github_repo)

# Explore the repository and build the dataset
dataset_builder.explore_and_extract()

# Save the dataset to a file
dataset_builder.save_dataset('kotlin_code_dataset.json')

## Process Dataset

Preprocessing of the gathered dataset adapted for Kotlin language, and preparing it for training/testing in the same format as described in the [CodeXGLUE dataset page](https://github.com/microsoft/CodeXGLUE/blob/main/Code-Code/CodeCompletion-token/dataset/py150/preprocess.py) for the Code Completion (line level) task. Token-level preprocessing is used in the beginning because in the line-level completion task the same data format was used. I follow the author's annotations for reproducibility.

In [None]:
def process_string(token: str, literals: Dict[str, List[str]]) -> str:

    """ Processing string literals with predefined replacements """

    str_lit = re.sub(r'^"(.*)"$', r'\1', token)  # Remove surrounding quotes
    if str_lit in literals['str']:
        return f"<STR_LIT:{str_lit}>"
    return "<STR_LIT>"


def process_number(token: str, literals: Dict[str, List[str]]) -> str:

    """ Processing number literals with predefined markers """

    if token in literals['num']:
        return f"<NUM_LIT:{token}>"
    return "<NUM_LIT>"


def tokenize_kotlin(code: str, literals: Dict[str, List[str]]) -> List[str]:

    """ Regular expressions to identify strings and numbers """

    tokens = []
    regex = re.compile(r'\".*?\"|\d+\.\d+|\d+')  # Simple regex to capture quoted strings and numbers

    start = 0
    for match in regex.finditer(code):
        # Split the text before the match while preserving lines
        before = code[start:match.start()]
        # Preserve new lines by splitting on them and reinserting them into the token list
        tokens.extend([x for x in re.split(r'(\n)', before) if x])
        token = match.group(0)
        if token.startswith('"'):
            tokens.append(process_string(token, literals))
        elif re.match(r'\d', token):
            tokens.append(process_number(token, literals))
        start = match.end()

    # Append remaining parts of the code that do not match the regex, preserving newlines
    remaining_text = code[start:]
    tokens.extend([x for x in re.split(r'(\n)', remaining_text) if x])
    return tokens

In [None]:
literals = json.load(open("literals.json"))
json_file = "kotlin_code_dataset.json"


with open(json_file, 'r') as file:
    data = json.load(file)
    output = []
    for entry in data:
        kotlin_code = entry['content']
        tokens = tokenize_kotlin(kotlin_code, literals)
        processed_code = " ".join(tokens)
        output.append(processed_code)

# Write the output to a text file
with open('kotlin_code_dataset_processed_tokens.json', 'w') as out_file:
    content_list = [{'content': line} for line in output]
    json.dump(content_list, out_file, indent=4)

Because we will work with the Code completion task on a line level, we need to split lines as described in the [CodeXGLUE repository](https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/CodeCompletion-line). The proportions of tokens in input and output are not kept because Kotlin code is different from Python code (in length as well).

I removed the last 2 lines in each Kotlin code because usually it's an "}" symbol followed by an empty line and we are not very interested in predicting them. Instead, I took the last 1 line (after the deletion of the 2 last lines) as the one that needs to be predicted, the rest goes to the input.

In [None]:
class CodeCompletionProcessor:
    def __init__(self, input_file: str, output_file: str) -> None:

        """ Initialize the processor with file paths """

        self.input_file = input_file
        self.output_file = output_file


    @staticmethod
    def tokenize_code(code: str) -> str:

        """ Tokenize the Kotlin code by replacing new lines with a special token """

        return code.replace("\n", " <EOL> ")


    def process_data(self) -> None:

        """ Process the Kotlin code data by splitting content such that the last line is the ground truth """

        with open(self.input_file, 'r') as file:
            data = json.load(file)

        processed_data = []

        for entry in data:
            content = entry['content']
            tokens = content.split('\n')[:-2]
            if len(tokens) > 1:
                # Take all but the last line for input, and the last line for ground truth
                input_section = "<s> " + self.tokenize_code('\n'.join(tokens[:-1]))
                gt_section = self.tokenize_code(tokens[-1])

                processed_data.append({
                    "input": input_section,
                    "gt": gt_section
                })
            else:
                # Handle the case for files with only one line
                print(f"Skipping file with insufficient lines: {len(tokens)} lines found.")

        with open(self.output_file, 'w') as file:
            for item in processed_data:
                json.dump(item, file)
                file.write('\n')

In [None]:
processor = CodeCompletionProcessor('/content/kotlin_code_dataset_processed_tokens.json',
                                    '/content/kotlin_code_dataset_processed_lines.json')
processor.process_data()

Skipping file with insufficient lines: 0 lines found.


## Split dataset into train and test sets

In total, I have 1049 samples in Kotlin. I split the dataset as follows: 949 samples for training set, 100 last samples for test set (the same amount of test data was used in the CodeXGLUE Python dataset).

In [None]:
def load_and_process_json(input_file: str,
                          output_train_file: str,
                          output_test_file: str,
                          output_test_answers: str) -> None:

    """ Load JSON objects from a file, split them into train and test sets,
    while keeping the answers for the test set in a separate file, save all """

    data = []
    with open(input_file, 'r') as file:
        for line in file:
            try:
                data.append(json.loads(line))
            except json.JSONDecodeError as e:
                print(f"Error decoding JSON: {e}")
                continue

    if len(data) < 100:
        raise ValueError("The input file does not contain enough entries (100 required).")

    train_data = data[:-100]

    test_data = data[-100:]
    test_answers = []
    for entry in test_data:
        test_answers.append({"gt":entry['gt']})
        entry['gt'] = ""  # Replace 'gt' values with empty strings

    with open(output_train_file, 'w') as file:
        for entry in train_data:
            json.dump(entry, file)
            file.write('\n')

    with open(output_test_file, 'w') as file:
        for entry in test_data:
            json.dump(entry, file)
            file.write('\n')

    with open(output_test_answers, 'w') as file:
        for entry in test_answers:
            json.dump(entry, file)
            file.write('\n')

In [None]:
load_and_process_json(input_file="kotlin_code_dataset_processed_lines.json",
                      output_train_file="kotlin_code_train.json",
                      output_test_file="kotlin_code_test.json",
                      output_test_answers="kotlin_code_answers.json")