#### Подготовка датасета для задачи Code2Test

В данном .ipynb файле происходит форматирование [open source датасетов](data) к предполагаемому формату решения задачи.

Импортируем необходимые модули:

In [200]:
import numpy as np
import pandas as pd
import re
import ast
import flowchart
import json
import time

import os
import random

from tqdm import tqdm
tqdm.pandas()

Устанавливаем seed'ы:

In [201]:
SEED = 42

random.seed(SEED)
np.random.seed(SEED)

import warnings
warnings.filterwarnings("ignore")

На этапе [предварительного анализа](data/data_pre_research) имеющихся open source датасетов была выдвинута гипотеза о релевантности датасета [Arain-unitTest](https://huggingface.co/datasets/Arain/UnitTest-Finetuning) для решаемой задачи.

В связи с этим пробуем достать необходимые данные из этого проекта.

Поработаем с частью, чтобы впоследствии обработать весь пакет данных.

In [5]:
data_path = '/Users/chervonikov_alexey/Desktop/projects/Technopark_Autumn_2024/NN_course_project/data'
folder_path = 'data_Arain_unitTest-FineTuning_example'
json_path = 'zero_shot_multi_unit_test.json'

input_path = os.path.join(data_path, folder_path, json_path)
# print(input_path)

Объявляем парсер json-файла:

In [6]:
def json_parser(input_path = input_path):
	'''Функция считывает файл формата JSON по пути и преобразует его в dict'''
	with open(input_path, 'r') as file:
		json_dict = json.load(file)

	return json_dict

json_example = json_parser()

Посмотрим сколько у нас сэмплов:

In [7]:
print(f"Количество данных: {len(json_example)}")

Количество данных: 281060


Как выглядит первый сэмпл:

ID

In [8]:
json_example[0]['id']

1

Instruction (тот самый промпт):

In [9]:
json_example[0]['conversations'][0]['from']

'### Instruction'

In [10]:
json_example[0]['conversations'][0]['value']

"You are a professional python software engineer. You are asked to generate a complete test class for a focal method in a focal class.\nYou will be given the following information of the focal method:\n1. Source code of the focal method.\n2. Source code of the focal class(Code that is not relevant to focal method's execution is filtered).\n3. Source code of callee examples of the focal method.\nYou will ONLY return unit test code for the focal method including necessary imports and dependencies, make sure it compile without errors, and use reflection to invoke private methods. \nNote that no additional explanations required.\n\nHere are the information of the focal method:\n1. Source code of the focal method.\n```python\ndef get(self, key, default=None):\n        kl = key.lower()\n        return super().get(self.keymap.get(kl, kl), default)\n```\n\n2. Source code of the focal class(Codes that are may not related to focal method are filtered).\n```python\n\nclass NoCaseDict(dict):\n    

Response (ответ на промпт):

In [11]:
json_example[0]['conversations'][1]['from']

'### Response'

In [12]:
json_example[0]['conversations'][1]['value']

"```python\nfrom microdot import Microdot, Response, abort\nfrom microdot.test_client import TestClient\n\nclass TestMicrodot(unittest.TestCase):\n    def _run(self, coro):\n        return self.loop.run_until_complete(coro)\n    def test_get_request(self):\n        app = Microdot()\n        @app.route('/')\n        def index(req):\n            return 'foo'\n        @app.route('/async')\n        async def index2(req):\n            return 'foo-async'\n        @app.route('/arg/<id>')\n        def index3(req, id):\n            return id\n        @app.route('/arg/async/<id>')\n        async def index4(req, id):\n            return f'async-{id}'\n        client = TestClient(app)\n        res = self._run(client.get('/'))\n        self.assertEqual(res.status_code, 200)\n        self.assertEqual(res.headers['Content-Type'],\n                         'text/plain; charset=UTF-8')\n        self.assertEqual(res.headers['Content-Length'], '3')\n        self.assertEqual(res.text, 'foo')\n        self

Итак, мы видим, что в рамках запроса присутствует написанная функция, а в рамках ответа представлен unit-test.

Таким образом, задача сводится к отделению одного от другого и фльтрации текста

#### *Важно*

В HuggingFace репозитории к датасету приложены шаблоны промптов, которые встрчеаются в запросах. 

Это даёт возможность для облегчения поиска кода среди запрсов.

Шаблон:

You are a professional {language} software engineer. You are asked to generate a complete test class for a focal method in a focal class.
You will be given the following information of the focal method:
1. Source code of the focal method.
2. Source code of the focal class(Code that is not relevant to focal method's execution is filtered).
3. Source code of callee examples of the focal method.
You will ONLY return unit test code for the focal method including necessary imports and dependencies, make sure it compile without errors, and use reflection to invoke private methods. 
Note that no additional explanations required.

Here are the information of the focal method:
1. Source code of the focal method.
{methodCode}

2. Source code of the focal class(Codes that are may not related to focal method are filtered).
{methodTotalCode}

3. Source code of callee examples of the focal method.
{callCode_callees_string}

Please note that the test class you return should include multiple test cases covering different functionalities. There is no upper limit on the number of test cases, but you need to ensure that the test cases provide high test coverage and test extreme and special cases of the code as much as possible.

Таким образом, можем разбить запись для получения пространства признаков наших данных.

Например, предположим:

* LANG_TOKEN (Python / Java) <-> {language}

* method code <-> 1. Source code of focal method

* class code <-> 2. Source code of the focal class

* Source code of callee examples of the focal method

Получим датасет из json-словарей:

In [14]:
code_dataset = pd.DataFrame(json_example)

In [15]:
code_dataset.head()

Unnamed: 0,id,conversations
0,1,"[{'from': '### Instruction', 'value': 'You are..."
1,2,"[{'from': '### Instruction', 'value': 'You are..."
2,3,"[{'from': '### Instruction', 'value': 'You are..."
3,4,"[{'from': '### Instruction', 'value': 'You are..."
4,5,"[{'from': '### Instruction', 'value': 'You are..."


In [16]:
code_dataset['query'] = code_dataset['conversations'].map(lambda x: x[0]['value'])
code_dataset['response'] = code_dataset['conversations'].map(lambda x: x[1]['value'])

In [17]:
code_dataset = code_dataset.drop(columns = ['conversations'])

In [18]:
code_dataset.head()

Unnamed: 0,id,query,response
0,1,You are a professional python software enginee...,"```python\nfrom microdot import Microdot, Resp..."
1,2,You are a professional python software enginee...,"```python\nfrom microdot import Microdot, Resp..."
2,3,You are a professional python software enginee...,"```python\nfrom microdot import Microdot, Resp..."
3,4,You are a professional python software enginee...,"```python\nfrom microdot import Microdot, Resp..."
4,5,You are a professional python software enginee...,```python\nfrom pyner.named_entity.corpus impo...


In [21]:
code_dataset['query'].values[0]

"You are a professional python software engineer. You are asked to generate a complete test class for a focal method in a focal class.\nYou will be given the following information of the focal method:\n1. Source code of the focal method.\n2. Source code of the focal class(Code that is not relevant to focal method's execution is filtered).\n3. Source code of callee examples of the focal method.\nYou will ONLY return unit test code for the focal method including necessary imports and dependencies, make sure it compile without errors, and use reflection to invoke private methods. \nNote that no additional explanations required.\n\nHere are the information of the focal method:\n1. Source code of the focal method.\n```python\ndef get(self, key, default=None):\n        kl = key.lower()\n        return super().get(self.keymap.get(kl, kl), default)\n```\n\n2. Source code of the focal class(Codes that are may not related to focal method are filtered).\n```python\n\nclass NoCaseDict(dict):\n    

In [20]:
print(code_dataset['query'].values[0])

You are a professional python software engineer. You are asked to generate a complete test class for a focal method in a focal class.
You will be given the following information of the focal method:
1. Source code of the focal method.
2. Source code of the focal class(Code that is not relevant to focal method's execution is filtered).
3. Source code of callee examples of the focal method.
You will ONLY return unit test code for the focal method including necessary imports and dependencies, make sure it compile without errors, and use reflection to invoke private methods. 
Note that no additional explanations required.

Here are the information of the focal method:
1. Source code of the focal method.
```python
def get(self, key, default=None):
        kl = key.lower()
        return super().get(self.keymap.get(kl, kl), default)
```

2. Source code of the focal class(Codes that are may not related to focal method are filtered).
```python

class NoCaseDict(dict):
    def get(self, key, de

Необоходимо проверить запросы на соответствие шаблонам с HuggingFace

In [73]:
def check_template_match(input_text: str, get_features: bool = False):
	'''
	Функция проверяет соответствие запроса шаблону
	
	Параметры:
	- input_text: входной запрос
	- get_features: флаг для осуществления извлечения признаков (default: False)
	
	Возвращает:
	True если соответствует, иначе False;
	Извлеченные признаки
	'''
	# Регулярное выражение для проверки соответствия шаблону
	pattern = re.compile(
		r"You are a professional (?P<language>\w+) software engineer\. You are asked to generate a complete test class for a focal method in a focal class\.\n"
		r"You will be given the following information of the focal method:"
		r"\n1\. Source code of the focal method\."
		r"\n2\. Source code of the focal class\(Code that is not relevant to focal method's execution is filtered\)\."
		r"\n3\. Source code of callee examples of the focal method\."
		r"\nYou will ONLY return unit test code for the focal method including necessary imports and dependencies, make sure it compile without errors, and use reflection to invoke private methods\. "
		r"\nNote that no additional explanations required\."
		r"\n\nHere are the information of the focal method:"
		r"\n1\. Source code of the focal method\.\n"
		r"(?P<methodCode>.*?)\n"
		r"2\. Source code of the focal class\(Codes that are may not related to focal method are filtered\).\n"
		r"(?P<methodTotalCode>.*?)\n"
		r"3\. Source code of callee examples of the focal method.\n"
		r"(?P<callCode_callees_string>.*?)\n"
		r"\nPlease note that the test class you return should include multiple test cases covering different functionalities. There is no upper limit on the number of test cases, but you need to ensure that the test cases provide high test coverage and test extreme and special cases of the code as much as possible\.\n"
		,
		re.DOTALL
	)
	
	# Проверяем соответствие шаблону 
	if not get_features:
		return bool(pattern.match(input_text))
	else:
		match = pattern.match(input_text)
		if match:
			return True, (match.group('language'), 
				 		match.group('methodCode'), 
				 		match.group('methodTotalCode'),
						match.group('callCode_callees_string'))
		else:
			return False, None


def process_all_samples_quieries(dataset: pd.DataFrame = code_dataset):
	'''
	Функция для проверки запросов всех сэмплов на соответствие шаблону, указанному на HuggingFace
	
	Параметры:
	-dataset - датасет с кодом (default: code_dataset)
	'''
	for i in tqdm(range(code_dataset.shape[0])): # Проходимся и проверяем на True
		assert(check_template_match(code_dataset.iloc[i, 1]) == True), "CODE DOESN'T MATCH THE TEMPLATE"

def query_processing_map(query):
	'''Вспомогательная функция для реализации pandas.map'''
	_, features = check_template_match(query, get_features=True)
	return features


Check на сэмпле данных:

In [71]:
_, features = check_template_match(json_example[0]['conversations'][0]['value'], get_features=True)
LANG_TOKEN, focal_method, focal_cls, call = features

LANG_TOKEN:

In [60]:
LANG_TOKEN

'python'

focal_method:

In [65]:
print(focal_method)

```python
def get(self, key, default=None):
        kl = key.lower()
        return super().get(self.keymap.get(kl, kl), default)
```



focal_cls:

In [66]:
print(focal_cls)

```python

class NoCaseDict(dict):
    def get(self, key, default=None):
        kl = key.lower()
        return super().get(self.keymap.get(kl, kl), default)
```



call:

In [67]:
print(call)

No callee examples provided.


Сделаем запуск проверки соответствия шаблону:

In [72]:
process_all_samples_quieries()

100%|██████████| 281060/281060 [00:06<00:00, 44498.77it/s]


assert не сработал, значит, данные соответствуют заявленному формату!

Сгенерируем новые признаки

In [97]:
code_dataset['code_features'] = code_dataset['query'].progress_apply(lambda query: query_processing_map(query))
code_dataset['LANG_TOKEN'] = code_dataset['code_features'].progress_apply(lambda features: features[0])
code_dataset['focal_method'] = code_dataset['code_features'].progress_apply(lambda features: features[1])
code_dataset['focal_cls'] = code_dataset['code_features'].progress_apply(lambda features: features[2])
code_dataset['callee'] = code_dataset['code_features'].progress_apply(lambda features: features[3])

100%|██████████| 281060/281060 [00:04<00:00, 60671.51it/s]
100%|██████████| 281060/281060 [00:00<00:00, 3988695.85it/s]
100%|██████████| 281060/281060 [00:00<00:00, 3170625.07it/s]
100%|██████████| 281060/281060 [00:00<00:00, 3091046.85it/s]
100%|██████████| 281060/281060 [00:00<00:00, 3836256.99it/s]


Извлечем комментарии из кода

In [106]:
def get_comments_and_description(text_code: str):
	'''
	Функция извлекает все комментарии из кода и описания функций:
	
	Параметры:
	-text_code: строка с кодом
	
	'''
	
	# Шаблоны для поиска комментариев и описаний
	triple_quotes_pattern = r'"""(.*?)"""|\'\'\'(.*?)\'\'\''
	single_comment_pattern = r'#.*'

	# Извлекаем 
	triple_quotes_matches = re.findall(triple_quotes_pattern, text_code, re.DOTALL)
	single_comment_matches = re.findall(single_comment_pattern, text_code)

	docstrings = [match[0] or match[1] for match in triple_quotes_matches]
	comments = single_comment_matches
	
	return ("\n".join(docstrings), "\n".join(comments))

In [107]:
docs, comments = get_comments_and_description(code_dataset[code_dataset['call'] != 'No callee examples provided.'].iloc[2, :][1])

In [110]:
print(comments)

# Calculate minor_(0,i) by deleting the first orbital and ith particle indices
# Stack on axis -3 to ensure shape (..., n) once det removes the last two axes
# Calculate x_(i, 0) by selecting orbital index 0
# TODO(ggoldsh): find a faster way to calculate these overlapping determinants.


In [109]:
print(docs)

Converts a regular array into (sign, logabs) form.

    Args:
        x (Array): input data.

    Returns:
        (SLArray): data in form (sign(x), log(abs(x)))
    
Converts a regular array into (sign, logabs) form.
    Args:
        x (Array): input data.
    Returns:
        (SLArray): data in form (sign(x), log(abs(x)))
    
Get the submatrices of x by deleting row i and col 0, for all rows of x.
    Args:
        x (Array): a tensor of orbital matrices which is square in the last two
            dimensions, thus of shape (..., n, n). The second last dimension is the
            particle dimension, and the last is the orbital dimension.
    Returns:
        (int, Array): n, submatrices of shape (..., n, n-1, n-1), obtained by
        deleting row (..., i, :) and deleted column is (..., :, 0), for 0 <= i <= n - 1.
    
Compute a cofactor-based antiequivariance, returning results in slogabs form.
    See :func:`~vmcnet.models.antiequivariance.cofactor_antieq`. This function performs

Теперь нужно извлечь комментарии и описания для всех кодовых предствалений в нашем датасете

In [115]:
# Для focal_method
code_dataset['focal_method_docs'] = code_dataset['focal_method'].progress_apply(lambda x: get_comments_and_description(x)[0])
code_dataset['focal_method_comments'] = code_dataset['focal_method'].progress_apply(lambda x: get_comments_and_description(x)[1])

# Для focal_cls
code_dataset['focal_cls_docs'] = code_dataset['focal_cls'].progress_apply(lambda x: get_comments_and_description(x)[0])
code_dataset['focal_cls_comments'] = code_dataset['focal_cls'].progress_apply(lambda x: get_comments_and_description(x)[1])

# Для callee
code_dataset['callee_docs'] = code_dataset['callee'].progress_apply(lambda x: get_comments_and_description(x)[0])
code_dataset['callee_comments'] = code_dataset['callee'].progress_apply(lambda x: get_comments_and_description(x)[1])


100%|██████████| 281060/281060 [00:01<00:00, 209765.73it/s]
100%|██████████| 281060/281060 [00:01<00:00, 212556.65it/s]
100%|██████████| 281060/281060 [00:02<00:00, 112134.77it/s]
100%|██████████| 281060/281060 [00:02<00:00, 113592.29it/s]
100%|██████████| 281060/281060 [00:00<00:00, 378818.78it/s]
100%|██████████| 281060/281060 [00:00<00:00, 375592.00it/s]


Уберем объявление для markdown (\`\`\`python <-> ```) в тексте кода

In [134]:
def clean_text_code(text_code: str) -> str:
    '''
    Фунция для чистки кода от .md команд
    
    Параметры:
    -text_code: код в формате строки
    
    Возвращает "чистый" код
    
    '''
    cleaned_code = re.sub(r"^```python\n|^```\n|```$", "", text_code, flags=re.MULTILINE)
    return cleaned_code

In [135]:
print(clean_text_code(code_dataset['focal_method'].values[0]))

def get(self, key, default=None):
        kl = key.lower()
        return super().get(self.keymap.get(kl, kl), default)



Применяем для каждого представления кода

In [136]:
code_dataset['focal_method'] = code_dataset['focal_method'].progress_apply(lambda x: clean_text_code(x))
code_dataset['focal_cls'] = code_dataset['focal_cls'].progress_apply(lambda x: clean_text_code(x))
code_dataset['callee'] = code_dataset['callee'].progress_apply(lambda x: clean_text_code(x))

100%|██████████| 281060/281060 [00:03<00:00, 84797.70it/s]
100%|██████████| 281060/281060 [00:06<00:00, 41581.24it/s]
100%|██████████| 281060/281060 [00:01<00:00, 156426.54it/s]


Исследуем колонку call

In [99]:
len(code_dataset['call'].unique())

18136

Итак, у нас всего 18136 callee представлений

In [105]:
print(code_dataset[code_dataset['call'] != 'No callee examples provided.'].iloc[2, :][1])

You are a professional python software engineer. You are asked to generate a complete test class for a focal method in a focal class.
You will be given the following information of the focal method:
1. Source code of the focal method.
2. Source code of the focal class(Code that is not relevant to focal method's execution is filtered).
3. Source code of callee examples of the focal method.
You will ONLY return unit test code for the focal method including necessary imports and dependencies, make sure it compile without errors, and use reflection to invoke private methods. 
Note that no additional explanations required.

Here are the information of the focal method:
1. Source code of the focal method.
```python
def array_to_slog(x: Array) -> SLArray:
    """Converts a regular array into (sign, logabs) form.

    Args:
        x (Array): input data.

    Returns:
        (SLArray): data in form (sign(x), log(abs(x)))
    """
    return (jnp.sign(x), jnp.log(jnp.abs(x)))
```

2. Source cod

Дополнительно построим ast-деревья для кодовых представлений

In [137]:
print(code_dataset['focal_method'].values[0])

def get(self, key, default=None):
        kl = key.lower()
        return super().get(self.keymap.get(kl, kl), default)



In [169]:
def get_ast_representation(text_code: str) -> str:
    """
    Получает AST-дерево для кода. Если возникает ошибка при парсинге, возвращает строку 'AST_TOKEN'.
    
    Параметры:
    - text_code: код в виде текста

    Возвращает AST-дерево или 'AST_TOKEN' в случае ошибки.
    """
    try:
        # Пытаемся сразу построить AST-дерево
        tree = ast.parse(text_code)
        # Преобразуем AST в строку и возвращаем
        ast_tree = ast.dump(tree, indent=4)
        return ast_tree
    except SyntaxError:
        # Если ошибка, возвращаем 'AST_TOKEN'
        return "AST_TOKEN"

In [170]:
get_ast_representation(code_dataset['focal_method'].values[1779])

'AST_TOKEN'

In [171]:
code_dataset.head()

Unnamed: 0,id,query,response,code_features,LANG_TOKEN,focal_method,focal_cls,callee,focal_method_docs,focal_method_comments,focal_cls_docs,focal_cls_comments,callee_docs,callee_comments
0,1,You are a professional python software enginee...,"```python\nfrom microdot import Microdot, Resp...","(python, ```python\ndef get(self, key, default...",python,"def get(self, key, default=None):\n kl ...","\nclass NoCaseDict(dict):\n def get(self, k...",No callee examples provided.,,,,,,
1,2,You are a professional python software enginee...,"```python\nfrom microdot import Microdot, Resp...","(python, ```python\ndef get(self, url_pattern)...",python,"def get(self, url_pattern):\n """"""Decora...","\nclass Microdot:\n def route(self, url_pat...",No callee examples provided.,Decorator that is used to register a function ...,# ...,Decorator that is used to register a function ...,# ...,,
2,3,You are a professional python software enginee...,"```python\nfrom microdot import Microdot, Resp...","(python, ```python\ndef post(self, url_pattern...",python,"def post(self, url_pattern):\n """"""Decor...","\nclass Microdot:\n def route(self, url_pat...",No callee examples provided.,Decorator that is used to register a function ...,# ...,Decorator that is used to register a function ...,# ...,,
3,4,You are a professional python software enginee...,"```python\nfrom microdot import Microdot, Resp...","(python, ```python\ndef mount(self, subapp, ur...",python,"def mount(self, subapp, url_prefix=''):\n ...","\nclass Microdot:\n def mount(self, subapp,...",No callee examples provided.,"Mount a sub-application, optionally under the ...",,"Mount a sub-application, optionally under the ...",,,
4,5,You are a professional python software enginee...,```python\nfrom pyner.named_entity.corpus impo...,"(python, ```python\ndef iob2bio(tags):\n pr...",python,def iob2bio(tags):\n processed_tags = [] #...,"\ndef split_tag(tag: str):\n """"""\n Split...",No callee examples provided.,,# should be bio format\n# case1. I-ORG I-ORG\n...,\n Split tag into state and named entity ca...,# should be bio format\n# case1. I-ORG I-ORG\n...,,


Применяем ко всем данным

In [173]:
code_dataset['focal_method_ast'] = code_dataset['focal_method'].progress_apply(lambda x: get_ast_representation(x))
code_dataset['focal_cls_ast'] = code_dataset['focal_cls'].progress_apply(lambda x: get_ast_representation(x))
code_dataset['callee_ast'] = code_dataset['callee'].progress_apply(lambda x: get_ast_representation(x))

100%|██████████| 281060/281060 [00:35<00:00, 7817.64it/s] 
100%|██████████| 281060/281060 [01:13<00:00, 3818.26it/s]
100%|██████████| 281060/281060 [00:02<00:00, 114657.80it/s]


In [174]:
code_dataset.head()

Unnamed: 0,id,query,response,code_features,LANG_TOKEN,focal_method,focal_cls,callee,focal_method_docs,focal_method_comments,focal_cls_docs,focal_cls_comments,callee_docs,callee_comments,focal_method_ast,focal_cls_ast,callee_ast
0,1,You are a professional python software enginee...,"```python\nfrom microdot import Microdot, Resp...","(python, ```python\ndef get(self, key, default...",python,"def get(self, key, default=None):\n kl ...","\nclass NoCaseDict(dict):\n def get(self, k...",No callee examples provided.,,,,,,,Module(\n body=[\n FunctionDef(\n ...,Module(\n body=[\n ClassDef(\n ...,AST_TOKEN
1,2,You are a professional python software enginee...,"```python\nfrom microdot import Microdot, Resp...","(python, ```python\ndef get(self, url_pattern)...",python,"def get(self, url_pattern):\n """"""Decora...","\nclass Microdot:\n def route(self, url_pat...",No callee examples provided.,Decorator that is used to register a function ...,# ...,Decorator that is used to register a function ...,# ...,,,Module(\n body=[\n FunctionDef(\n ...,Module(\n body=[\n ClassDef(\n ...,AST_TOKEN
2,3,You are a professional python software enginee...,"```python\nfrom microdot import Microdot, Resp...","(python, ```python\ndef post(self, url_pattern...",python,"def post(self, url_pattern):\n """"""Decor...","\nclass Microdot:\n def route(self, url_pat...",No callee examples provided.,Decorator that is used to register a function ...,# ...,Decorator that is used to register a function ...,# ...,,,Module(\n body=[\n FunctionDef(\n ...,Module(\n body=[\n ClassDef(\n ...,AST_TOKEN
3,4,You are a professional python software enginee...,"```python\nfrom microdot import Microdot, Resp...","(python, ```python\ndef mount(self, subapp, ur...",python,"def mount(self, subapp, url_prefix=''):\n ...","\nclass Microdot:\n def mount(self, subapp,...",No callee examples provided.,"Mount a sub-application, optionally under the ...",,"Mount a sub-application, optionally under the ...",,,,Module(\n body=[\n FunctionDef(\n ...,Module(\n body=[\n ClassDef(\n ...,AST_TOKEN
4,5,You are a professional python software enginee...,```python\nfrom pyner.named_entity.corpus impo...,"(python, ```python\ndef iob2bio(tags):\n pr...",python,def iob2bio(tags):\n processed_tags = [] #...,"\ndef split_tag(tag: str):\n """"""\n Split...",No callee examples provided.,,# should be bio format\n# case1. I-ORG I-ORG\n...,\n Split tag into state and named entity ca...,# should be bio format\n# case1. I-ORG I-ORG\n...,,,Module(\n body=[\n FunctionDef(\n ...,Module(\n body=[\n FunctionDef(\n ...,AST_TOKEN


Итак, теперь мы можем создать общую функцию:

In [175]:
def create_initial_code_dataset(json_dict: dict = json_example) -> pd.DataFrame: 
	'''
	Функция преобразует начальный набор данных в датасет, релевантный для обучения Code2Test модели
	
	Параметры:
	-json_dict: json-dict (default: json_example)
	
	Возвращает:
	pd.DataFrame()
	'''
	
	code_dataset = pd.DataFrame(json_dict) # Создаём начальный датафрейм
	code_dataset['query'] = code_dataset['conversations'].map(lambda x: x[0]['value']) # Создаём колонку с промптом
	code_dataset['response'] = code_dataset['conversations'].map(lambda x: x[1]['value']) # Создаём колонку с ответом
	code_dataset = code_dataset.drop(columns = ['conversations']) # Убираем колонку conversations
	process_all_samples_quieries(code_dataset) # Проверка на соответствие представленному шаблону

	# Извлекаем признаки из текста
	code_dataset['code_features'] = code_dataset['query'].progress_apply(lambda query: query_processing_map(query))
	code_dataset['LANG_TOKEN'] = code_dataset['code_features'].progress_apply(lambda features: features[0]) 
	code_dataset['focal_method'] = code_dataset['code_features'].progress_apply(lambda features: features[1])
	code_dataset['focal_cls'] = code_dataset['code_features'].progress_apply(lambda features: features[2])
	code_dataset['callee'] = code_dataset['code_features'].progress_apply(lambda features: features[3])

	# Извлекаем описание и комментарии

	# Для focal_method
	code_dataset['focal_method_docs'] = code_dataset['focal_method'].progress_apply(lambda x: get_comments_and_description(x)[0])
	code_dataset['focal_method_comments'] = code_dataset['focal_method'].progress_apply(lambda x: get_comments_and_description(x)[1])

	# Для focal_cls
	code_dataset['focal_cls_docs'] = code_dataset['focal_cls'].progress_apply(lambda x: get_comments_and_description(x)[0])
	code_dataset['focal_cls_comments'] = code_dataset['focal_cls'].progress_apply(lambda x: get_comments_and_description(x)[1])

	# Для callee
	code_dataset['callee_docs'] = code_dataset['callee'].progress_apply(lambda x: get_comments_and_description(x)[0])
	code_dataset['callee_comments'] = code_dataset['callee'].progress_apply(lambda x: get_comments_and_description(x)[1])

	# ast-парсинг
	code_dataset['focal_method_ast'] = code_dataset['focal_method'].progress_apply(lambda x: get_ast_representation(x))
	code_dataset['focal_cls_ast'] = code_dataset['focal_cls'].progress_apply(lambda x: get_ast_representation(x))
	code_dataset['callee_ast'] = code_dataset['callee'].progress_apply(lambda x: get_ast_representation(x))

	

Создадим класс обработки датасета

In [204]:
class Code2TestPrepareDataset:
	'''Датасет типа code2test для решения задачи генерации тестов к коду'''
	
	def __init__(self, json_input_path):
		'''
		Конструктор датасета. Создает словарь данных (dict)
		
		Параметры:
		-self
		-json_input: входной файл в формате .json
		'''
		start_time = time.time()
		try:
			with open(json_input_path, 'r') as file:
				data = json.load(file)  # Загружаем JSON данные
				self.code2test_dict = data # Объявляем словарь code2test_dict
		except FileNotFoundError:
			print(f"Ошибка: файл {json_input_path} не найден.")
		except json.JSONDecodeError:
			print(f"Ошибка: файл {json_input_path} не является корректным JSON.")
		except Exception as e:
			print(f"Произошла непредвиденная ошибка: {e}")
		self.code_dataset = pd.DataFrame() # Пустой code_dataset
		code_dataset = pd.DataFrame(self.code2test_dict) # создаем code_dataset из словаря json_dict
		code_dataset['query'] = code_dataset['conversations'].map(lambda x: x[0]['value']) # извлекаем prompt
		code_dataset['response'] = code_dataset['conversations'].map(lambda x: x[1]['value']) # извлекаем response
		code_dataset = code_dataset.drop(columns = ['conversations']) # убираем conversations

		self.code_dataset = code_dataset # Объявляем датафрейм
		end_time = time.time()
		print(f"Время инициализации датасета: {end_time - start_time:.3f} секунды")
	
	def get_dataset(self):
		'''Простой возврат датасета'''
		return self.code_dataset

	@staticmethod
	def check_template_match(input_text: str, get_features: bool = False):
		'''
		Функция проверяет соответствие запроса шаблону
		
		Параметры:
		- input_text: входной запрос
		- get_features: флаг для осуществления извлечения признаков (default: False)
		
		Возвращает:
		True если соответствует, иначе False;
		Извлеченные признаки
		'''
		# Регулярное выражение для проверки соответствия шаблону
		pattern = re.compile(
			r"You are a professional (?P<language>\w+) software engineer\. You are asked to generate a complete test class for a focal method in a focal class\.\n"
			r"You will be given the following information of the focal method:"
			r"\n1\. Source code of the focal method\."
			r"\n2\. Source code of the focal class\(Code that is not relevant to focal method's execution is filtered\)\."
			r"\n3\. Source code of callee examples of the focal method\."
			r"\nYou will ONLY return unit test code for the focal method including necessary imports and dependencies, make sure it compile without errors, and use reflection to invoke private methods\. "
			r"\nNote that no additional explanations required\."
			r"\n\nHere are the information of the focal method:"
			r"\n1\. Source code of the focal method\.\n"
			r"(?P<methodCode>.*?)\n"
			r"2\. Source code of the focal class\(Codes that are may not related to focal method are filtered\).\n"
			r"(?P<methodTotalCode>.*?)\n"
			r"3\. Source code of callee examples of the focal method.\n"
			r"(?P<callCode_callees_string>.*?)\n"
			r"\nPlease note that the test class you return should include multiple test cases covering different functionalities. There is no upper limit on the number of test cases, but you need to ensure that the test cases provide high test coverage and test extreme and special cases of the code as much as possible\.\n"
			,
			re.DOTALL
		)
		
		# Проверяем соответствие шаблону 
		if not get_features:
			return bool(pattern.match(input_text))
		else:
			match = pattern.match(input_text)
			if match:
				return True, (match.group('language'), 
							match.group('methodCode'), 
							match.group('methodTotalCode'),
							match.group('callCode_callees_string'))
			else:
	
				return False, None 
	@staticmethod
	def process_all_samples_quieries(self):
		'''
		Функция для проверки запросов всех сэмплов на соответствие шаблону, указанному на HuggingFace
		
		Параметры:
		-dataset: датасет с кодом (default: code_dataset)
		'''
		for i in tqdm(range(self.code_dataset.shape[0])): # Проходимся и проверяем на True
			assert(Code2TestPrepareDataset.check_template_match(self.code_dataset.iloc[i, 1]) == True), "CODE DOESN'T MATCH THE TEMPLATE"
	
	@staticmethod
	def query_processing_map(query):
		'''Вспомогательная функция для реализации pandas.map'''
		_, features = Code2TestPrepareDataset.check_template_match(query, get_features=True)
		return features
	
	@staticmethod
	def get_comments_and_description(text_code: str):
		'''
		Функция извлекает все комментарии из кода и описания функций:
		
		Параметры:
		-text_code: строка с кодом
		
		'''
		
		# Шаблоны для поиска комментариев и описаний
		triple_quotes_pattern = r'"""(.*?)"""|\'\'\'(.*?)\'\'\''
		single_comment_pattern = r'#.*'

		# Извлекаем 
		triple_quotes_matches = re.findall(triple_quotes_pattern, text_code, re.DOTALL)
		single_comment_matches = re.findall(single_comment_pattern, text_code)

		docstrings = [match[0] or match[1] for match in triple_quotes_matches]
		comments = single_comment_matches
		
		return ("\n".join(docstrings), "\n".join(comments))
	
	@staticmethod
	def clean_text_code(text_code: str) -> str:
		'''
		Фунция для чистки кода от .md команд
		
		Параметры:
		-text_code: код в формате строки
		
		Возвращает "чистый" код
		
		'''
		cleaned_code = re.sub(r"^```python\n|^```\n|```$", "", text_code, flags=re.MULTILINE)
		return cleaned_code
	
	@staticmethod
	def get_ast_representation(text_code: str) -> str:
		"""
		Получает AST-дерево для кода. Если возникает ошибка при парсинге, возвращает строку 'AST_TOKEN'.
		
		Параметры:
		- text_code: код в виде текста

		Возвращает AST-дерево или 'AST_TOKEN' в случае ошибки.
		"""
		try:
			# Пытаемся сразу построить AST-дерево
			tree = ast.parse(text_code)
			# Преобразуем AST в строку и возвращаем
			ast_tree = ast.dump(tree, indent=4)
			return ast_tree
		except SyntaxError:
			# Если ошибка, возвращаем 'AST_TOKEN'
			return "AST_TOKEN"
	
	def prepare_dataset(self):
		'''Функция подготовки датасета'''

		start_time_ = time.time()
		
		start_time =  time.time()
		self.process_all_samples_quieries(self) # Проверка на соответствие представленному шаблону
		end_time = time.time()
		print(f"Время проверки формата запросов: {end_time - start_time:.3f} секунды")

		# Извлекаем признаки из текста
		start_time =  time.time()
		self.code_dataset['code_features'] = self.code_dataset['query'].progress_apply(lambda query: self.query_processing_map(query))
		self.code_dataset['LANG_TOKEN'] = self.code_dataset['code_features'].progress_apply(lambda features: features[0]) 
		self.code_dataset['focal_method'] = self.code_dataset['code_features'].progress_apply(lambda features: features[1])
		self.code_dataset['focal_cls'] = self.code_dataset['code_features'].progress_apply(lambda features: features[2])
		self.code_dataset['callee'] = self.code_dataset['code_features'].progress_apply(lambda features: features[3])
		end_time = time.time()
		print(f"Время извлечения глобальных признаков из текста кода: {end_time - start_time:.3f} секунды")
		
		# Извлекаем описание и комментарии

		# Для focal_method
		start_time =  time.time()
		self.code_dataset['focal_method_docs'] = self.code_dataset['focal_method'].progress_apply(lambda x: self.get_comments_and_description(x)[0])
		self.code_dataset['focal_method_comments'] = self.code_dataset['focal_method'].progress_apply(lambda x: self.get_comments_and_description(x)[1])
		
		# Для focal_cls
		self.code_dataset['focal_cls_docs'] = self.code_dataset['focal_cls'].progress_apply(lambda x: self.get_comments_and_description(x)[0])
		self.code_dataset['focal_cls_comments'] = self.code_dataset['focal_cls'].progress_apply(lambda x: self.get_comments_and_description(x)[1])
		
		# Для calle
		self.code_dataset['callee_docs'] = self.code_dataset['callee'].progress_apply(lambda x: self.get_comments_and_description(x)[0])
		self.code_dataset['callee_comments'] = self.code_dataset['callee'].progress_apply(lambda x: self.get_comments_and_description(x)[1])
		end_time = time.time()
		print(f"Время извлечения комментариев и описаний из текста кода: {end_time - start_time:.3f} секунды")
		
		# "Чистый" код
		start_time =  time.time()
		self.code_dataset['focal_method'] =self.code_dataset['focal_method'].progress_apply(lambda x: self.clean_text_code(x))
		self.code_dataset['focal_cls'] = self.code_dataset['focal_cls'].progress_apply(lambda x: self.clean_text_code(x))
		self.code_dataset['callee'] = self.code_dataset['callee'].progress_apply(lambda x: self.clean_text_code(x))	
		self.code_dataset['response'] = self.code_dataset['response'].progress_apply(lambda x: self.clean_text_code(x))
		end_time = time.time()	
		print(f"Время очистки текста кода: {end_time - start_time:.3f} секунды")

		# ast-парсинг
		start_time =  time.time()
		self.code_dataset['focal_method_ast'] = self.code_dataset['focal_method'].progress_apply(lambda x: self.get_ast_representation(x))
		self.code_dataset['focal_cls_ast'] = self.code_dataset['focal_cls'].progress_apply(lambda x: self.get_ast_representation(x))
		self.code_dataset['callee_ast'] = self.code_dataset['callee'].progress_apply(lambda x: self.get_ast_representation(x))
		end_time = time.time()
		print(f"Время получения ast-деревьев на основе текста кода: {end_time - start_time:.3f} секунды")

		end_time_ = time.time()
		print(f"Время подготовки датасета: {end_time_ - start_time_:.3f} секунды")

data_path = '/Users/chervonikov_alexey/Desktop/projects/Technopark_Autumn_2024/NN_course_project/data'
folder_path = 'data_Arain_unitTest-FineTuning_example'
json_path = 'zero_shot_multi_unit_test.json'

input_json_path = os.path.join(data_path, folder_path, json_path)
code2test_dataset = Code2TestPrepareDataset(input_json_path)

Время инициализации датасета: 12.025 секунды


In [205]:
code2test_dataset.prepare_dataset()

100%|██████████| 281060/281060 [00:06<00:00, 44459.03it/s]


Время проверки формата запросов: 6.325 секунды


100%|██████████| 281060/281060 [00:12<00:00, 22654.25it/s]
100%|██████████| 281060/281060 [00:00<00:00, 3863742.71it/s]
100%|██████████| 281060/281060 [00:00<00:00, 2916107.61it/s]
100%|██████████| 281060/281060 [00:00<00:00, 2941722.29it/s]
100%|██████████| 281060/281060 [00:00<00:00, 3750218.18it/s]


Время извлечения глобальных признаков из текста кода: 12.774 секунды


100%|██████████| 281060/281060 [00:01<00:00, 198340.94it/s]
100%|██████████| 281060/281060 [00:01<00:00, 187343.73it/s]
100%|██████████| 281060/281060 [00:02<00:00, 105407.33it/s]
100%|██████████| 281060/281060 [00:02<00:00, 106205.20it/s]
100%|██████████| 281060/281060 [00:00<00:00, 356826.44it/s]
100%|██████████| 281060/281060 [00:00<00:00, 354183.66it/s]


Время извлечения комментариев и описаний из текста кода: 9.833 секунды


100%|██████████| 281060/281060 [00:03<00:00, 81301.95it/s]
100%|██████████| 281060/281060 [00:07<00:00, 38857.36it/s]
100%|██████████| 281060/281060 [00:02<00:00, 137387.99it/s]
100%|██████████| 281060/281060 [00:05<00:00, 47981.12it/s]


Время очистки текста кода: 18.662 секунды


100%|██████████| 281060/281060 [00:36<00:00, 7769.24it/s] 
100%|██████████| 281060/281060 [01:17<00:00, 3625.27it/s]
100%|██████████| 281060/281060 [00:02<00:00, 109963.46it/s]

Время получения ast-деревьев на основе текста кода: 117.064 секунды
Время подготовки датасета: 164.658 секунды





In [206]:
dataset = code2test_dataset.get_dataset()

In [207]:
dataset.head()

Unnamed: 0,id,query,response,code_features,LANG_TOKEN,focal_method,focal_cls,callee,focal_method_docs,focal_method_comments,focal_cls_docs,focal_cls_comments,callee_docs,callee_comments,focal_method_ast,focal_cls_ast,callee_ast
0,1,You are a professional python software enginee...,"from microdot import Microdot, Response, abort...","(python, ```python\ndef get(self, key, default...",python,"def get(self, key, default=None):\n kl ...","\nclass NoCaseDict(dict):\n def get(self, k...",No callee examples provided.,,,,,,,Module(\n body=[\n FunctionDef(\n ...,Module(\n body=[\n ClassDef(\n ...,AST_TOKEN
1,2,You are a professional python software enginee...,"from microdot import Microdot, Response, abort...","(python, ```python\ndef get(self, url_pattern)...",python,"def get(self, url_pattern):\n """"""Decora...","\nclass Microdot:\n def route(self, url_pat...",No callee examples provided.,Decorator that is used to register a function ...,# ...,Decorator that is used to register a function ...,# ...,,,Module(\n body=[\n FunctionDef(\n ...,Module(\n body=[\n ClassDef(\n ...,AST_TOKEN
2,3,You are a professional python software enginee...,"from microdot import Microdot, Response, abort...","(python, ```python\ndef post(self, url_pattern...",python,"def post(self, url_pattern):\n """"""Decor...","\nclass Microdot:\n def route(self, url_pat...",No callee examples provided.,Decorator that is used to register a function ...,# ...,Decorator that is used to register a function ...,# ...,,,Module(\n body=[\n FunctionDef(\n ...,Module(\n body=[\n ClassDef(\n ...,AST_TOKEN
3,4,You are a professional python software enginee...,"from microdot import Microdot, Response, abort...","(python, ```python\ndef mount(self, subapp, ur...",python,"def mount(self, subapp, url_prefix=''):\n ...","\nclass Microdot:\n def mount(self, subapp,...",No callee examples provided.,"Mount a sub-application, optionally under the ...",,"Mount a sub-application, optionally under the ...",,,,Module(\n body=[\n FunctionDef(\n ...,Module(\n body=[\n ClassDef(\n ...,AST_TOKEN
4,5,You are a professional python software enginee...,from pyner.named_entity.corpus import bio2bioe...,"(python, ```python\ndef iob2bio(tags):\n pr...",python,def iob2bio(tags):\n processed_tags = [] #...,"\ndef split_tag(tag: str):\n """"""\n Split...",No callee examples provided.,,# should be bio format\n# case1. I-ORG I-ORG\n...,\n Split tag into state and named entity ca...,# should be bio format\n# case1. I-ORG I-ORG\n...,,,Module(\n body=[\n FunctionDef(\n ...,Module(\n body=[\n FunctionDef(\n ...,AST_TOKEN
