SyntaxEval

Welcome, this is the repository that host the source code of SyntaxEval and results of our paper ‘Which Syntactic Capabilities Are Statistically Learned by Masked Language Models for Code?’.

Our work discusses the limitations of evaluating Masked Language Models (MLM) in code completion tasks. We highlight that relying on accuracy-based measurements may lead to an overestimation of models’ capabilities by neglecting the syntax rules of Programming Languages. To address these issues, we introduce a technique called SyntaxEval in which Syntactic Capabilities are used to enhance the evaluation of MLMs. SyntaxEval automates the process of masking elements in the model input based on their Abstract Syntax Trees (ASTs). We conducted a case study on two popular MLMs using data from GitHub repositories. Our results showed negative causal effects between the node types and MLMs’ accuracy. We conclude that MLMs under study fails to predict some syntactic capabilities.

Prerequisites

If you want to use GPU to perform the predictions, please make sure to have pytorch correctly installed and GPU available to use. https://pytorch.org

!nvidia-smi

Sat Dec 17 12:50:47 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40          Off  | 00000000:01:00.0 Off |                    0 |
|  0%   69C    P0   295W / 300W |  29143MiB / 46068MiB |     82%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A40          Off  | 00000000:25:00.0 Off |                    0 |
|  0%   30C    P8    33W / 300W |     26MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A40          Off  | 00000000:41:00.0 Off |                    0 |
|  0%   30C    P8    35W / 300W |     26MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A40          Off  | 00000000:61:00.0 Off |                    0 |
|  0%   35C    P0    82W / 300W |     26MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A40          Off  | 00000000:81:00.0 Off |                    0 |
|  0%   28C    P8    32W / 300W |     26MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A40          Off  | 00000000:A1:00.0 Off |                    0 |
|  0%   58C    P0    90W / 300W |  42659MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A40          Off  | 00000000:C1:00.0 Off |                    0 |
|  0%   62C    P0   296W / 300W |  42659MiB / 46068MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A40          Off  | 00000000:E1:00.0 Off |                    0 |
|  0%   67C    P0   321W / 300W |  42659MiB / 46068MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      3048      G   /usr/lib/xorg/Xorg                 23MiB |
|    0   N/A  N/A   1807667      C   python3.8                       29117MiB |
|    1   N/A  N/A      3048      G   /usr/lib/xorg/Xorg                 23MiB |
|    2   N/A  N/A      3048      G   /usr/lib/xorg/Xorg                 23MiB |
|    3   N/A  N/A      3048      G   /usr/lib/xorg/Xorg                 23MiB |
|    4   N/A  N/A      3048      G   /usr/lib/xorg/Xorg                 23MiB |
|    5   N/A  N/A      3048      G   /usr/lib/xorg/Xorg                 23MiB |
|    5   N/A  N/A   2970647      C   julia                           42633MiB |
|    6   N/A  N/A      3048      G   /usr/lib/xorg/Xorg                 23MiB |
|    6   N/A  N/A   1778799      C   julia                           42633MiB |
|    7   N/A  N/A      3048      G   /usr/lib/xorg/Xorg                 23MiB |
|    7   N/A  N/A   1783425      C   julia                           42633MiB |
+-----------------------------------------------------------------------------+

Installation

Create a virtual environment, you can use conda, mamba or virtualenv. Then activate the envorinment and go to the project base path.

mamba create code-check-list

cd CodeCheckList

Now, install CodeCheckList using the package manager

pip install CodeCheckList

General Instructions

Each module in CodeCheckList can be used independently, you can go to ./nbs if you want to look at more detail examples for each module.:

Module	Purpose
Checklist.Loader	Download and install Tree-sitter grammars
Checklist.Tokenizer	Tokenize, Encode and Associate AST types, for he requested source code snippet using the model’s BPE and Tree-Sitter Parser
Checklist.Masker	Mask the given source code snippet on occurrences for the requested AST element with a masking rate
Checklist.Predictor	Attempts to predict the masked elements of a source code snippet using the selected model. Reports the results of top-k predictions
Checklist.Judge	Compare the AST representations of original snippet and prediction to calculate similarity scores
Checklist.Evaluator	Iterates over the specified number of samples from code-search-net, mask the ast elements defined by the programming language grammar in all the snippets, and report the results in a dataframe

Full Evaluation pipeline

Donwloading the grammar

First, download the grammar of the programming language of interest using the loader module

from CodeCheckList import loader

python_language = "python"

################ LOAD GRAMMAR
languages = [python_language]
loader.download_grammars(languages)

/home/svelascodimate/Documents/SEMERU/CodeCheckList/CodeCheckList/grammars

Defining the Evaluator

Define the evaluator Component to perform the evaluation of Linguistic Capabilities

You need to setup first some parameters

#chechpoint to use
checkpoint = "huggingface/CodeBERTa-small-v1"
#number of sample sto evaluate
number_of_samples = 5
#masking rate to apply
masking_rate = 25/100
#top-k prediction per code sample
number_of_predictions_per_sample = 3
#if GPU:3 is available, set it to True, else False
gpu_available = True
#Save Path for the dataframe results
save_path = "output/CodeBERTa-small-v1/"

Now, Instantiate the evaluator

from CodeCheckList.evaluator import Evaluator
evaluator = Evaluator(checkpoint, python_language, gpu_available)

/home/svelascodimate/miniconda3/envs/code-check-list/lib/python3.10/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

------------------Loading Model into GPU------------------

Loading the samples

Next, you need to define the source code samples to be used in the evaluation.

from datasets import load_dataset 
import CodeCheckList.utils as utils

test_set = utils.get_test_sets(load_dataset("code_search_net", split='test'), python_language, evaluator.tokenizer.tokenizer.max_len_single_sentence, evaluator.tokenizer)
test_set = utils.get_random_sub_set_test_set(test_set, number_of_samples)

No config specified, defaulting to: code_search_net/all
Found cached dataset code_search_net (/home/svelascodimate/.cache/huggingface/datasets/code_search_net/all/1.0.0/80a244ab541c6b2125350b764dc5c2b715f65f00de7a56107a28915fac173a27)
Parameter 'function'=<function get_test_sets.<locals>.<lambda> at 0x7f6604bcb520> of the transform datasets.arrow_dataset.Dataset.filter@2.0.1 couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
  0%|          | 0/101 [00:00<?, ?ba/s]Token indices sequence length is longer than the specified maximum sequence length for this model (517 > 512). Running this sequence through the model will result in indexing errors
100%|██████████| 101/101 [00:17<00:00,  5.71ba/s]

Performing the Evaluation

Now, call the evaluator to perform the evaluation of Linguistic Capabilities

Evaluation will be conducted on each sample for each AST node type define din the grammar of the programming language.:

print(evaluator.tokenizer.node_types)

['false', 'format_specifier', 'argument_list', 'augmented_assignment', 'exec_statement', 'true', 'exec', 'global', 'for_statement', 'for', '<<', '-=', 'module', '==', 'print', '//=', '[', 'else_clause', 'type', 'subscript', 'tuple_pattern', '<', 'match_statement', 'not_operator', '"', 'float', 'dotted_name', 'or', 'finally', 'pair', 'try_statement', '/', 'set', 'concatenated_string', 'nonlocal', 'async', 'typed_parameter', 'wildcard_import', '>=', 'expression', 'yield', 'assignment', ')', '//', 'global_statement', 'class', '+', 'import_from_statement', 'not', 'parameters', '>>=', 'case_pattern', '^=', 'set_comprehension', '_simple_statement', '*=', 'relative_import', 'as_pattern', 'del', '}', 'conditional_expression', 'pass_statement', 'and', 'as', 'escape_sequence', 'chevron', 'pattern', 'future_import_statement', 'import_prefix', 'continue_statement', 'expression_list', 'list_splat_pattern', 'except_clause', 'if_clause', 'positional_separator', 'comparison_operator', 'return_statement', ':', '(', ',', 'typed_default_parameter', ']', '_compound_statement', 'list_splat', 'named_expression', 'parenthesized_expression', '+=', 'with', 'nonlocal_statement', 'case', 'ERROR', '<>', '|=', 'unary_operator', 'list_pattern', 'ellipsis', ':=', 'list', 'assert_statement', 'function_definition', 'continue', 'else', 'default_parameter', 'delete_statement', 'list_comprehension', 'dictionary', 'identifier', 'as_pattern_target', 'decorated_definition', 'comment', '__future__', 'def', '}}', 'aliased_import', 'match', '**=', '!=', 'class_definition', 'return', 'type_conversion', '{{', '.', '<=', 'generator_expression', '>', 'keyword_argument', 'import', 'from', '|', 'block', '<<=', 'case_clause', 'elif_clause', 'string', 'expression_statement', '@', 'for_in_clause', 'interpolation', '&=', '^', 'format_expression', '-', 'decorator', 'with_item', 'primary_expression', 'finally_clause', 'print_statement', 'if_statement', '>>', 'await', 'boolean_operator', 'binary_operator', 'raise_statement', 'try', '%=', 'keyword_separator', 'import_statement', 'parenthesized_list_splat', 'with_statement', 'with_clause', '**', '@=', '%', 'break_statement', 'dictionary_comprehension', 'slice', 'assert', 'break', '~', 'pass', 'dictionary_splat', 'none', 'in', 'attribute', 'call', 'lambda_parameters', 'elif', 'integer', 'dictionary_splat_pattern', ';', '*', 'tuple', '{', 'pattern_list', '/=', '->', 'raise', 'while_statement', 'parameter', '=', 'except', 'is', 'lambda', '&', 'if', 'while']

To perform the evaluation, simply call the evaluator with the defined parameters

results_dataframe = evaluator(test_set, number_of_predictions_per_sample, masking_rate)

-------- evaluating sample:0 --------
-------- evaluating sample:1 --------
-------- evaluating sample:2 --------
-------- evaluating sample:3 --------
-------- evaluating sample:4 --------

Results are reported on a dataframe, that can be processed later.

results_dataframe = results_dataframe.sort_values(by=['occurences'], ascending=False)
results_dataframe.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	ast_element	occurences	jaccard	sorensen_dice	levenshtein	jaccard_avg	sorensen_dice_avg	levenshtein_avg
106	identifier	115	((0.9840425531914894, 1.0, 0.9692307692307692,...	((0.9919571045576407, 1.0, 0.984375, 1.0, 1.0)...	((0.9840425531914894, 1.0, 0.9692307692307692,...	(0.991, 0.961, 0.95)	(0.995, 0.98, 0.974)	(0.991, 0.955, 0.95)
42	)	29	((1.0, 1.0, 1.0, 1.0, 1.0), (0.994623655913978...	((1.0, 1.0, 1.0, 1.0, 1.0), (0.997304582210242...	((1.0, 1.0, 1.0, 1.0, 1.0), (0.994623655913978...	(1.0, 0.992, 0.993)	(1.0, 0.996, 0.996)	(1.0, 0.995, 0.992)
78	(	29	((1.0, 1.0, 1.0, 1.0, 1.0), (1.0, 1.0, 1.0, 1....	((1.0, 1.0, 1.0, 1.0, 1.0), (1.0, 1.0, 1.0, 1....	((1.0, 1.0, 1.0, 1.0, 1.0), (1.0, 1.0, 1.0, 1....	(1.0, 0.997, 0.987)	(1.0, 0.999, 0.993)	(1.0, 0.997, 0.99)
121	.	28	((1.0, 1.0, 1.0, 1.0, 1.0), (1.0, 1.0, 1.0, 1....	((1.0, 1.0, 1.0, 1.0, 1.0), (1.0, 1.0, 1.0, 1....	((1.0, 1.0, 1.0, 1.0, 1.0), (1.0, 1.0, 1.0, 1....	(1.0, 0.996, 0.989)	(1.0, 0.998, 0.995)	(1.0, 0.996, 0.989)
24	"	28	((1.0, 1.0, 1.0, 1.0, 1.0), (1.0, 1.0, 1.0, 1....	((1.0, 1.0, 1.0, 1.0, 1.0), (1.0, 1.0, 1.0, 1....	((1.0, 1.0, 1.0, 1.0, 1.0), (1.0, 1.0, 1.0, 1....	(1.0, 1.0, 1.0)	(1.0, 1.0, 1.0)	(1.0, 1.0, 1.0)

Visualizing the Results

Once you have the dataframe with the results of the evaluation, you can perform visualization tasks to analyse the data.

results_dataframe = results_dataframe.drop('jaccard',axis=1)
results_dataframe = results_dataframe.drop('sorensen_dice',axis=1)
results_dataframe = results_dataframe.drop('levenshtein',axis=1)
results_dataframe.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	ast_element	occurences	jaccard_avg	sorensen_dice_avg	levenshtein_avg
106	identifier	115	(0.991, 0.961, 0.95)	(0.995, 0.98, 0.974)	(0.991, 0.955, 0.95)
42	)	29	(1.0, 0.992, 0.993)	(1.0, 0.996, 0.996)	(1.0, 0.995, 0.992)
78	(	29	(1.0, 0.997, 0.987)	(1.0, 0.999, 0.993)	(1.0, 0.997, 0.99)
121	.	28	(1.0, 0.996, 0.989)	(1.0, 0.998, 0.995)	(1.0, 0.996, 0.989)
24	"	28	(1.0, 1.0, 1.0)	(1.0, 1.0, 1.0)	(1.0, 1.0, 1.0)

You can go to ./experimental_notebooks/result_visualizer.ipynb for a complete example

Results

Please refer to 00_cda_evaluation.ipynb to observe the results of the Confirmatory Data Analysis for Both Treatments in our evaluation: Masking Tokens Associated with AST node types and Random Masking.

Please refer to 01_causal_evaluation.ipynb To observe the results of our causal evaluation, where we compute correlations and also control for confounders.

Data

experimental_data folder contains the data that we used for our experiments.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
.idea		.idea
CodeCheckList		CodeCheckList
experimental_data		experimental_data
experimental_notebooks		experimental_notebooks
nbs		nbs
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
_quarto.yml		_quarto.yml
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
settings.ini		settings.ini
setup.py		setup.py
sidebar.yml		sidebar.yml
styles.css		styles.css

License

WM-SEMERU/SyntaxEval

Folders and files

Latest commit

History

Repository files navigation

SyntaxEval

Prerequisites

Installation

General Instructions

Full Evaluation pipeline

Donwloading the grammar

Defining the Evaluator

Loading the samples

Performing the Evaluation

Visualizing the Results

Results

Data

About

Resources

License

Stars

Watchers

Forks

Languages