# CodeNet Dataset

In this notebook I intend to make it very easy to visualize all the necessary information about a specific problem. The base dataset I will be using in this notebook is [CodeNet](https://github.com/IBM/Project_CodeNet) which is a large collection of source files and problem descriptions with metadata. The solutions are written in multiple programming languages (55+ according to the paper) and each problem has multiple submissions. Most of the submissions are written in the six most common languages (C++, Python, Java, C, Ruby, C#). As expected most of the solutions are in C++. One interesting aspect of the dataset is that it includes failed submissions, with various status codes such as Compilation Errors, Runtime Errors, Time Limit Exceeded, Memory Limit Exceeded, etc. This will prove useful since we are looking into bug detection in source code files.

Make sure you run `source spt.profile` to create the environment for the tokenizer to work. Also you have to compile the tokenizer from CodeNet.

## Table of Contents
1. [Imports](#Imports)
1. [Download CodeNet](#Download-CodeNet)
1. [Missing Values](#Missing-Values)
1. [Generate Source Code Pairs](#Generate-Source-Code-Pairs)
1. [Tokenize Source Code Pairs](#Tokenize-Source-Code-Pairs)
1. [Generate Opcodes](#Generate-Opcodes)
1. [Generate Error Pairs](#Generate-Error-Pairs)

## Imports

In [1]:
import myparser
import codenet

import pandas as pd

pd.set_option('max_columns', None)

codenet.P = 4

## Download CodeNet

The next code cell will download the CodeNet dataset from it's original repository (the archive has around 80GB). If you already have the dataset change the input_path variable to point to the root of the dataset, otherwise the notebook will download it in the ../input/ directory.

In [3]:
codenet.download_data()

dataset root dir found


## Missing Values

The dataset also includes a description file for most of the problems. We can see which problems have or don't have a description associated. The description file can be useful to predict what the problem topic is about, graphs, dp, greedy, etc.

In the case of missing input files, I think it is also better to just drop the submissions, most of the description files are written in Chinese and we cannot really extract any useful information from them. Since there are so few files with no input we can drop them. By looking in the description files there are like 2 problems with no input from the stdin.

To conclude the missing values section, 54/56 of the missing names in the problems list are due to missing description files 1/56 is just a href which links to a 404 web page and the last one is a test problem, the later 2 problems having no submissions anyway. I think it is a fair decision to drop these samples as they are not useful. There will be 130 remaining problems with no input/output samples and 128 of them have description files in Chinese which makes it harder to extract samples, and 2 of them only require printing of values (similar to problem p00000). In this case I also think that it is ok to drop those 2 problems that don't need input alongside the rest of problems that have no input examples extracted, because we don't have to remember that there is one or two problems that can cause some bugs later on.

In [4]:
codenet.clean_problem_list_v2().to_csv(codenet.problem_list_clean_v2_path)

Cleaning ../input/Project_CodeNet/metadata/problem_list.csv


In [2]:
problem_list_df = pd.read_csv(codenet.problem_list_clean_v2_path, index_col="id")
problem_ids = problem_list_df.index.unique()

print(f"We have {len(problem_list_df)} problems")
print('The distribution of the datasets is')
print(problem_list_df['dataset'].value_counts(normalize=True))
display(problem_list_df.head())
display(problem_list_df.isna().sum())

We have 3867 problems
The distribution of the datasets is
AIZU       0.613654
AtCoder    0.386346
Name: dataset, dtype: float64


Unnamed: 0_level_0,name,dataset,time_limit,memory_limit,rating,tags,complexity
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
p00001,List of Top 3 Hills,AIZU,1000.0,131072.0,,,
p00002,Digit Number,AIZU,1000.0,131072.0,,,
p00003,Is it a Right Triangle?,AIZU,1000.0,131072.0,,,
p00004,Simultaneous Equation,AIZU,1000.0,131072.0,,,
p00005,GCD and LCM,AIZU,1000.0,131072.0,,,


name               0
dataset            0
time_limit         0
memory_limit       0
rating          3867
tags            3867
complexity      3867
dtype: int64

## Generate Source Code Pairs
- for each problem:
- group solutions by user and select the solutions that are consecutive and of the form (Error, Accepted) and sort by the submission date; heuristic to search a smaller space, since a user might submit a correct solution after a wrong one
- build a df from this list and save it

Here we want to generate submission pairs that will allow us to find code fixes within the dataset. With this information we will be able to find the instructions that have to be modified, either deleted, inserted or changed, so that the old code starts to work. You would imagine that solutions that were submitted successively should be similar in terms of content and only have a few changes between each other. The small mistakes can be patched by knowing the accepted submission in that chain and give us the errors produced, and thus removed by the correct instruction. These are the submissions that we are interested in and look into finding patterns and further cleaning those that will not prove helpful.

In [None]:
codenet.generate_pairs_v2(problem_list_df).to_csv(codenet.generated_pairs_v2_path, index=False)

In [3]:
generated_pairs_df = pd.read_csv(codenet.generated_pairs_v2_path)

display(generated_pairs_df)
display(generated_pairs_df.info())
display(generated_pairs_df.language.value_counts())
display(generated_pairs_df.original_status.value_counts())

print(f"We are left with {len(generated_pairs_df)} submissions in total")

Unnamed: 0,original_id,changed_id,original_status,problem_id,language,filename_ext
0,s000016565,s604436209,Runtime Error,p03106,Python,py
1,s000023530,s834210063,Runtime Error,p02684,Python,py
2,s000041036,s454210115,Runtime Error,p02584,Python,py
3,s000041460,s952454189,Time Limit Exceeded,p02658,Python,py
4,s000054326,s226572665,Time Limit Exceeded,p02701,Python,py
...,...,...,...,...,...,...
54587,s999855642,s185899176,Runtime Error,p02645,Python,py
54588,s999876744,s590378402,Time Limit Exceeded,p02642,Python,py
54589,s999891212,s835176880,Runtime Error,p03103,Python,py
54590,s999921259,s180756175,Time Limit Exceeded,p02713,Python,py


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54592 entries, 0 to 54591
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   original_id      54592 non-null  object
 1   changed_id       54592 non-null  object
 2   original_status  54592 non-null  object
 3   problem_id       54592 non-null  object
 4   language         54592 non-null  object
 5   filename_ext     54592 non-null  object
dtypes: object(6)
memory usage: 2.5+ MB


None

Python    54592
Name: language, dtype: int64

Runtime Error             34126
Time Limit Exceeded       18385
WA: Presentation Error     1938
Memory Limit Exceeded       134
Output Limit Exceeded         8
Judge Not Available           1
Name: original_status, dtype: int64

We are left with 54592 submissions in total


## Tokenize Source Code Pairs

For the task of generating the source code token lists I will be using the already implemented [Simplified Parse Tree](https://github.com/IBM/Project_CodeNet/tree/main/tools/spt-generator) generator script from the CodeNet release. This stage has the main goal of simplifying the syntax of the programming languages into a sequence of tokens that can be more easily processed in tasks such as obtaining differences between source codes or precisely identifying the part of the source code that produced an error. The output of this stage will mirror the file structure of the CodeNet dataset, in the sense that it will create a csv file with the tokens and their properties, such as line number, column number, token index in a new directory root, but with the same structure for each problem and language submission.

In [None]:
codenet.tokenize_pairs_v2(generated_pairs_df)

## Generate Opcodes

In this stage we are going to generate all the modified chunks between the two submissions. This will provide us what changes made the user from the last submission that failed to the new one that passed the tests. Now I think that you can image that we can create all the arrangements of fixes such that we can automatically fix some of the bugs. For example if we had 3 modified chunks, then we could take all the pairs of 1 chunk that will remain unchanged and the other 2 will be fixed, and thus produce a diff that only considers the affected chunks. In this manner we would be able to obtain for each chunk, in the obtained dataset, what error message will be provided when running the source code on the sample input. This method has the potential to give us more examples rather than just picking the submissions that have only one change between them, since it will include both those and some new ones that will be generated in a synthetic manner. All we have to do is to generate such tokens sequences and then convert the token sequence to a source code file to run. Our next task is to create a function that will take as input the path to a csv file containing the token information and then output a string with the source code. The next step will be to generate all the possible combinations of original submissions steps, so for each modification we have to apply the rest of the modifications to obtain an intermediate result between the original submission and the changed one, that passes. In hindsight, if we take the changed submission and apply each chunk in reverse to it we will obtain just that, all the intermediary steps between the original file and the changed file. The only problem we will have when applying such modifications is that the line numbers and the column numbers will no longer indicate to valid positions in the source code, because it was changed, as such we might have to not relly on that information for further analysis or to recalculate it based on what we change. Another aspect to keep in mind is that sometimes the inserted instruction might be a parenthesis which might require an closing parenthesis too.

In [None]:
codenet.generate_opcodes_v2(generated_pairs_df).to_csv(codenet.generated_opcodes_v2_path, index=False)

In [4]:
generated_opcodes_df = pd.read_csv(codenet.generated_opcodes_v2_path)

generated_opcodes_df = generated_opcodes_df.astype({"i1": int, "i2": int, "j1": int, "j2": int})

display(generated_opcodes_df)
display(generated_opcodes_df.info())

print(f"We are left with {len(generated_opcodes_df)} modified chunks in total")

print("\nSample code original")
print(codenet.tokens2source(*generated_pairs_df.iloc[0][["problem_id", "language", "original_id"]]))
print("\nSample code changed")
print(codenet.tokens2source(*generated_pairs_df.iloc[0][["problem_id", "language", "changed_id"]]))

Unnamed: 0,tag,i1,i2,j1,j2,original_id,changed_id,problem_id
0,insert,80,80,80,94,s010979186,s734710205,p02714
1,insert,81,81,95,105,s010979186,s734710205,p02714
2,replace,88,100,112,114,s010979186,s734710205,p02714
3,delete,104,117,118,118,s010979186,s734710205,p02714
4,delete,124,125,125,125,s010979186,s734710205,p02714
...,...,...,...,...,...,...,...,...
350803,delete,38,49,35,35,s999891212,s835176880,p03103
350804,replace,61,67,47,58,s999891212,s835176880,p03103
350805,delete,68,83,59,59,s999891212,s835176880,p03103
350806,insert,108,108,108,110,s999831708,s237718445,p02588


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350808 entries, 0 to 350807
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   tag          350808 non-null  object
 1   i1           350808 non-null  int64 
 2   i2           350808 non-null  int64 
 3   j1           350808 non-null  int64 
 4   j2           350808 non-null  int64 
 5   original_id  350808 non-null  object
 6   changed_id   350808 non-null  object
 7   problem_id   350808 non-null  object
dtypes: int64(4), object(4)
memory usage: 21.4+ MB


None

We are left with 350808 modified chunks in total

Sample code original
a,b,k=map(int,input().split())
N=""
for i in range(1,min(a,b)):
    if a%i==0 and b%i==0:
            N.append(i)
N.reverse()
print(N[k-1])

Sample code changed
a,b,k=map(int,input().split())
N=[]
for i in range(1,min(a,b)+1):
    if (a%i==0 and b%i==0):
            N.append(i)
N.reverse()
print(N[k-1])


## Generate Error Pairs

In this section we are going to generate the error classes for each of the found changes. To be able to create an error message for each change we have to generate the corresponding source code files to analyze what error would be produced by each modification. First, we have to take into account all changes and apply all but the one we analyze. This way we will obtain a file that contains only the instruction that we analyze and is buggy. After this step we can run the source code and obtain the error message for the analyzed instruction. Next we repeat this step for the rest of the instruction changed in a single file. This way we obtained 250K labels that contain errors, compared to the previous attempt where we considered only the files that had a single instruction changed, where we obtained only 5K examples. Now, out of the entire set of generated pairs, a third contained instructions that did not matter toward the acceptance of the problem. These instructions, when analyzed, obtained a return code with the value of zero, indicating that the execution was successful. Out of the remaining buggy instructions, half are syntax errors, and the other more frequent errors are Python related problems, with some indentation bugs and type bugs.

In [None]:
codenet.add_error_description_v2(generated_pairs_df, problem_list_df, generated_opcodes_df).to_csv(codenet.error_pairs_v2_path, index=False)

In [5]:
error_pairs_df = pd.read_csv(codenet.error_pairs_v2_path)

display(error_pairs_df)
display(error_pairs_df.info())

error_pairs_df['error_class'].value_counts()

Unnamed: 0,original_id,changed_id,original_status,problem_id,language,filename_ext,tag,i1,i2,j1,j2,output,error,returncode,error_class,error_class_extra
0,s000016565,s604436209,Runtime Error,p03106,Python,py,replace,22,23,22,24,,"Traceback (most recent call last):\n File ""<s...",1.0,AttributeError,AttributeError: 'str' object has no attribute ...
1,s000016565,s604436209,Runtime Error,p03106,Python,py,insert,49,49,52,53,,"File ""<string>"", line 4\n if a%i==0 and b...",1.0,SyntaxError,SyntaxError: unmatched ')'
2,s000016565,s604436209,Runtime Error,p03106,Python,py,insert,62,62,66,67,,"File ""<string>"", line 4\n if (a%i==0 and ...",1.0,SyntaxError,SyntaxError: invalid syntax
3,s000023530,s834210063,Runtime Error,p02684,Python,py,insert,81,81,81,88,,"Traceback (most recent call last):\n File ""<s...",1.0,NameError,NameError: name 'flag' is not defined
4,s000023530,s834210063,Runtime Error,p02684,Python,py,insert,179,179,185,237,,"File ""<string>"", line 18\n if s:]\n ...",1.0,SyntaxError,SyntaxError: unmatched ']'
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245480,s999921259,s180756175,Time Limit Exceeded,p02713,Python,py,insert,88,88,73,88,,"File ""<string>"", line 9\n sum += math.gcd...",1.0,IndentationError,IndentationError: unexpected indent
245481,s999971803,s030030532,Runtime Error,p03043,Python,py,replace,1,12,1,2,,"Traceback (most recent call last):\n File ""<s...",1.0,ValueError,ValueError: invalid literal for int() with bas...
245482,s999971803,s030030532,Runtime Error,p03043,Python,py,insert,16,16,6,8,,"File ""<string>"", line 1\n N,K = int,input...",1.0,SyntaxError,SyntaxError: unmatched ')'
245483,s999971803,s030030532,Runtime Error,p03043,Python,py,insert,17,17,9,11,,"Traceback (most recent call last):\n File ""<s...",1.0,AttributeError,AttributeError: 'int' object has no attribute ...


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 245485 entries, 0 to 245484
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   original_id        245485 non-null  object 
 1   changed_id         245485 non-null  object 
 2   original_status    245485 non-null  object 
 3   problem_id         245485 non-null  object 
 4   language           245485 non-null  object 
 5   filename_ext       245485 non-null  object 
 6   tag                245485 non-null  object 
 7   i1                 245485 non-null  int64  
 8   i2                 245485 non-null  int64  
 9   j1                 245485 non-null  int64  
 10  j2                 245485 non-null  int64  
 11  output             2639 non-null    object 
 12  error              245449 non-null  object 
 13  returncode         245485 non-null  float64
 14  error_class        245485 non-null  object 
 15  error_class_extra  245449 non-null  object 
dtypes:

None

SyntaxError                  125672
NameError                     56326
IndentationError              32932
TypeError                     11637
IndexError                     3718
AttributeError                 2716
ValueError                     2355
ModuleNotFoundError            2194
TLEError                       2062
EOFError                       1514
TabError                       1384
ImportError                    1084
UnboundLocalError               964
KeyError                        353
ZeroDivisionError               198
1                                78
RecursionError                   43
FileNotFoundError                43
RuntimeError                     23
OverflowError                    23
-11                              18
OSError                          12
2                                 2
-6                                2
255                               2
Name: error_class, dtype: int64