# CodeNet Dataset

In this notebook I intend to make it very easy to visualize all the necessary information about a specific problem. The base dataset I will be using in this notebook is [CodeNet](https://github.com/IBM/Project_CodeNet) which is a large collection of source files and problem descriptions with metadata. The solutions are written in multiple programming languages (55+ according to the paper) and each problem has multiple submissions. Most of the submissions are written in the six most common languages (C++, Python, Java, C, Ruby, C#). As expected most of the solutions are in C++. One interesting aspect of the dataset is that it includes failed submissions, with various status codes such as Compilation Errors, Runtime Errors, Time Limit Exceeded, Memory Limit Exceeded, etc. This will prove useful since we are looking into bug detection in source code files.

Make sure you run `source spt.profile` to create the environment for the tokenizer to work. Also you have to compile the tokenizer from CodeNet.

# Table of Contents
1. [Download CodeNet](#Download-CodeNet)
2. [Missing Values](#Missing-Values)
3. [Generate Source Code Pairs](#Generate-Source-Code-Pairs)
4. [Clean Source Code Pairs](#Clean-Source-Code-Pairs)
5. [Parser Tags Visualization](#Parser-Tags-Visualization)

In [1]:
import parser
import codenet

import pandas as pd

pd.set_option('max_columns', None)

input_path = "../input/"
root_path = input_path + "Project_CodeNet/"

metadata_path = root_path + "metadata/"

problem_list_clean_path = input_path + "problem_list_clean.csv"
generated_pairs_path = input_path + "generated_pairs.csv"
cleaned_generated_pairs_path = input_path + "cleaned_generated_pairs.csv"
token_class_generated_pairs_path = input_path + "token_class_generated_pairs.csv"

## Download CodeNet

The next code cell will download the CodeNet dataset from it's original repository (the archive has around 80GB). If you already have the dataset change the input_path variable to point to the root of the dataset, otherwise the notebook will download it in the ../input/ directory.

In [2]:
codenet.download_data()

dataset root dir found


## Missing Values

The dataset also includes a description file for most of the problems. We can see which problems have or don't have a description associated. The description file can be useful to predict what the problem topic is about, graphs, dp, greedy, etc.

In the case of missing input files, I think it is also better to just drop the submissions, most of the description files are written in Chinese and we cannot really extract any useful information from them. Since there are so few files with no input we can drop them. By looking in the description files there are like 2 problems with no input from the stdin.

To conclude the missing values section, 54/56 of the missing names in the problems list are due to missing description files 1/56 is just a href which links to a 404 web page and the last one is a test problem, the later 2 problems having no submissions anyway. I think it is a fair decision to drop these samples as they are not useful. There will be 130 remaining problems with no input/output samples and 128 of them have description files in Chinese which makes it harder to extract samples, and 2 of them only require printing of values (similar to problem p00000). In this case I also think that it is ok to drop those 2 problems that don't need input alongside the rest of problems that have no input examples extracted, because we don't have to remember that there is one or two problems that can cause some bugs later on.

In [8]:
problem_list_df = pd.read_csv(metadata_path + 'problem_list.csv', index_col="id")
problem_list_df = codenet.clean_problem_list(problem_list_df)
problem_list_df.to_csv(problem_list_clean_path)

Cleaning ../input/Project_CodeNet/metadata/problem_list.csv


In [2]:
problem_list_df = pd.read_csv(problem_list_clean_path, index_col="id")
problem_ids = problem_list_df.index.unique()

print(f"We have {len(problem_list_df)} problems")
print('The distribution of the datasets is')
print(problem_list_df['dataset'].value_counts(normalize=True))
display(problem_list_df.head())
display(problem_list_df.isna().sum())

We have 3867 problems
The distribution of the datasets is
AIZU       0.613654
AtCoder    0.386346
Name: dataset, dtype: float64


Unnamed: 0_level_0,name,dataset,time_limit,memory_limit,rating,tags,complexity
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
p00001,List of Top 3 Hills,AIZU,1000.0,131072.0,,,
p00002,Digit Number,AIZU,1000.0,131072.0,,,
p00003,Is it a Right Triangle?,AIZU,1000.0,131072.0,,,
p00004,Simultaneous Equation,AIZU,1000.0,131072.0,,,
p00005,GCD and LCM,AIZU,1000.0,131072.0,,,


name               0
dataset            0
time_limit         0
memory_limit       0
rating          3867
tags            3867
complexity      3867
dtype: int64

## Generate Source Code Pairs
- for each problem:
- group solutions by user and select the solutions that are consecutive and of the form (Error, Accepted) with only one instruction (line) changed; heuristic to search a smaller space, since a user might submit a correct solution after a wrong one
- get the diff lines the diff operation (add, delete, change) and error type
- build a df from this list and save it
- sanity check: check that the error types of this df are the same as the types in the problem metadata: Success
- done

Up until this point we played around with single files and checked the looked at how we can load the source code from the dataset. In this section we will see how we can get the diff between source code files and more specifically the instruction (given by the line in the file) that caused a problem.

In [None]:
generated_pairs_df = codenet.generate_pairs(problem_list_df)
generated_pairs_df.to_csv(generated_pairs_path, index=False)

In [3]:
generated_pairs_df = pd.read_csv(generated_pairs_path)

display(generated_pairs_df)
display(generated_pairs_df.info())
display(generated_pairs_df.language.value_counts())
display(generated_pairs_df.original_status.value_counts())

print(f"We are left with {len(generated_pairs_df)} submissions in total")

Unnamed: 0,original_id,changed_id,original_line,diff_op,changed_line,original_status,original_language,problem_id,language,filename_ext
0,s000088266,s609532420,8,c,8,Wrong Answer,C,p02392,C,c
1,s000103279,s107821968,16,c,16,Time Limit Exceeded,C,p00017,C,c
2,s000134983,s025451182,8,d,7,Wrong Answer,C,p02694,C,c
3,s000150407,s325747983,18,c,18,Wrong Answer,C,p02258,C,c
4,s000173494,s918627271,6,c,6,Wrong Answer,C,p02415,C,c
...,...,...,...,...,...,...,...,...,...,...
25490,s999918666,s390104290,1,c,1,Runtime Error,C,p03134,C,c
25491,s999923152,s735413884,11,c,11,Wrong Answer,C,p00014,C,c
25492,s999951931,s536625431,8,c,8,Wrong Answer,C,p02990,C,c
25493,s999971044,s120806942,5,c,5,Wrong Answer,C,p00252,C,c


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25495 entries, 0 to 25494
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   original_id        25495 non-null  object
 1   changed_id         25495 non-null  object
 2   original_line      25495 non-null  int64 
 3   diff_op            25495 non-null  object
 4   changed_line       25495 non-null  int64 
 5   original_status    25495 non-null  object
 6   original_language  25495 non-null  object
 7   problem_id         25495 non-null  object
 8   language           25495 non-null  object
 9   filename_ext       25495 non-null  object
dtypes: int64(2), object(8)
memory usage: 1.9+ MB


None

C    25495
Name: language, dtype: int64

Wrong Answer              17990
Runtime Error              3614
WA: Presentation Error     3266
Time Limit Exceeded         587
Memory Limit Exceeded        28
Output Limit Exceeded        10
Name: original_status, dtype: int64

We are left with 25495 submissions in total


So we can see that after running this preprocessing function on all the problems we are left with 100,000 samples of pairs of source code files of the form (error, successful) for all the languages in the dataset. We will have to analyze the source code files that we obtained and make sure they are well suited for being tokenized and used in a machine learning algorithm. Now the error messages that are provided in this dataset don't look that useful, so in the next steps I will attempt to improve the error messages by running the source code on sample inputs or using compilers, code check tools etc.

## Clean Source Code Pairs
- for each problem in the :
- drop the submissions with compile error status, we are only interested in runtime errors
- group solutions by user and select the solutions that are consecutive and of the form (Error, Accepted) with only one instruction changed; heuristic to search a smaller space, since a user might submit a correct solution after a wrong one
- use the AST Tokenizer for C, C++, Java and Python to generate the tokens for each pair of submission files
- use an edit distance algorithm to detect the diff between the two submissions and save the information
- build a df from this list and save it

Up until this point we played around with single files and checked the looked at how we can load the source code from the dataset. In this section we will see how we can get the diff between source code files and more specifically the instruction (given by the line, token, in the file) that caused a problem.

One interesting aspect here is how many users sent more than one submission and also have submissions of the form (failed, accepted) and have only one instruction changed so that we can understand what change made their code work. To do this we need to first split each source file into tokens and then implement a function that computes the edit distance and the get_opcodes of the two tokens lists. Luckily the CodeNet repository contains a tool written in Java that can tokenize correct source code (only C, C++, Java and Python), meaning that we have to drop compilation errors, which are not that interesting for this subject anyway. Then to compute the op_codes for the edit distance we can use the difflib SequenceMatcher class from Python.

Some notes:
- The C Tokenizer needs to delete the include statements, so they might be shifted in the file

In [None]:
generated_pairs_df = clean_genereated_pairs(generated_pairs_df)
generated_pairs_df.to_csv(cleaned_generated_pairs_path, index=False)

In [4]:
generated_pairs_df = pd.read_csv(cleaned_generated_pairs_path)

display(generated_pairs_df)
display(generated_pairs_df.info())
display(generated_pairs_df['language'].value_counts())
display(generated_pairs_df.groupby('original_id').first()['original_status'].value_counts())

print(f"We are left with {len(generated_pairs_df.groupby('original_id').first())} submissions in total")

Unnamed: 0,seqnr,start_x,stop_x,text_x,class_x,channel_x,line_x,column_x,start_y,stop_y,text_y,class_y,channel_y,line_y,column_y,tag,problem_id,original_id,changed_id,language,extension,original_language,original_status
0,38,73.0,78.0,printf,identifier,0.0,8.0,1.0,,,,,,,,delete,p03260,s000972575,s938269052,C,c,C,Wrong Answer
1,39,79.0,79.0,(,punctuator,0.0,8.0,7.0,,,,,,,,delete,p03260,s000972575,s938269052,C,c,C,Wrong Answer
2,40,80.0,92.0,"""%d %d %d %d""",stringliteral,0.0,8.0,8.0,,,,,,,,delete,p03260,s000972575,s938269052,C,c,C,Wrong Answer
3,41,93.0,93.0,",",punctuator,0.0,8.0,21.0,,,,,,,,delete,p03260,s000972575,s938269052,C,c,C,Wrong Answer
4,42,94.0,94.0,A,identifier,0.0,8.0,22.0,,,,,,,,delete,p03260,s000972575,s938269052,C,c,C,Wrong Answer
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60478,64,174.0,177.0,temp,identifier,0.0,5.0,27.0,174.0,174.0,*,punctuator,0.0,5.0,27.0,replace,p02990,s999951931,s536625431,C,c,C,Wrong Answer
60479,65,,,,,,,,175.0,178.0,temp,identifier,0.0,5.0,28.0,replace,p02990,s999951931,s536625431,C,c,C,Wrong Answer
60480,66,,,,,,,,179.0,179.0,%,punctuator,0.0,5.0,32.0,replace,p02990,s999951931,s536625431,C,c,C,Wrong Answer
60481,67,,,,,,,,180.0,180.0,M,identifier,0.0,5.0,33.0,replace,p02990,s999951931,s536625431,C,c,C,Wrong Answer


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60483 entries, 0 to 60482
Data columns (total 23 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   seqnr              60483 non-null  int64  
 1   start_x            29350 non-null  float64
 2   stop_x             29350 non-null  float64
 3   text_x             29346 non-null  object 
 4   class_x            29350 non-null  object 
 5   channel_x          29350 non-null  float64
 6   line_x             29350 non-null  float64
 7   column_x           29350 non-null  float64
 8   start_y            47150 non-null  float64
 9   stop_y             47150 non-null  float64
 10  text_y             47145 non-null  object 
 11  class_y            47150 non-null  object 
 12  channel_y          47150 non-null  float64
 13  line_y             47150 non-null  float64
 14  column_y           47150 non-null  float64
 15  tag                60483 non-null  object 
 16  problem_id         604

None

C    60483
Name: language, dtype: int64

Wrong Answer              13707
WA: Presentation Error     3158
Runtime Error              2505
Time Limit Exceeded         446
Memory Limit Exceeded        13
Output Limit Exceeded         8
Name: original_status, dtype: int64

We are left with 19837 submissions in total


# Parser Tags Visualization 

In this section we are going to take a look at the most encountered mistakes in the source code of the submissions. Also we will be visualizing the types of parser tags and the changed tokens frequencies. The types of parser tags were presented in previous sections and relate to what types of actions were performed since the last submission, was the change a addition, deletion or purely a modification of a statement. Other than this we are interested in how each change affects each type of instruction. One interesting aspect of this analysis is that there are more deletions when it comes to function calls and that most function calls are related to printing, so we can assume that these calls were used for debugging and the users forgot to delete them. When it comes to literals, most of the changed literals were strings. This can be the case that these were variables used to string format print calls or read calls, which caused wrong answers or segmentation faults. Next on the list are integer constants which could be used to denote the problem constraints, maximum or minimum numbers, lengths of lists or similar aspects. The difference between the literals and the function calls is that literals were mostly modified, which would be logical in case they were used in variable assignation or as parameters. One interesting appearance are the binary operators, from which most of the mistakes were between the less than and greater than operators, similar to what we can call a "one off" mistake, when we play with the code by randomly changing instructions to see what happens. Most of these mistakes come up in infinite for and while loops when the condition should be reversed. Logical operators are also known to cause problems especially in complex if statements and finally binary operators, which are not that frequent, since there are not that many problems which require them, but are similar to boolean operators. By looking at the mistakes caused by the identifiers we can see a lot of common names that we use for indexing, so again we could be looking at for and while loops. When it comes to assigning variables, more common are the cases where we use index variables as the destination (the i's and j's are common such cases), which are more often used in while loops. However, when we are looking at changes made to keyword instructions we see that the while and for loops are not the most common, that is because most of the changes do not also include changing the keyword by itself, but the body of the instruction. With this in mind, seeing that the most commonly changed keywords are types makes a lot of sense in algorithmic problem solving, where, to satisfy a certain constraint you might need to use different types of variables, for example a 32 bit int might not be good enough if the resulting numbers are huge, and a 64 bit long would do better, or even solve the issue. The same goes for using signed types instead of unsigned. The most common unary operation is indexing, and in our research we can see that it is also a common cause in errors, most of the times it causes index out of bounds errors or segmentation faults. Other unary operations that might be interesting are the negations, logical and binary, which are related to the logical operators discussed earlier. There could maybe appear errors when using De Morgen's rule when using a complex if statement. And last are the usages of commas and semi colons, the commas could be missing when dealing with strings, which would lead to concatenation (in C). With this being said, most of the problems seem to appear at printing results and conditional branches in code.

In [5]:
generated_pairs_df = codenet.add_token_class(generated_pairs_df)
generated_pairs_df.to_csv(token_class_generated_pairs_path, index=False)

In [5]:
generated_pairs_df = pd.read_csv(token_class_generated_pairs_path)

display(generated_pairs_df.groupby(['original_id', 'changed_id'])['token_class'].first().value_counts())

literal_df = generated_pairs_df[generated_pairs_df['token_class'] == 'literal']
display("Tag for the literals")
display(literal_df.groupby(['original_id', 'changed_id'])['tag'].first().value_counts())
display(literal_df.groupby(['original_id', 'changed_id'])['class_x'].first().value_counts())

binary_df = generated_pairs_df[generated_pairs_df['token_class'] == 'binary']
display("Tag for the binary operations")
display(binary_df.groupby(['original_id', 'changed_id'])['tag'].first().value_counts())
display(binary_df.groupby(['original_id', 'changed_id']).apply(parser.get_bop_from_df).value_counts())

identifier_df = generated_pairs_df[generated_pairs_df['token_class'] == 'identifier']
display("Tag for the identifiers")
display(identifier_df.groupby(['original_id', 'changed_id'])['tag'].first().value_counts())
display(identifier_df.groupby(['original_id', 'changed_id']).apply(parser.get_id_from_df).value_counts())

calls_df = generated_pairs_df[generated_pairs_df['token_class'] == 'call']
display("Tag for the function calls")
display(calls_df.groupby(['original_id', 'changed_id'])['tag'].first().value_counts())
display(calls_df.groupby(['original_id', 'changed_id']).apply(parser.get_fcall_from_df).value_counts().head())

assigns_df = generated_pairs_df[generated_pairs_df['token_class'] == 'assign']
display("Tag for the function calls")
display(assigns_df.groupby(['original_id', 'changed_id'])['tag'].first().value_counts())
display(assigns_df.groupby(['original_id', 'changed_id']).apply(parser.get_assign_from_df).value_counts())

keywords_df = generated_pairs_df[generated_pairs_df['token_class'] == 'keyword']
display("Tag for the function calls")
display(keywords_df.groupby(['original_id', 'changed_id'])['tag'].first().value_counts())
display(keywords_df.groupby(['original_id', 'changed_id']).apply(parser.get_keyword_from_df).value_counts())

unaries_df = generated_pairs_df[generated_pairs_df['token_class'] == 'unary']
display("Tag for the function calls")
display(unaries_df.groupby(['original_id', 'changed_id'])['tag'].first().value_counts())
display(unaries_df.groupby(['original_id', 'changed_id']).apply(parser.get_unary_from_df).value_counts())

puncts_df = generated_pairs_df[generated_pairs_df['token_class'] == 'punctuator']
display("Tag for the function calls")
display(puncts_df.groupby(['original_id', 'changed_id'])['tag'].first().value_counts())
display(puncts_df.groupby(['original_id', 'changed_id']).apply(parser.get_punct_from_df).value_counts())

literal       8280
binary        2866
identifier    1248
call          1089
assign         532
keyword        332
unary          297
punctuator      45
Name: token_class, dtype: int64

'Tag for the literals'

replace    8267
delete       13
Name: tag, dtype: int64

stringliteral        5223
integerconstant      2809
floatingconstant      132
characterconstant     103
keyword                 9
punctuator              4
Name: class_x, dtype: int64

'Tag for the binary operations'

replace    2368
delete      498
Name: tag, dtype: int64

<     799
>     433
-     341
+     221
<=    214
==    161
>=    145
&&    113
||    109
*      90
/      88
&      62
%      44
^      35
<<      7
|       3
>>      1
dtype: int64

'Tag for the identifiers'

replace    1231
delete       17
Name: tag, dtype: int64

n            174
i            133
a            103
b             72
x             61
            ... 
tetra_sum      1
map            1
strlen         1
insert         1
setNum         1
Length: 222, dtype: int64

'Tag for the function calls'

delete     1053
replace      36
Name: tag, dtype: int64

printf     970
puts        27
scanf       18
putchar     14
strlen       6
dtype: int64

'Tag for the function calls'

replace    429
delete     103
Name: tag, dtype: int64

i            36
j            12
x             9
max           6
n             6
a             5
b             4
t             4
ans           4
c             4
k             2
flag          2
p             2
r             2
tmp           2
min_i         1
heap_size     1
v             1
run           1
rem           1
middle        1
min           1
on            1
lim           1
t_src         1
now           1
y             1
some          1
N             1
X             1
K             1
G_Common      1
sum           1
exceed        1
cou           1
fi            1
s             1
PI            1
R             1
g             1
tmp1          1
f             1
m             1
count         1
sum1          1
dtype: int64

'Tag for the function calls'

replace    273
delete      59
Name: tag, dtype: int64

int         99
float       61
break       40
else        29
char        22
long        21
void        14
double      14
unsigned    11
if           7
continue     5
short        4
return       3
for          1
while        1
dtype: int64

'Tag for the function calls'

replace    285
delete      12
Name: tag, dtype: int64

]    239
[     48
~      7
!      3
dtype: int64

'Tag for the function calls'

replace    31
delete     14
Name: tag, dtype: int64

,    28
;    17
dtype: int64