# CodeNet Dataset

In this notebook I intend to make it very easy to visualize all the necessary information about a specific problem. The base dataset I will be using in this notebook is [CodeNet](https://github.com/IBM/Project_CodeNet) which is a large collection of source files and problem descriptions with metadata. The solutions are written in multiple programming languages (55+ according to the paper) and each problem has multiple submissions. Most of the submissions are written in the six most common languages (C++, Python, Java, C, Ruby, C#). As expected most of the solutions are in C++. One interesting aspect of the dataset is that it includes failed submissions, with various status codes such as Compilation Errors, Runtime Errors, Time Limit Exceeded, Memory Limit Exceeded, etc. This will prove useful since we are looking into bug detection in source code files.

Make sure you run `source spt.profile` to create the environment for the tokenizer to work. Also you have to compile the tokenizer from CodeNet.

# Table of Contents
1. [Download CodeNet](#Download-CodeNet)
2. [Missing Values](#Missing-Values)
3. [Generate Source Code Pairs](#Generate-Source-Code-Pairs)
4. [Tokenize Source Code Pairs](#Tokenize-Source-Code-Pairs)
5. [Parser Tags Visualization](#Parser-Tags-Visualization)
6. [Clean Code Pairs](#Clean-Code-Pairs)
7. [Error Type Description](#Error-Type-Description)

In [51]:
import myparser
import codenet

import pandas as pd

pd.set_option('max_columns', None)

input_path = "../input/"
generated_data_path = input_path + "generated/data/"
root_path = input_path + "Project_CodeNet/"

metadata_path = root_path + "metadata/"

problem_list_clean_path = input_path + "generated/problem_list_clean.csv"
generated_pairs_path = input_path + "generated/generated_pairs.csv"
generated_pairs_tok_path = input_path + "generated/generated_pairs_tok.csv"
token_class_generated_pairs_path = input_path + "generated/token_class_generated_pairs.csv"
clean_pairs_path = input_path + "generated/clean_pairs.csv"
error_pairs_path = input_path + "generated/error_pairs.csv"
clean_error_pairs_path = input_path + "generated/clean_error_pairs.csv"
generate_labels_path = input_path + "generated/generate_labels.csv"

codenet.P = 4

## Download CodeNet

The next code cell will download the CodeNet dataset from it's original repository (the archive has around 80GB). If you already have the dataset change the input_path variable to point to the root of the dataset, otherwise the notebook will download it in the ../input/ directory.

In [2]:
codenet.download_data()

dataset root dir found


## Missing Values

The dataset also includes a description file for most of the problems. We can see which problems have or don't have a description associated. The description file can be useful to predict what the problem topic is about, graphs, dp, greedy, etc.

In the case of missing input files, I think it is also better to just drop the submissions, most of the description files are written in Chinese and we cannot really extract any useful information from them. Since there are so few files with no input we can drop them. By looking in the description files there are like 2 problems with no input from the stdin.

To conclude the missing values section, 54/56 of the missing names in the problems list are due to missing description files 1/56 is just a href which links to a 404 web page and the last one is a test problem, the later 2 problems having no submissions anyway. I think it is a fair decision to drop these samples as they are not useful. There will be 130 remaining problems with no input/output samples and 128 of them have description files in Chinese which makes it harder to extract samples, and 2 of them only require printing of values (similar to problem p00000). In this case I also think that it is ok to drop those 2 problems that don't need input alongside the rest of problems that have no input examples extracted, because we don't have to remember that there is one or two problems that can cause some bugs later on.

In [8]:
problem_list_df = pd.read_csv(metadata_path + 'problem_list.csv', index_col="id")
problem_list_df = codenet.clean_problem_list(problem_list_df)
problem_list_df.to_csv(problem_list_clean_path)

Cleaning ../input/Project_CodeNet/metadata/problem_list.csv


In [2]:
problem_list_df = pd.read_csv(problem_list_clean_path, index_col="id")
problem_ids = problem_list_df.index.unique()

print(f"We have {len(problem_list_df)} problems")
print('The distribution of the datasets is')
print(problem_list_df['dataset'].value_counts(normalize=True))
display(problem_list_df.head())
display(problem_list_df.isna().sum())

We have 3867 problems
The distribution of the datasets is
AIZU       0.613654
AtCoder    0.386346
Name: dataset, dtype: float64


Unnamed: 0_level_0,name,dataset,time_limit,memory_limit,rating,tags,complexity
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
p00001,List of Top 3 Hills,AIZU,1000.0,131072.0,,,
p00002,Digit Number,AIZU,1000.0,131072.0,,,
p00003,Is it a Right Triangle?,AIZU,1000.0,131072.0,,,
p00004,Simultaneous Equation,AIZU,1000.0,131072.0,,,
p00005,GCD and LCM,AIZU,1000.0,131072.0,,,


name               0
dataset            0
time_limit         0
memory_limit       0
rating          3867
tags            3867
complexity      3867
dtype: int64

## Generate Source Code Pairs
- for each problem:
- group solutions by user and select the solutions that are consecutive and of the form (Error, Accepted) with only one instruction (line) changed; heuristic to search a smaller space, since a user might submit a correct solution after a wrong one
- get the diff lines the diff operation (add, delete, change) and error type
- build a df from this list and save it
- sanity check: check that the error types of this df are the same as the types in the problem metadata: Success
- done

Up until this point we played around with single files and checked the looked at how we can load the source code from the dataset. In this section we will see how we can get the diff between source code files and more specifically the instruction (given by the line in the file) that caused a problem.

In [None]:
generated_pairs_df = codenet.generate_pairs(problem_list_df)
generated_pairs_df.to_csv(generated_pairs_path, index=False)

In [3]:
generated_pairs_df = pd.read_csv(generated_pairs_path)

display(generated_pairs_df)
display(generated_pairs_df.info())
display(generated_pairs_df.language.value_counts())
display(generated_pairs_df.original_status.value_counts())

print(f"We are left with {len(generated_pairs_df)} submissions in total")

Unnamed: 0,original_id,changed_id,original_line,diff_op,changed_line,original_status,original_language,problem_id,language,filename_ext
0,s000002117,s279971709,98,c,98,Runtime Error,C++14 (GCC 5.4.1),p03160,C++,cpp
1,s000010752,s827043744,4,c,4,Runtime Error,C++14 (GCC 5.4.1),p02754,C++,cpp
2,s000013369,s886874630,13,c,13,Time Limit Exceeded,C++14 (GCC 5.4.1),p03353,C++,cpp
3,s000019909,s548620913,64,c,64,Time Limit Exceeded,C++,p02250,C++,cpp
4,s000039595,s299641757,26,c,26,Runtime Error,JAVA,p02247,Java,java
...,...,...,...,...,...,...,...,...,...,...
63786,s999945536,s649357053,5,c,5,Memory Limit Exceeded,C++14 (GCC 5.4.1),p03717,C++,cpp
63787,s999955927,s677172776,10,c,10,Runtime Error,C++14 (GCC 5.4.1),p03374,C++,cpp
63788,s999958485,s576007819,2,c,2,Runtime Error,C++14 (GCC 5.4.1),p03819,C++,cpp
63789,s999978013,s377979752,43,c,43,Time Limit Exceeded,C++ (GCC 9.2.1),p02596,C++,cpp


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63791 entries, 0 to 63790
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   original_id        63791 non-null  object
 1   changed_id         63791 non-null  object
 2   original_line      63791 non-null  int64 
 3   diff_op            63791 non-null  object
 4   changed_line       63791 non-null  int64 
 5   original_status    63791 non-null  object
 6   original_language  63791 non-null  object
 7   problem_id         63791 non-null  object
 8   language           63791 non-null  object
 9   filename_ext       63791 non-null  object
dtypes: int64(2), object(8)
memory usage: 4.9+ MB


None

C++       41094
Python    10246
C          7814
Java       4637
Name: language, dtype: int64

Runtime Error             43372
Time Limit Exceeded       10091
WA: Presentation Error     9524
Memory Limit Exceeded       684
Output Limit Exceeded        97
Internal error               16
Judge Not Available           6
Judge System Error            1
Name: original_status, dtype: int64

We are left with 63791 submissions in total


So we can see that after running this preprocessing function on all the problems we are left with 60,000 samples of pairs of source code files of the form (error, successful) for all the languages in the dataset. We will have to analyze the source code files that we obtained and make sure they are well suited for being tokenized and used in a machine learning algorithm. Now the error messages that are provided in this dataset don't look that useful, so in the next steps I will attempt to improve the error messages by running the source code on sample inputs or using compilers, code check tools etc.

## Tokenize Source Code Pairs
- for each problem in the :
- drop the submissions with compile error status, we are only interested in runtime errors
- group solutions by user and select the solutions that are consecutive and of the form (Error, Accepted) with only one instruction changed; heuristic to search a smaller space, since a user might submit a correct solution after a wrong one
- use the AST Tokenizer for C, C++, Java and Python to generate the tokens for each pair of submission files
- use an edit distance algorithm to detect the diff between the two submissions and save the information
- build a df from this list and save it

Up until this point we played around with single files and checked the looked at how we can load the source code from the dataset. In this section we will see how we can get the diff between source code files and more specifically the instruction (given by the line, token, in the file) that caused a problem.

One interesting aspect here is how many users sent more than one submission and also have submissions of the form (failed, accepted) and have only one instruction changed so that we can understand what change made their code work. To do this we need to first split each source file into tokens and then implement a function that computes the edit distance and the get_opcodes of the two tokens lists. Luckily the CodeNet repository contains a tool written in Java that can tokenize correct source code (only C, C++, Java and Python), meaning that we have to drop compilation errors, which are not that interesting for this subject anyway. Then to compute the op_codes for the edit distance we can use the difflib SequenceMatcher class from Python.

Some notes:
- The C Tokenizer needs to delete the include statements, so they might be shifted in the file

In [None]:
generated_pairs_tok_df = codenet.tokenize_generated_pairs(generated_pairs_df)
generated_pairs_tok_df.to_csv(generated_pairs_tok_path, index=False)

In [4]:
generated_pairs_tok_df = pd.read_csv(generated_pairs_tok_path)

display(generated_pairs_tok_df)
display(generated_pairs_tok_df.info())
display(generated_pairs_tok_df['language'].value_counts())
display(generated_pairs_tok_df.groupby('original_id').first()['original_status'].value_counts())

print(f"We are left with {len(generated_pairs_tok_df.groupby('original_id').first())} submissions in total")

Unnamed: 0,seqnr,start_x,stop_x,text_x,class_x,channel_x,line_x,column_x,start_y,stop_y,text_y,class_y,channel_y,line_y,column_y,tag,problem_id,original_id,changed_id,language,extension,original_language,original_status
0,95,200.0,209.0,"""%d%d%d\n""",stringliteral,0.0,20.0,8.0,200.0,211.0,"""%d %d %d\n""",stringliteral,0.0,20.0,8.0,replace,p02393,s000907095,s097564908,C,c,C,WA: Presentation Error
1,96,210.0,210.0,",",punctuator,0.0,20.0,18.0,212.0,212.0,",",punctuator,0.0,20.0,20.0,replace,p02393,s000907095,s097564908,C,c,C,WA: Presentation Error
2,135,364.0,366.0,int,keyword,0.0,21.0,2.0,364.0,367.0,long,keyword,0.0,21.0,2.0,replace,p03286,s001407680,s276813988,C++,cpp,C++14 (GCC 5.4.1),Runtime Error
3,136,368.0,368.0,N,identifier,0.0,21.0,6.0,369.0,369.0,N,identifier,0.0,21.0,7.0,replace,p03286,s001407680,s276813988,C++,cpp,C++14 (GCC 5.4.1),Runtime Error
4,35,45.0,45.0,&,operators,0.0,3.0,21.0,45.0,47.0,and,keywords,0.0,3.0,21.0,replace,p02681,s001183728,s212071531,Python,py,Python (3.8.2),Runtime Error
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
241407,57,,,,,,,,172.0,172.0,5,integer,0.0,8.0,29.0,insert,p02596,s999978013,s377979752,C++,cpp,C++ (GCC 9.2.1),Time Limit Exceeded
241408,58,,,,,,,,173.0,173.0,),operator,0.0,8.0,30.0,insert,p02596,s999978013,s377979752,C++,cpp,C++ (GCC 9.2.1),Time Limit Exceeded
241409,243,684.0,684.0,),operator,0.0,35.0,27.0,684.0,684.0,-,operator,0.0,35.0,27.0,insert,p03107,s999985350,s209658151,C++,cpp,C++14 (GCC 5.4.1),Runtime Error
241410,244,,,,,,,,685.0,685.0,1,integer,0.0,35.0,28.0,insert,p03107,s999985350,s209658151,C++,cpp,C++14 (GCC 5.4.1),Runtime Error


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 241412 entries, 0 to 241411
Data columns (total 23 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   seqnr              241412 non-null  int64  
 1   start_x            124270 non-null  float64
 2   stop_x             124270 non-null  float64
 3   text_x             123962 non-null  object 
 4   class_x            124270 non-null  object 
 5   channel_x          124270 non-null  float64
 6   line_x             124270 non-null  float64
 7   column_x           124270 non-null  float64
 8   start_y            209424 non-null  float64
 9   stop_y             209424 non-null  float64
 10  text_y             208687 non-null  object 
 11  class_y            209424 non-null  object 
 12  channel_y          209424 non-null  float64
 13  line_y             209424 non-null  float64
 14  column_y           209424 non-null  float64
 15  tag                241412 non-null  object 
 16  pr

None

C++       180743
Java       20993
C          20664
Python     19012
Name: language, dtype: int64

Runtime Error             32784
WA: Presentation Error     8665
Time Limit Exceeded        7915
Memory Limit Exceeded       553
Output Limit Exceeded        82
Internal error               11
Name: original_status, dtype: int64

We are left with 50010 submissions in total


# Parser Tags Visualization 

In this section we are going to take a look at the most encountered mistakes in the source code of the submissions. Also we will be visualizing the types of parser tags and the changed tokens frequencies. The types of parser tags were presented in previous sections and relate to what types of actions were performed since the last submission, was the change a addition, deletion or purely a modification of a statement. Other than this we are interested in how each change affects each type of instruction. One interesting aspect of this analysis is that there are more deletions when it comes to function calls and that most function calls are related to printing, so we can assume that these calls were used for debugging and the users forgot to delete them. When it comes to literals, most of the changed literals were strings. This can be the case that these were variables used to string format print calls or read calls, which caused wrong answers or segmentation faults. Next on the list are integer constants which could be used to denote the problem constraints, maximum or minimum numbers, lengths of lists or similar aspects. The difference between the literals and the function calls is that literals were mostly modified, which would be logical in case they were used in variable assignation or as parameters. One interesting appearance are the binary operators, from which most of the mistakes were between the less than and greater than operators, similar to what we can call a "one off" mistake, when we play with the code by randomly changing instructions to see what happens. Most of these mistakes come up in infinite for and while loops when the condition should be reversed. Logical operators are also known to cause problems especially in complex if statements and finally binary operators, which are not that frequent, since there are not that many problems which require them, but are similar to boolean operators. By looking at the mistakes caused by the identifiers we can see a lot of common names that we use for indexing, so again we could be looking at for and while loops. When it comes to assigning variables, more common are the cases where we use index variables as the destination (the i's and j's are common such cases), which are more often used in while loops. However, when we are looking at changes made to keyword instructions we see that the while and for loops are not the most common, that is because most of the changes do not also include changing the keyword by itself, but the body of the instruction. With this in mind, seeing that the most commonly changed keywords are types makes a lot of sense in algorithmic problem solving, where, to satisfy a certain constraint you might need to use different types of variables, for example a 32 bit int might not be good enough if the resulting numbers are huge, and a 64 bit long would do better, or even solve the issue. The same goes for using signed types instead of unsigned. The most common unary operation is indexing, and in our research we can see that it is also a common cause in errors, most of the times it causes index out of bounds errors or segmentation faults. Other unary operations that might be interesting are the negations, logical and binary, which are related to the logical operators discussed earlier. There could maybe appear errors when using De Morgen's rule when using a complex if statement. And last are the usages of commas and semi colons, the commas could be missing when dealing with strings, which would lead to concatenation (in C). With this being said, most of the problems seem to appear at printing results and conditional branches in code.

In [14]:
generated_pairs_tok_df = codenet.add_token_class(generated_pairs_tok_df)
generated_pairs_tok_df.to_csv(token_class_generated_pairs_path, index=False)

In [5]:
generated_pairs_tok_df = pd.read_csv(token_class_generated_pairs_path)

display(generated_pairs_tok_df.groupby(['original_id', 'changed_id'])['token_class'].first().value_counts())

literal_df = generated_pairs_tok_df[generated_pairs_tok_df['token_class'] == 'literal']
display("Tag for the literals")
display(literal_df.groupby(['original_id', 'changed_id'])['tag'].first().value_counts())
display(literal_df.groupby(['original_id', 'changed_id'])['class_x'].first().value_counts())

binary_df = generated_pairs_tok_df[generated_pairs_tok_df['token_class'] == 'binary']
display("Tag for the binary operations")
display(binary_df.groupby(['original_id', 'changed_id'])['tag'].first().value_counts())
display(binary_df.groupby(['original_id', 'changed_id']).apply(myparser.get_bop_from_df).value_counts())

identifier_df = generated_pairs_tok_df[generated_pairs_tok_df['token_class'] == 'identifier']
display("Tag for the identifiers")
display(identifier_df.groupby(['original_id', 'changed_id'])['tag'].first().value_counts())
display(identifier_df.groupby(['original_id', 'changed_id']).apply(myparser.get_id_from_df).value_counts())

calls_df = generated_pairs_tok_df[generated_pairs_tok_df['token_class'] == 'call']
display("Tag for the function calls")
display(calls_df.groupby(['original_id', 'changed_id'])['tag'].first().value_counts())
display(calls_df.groupby(['original_id', 'changed_id']).apply(myparser.get_fcall_from_df).value_counts().head())

assigns_df = generated_pairs_tok_df[generated_pairs_tok_df['token_class'] == 'assign']
display("Tag for the assign")
display(assigns_df.groupby(['original_id', 'changed_id'])['tag'].first().value_counts())
display(assigns_df.groupby(['original_id', 'changed_id']).apply(myparser.get_assign_from_df).value_counts())

keywords_df = generated_pairs_tok_df[generated_pairs_tok_df['token_class'] == 'keyword']
display("Tag for the keywords")
display(keywords_df.groupby(['original_id', 'changed_id'])['tag'].first().value_counts())
display(keywords_df.groupby(['original_id', 'changed_id']).apply(myparser.get_keyword_from_df).value_counts())

unaries_df = generated_pairs_tok_df[generated_pairs_tok_df['token_class'] == 'unary']
display("Tag for the unary operations")
display(unaries_df.groupby(['original_id', 'changed_id'])['tag'].first().value_counts())
display(unaries_df.groupby(['original_id', 'changed_id']).apply(myparser.get_unary_from_df).value_counts())

puncts_df = generated_pairs_tok_df[generated_pairs_tok_df['token_class'] == 'punctuator']
display("Tag for the punctuators")
display(puncts_df.groupby(['original_id', 'changed_id'])['tag'].first().value_counts())
display(puncts_df.groupby(['original_id', 'changed_id']).apply(myparser.get_punct_from_df).value_counts())

literal       9549
identifier    9125
unary         7881
binary        7397
call          3210
punctuator    2752
keyword       2344
assign        1447
Name: token_class, dtype: int64

'Tag for the literals'

replace    9186
insert      242
delete      121
Name: tag, dtype: int64

integer              4978
stringliteral        2485
string               1426
floating              314
integerconstant       165
keyword                62
character              42
characterconstant      27
operator               13
'assert'               10
ws                      5
floatingconstant        4
identifier              4
operators               4
punctuator              3
line_break              1
number                  1
dedent                  1
newline                 1
keywords                1
indent                  1
name                    1
Name: class_x, dtype: int64

'Tag for the binary operations'

replace    5688
delete     1257
insert      452
Name: tag, dtype: int64

+     2701
<     1570
-      619
>      521
<=     458
==     406
*      383
>=     185
/      165
%      108
&&      95
||      95
&       57
^       17
|        9
<<       8
dtype: int64

'Tag for the identifiers'

replace    4684
insert     4028
delete      413
Name: tag, dtype: int64

n        1170
N         662
i         547
a         377
s         230
         ... 
Aoj1        1
isope       1
amd         1
sum_c       1
ramen       1
Length: 1442, dtype: int64

'Tag for the function calls'

replace    1836
delete     1374
Name: tag, dtype: int64

freopen    400
print      239
printf     225
println    205
int        196
dtype: int64

'Tag for the assign'

replace    1137
delete      266
insert       44
Name: tag, dtype: int64

i        169
n         35
j         31
k         24
a         19
        ... 
nv         1
P          1
h          1
MAX_S      1
cur        1
Length: 117, dtype: int64

'Tag for the keywords'

insert     1938
replace     313
delete       93
Name: tag, dtype: int64

return       730
for          456
int          303
if           192
while        119
long          85
true          63
break         58
print         55
continue      48
else          43
false         27
elif          23
using         16
bool          15
double        12
auto          12
def            9
in             7
delete         7
unsigned       5
import         5
const          4
char           4
or             4
package        4
short          3
private        3
void           3
True           3
static         3
lambda         2
this           2
do             2
constexpr      2
goto           2
byte           2
as             1
default        1
from           1
not            1
and            1
pass           1
float          1
class          1
operator       1
namespace      1
final          1
dtype: int64

'Tag for the unary operations'

replace    7209
insert      620
delete       52
Name: tag, dtype: int64

]    7472
[     326
!      67
~      16
dtype: int64

'Tag for the punctuators'

insert     2700
replace      33
delete       19
Name: tag, dtype: int64

;    2487
,     265
dtype: int64

# Clean Code Pairs

Here we only keep those source code pairs that are in both the tokenized version and the generated pairs version. We only keep the pairs that have a single instruction changed. In our case some of the pairs had only one line as difference between them, but in reality, only 50K/63K had a single instruction modified. Furthermore, we can merge the newly generated cleaned pairs with the tokenized version if we need extra information later on.

In [16]:
clean_pairs_df = generated_pairs_df.merge(generated_pairs_tok_df[["original_id", "changed_id"]].drop_duplicates())
clean_pairs_df.to_csv(clean_pairs_path, index=False)

In [6]:
clean_pairs_df = pd.read_csv(clean_pairs_path)
clean_pairs_df

Unnamed: 0,original_id,changed_id,original_line,diff_op,changed_line,original_status,original_language,problem_id,language,filename_ext
0,s000002117,s279971709,98,c,98,Runtime Error,C++14 (GCC 5.4.1),p03160,C++,cpp
1,s000019909,s548620913,64,c,64,Time Limit Exceeded,C++,p02250,C++,cpp
2,s000039595,s299641757,26,c,26,Runtime Error,JAVA,p02247,Java,java
3,s000046211,s356098162,48,c,48,Runtime Error,C++ (GCC 9.2.1),p03065,C++,cpp
4,s000055709,s401616856,7,c,7,Runtime Error,C++14 (GCC 5.4.1),p02888,C++,cpp
...,...,...,...,...,...,...,...,...,...,...
50005,s999894525,s751202776,32,c,32,Runtime Error,C++ (GCC 9.2.1),p02596,C++,cpp
50006,s999910703,s281129745,13,c,13,Runtime Error,C++14 (GCC 5.4.1),p03346,C++,cpp
50007,s999945536,s649357053,5,c,5,Memory Limit Exceeded,C++14 (GCC 5.4.1),p03717,C++,cpp
50008,s999978013,s377979752,43,c,43,Time Limit Exceeded,C++ (GCC 9.2.1),p02596,C++,cpp


# Error Type Description

TODO doc

In [None]:
error_pairs_df = codenet.add_error_description(clean_pairs_df, problem_list_df)
error_pairs_df.to_csv(error_pairs_path, index=False)

In [7]:
error_pairs_df = pd.read_csv(error_pairs_path)
error_pairs_df

Unnamed: 0,original_id,changed_id,original_line,diff_op,changed_line,original_status,original_language,problem_id,language,filename_ext,output,error,returncode,error_class,error_class_extra
0,s000002117,s279971709,98,c,98,Runtime Error,C++14 (GCC 5.4.1),p03160,C++,cpp,,,0,0,
1,s000019909,s548620913,64,c,64,Time Limit Exceeded,C++,p02250,C++,cpp,1\n1\n0\n0\n,,0,0,
2,s000039595,s299641757,26,c,26,Runtime Error,JAVA,p02247,Java,java,0\n3\n4\n,,0,,
3,s000046211,s356098162,48,c,48,Runtime Error,C++ (GCC 9.2.1),p03065,C++,cpp,2\n7\n,,0,0,
4,s000055709,s401616856,7,c,7,Runtime Error,C++14 (GCC 5.4.1),p02888,C++,cpp,1\n,,0,0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50005,s999894525,s751202776,32,c,32,Runtime Error,C++ (GCC 9.2.1),p02596,C++,cpp,4\n,,0,0,
50006,s999910703,s281129745,13,c,13,Runtime Error,C++14 (GCC 5.4.1),p03346,C++,cpp,2,,0,0,
50007,s999945536,s649357053,5,c,5,Memory Limit Exceeded,C++14 (GCC 5.4.1),p03717,C++,cpp,6\n,,0,0,
50008,s999978013,s377979752,43,c,43,Time Limit Exceeded,C++ (GCC 9.2.1),p02596,C++,cpp,4\n,,0,0,


# Clean data

TODO doc

In [8]:
clean_error_pairs_df = codenet.clean_error_list(error_pairs_df)
clean_error_pairs_df.to_csv(clean_error_pairs_path, index=False)

In [13]:
clean_error_pairs_df = pd.read_csv(clean_error_pairs_path)
clean_error_pairs_df

Unnamed: 0,original_id,changed_id,original_line,diff_op,changed_line,original_status,original_language,problem_id,language,filename_ext,output,error,returncode,error_class,error_class_extra
0,s000103279,s107821968,16,c,16,Time Limit Exceeded,C,p00017,C,c,,TLEError: Time limit exceeded,-9,OutOfMemory,TLEError: Time limit exceeded
1,s000139707,s081978093,6,c,6,Runtime Error,C++14 (GCC 5.4.1),p03424,C++,cpp,,terminate called after throwing an instance of...,-6,SIGABRT,terminate called after throwing an instance of...
2,s000186023,s530109551,153,c,153,Time Limit Exceeded,C++ (GCC 9.2.1),p02618,C++,cpp,,TLEError: Time limit exceeded,-9,OutOfMemory,TLEError: Time limit exceeded
3,s000359301,s635308388,12,c,12,Time Limit Exceeded,C++ (Clang 10.0.0),p02803,C++,cpp,,TLEError: Time limit exceeded,-9,OutOfMemory,TLEError: Time limit exceeded
4,s000778835,s833669381,1,c,1,Runtime Error,Python (3.8.2),p02628,Python,py,,"Traceback (most recent call last):\n File ""/h...",1,ValueError,ValueError: invalid literal for int() with bas...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9905,s999341335,s456156683,7,c,7,Runtime Error,Python (3.8.2),p02615,Python,py,,"Traceback (most recent call last):\n File ""/h...",1,IndexError,IndexError: list index out of range
9906,s999352609,s119429512,0,a,1,Runtime Error,Python (3.8.2),p02677,Python,py,,"Traceback (most recent call last):\n File ""/h...",1,NameError,NameError: name 'math' is not defined
9907,s999516884,s693491511,12,a,13,Runtime Error,JAVA,p00168,Java,java,1\n1\n34\n701\n1\n,"Exception in thread ""main"" java.util.NoSuchEle...",1,java.util.NoSuchElementException,"Exception in thread ""main"" java.util.NoSuchEle..."
9908,s999647680,s672945348,26,c,26,Runtime Error,C++14 (GCC 5.4.1),p03426,C++,cpp,,,-11,SIGSEGV,


In [14]:
clean_error_pairs_df['error_class'].value_counts()

SIGSEGV                                       2212
OutOfMemory                                   1838
SyntaxError                                   1664
SIGABRT                                       1205
NameError                                     1138
TypeError                                      385
SIGFPE                                         209
ValueError                                     208
AttributeError                                 180
java.lang.ArrayIndexOutOfBoundsException       177
java.util.NoSuchElementException               155
IndexError                                      99
EOFError                                        79
SIGILL                                          77
java.lang.NullPointerException                  39
IndentationError                                39
java.lang.StringIndexOutOfBoundsException       36
java.util.InputMismatchException                26
java.lang.NumberFormatException                 19
java.lang.ArithmeticException  

In [15]:
clean_error_pairs_df['language'].value_counts()

C++       4749
Python    3851
C          792
Java       518
Name: language, dtype: int64

# Generate Labels

TODO doc

In [None]:
generate_labels_df = codenet.generate_labels(clean_error_pairs_df)
generate_labels_df.to_csv(generate_labels_path, index=False)

In [52]:
generate_labels_df = pd.read_csv(generate_labels_path)
generate_labels_df

Unnamed: 0,tag,i1,i2,j1,j2,problem_id,original_id,changed_id,language,extension,original_language,original_status,output,error,returncode,error_class,error_class_extra
0,replace,109,110,109,110,p00017,s000103279,s107821968,C,c,C,Time Limit Exceeded,,TLEError: Time limit exceeded,-9,OutOfMemory,TLEError: Time limit exceeded
1,insert,22,22,22,25,p03424,s000139707,s081978093,C++,cpp,C++14 (GCC 5.4.1),Runtime Error,,terminate called after throwing an instance of...,-6,SIGABRT,terminate called after throwing an instance of...
2,insert,816,816,816,830,p02618,s000186023,s530109551,C++,cpp,C++ (GCC 9.2.1),Time Limit Exceeded,,TLEError: Time limit exceeded,-9,OutOfMemory,TLEError: Time limit exceeded
3,insert,153,153,153,159,p02803,s000359301,s635308388,C++,cpp,C++ (Clang 10.0.0),Time Limit Exceeded,,TLEError: Time limit exceeded,-9,OutOfMemory,TLEError: Time limit exceeded
4,insert,9,9,9,13,p02628,s000778835,s833669381,Python,py,Python (3.8.2),Runtime Error,,"Traceback (most recent call last):\n File ""/h...",1,ValueError,ValueError: invalid literal for int() with bas...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9905,insert,81,81,81,83,p02615,s999341335,s456156683,Python,py,Python (3.8.2),Runtime Error,,"Traceback (most recent call last):\n File ""/h...",1,IndexError,IndexError: list index out of range
9906,insert,0,0,0,5,p02677,s999352609,s119429512,Python,py,Python (3.8.2),Runtime Error,,"Traceback (most recent call last):\n File ""/h...",1,NameError,NameError: name 'math' is not defined
9907,insert,161,161,161,170,p00168,s999516884,s693491511,Java,java,JAVA,Runtime Error,1\n1\n34\n701\n1\n,"Exception in thread ""main"" java.util.NoSuchEle...",1,java.util.NoSuchElementException,"Exception in thread ""main"" java.util.NoSuchEle..."
9908,replace,176,181,176,181,p03426,s999647680,s672945348,C++,cpp,C++14 (GCC 5.4.1),Runtime Error,,,-11,SIGSEGV,
