# CodeNetPy Dataset

In this notebook I intend to make it very easy to visualize all the necessary information about a specific problem. The base dataset I will be using in this notebook is [CodeNet](https://github.com/IBM/Project_CodeNet) which is a large collection of source files and problem descriptions with metadata. The solutions are written in multiple programming languages (55+ according to the paper) and each problem has multiple submissions. Most of the submissions are written in the six most common languages (C++, Python, Java, C, Ruby, C#). As expected most of the solutions are in C++. One interesting aspect of the dataset is that it includes failed submissions, with various status codes such as Compilation Errors, Runtime Errors, Time Limit Exceeded, Memory Limit Exceeded, etc. This will prove useful since we are looking into bug detection in source code files.

To be able to run the notebook you have to run the `codenet.py` script first. It will download and preprocess the CodeNetPy dataset.

```console
python3 codenet.py
```

## Table of Contents
1. [Imports](#Imports)
1. [Download CodeNet](#Download-CodeNet)
1. [Missing Values](#Missing-Values)
1. [Generate Source Code Pairs](#Generate-Source-Code-Pairs)
1. [Generate Error Pairs](#Generate-Error-Pairs)

## Imports

In [1]:
import json
import codenet

import pandas as pd

pd.set_option('max_columns', None)

## Download CodeNet

To generate the same dataset as the one used in this project you can download the CodeNet dataset as described in IBM's GitHub repository and run the data processing pipeline from [this](https://github.com/alexjercan/bug-detection) GitHub repo.

## Missing Values

The dataset also includes a description file for most of the problems. We can see which problems have or don't have a description associated. The description file can be useful to predict what the problem topic is about, graphs, dp, greedy, etc.

In the case of missing input files, I think it is also better to just drop the submissions, most of the description files are written in Chinese and we cannot really extract any useful information from them. Since there are so few files with no input we can drop them. By looking in the description files there are like 2 problems with no input from the stdin.

To conclude the missing values section, 54/56 of the missing names in the problems list are due to missing description files 1/56 is just a href which links to a 404 web page and the last one is a test problem, the later 2 problems having no submissions anyway. I think it is a fair decision to drop these samples as they are not useful. There will be 130 remaining problems with no input/output samples and 128 of them have description files in Chinese which makes it harder to extract samples, and 2 of them only require printing of values (similar to problem p00000). In this case I also think that it is ok to drop those 2 problems that don't need input alongside the rest of problems that have no input examples extracted, because we don't have to remember that there is one or two problems that can cause some bugs later on.

In the next cell we can look at the cleaned problem list dataframe that is contained in the CodeNetPy dataset.

In [2]:
problem_list_df = pd.read_csv(codenet.problem_list_clean_path, index_col='id')
problem_ids = problem_list_df.index.unique()

print(f'We have {len(problem_list_df)} problems')
print('The distribution of the datasets is')
print(problem_list_df['dataset'].value_counts(normalize=True))
display(problem_list_df.head())
display(problem_list_df.isna().sum())

We have 3867 problems
The distribution of the datasets is
AIZU       0.613654
AtCoder    0.386346
Name: dataset, dtype: float64


Unnamed: 0_level_0,name,dataset,time_limit,memory_limit,rating,tags,complexity
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
p00001,List of Top 3 Hills,AIZU,1000.0,131072.0,,,
p00002,Digit Number,AIZU,1000.0,131072.0,,,
p00003,Is it a Right Triangle?,AIZU,1000.0,131072.0,,,
p00004,Simultaneous Equation,AIZU,1000.0,131072.0,,,
p00005,GCD and LCM,AIZU,1000.0,131072.0,,,


name               0
dataset            0
time_limit         0
memory_limit       0
rating          3867
tags            3867
complexity      3867
dtype: int64

## Generate Source Code Pairs
- for each problem:
- group solutions by user and select the solutions that are consecutive and of the form (Error, Accepted) and sort by the submission date; heuristic to search a smaller space, since a user might submit a correct solution after a wrong one
- build a df from this list and save it

Here we want to generate submission pairs that will allow us to find code fixes within the dataset. With this information we will be able to find the instructions that have to be modified, either deleted, inserted or changed, so that the old code starts to work. You would imagine that solutions that were submitted successively should be similar in terms of content and only have a few changes between each other. The small mistakes can be patched by knowing the accepted submission in that chain and give us the errors produced, and thus removed by the correct instruction. These are the submissions that we are interested in and look into finding patterns and further cleaning those that will not prove helpful.

In [3]:
generated_pairs_df = pd.read_csv(codenet.generated_pairs_path)

display(generated_pairs_df)
display(generated_pairs_df.info())
display(generated_pairs_df.language.value_counts())
display(generated_pairs_df.original_status.value_counts())

print(f'We are left with {len(generated_pairs_df)} submissions in total')

Unnamed: 0,original_id,changed_id,original_status,problem_id,language,filename_ext
0,s000016565,s604436209,Runtime Error,p03106,Python,py
1,s000023530,s834210063,Runtime Error,p02684,Python,py
2,s000041036,s454210115,Runtime Error,p02584,Python,py
3,s000041460,s952454189,Time Limit Exceeded,p02658,Python,py
4,s000054326,s226572665,Time Limit Exceeded,p02701,Python,py
...,...,...,...,...,...,...
54587,s999855642,s185899176,Runtime Error,p02645,Python,py
54588,s999876744,s590378402,Time Limit Exceeded,p02642,Python,py
54589,s999891212,s835176880,Runtime Error,p03103,Python,py
54590,s999921259,s180756175,Time Limit Exceeded,p02713,Python,py


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54592 entries, 0 to 54591
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   original_id      54592 non-null  object
 1   changed_id       54592 non-null  object
 2   original_status  54592 non-null  object
 3   problem_id       54592 non-null  object
 4   language         54592 non-null  object
 5   filename_ext     54592 non-null  object
dtypes: object(6)
memory usage: 2.5+ MB


None

Python    54592
Name: language, dtype: int64

Runtime Error             34126
Time Limit Exceeded       18385
WA: Presentation Error     1938
Memory Limit Exceeded       134
Output Limit Exceeded         8
Judge Not Available           1
Name: original_status, dtype: int64

We are left with 54592 submissions in total


## Generate Error Pairs

In this section we are going to generate the error classes for each of the found changes. To be able to create an error message for each change we have to generate the corresponding source code files to analyze what error would be produced by each modification. After this step we can run the source code and obtain the error message for the analyzed instruction. Next we repeat this step for the rest of the instruction changed in a single file. This way we obtained 250K labels that contain errors, compared to the previous attempt where we considered only the files that had a single instruction changed, where we obtained only 5K examples. Now, out of the entire set of generated pairs, a third contained instructions that did not matter toward the acceptance of the problem. These instructions, when analyzed, obtained a return code with the value of zero, indicating that the execution was successful. Out of the remaining buggy instructions, half are syntax errors, and the other more frequent errors are Python related problems, with some indentation bugs and type bugs.

In [4]:
with open(codenet.codenetpy_path, 'r') as f:
    data = json.load(f)
codenetpy_df = pd.DataFrame(data)

display(codenetpy_df)
display(codenetpy_df.info())

codenetpy_df['error_class'].value_counts()

Unnamed: 0,original_src,changed_src,problem_id,original_id,changed_id,language,filename_ext,original_status,returncode,error_class,error_class_extra,error,output
0,"a, b, c = map(int, input().split())\n\nif a=5,...","a, b, c = map(int, input().split())\niroha = [...",p04043,s008791833,s239750578,Python,py,Runtime Error,1,SyntaxError,SyntaxError: invalid syntax,"File ""/home/alex/Documents/research/bug-dete...",
1,S=int(input())\nh=int(S / 3600)\nm=int((s / 36...,S = int(input())\nh = S // 3600\nm = S % 3600 ...,p02390,s015523893,s437020613,Python,py,Runtime Error,1,NameError,NameError: name 's' is not defined,"Traceback (most recent call last):\n File ""/h...",
2,"seep, wolf = map(int, input().rstrip())\n\nif ...","seep, wolf = map(int, input().rstrip().split()...",p02699,s024536009,s838012510,Python,py,Runtime Error,1,ValueError,ValueError: invalid literal for int() with bas...,"Traceback (most recent call last):\n File ""/h...",
3,n = int(input())\ns = [input() for i in range(...,import collections\nn = int(input())\ns = [inp...,p02701,s020385685,s773213317,Python,py,Time Limit Exceeded,0,0,,,2\n
4,import sys\nsys.setrecursionlimit(100000)\n\nd...,import sys\nsys.setrecursionlimit(1000000)\n\n...,p02573,s012413364,s819427232,Python,py,Runtime Error,0,0,,,3\n
...,...,...,...,...,...,...,...,...,...,...,...,...,...
54587,import numpy as np\n\ndef main():\n N = int...,"N = int(input())\nAn = list(map(int, input().s...",p02642,s999876744,s590378402,Python,py,Time Limit Exceeded,0,0,,,3\n
54588,#!/usr/bin/env python3\nimport numpy as np\n\n...,#!/usr/bin/env python3\n\n# N ...,p02624,s999814239,s783590686,Python,py,Time Limit Exceeded,0,0,,,23\n
54589,N = int(input())\nK = int(input())\nresult = 0...,"N,K = map(int,input().split())\nresult = 0\nfo...",p03043,s999971803,s030030532,Python,py,Runtime Error,1,ValueError,ValueError: invalid literal for int() with bas...,"Traceback (most recent call last):\n File ""/h...",
54590,from decimal import *\n\nn = int(input())\nans...,from decimal import *\n\nn = int(input())\nans...,p02588,s999831708,s237718445,Python,py,Time Limit Exceeded,0,0,,,3\n


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54592 entries, 0 to 54591
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   original_src       54592 non-null  object
 1   changed_src        54592 non-null  object
 2   problem_id         54592 non-null  object
 3   original_id        54592 non-null  object
 4   changed_id         54592 non-null  object
 5   language           54592 non-null  object
 6   filename_ext       54592 non-null  object
 7   original_status    54592 non-null  object
 8   returncode         54592 non-null  int64 
 9   error_class        54592 non-null  object
 10  error_class_extra  54592 non-null  object
 11  error              54592 non-null  object
 12  output             54592 non-null  object
dtypes: int64(1), object(12)
memory usage: 5.4+ MB


None

0                            28406
SyntaxError                   8686
NameError                     5262
TypeError                     4202
ValueError                    2444
IndentationError              1081
IndexError                     999
AttributeError                 913
EOFError                       852
TLEError                       525
ModuleNotFoundError            357
TabError                       265
ImportError                    132
ZeroDivisionError               79
KeyError                        66
FileNotFoundError               62
UnboundLocalError               61
1                               58
RecursionError                  13
OverflowError                   12
RuntimeError                     6
-11                              5
OSError                          3
2                                3
255                              1
Name: error_class, dtype: int64