# CodeNetPy Dataset

In this notebook I intend to make it very easy to visualize all the necessary information about a specific problem. The base dataset I will be using in this notebook is [CodeNet](https://github.com/IBM/Project_CodeNet) which is a large collection of source files and problem descriptions with metadata. The solutions are written in multiple programming languages (55+ according to the paper) and each problem has multiple submissions. Most of the submissions are written in the six most common languages (C++, Python, Java, C, Ruby, C#). As expected most of the solutions are in C++. One interesting aspect of the dataset is that it includes failed submissions, with various status codes such as Compilation Errors, Runtime Errors, Time Limit Exceeded, Memory Limit Exceeded, etc. This will prove useful since we are looking into bug detection in source code files.

To be able to run the notebook you have to run the `codenet.py` script first. It will download and preprocess the CodeNetPy dataset.

```console
python3 codenet.py
```

## Table of Contents
1. [Imports](#Imports)
1. [Download CodeNet](#Download-CodeNet)
1. [Missing Values](#Missing-Values)
1. [Generate Source Code Pairs](#Generate-Source-Code-Pairs)
1. [Generate Error Pairs](#Generate-Error-Pairs)
1. [Examples of buggy code](#Examples-of-buggy-code)

## Imports

In [1]:
import json

import numpy as np
import pandas as pd

from difflib import SequenceMatcher
from IPython.display import HTML

pd.set_option('display.max_columns', None)

problem_list_clean_path = "../../input/generated/problem_list_clean.csv"
generated_pairs_path = "../../input/generated/generated_pairs.csv"
codenetpy_path = "../../input/generated/codenetpy.json"
filter_codenetpy_path = "../../input/generated/filter_codenetpy.json"

## Download CodeNet

To generate the same dataset as the one used in this project you can download the CodeNet dataset as described in IBM's GitHub repository and run the data processing pipeline from [this](https://github.com/alexjercan/bug-detection) GitHub repo.

## Missing Values

The dataset also includes a description file for most of the problems. We can see which problems have or don't have a description associated. The description file can be useful to predict what the problem topic is about, graphs, dp, greedy, etc.

In the case of missing input files, I think it is also better to just drop the submissions, most of the description files are written in Chinese and we cannot really extract any useful information from them. Since there are so few files with no input we can drop them. By looking in the description files there are like 2 problems with no input from the stdin.

To conclude the missing values section, 54/56 of the missing names in the problems list are due to missing description files 1/56 is just a href which links to a 404 web page and the last one is a test problem, the later 2 problems having no submissions anyway. I think it is a fair decision to drop these samples as they are not useful. There will be 130 remaining problems with no input/output samples and 128 of them have description files in Chinese which makes it harder to extract samples, and 2 of them only require printing of values (similar to problem p00000). In this case I also think that it is ok to drop those 2 problems that don't need input alongside the rest of problems that have no input examples extracted, because we don't have to remember that there is one or two problems that can cause some bugs later on.

In the next cell we can look at the cleaned problem list dataframe that is contained in the CodeNetPy dataset.

In [2]:
problem_list_df = pd.read_csv(problem_list_clean_path, index_col='id')
problem_ids = problem_list_df.index.unique()

print(f'We have {len(problem_list_df)} problems')
print('The distribution of the datasets is')
print(problem_list_df['dataset'].value_counts(normalize=True))
display(problem_list_df.head())
display(problem_list_df.isna().sum())

We have 3867 problems
The distribution of the datasets is
AIZU       0.613654
AtCoder    0.386346
Name: dataset, dtype: float64


Unnamed: 0_level_0,name,dataset,time_limit,memory_limit,rating,tags,complexity
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
p00001,List of Top 3 Hills,AIZU,1000.0,131072.0,,,
p00002,Digit Number,AIZU,1000.0,131072.0,,,
p00003,Is it a Right Triangle?,AIZU,1000.0,131072.0,,,
p00004,Simultaneous Equation,AIZU,1000.0,131072.0,,,
p00005,GCD and LCM,AIZU,1000.0,131072.0,,,


name               0
dataset            0
time_limit         0
memory_limit       0
rating          3867
tags            3867
complexity      3867
dtype: int64

## Generate Source Code Pairs
- for each problem:
- group solutions by user and select the solutions that are consecutive and of the form (Error, Accepted) and sort by the submission date; heuristic to search a smaller space, since a user might submit a correct solution after a wrong one
- build a df from this list and save it

Here we want to generate submission pairs that will allow us to find code fixes within the dataset. With this information we will be able to find the instructions that have to be modified, either deleted, inserted or changed, so that the old code starts to work. You would imagine that solutions that were submitted successively should be similar in terms of content and only have a few changes between each other. The small mistakes can be patched by knowing the accepted submission in that chain and give us the errors produced, and thus removed by the correct instruction. These are the submissions that we are interested in and look into finding patterns and further cleaning those that will not prove helpful.

In [3]:
generated_pairs_df = pd.read_csv(generated_pairs_path)

display(generated_pairs_df)
display(generated_pairs_df.info())
display(generated_pairs_df.language.value_counts())
display(generated_pairs_df.original_status.value_counts())

print(f'We are left with {len(generated_pairs_df)} submissions in total')

Unnamed: 0,original_id,changed_id,original_status,problem_id,language,filename_ext
0,s000016565,s604436209,Runtime Error,p03106,Python,py
1,s000023530,s834210063,Runtime Error,p02684,Python,py
2,s000041036,s454210115,Runtime Error,p02584,Python,py
3,s000041460,s952454189,Time Limit Exceeded,p02658,Python,py
4,s000054326,s226572665,Time Limit Exceeded,p02701,Python,py
...,...,...,...,...,...,...
54587,s999855642,s185899176,Runtime Error,p02645,Python,py
54588,s999876744,s590378402,Time Limit Exceeded,p02642,Python,py
54589,s999891212,s835176880,Runtime Error,p03103,Python,py
54590,s999921259,s180756175,Time Limit Exceeded,p02713,Python,py


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54592 entries, 0 to 54591
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   original_id      54592 non-null  object
 1   changed_id       54592 non-null  object
 2   original_status  54592 non-null  object
 3   problem_id       54592 non-null  object
 4   language         54592 non-null  object
 5   filename_ext     54592 non-null  object
dtypes: object(6)
memory usage: 2.5+ MB


None

Python    54592
Name: language, dtype: int64

Runtime Error             34126
Time Limit Exceeded       18385
WA: Presentation Error     1938
Memory Limit Exceeded       134
Output Limit Exceeded         8
Judge Not Available           1
Name: original_status, dtype: int64

We are left with 54592 submissions in total


## Generate Error Pairs

In this section we are going to generate the error classes for each of the found changes. To be able to create an error message for each change we have to generate the corresponding source code files to analyze what error would be produced by each modification. After this step we can run the source code and obtain the error message for the analyzed instruction. Next we repeat this step for the rest of the instruction changed in a single file. This way we obtained 250K labels that contain errors, compared to the previous attempt where we considered only the files that had a single instruction changed, where we obtained only 5K examples. Now, out of the entire set of generated pairs, a third contained instructions that did not matter toward the acceptance of the problem. These instructions, when analyzed, obtained a return code with the value of zero, indicating that the execution was successful. Out of the remaining buggy instructions, half are syntax errors, and the other more frequent errors are Python related problems, with some indentation bugs and type bugs.

In [2]:
with open(codenetpy_path, 'r') as f:
    data = json.load(f)

codenetpy_df = pd.DataFrame(data)

display(codenetpy_df)
display(codenetpy_df.info())

codenetpy_df['error_class'].value_counts()

Unnamed: 0,original_src,changed_src,problem_id,original_id,changed_id,language,filename_ext,original_status,returncode,error_class,error_class_extra,error,output
0,l=[]\nwhile True:\n try:\n l.append(...,l=[]\nwhile True:\n try:\n l.append(...,p00001,s196059089,s508355022,Python,py,Runtime Error,1,SyntaxError,SyntaxError: Missing parentheses in call to 'p...,"File ""/home/alex/Documents/research/bug-dete...",
1,"data = []\nfor i in range(0,10):\n data.app...","data = []\nfor i in range(0,10):\n data.app...",p00001,s840313913,s627573591,Python,py,WA: Presentation Error,0,0,,,\n3776\n2848\n2840\n
2,first = 0;\nsecond = 0;\nthird = 0;\nfor var i...,first = 0;\nsecond = 0;\nthird = 0;\nfor var i...,p00001,s973623080,s521689663,Python,py,Runtime Error,1,TypeError,TypeError: '>' not supported between instances...,"Traceback (most recent call last):\n File ""/h...",
3,for i in range(10):\n s.append(int(input())...,s=[]\nfor i in range(10):\n s.append(int(in...,p00001,s393051088,s459982723,Python,py,Runtime Error,1,NameError,NameError: name 's' is not defined,"Traceback (most recent call last):\n File ""/h...",
4,# -*- coding: utf-8 -*-\n\nimport sys\n\ndef t...,# -*- coding: utf-8 -*-\n\nimport sys\n\ndef t...,p00001,s803831828,s403126268,Python,py,Runtime Error,1,NameError,NameError: name 'argv' is not defined,"Traceback (most recent call last):\n File ""/h...",
...,...,...,...,...,...,...,...,...,...,...,...,...,...
54587,"N = int(input())\nL = list(map(int, input().sp...","N = int(input())\nL = list(map(int, input().sp...",p04047,s420396881,s036877571,Python,py,Runtime Error,1,IndexError,IndexError: list index out of range,"Traceback (most recent call last):\n File ""/h...",
54588,\nnum = int(input())\nli = int(input().split()...,n = int(input())\nli = [int(x) for x in input(...,p04047,s838662080,s028090472,Python,py,Runtime Error,1,TypeError,"TypeError: int() argument must be a string, a ...","Traceback (most recent call last):\n File ""/h...",
54589,from collections import Counter\n \nn = int(in...,"n = int(input())\n \narr = list(map(int, input...",p04047,s112597968,s888016571,Python,py,Runtime Error,1,TypeError,TypeError: unsupported operand type(s) for +=:...,"Traceback (most recent call last):\n File ""/h...",
54590,"N,X = input().split()\nN,X = int(N), int(X)\na...","N,X = input().split()\nN,X = int(N), int(X)\na...",p04048,s311100241,s323769427,Python,py,Runtime Error,1,TabError,TabError: inconsistent use of tabs and spaces ...,"File ""/home/alex/Documents/research/bug-dete...",


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54592 entries, 0 to 54591
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   original_src       54592 non-null  object
 1   changed_src        54592 non-null  object
 2   problem_id         54592 non-null  object
 3   original_id        54592 non-null  object
 4   changed_id         54592 non-null  object
 5   language           54592 non-null  object
 6   filename_ext       54592 non-null  object
 7   original_status    54592 non-null  object
 8   returncode         54592 non-null  int64 
 9   error_class        54592 non-null  object
 10  error_class_extra  54592 non-null  object
 11  error              54592 non-null  object
 12  output             54592 non-null  object
dtypes: int64(1), object(12)
memory usage: 5.4+ MB


None

0                            28406
SyntaxError                   8686
NameError                     5262
TypeError                     4202
ValueError                    2444
IndentationError              1081
IndexError                     999
AttributeError                 913
EOFError                       852
TLEError                       525
ModuleNotFoundError            357
TabError                       265
ImportError                    132
ZeroDivisionError               79
KeyError                        66
FileNotFoundError               62
UnboundLocalError               61
1                               58
RecursionError                  13
OverflowError                   12
RuntimeError                     6
-11                              5
2                                3
OSError                          3
255                              1
Name: error_class, dtype: int64

In [3]:
err_codenetpy_df = codenetpy_df[codenetpy_df["returncode"] != 0]

print(f"Number of return code != 0 sources {len(err_codenetpy_df)} from the total of {len(codenetpy_df)}")
print(f"{len(err_codenetpy_df)/len(codenetpy_df)*100:.2f}% of the sources are labeled as having return code of error during execution")


Number of return code != 0 sources 26137 from the total of 54592
47.88% of the sources are labeled as having return code of error during execution


Even with ground truth return code status there are still some sources that give correct results. These sources will return answers to stdout, such as "Yes", "0", etc. which we try to interpret as errors. First fix is to remove any sources that have the error class not in a particular set of errors such as SyntaxError, NameError, etc. Let's say we filter what does not contain "Error" in the name. Probably those are examples of wrong answer.

In [4]:
err_codenetpy_df = err_codenetpy_df[err_codenetpy_df["error_class"].str.contains("Error")]

err_codenetpy_df['error_class'].value_counts()

SyntaxError            8686
NameError              5262
TypeError              4202
ValueError             2444
IndentationError       1081
IndexError              999
AttributeError          913
EOFError                852
TLEError                525
ModuleNotFoundError     357
TabError                265
ImportError             132
ZeroDivisionError        79
KeyError                 66
FileNotFoundError        62
UnboundLocalError        61
RecursionError           13
OverflowError            12
RuntimeError              6
OSError                   3
Name: error_class, dtype: int64

In [5]:
with open(filter_codenetpy_path, 'r') as f:
    data = json.load(f)

filter_codenetpy_df = pd.DataFrame(data)

display(filter_codenetpy_df)
display(filter_codenetpy_df.info())

filter_codenetpy_df['error_class'].value_counts()

Unnamed: 0,original_src,changed_src,problem_id,original_id,changed_id,language,filename_ext,original_status,returncode,error_class,error_class_extra,error,output
0,import sys\nmountains = [int(line.readline()) ...,mountains = [int(input()) for _ in range(10)]\...,p00001,s068025056,s198062921,Python,py,Runtime Error,1,AttributeError,AttributeError: 'str' object has no attribute ...,"Traceback (most recent call last):\n File ""/h...",
1,# -*- coding: utf-8 -*-\n\nimport sys\n\ndef t...,# -*- coding: utf-8 -*-\n\nimport sys\n\ndef t...,p00001,s803831828,s403126268,Python,py,Runtime Error,1,NameError,NameError: name 'argv' is not defined,"Traceback (most recent call last):\n File ""/h...",
2,a = []\nfor i in range(10):\n a.append(inpu...,a = []\nfor i in range(10):\n a.append(inpu...,p00001,s209735121,s412417225,Python,py,Runtime Error,1,TypeError,TypeError: 'NoneType' object is not subscriptable,"Traceback (most recent call last):\n File ""/h...",
3,import sys\nheights = sorted([ int(h) for h in...,import sys\nheights = sorted([ int(h) for h in...,p00001,s286499566,s804982567,Python,py,Runtime Error,1,NameError,NameError: name 'reverce' is not defined,"Traceback (most recent call last):\n File ""/h...",
4,for i in range(10):\n s.append(int(input())...,s=[]\nfor i in range(10):\n s.append(int(in...,p00001,s393051088,s459982723,Python,py,Runtime Error,1,NameError,NameError: name 's' is not defined,"Traceback (most recent call last):\n File ""/h...",
...,...,...,...,...,...,...,...,...,...,...,...,...,...
39839,"try:\n k=input()\nexcept EOFError:\n print(""...","k = int(input())\nli= list(map(int, input().sp...",p04047,s131254454,s616077020,Python,py,Runtime Error,1,AttributeError,AttributeError: 'str' object has no attribute ...,"Traceback (most recent call last):\n File ""/h...",
39840,"import math\nn, x = map(int, input().split())\...","import math\nn, x = map(int, input().split())\...",p04048,s297220978,s109261748,Python,py,Runtime Error,1,NameError,NameError: name 'gcd' is not defined,"Traceback (most recent call last):\n File ""/h...",
39841,\nnum = int(input())\nli = int(input().split()...,n = int(input())\nli = [int(x) for x in input(...,p04047,s838662080,s028090472,Python,py,Runtime Error,1,TypeError,"TypeError: int() argument must be a string, a ...","Traceback (most recent call last):\n File ""/h...",
39842,"N = int(input())\nL = list(map(int, input().sp...","N = int(input())\nL = list(map(int, input().sp...",p04047,s420396881,s036877571,Python,py,Runtime Error,1,IndexError,IndexError: list index out of range,"Traceback (most recent call last):\n File ""/h...",


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39844 entries, 0 to 39843
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   original_src       39844 non-null  object
 1   changed_src        39844 non-null  object
 2   problem_id         39844 non-null  object
 3   original_id        39844 non-null  object
 4   changed_id         39844 non-null  object
 5   language           39844 non-null  object
 6   filename_ext       39844 non-null  object
 7   original_status    39844 non-null  object
 8   returncode         39844 non-null  int64 
 9   error_class        39844 non-null  object
 10  error_class_extra  39844 non-null  object
 11  error              39844 non-null  object
 12  output             39844 non-null  object
dtypes: int64(1), object(12)
memory usage: 4.0+ MB


None

0                          24564
NameError                   4919
TypeError                   3999
ValueError                  2341
IndexError                   929
AttributeError               847
EOFError                     794
TLEError                     458
SyntaxError                  300
ModuleNotFoundError          223
ImportError                   83
ZeroDivisionError             72
KeyError                      59
UnboundLocalError             55
1                             47
FileNotFoundError             42
OverflowError                 10
RecursionError                10
-11                            4
RuntimeError                   3
2                              3
OSError                        2
255                            1
Name: error_class, dtype: int64

In [6]:
err_filter_codenetpy_df = filter_codenetpy_df[filter_codenetpy_df["returncode"] != 0]

print(f"Number of return code != 0 sources {len(err_filter_codenetpy_df)} from the total of {len(filter_codenetpy_df)}")
print(f"{len(err_filter_codenetpy_df)/len(filter_codenetpy_df)*100:.2f}% of the sources are labeled as having return code of error during execution")

Number of return code != 0 sources 15244 from the total of 39844
38.26% of the sources are labeled as having return code of error during execution


In [9]:
err_filter_codenetpy_df = err_filter_codenetpy_df[err_filter_codenetpy_df["error_class"].str.contains("Error")]

display(err_filter_codenetpy_df.info())
err_filter_codenetpy_df['error_class'].value_counts()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15146 entries, 0 to 39843
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   original_src       15146 non-null  object
 1   changed_src        15146 non-null  object
 2   problem_id         15146 non-null  object
 3   original_id        15146 non-null  object
 4   changed_id         15146 non-null  object
 5   language           15146 non-null  object
 6   filename_ext       15146 non-null  object
 7   original_status    15146 non-null  object
 8   returncode         15146 non-null  int64 
 9   error_class        15146 non-null  object
 10  error_class_extra  15146 non-null  object
 11  error              15146 non-null  object
 12  output             15146 non-null  object
dtypes: int64(1), object(12)
memory usage: 1.6+ MB


None

NameError              4919
TypeError              3999
ValueError             2341
IndexError              929
AttributeError          847
EOFError                794
TLEError                458
SyntaxError             300
ModuleNotFoundError     223
ImportError              83
ZeroDivisionError        72
KeyError                 59
UnboundLocalError        55
FileNotFoundError        42
OverflowError            10
RecursionError           10
RuntimeError              3
OSError                   2
Name: error_class, dtype: int64

In [8]:
diff_df = codenetpy_df[~codenetpy_df["original_id"].isin(filter_codenetpy_df["original_id"])]

display(diff_df)
display(diff_df.info())

diff_df["error_class"].value_counts()

Unnamed: 0,original_src,changed_src,problem_id,original_id,changed_id,language,filename_ext,original_status,returncode,error_class,error_class_extra,error,output
0,l=[]\nwhile True:\n try:\n l.append(...,l=[]\nwhile True:\n try:\n l.append(...,p00001,s196059089,s508355022,Python,py,Runtime Error,1,SyntaxError,SyntaxError: Missing parentheses in call to 'p...,"File ""/home/alex/Documents/research/bug-dete...",
5,height = []\nwhile 1:\n h = raw_input()\n ...,height = []\nwhile 1:\n try:\n heigh...,p00001,s700829087,s365381008,Python,py,Runtime Error,1,SyntaxError,SyntaxError: Missing parentheses in call to 'p...,"File ""/home/alex/Documents/research/bug-dete...",
6,"cnt = 10\ntop3 = [0,0,0]\n\nfor x in xrange(cn...","\ncnt = 10\ntop3 = [0,0,0]\n\nfor x in xrange(...",p00001,s473008139,s013491427,Python,py,Runtime Error,1,SyntaxError,SyntaxError: Missing parentheses in call to 'p...,"File ""/home/alex/Documents/research/bug-dete...",
8,import sys\ne = sys.stdin.readlines():\nfor i ...,import sys\ne = sys.stdin.readlines()\ne = [in...,p00001,s239033172,s000171711,Python,py,Runtime Error,1,SyntaxError,SyntaxError: invalid syntax,"File ""/home/alex/Documents/research/bug-dete...",
11,try:\n while True:\nexcept EOFError:,a=[int(input()) for i in range(10)]\na.sort(re...,p00001,s761743786,s244934112,Python,py,Runtime Error,1,IndentationError,IndentationError: expected an indented block,"File ""/home/alex/Documents/research/bug-dete...",
...,...,...,...,...,...,...,...,...,...,...,...,...,...
54571,"import sys,math,collections,itertools,bisect\n...","import sys,math,collections,itertools,bisect\n...",p04045,s123932140,s101021309,Python,py,Runtime Error,0,0,,,2000\n
54575,"N, K = map(int, input().split())\nP = list(map...","N, K = map(int, input().split())\nP = list(map...",p04045,s236534621,s993724080,Python,py,Runtime Error,0,0,,,2000\n
54577,"def digit():\n N, K = [int(n) for n in inpu...","def has_no(s, digits):\n for ch in s:\n ...",p04045,s345758312,s307175024,Python,py,Runtime Error,1,SyntaxError,SyntaxError: invalid syntax,"File ""/home/alex/Documents/research/bug-dete...",
54586,"n = int(input())\na = sorted(map(int, input()....","n = int(input())\na = sorted(list(map(int, inp...",p04047,s604076487,s034960174,Python,py,Runtime Error,1,SyntaxError,SyntaxError: invalid syntax,"File ""/home/alex/Documents/research/bug-dete...",


<class 'pandas.core.frame.DataFrame'>
Int64Index: 14748 entries, 0 to 54590
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   original_src       14748 non-null  object
 1   changed_src        14748 non-null  object
 2   problem_id         14748 non-null  object
 3   original_id        14748 non-null  object
 4   changed_id         14748 non-null  object
 5   language           14748 non-null  object
 6   filename_ext       14748 non-null  object
 7   original_status    14748 non-null  object
 8   returncode         14748 non-null  int64 
 9   error_class        14748 non-null  object
 10  error_class_extra  14748 non-null  object
 11  error              14748 non-null  object
 12  output             14748 non-null  object
dtypes: int64(1), object(12)
memory usage: 1.6+ MB


None

SyntaxError                  8386
0                            3842
IndentationError             1081
NameError                     343
TabError                      265
TypeError                     203
ModuleNotFoundError           134
ValueError                    103
IndexError                     70
TLEError                       67
AttributeError                 66
EOFError                       58
ImportError                    49
FileNotFoundError              20
1                              11
KeyError                        7
ZeroDivisionError               7
UnboundLocalError               6
RuntimeError                    3
RecursionError                  3
OverflowError                   2
-11                             1
OSError                         1
Name: error_class, dtype: int64

Looks like flake did not find that many sources with bugs using static analysis. It only found 27 sources?? How is flake not finding indent and syntax errors? Ok I think I forgot to flush :). 47961 is the new number. Maybe will have to check some parameters for flake to be less strict. There are also a lot of 0 return codes, because I don't have the full inputs/output tests. The runtime errors probably happened on some hidden tests.
Using setting for flake and black it will remove 14747 sources (Ignore long lines E501, blank spaces E303, ambiguous name E741, format F and warnings W)


One  way to keep only runtime errors it to keep the errors:
- TypeError
- ValueError
- IndexError
- AttributeError (maybe, if not using something like mypy)
- TLEError
- ZeroDivisionError
- KeyError
- UnboundLocalError (what is this lol? global variables?)
- FileNotFoundError (how to fix this though? kind of hard to predict missing input files)
- RecursionError
- OverflowError
- RuntimeError
- OSError

It's a big difference between the counts of these errors.

## Examples of buggy code

In this section we will look at samples of buggy and accepted submissions. The diff between the two types of submissions will be showed in red (for buggy) and blue (for accepted). For each case we will also look at the error description message associated with the buggy sample.

In [5]:
buggy_df = codenetpy_df[codenetpy_df['returncode'] != 0]

def color_source(source_code, mask, color):
    text = ""
    for i, char in enumerate(source_code):
        norm_color = 'black'
        if char == ' ':
            char = "•"
            norm_color = 'lightgrey'
        if char == '\n':
            char = "↵\n"
            norm_color = 'lightgrey'
        text += f'<span style="color:{color if mask[i] == 1 else norm_color};">{char}</span>'
    return "<pre>" + text + "</pre>"

def display_example(i):
    original_src, changed_src, error_class_extra = buggy_df.iloc[i][['original_src', 'changed_src', 'error_class_extra']]

    s = SequenceMatcher(None, original_src, changed_src)
    opcodes = [x for x in s.get_opcodes() if x[0] != "equal"]

    original_labels = np.zeros_like(list(original_src), dtype=np.int32)
    changed_labels = np.zeros_like(list(changed_src), dtype=np.int32)
    for op, i1, i2, j1, j2 in opcodes:
        if op == 'insert':
            original_labels[i1: i1+1] = 1
            changed_labels[j1: j2+1] = 1
        else:
            original_labels[i1:i2] = 1
            changed_labels[j1:j2] = 1

    original_labels = original_labels.tolist()
    changed_labels = changed_labels.tolist()

    display(HTML(f"<h1>Example {i}</h1>"))
    
    display(HTML("<h2>The source code that is buggy:\n</h2>"))
    display(HTML(color_source(original_src, original_labels, color='red')))

    display(HTML("<h2>The source code that is accepted:\n</h2>"))
    display(HTML(color_source(changed_src, changed_labels, color='blue')))

    display(HTML("<h2>The bug that should be assigned to the original_src:\n</h2>"))
    display(HTML(f"<pre>{error_class_extra}</pre>"))

for i in range(10):
    display_example(i)