# Labelled Issues Dataset
The plan is to mine all the closed issues, look up their ids in tests and produce a CSV containing the JSON of an issue as a first column and its label (pos, neg, run) as the second one.

## Mine closed issues from GitHub

In [1]:
import requests, json, csv
from util import issues

In [13]:
allIssues = []
i = 0
while True:
  nextPage = issues(page=i, state='closed', per_page=100)
  if not nextPage:
    break
  allIssues.extend(nextPage)
  print('[{page}]: got {num} issues, in total {numTotal}'.format(page=i, num=len(nextPage), numTotal = len(allIssues)))
  i+=1

[0]: got 100 issues, in total 100
[1]: got 100 issues, in total 200
[2]: got 100 issues, in total 300
[3]: got 100 issues, in total 400
[4]: got 100 issues, in total 500
[5]: got 100 issues, in total 600
[6]: got 100 issues, in total 700
[7]: got 100 issues, in total 800
[8]: got 100 issues, in total 900
[9]: got 100 issues, in total 1000
[10]: got 100 issues, in total 1100
[11]: got 100 issues, in total 1200
[12]: got 100 issues, in total 1300
[13]: got 100 issues, in total 1400
[14]: got 100 issues, in total 1500
[15]: got 100 issues, in total 1600
[16]: got 100 issues, in total 1700
[17]: got 100 issues, in total 1800
[18]: got 100 issues, in total 1900
[19]: got 100 issues, in total 2000
[20]: got 100 issues, in total 2100
[21]: got 100 issues, in total 2200
[22]: got 100 issues, in total 2300
[23]: got 100 issues, in total 2400
[24]: got 100 issues, in total 2500
[25]: got 100 issues, in total 2600
[26]: got 100 issues, in total 2700
[27]: got 100 issues, in total 2800
[28]: got 1

In [14]:
issuesWithoutPrs = [i for i in allIssues if 'pull_request' not in i]

In [18]:
len(issuesWithoutPrs)

4365

In [17]:
with open('data/labelled-issues/closed_issues.json', 'w') as f:
  json.dump(issuesWithoutPrs, f, indent=2)

## Label them with a corresponding test and write as a JSON file

In [5]:
import json
with open('data/labelled-issues/closed_issues.json', 'r') as f:
  issuesWithoutPrs = json.load(f)
len(issuesWithoutPrs)

4365

In [6]:
def read_tests(tests_name):
  with open('data/labelled-issues/{0}'.format(tests_name)) as f:
    return [int(line) for line in f.readlines()]
pos_tests = read_tests('pos_tests')
neg_tests = read_tests('neg_tests')
run_tests = read_tests('run_tests')
[len(pos_tests), len(neg_tests), len(run_tests)]

[1382, 535, 634]

In [7]:
def labelled(iss):
  iid = iss['number']
  labels = []
  if iid in pos_tests:
    labels.append('pos')
  if iid in neg_tests:
    labels.append('neg')
  if iid in run_tests:
    labels.append('run')
  if len(labels) == 1:
    return labels[0]

labelled_issues = [{'issue': i, 'label': labelled(i)} for i in issuesWithoutPrs if labelled(i)]

In [9]:
with open('data/labelled-issues/closed_labelled_issues.json', 'w') as f:
  json.dump(labelled_issues, f, indent=2)

## Prepare a CSV file for further exploration
In that file, write down all the stuff that may be relevant to classify issues.

In [19]:
import json, 
with open('data/labelled-issues/closed_labelled_issues.json', 'r') as f:
  labelled_issues = json.load(f)

In [59]:
import mistune
markdown = mistune.create_markdown(renderer=mistune.AstRenderer())
def get_codeblocks(raw_body, filter_condition = None):
  if not filter_condition:
    predicate = lambda _: True
  elif isinstance(filter_condition, str):
    if filter_condition == 'empty':
      predicate = lambda node: not node['info']
    else:
      predicate = lambda node: node['info'] and node['info'].lower() == filter_condition.lower()
  else:
    predicate = filter_condition
  return [node for node in markdown(raw_body) if node['type'] == 'block_code' and predicate(node)]

In [57]:
markdown(labelled_issues[64]['issue']['body'])

[{'type': 'heading',
  'children': [{'type': 'text', 'text': 'Compiler version'}],
  'level': 2},
 {'type': 'paragraph',
  'children': [{'type': 'text', 'text': 'Latest nightly (i.e. '},
   {'type': 'codespan', 'text': 'dottyLatestNightlyBuild.get'},
   {'type': 'text', 'text': ')'}]},
 {'type': 'heading',
  'children': [{'type': 'text', 'text': 'Minimized code'}],
  'level': 2},
 {'type': 'paragraph',
  'children': [{'type': 'text',
    'text': 'Create a simple mirror type object which summons a product mirror, add an extension method for it to be able to be applied to any type:'}]},
 {'type': 'block_code',
  'text': 'object MirrorType {\n  class Container[T]\n\n  inline def decode[T]: String =\n    summonFrom {\n      case ev: Mirror.ProductOf[T] =>\n        s"Product-${new Container[ev.MirroredElemLabels]}" // This is the part that splices in the cast\n      case m: Mirror.SumOf[T] =>\n        "Sum"\n    }\n\n  inline def generic[T]: MirrorType[T] = \n    new MirrorType[T] {\n      

In [62]:
header = [
  'Issue Title',
  'Test Type',
  'Issue URL',
  'Comments count',
  'Scala codeblocks',
  'Empty codeblocks',
  'Total codeblocks',
  'First code block',
  'Second code block',
  'Issue Body',
]
dataset = [header]
for i in labelled_issues:
  issue = i['issue']
  body = issue['body']
  
  codeblocks = get_codeblocks(body)
  total_codeblocks = len(codeblocks)
  if total_codeblocks == 2:
    entry = [
      issue['title'],
      i['label'],
      issue['html_url'],
      issue['comments'],
      len(get_codeblocks(body, 'scala')),
      len(get_codeblocks(body, 'empty')),
      total_codeblocks,
      codeblocks[0]['text'],
      codeblocks[1]['text'],
      body,
    ]
  
    dataset.append(entry)

In [63]:
with open('data/labelled-issues/issues.csv', 'w') as csvfile:
  w = csv.writer(csvfile)
  for row in dataset:
    w.writerow(row)