# CIRCT Issue Statistics

This notebooks analyzes the Github issues exported using [Github CSV Tools](https://github.com/gavinr/github-csv-tools).

Since the CSV Tools exports the issues, pull requests, and associated comments as a single CSV file, we first normalize the data into separate dataframes for easier analysis.

In [None]:
import json
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

In [None]:
# load the raw export
df = pd.read_csv("2023-01-06-14-25-56-issues.csv")
df

In [None]:
# show all columns
# note that not all rows will have an entry for each column
df.columns.values

Each row corresponds to one of the following:
- An issue: A row corresponding to an issue contains values for `issue.*` columns
- A pull request: A row corresponding to a PR has values for the same columns as issues, but the `issue.html_url` is `https://github.com/llvm/circt/pull/<pr_id>` rather than `https://github.com/llvm/circt/pull/<issue_id>`
- A comment: A row corresponding to a comment on an issue or a pull request inherits all the same fields as the issue or pull request, except that the `comment.*` fields are also populated.

Using these facts, we can separate the issues from the pull requests by inspecting the `issue.html_url`; and separate the comments from the parent issue/PR by checking if any of the `comment.*` fields are populated or contain `NaN`.

Other notable fields of interest are as follows:
- `issue.url`: The Github API url to retrieve this issue/PR/comment. All rows are guaranteed to contain this field so we can use this to uniquely index issue/PR threads. However, we must filter out the associated comment rows before calculating statistics.
- `issue.labels`: A JSON-encoded list of labels assigned to this issue/PR. This is how we will filter out issues that correspond to bugs and determine which component of CIRCT the issue is associated with.
- `issue.body`: The text description of the issue.
- `comment.body`: The text of the comment.

In [None]:
# only closed issues
closed_filter = ~pd.isna(df["issue.closed_at"])
closed_prs_and_issues = df[closed_filter]
# separate pull requests and issues
pr_filter = closed_prs_and_issues["issue.html_url"].str.contains("pull")
prs_and_comments = closed_prs_and_issues[pr_filter]
issues_and_comments = closed_prs_and_issues[~pr_filter]

# separate comments from pull requests
pr_comment_filter = pd.isna(prs_and_comments["comment.created_at"])
prs = prs_and_comments[pr_comment_filter]
pr_comments = prs_and_comments[~pr_comment_filter]

# separate comments from issues
issue_comment_filter = pd.isna(issues_and_comments["comment.created_at"])
issues = issues_and_comments[issue_comment_filter]
issue_comments = issues_and_comments[~issue_comment_filter]

Since multiple opened issues can correspond to the same bug, we index by pull requests (making the assumption that a pull request that links to one or more issues tagged as bugs represents fixing a single common bug).

In [None]:
def has_label(item, label_name):
    labels = json.loads(item)
    for label in labels:
        if label["name"] == label_name:
            return True
    return False

def is_bug(item):
    return has_label(item, "bug")
bug_mask = issues["issue.labels"].apply(is_bug)
bug_issues = issues[bug_mask]
bug_numbers = bug_issues["issue.number"]
# save this to a file so that we can reference these issues manually later
bug_numbers.to_csv("bug_numbers.csv")
bug_numbers

In [None]:
# get all the labels that are associated with issues that are labeled as bugs
all_bug_labels = set()
for idx, bug in bug_issues.iterrows():
    labels = json.loads(bug["issue.labels"])
    for label in labels:
        name = label["name"]
        if name not in all_bug_labels:
            all_bug_labels.add(name)
all_bug_labels

In [None]:
total_bugs = len(bug_issues)
print(f"Total Bug Issues: {total_bugs}")
for bug_label in all_bug_labels:
    label_count = bug_issues["issue.labels"].apply(lambda item: has_label(item, bug_label)).sum()
    print(f"Label: {bug_label}, Count: {label_count}, Proportion: {label_count/total_bugs}")

In [None]:
labels_to_plot = list(all_bug_labels)
labels_to_plot.remove("bug")
labels_to_plot.remove("good first issue")
x = list()
y = list()
for bug_label in list(labels_to_plot):
    x.append(bug_label)
    label_count = bug_issues["issue.labels"].apply(lambda item: has_label(item, bug_label)).sum()
    y.append(label_count)
data = pd.DataFrame({"Issue Label": x, "Number of Occurences": y})
data.sort_values(by="Number of Occurences", ascending=False, inplace=True)
plt.tight_layout()
plt.xticks(rotation=70)
ax = sns.barplot(data=data, x="Issue Label", y="Number of Occurences")
for i in ax.containers:
    ax.bar_label(i,)

In [None]:
### find all pull requests that mention an issue (assumption is that a PR
### mentioning an issue == a PR fixing said issue)

# first we extract the mentioned issue numbers from each PR
mentioned_numbers = prs["issue.body"].str.extractall("#(\d{1,4})")[0]
mentioned_numbers_comments = pr_comments["comment.body"].str.extractall("#(\d{1,4})")[0]
# the extractall function returns one match per row. Each row is index with a
# multi-index where the first index is the original row index, and the second
# index is the match # (starting from zero)

# we turn these series of mentioned numbers into a boolean mask that
# selects PRs that mention an issue tagged as a bug
def does_mention_bug(numbers):
    """Checks if the given list of numbers references an issue labeled as a bug"""
    numbers = numbers.apply(int).values.flatten()
    for num in numbers:
        if num in bug_numbers.values:
            return True
    return False        
mention_filter_data = list()
for idx in prs.index:
    mentions_bug = False
    # check if the pr mentions a number in the body text
    if (idx, 0) in mentioned_numbers.index:
        mentions_bug |= does_mention_bug(mentioned_numbers.loc[idx])
    # check if the pr mentions a number in a comment
    if (idx, 0) in mentioned_numbers_comments.index:
        mentions_bug |= does_mention_bug(mentioned_numbers_comments.loc[idx])
    mention_filter_data.append(mentions_bug)
mention_filter = pd.Series(data=mention_filter_data, index=prs.index)

prs_that_mention = prs[mention_filter]
prs_that_mention

We only find 81 PRs that satisfy this. This seems like too low of a number to be correct. Maybe looking for mentions of issues within the PR body and comments is not a reliable way of matching PRs to issues.

However, in Github's web UI, many of the issues are linked to a pull request even if it's not explicitly mentioned in the actual description. We may want to look into the Github API to see if there's a way to extract this information.