# GitHub Repository Metric Analysis

In this notebook, we will analyze the GitHub data collected for the repositories mentioned in [sigs.yaml](https://github.com/open-services-group/community/blob/main/sigs.yaml) and derive some basic metrics such as number of issues/PR open, number of issues/PR closed, mean time to close PRs/issues etc.

This notebook will serve as a template notebook to analyze different GitHub repositories so that it can be easily executed in automation as part of our metrics processing pipeline. The notebook can be executed in parallel for different repos by passing as an argument the GitHub repository for which we would like to analyze and calculate metrics.

(Related issues: [Issue 1](https://github.com/open-services-group/metrics/issues/19))

In [1]:
import os
import datetime as dt
import numpy as np
from dotenv import find_dotenv, load_dotenv
from matplotlib import pyplot as plt
import warnings
import trino

from s3_communication import S3Communication

# Note: The GitHub access token needs to be exported before importing the srcopmetrics package (current bug)
from srcopsmetrics.entities.issue import Issue  # noqa: E402
from srcopsmetrics.entities.pull_request import PullRequest  # noqa: E402

warnings.filterwarnings("ignore")
load_dotenv(find_dotenv())

True

In [2]:
# init s3 connector
s3c = S3Communication(
    s3_endpoint_url=os.getenv("S3_ENDPOINT"),
    aws_access_key_id=os.getenv("S3_ACCESS_KEY"),
    aws_secret_access_key=os.getenv("S3_SECRET_KEY"),
    s3_bucket=os.getenv("S3_BUCKET"),
)

In [3]:
## Create a .env file on your local with the correct configs
REPO = os.getenv("REPO")
ORG = os.getenv("ORG")
GITHUB_ACCESS_TOKEN = os.getenv("GITHUB_ACCESS_TOKEN")

In [4]:
repo_slug = f"{ORG}/{REPO}"
repo_slug

'os-climate/aicoe-osc-demo'

In [5]:
# Gather the data
!python -m srcopsmetrics.cli -clr $repo_slug -e Issue,PullRequest

INFO:srcopsmetrics.github_knowledge:Overall repositories found: 1
INFO:srcopsmetrics.bot_knowledge:######################## Analysing os-climate/aicoe-osc-demo ########################

INFO:srcopsmetrics.bot_knowledge:########################
INFO:srcopsmetrics.bot_knowledge:Detected entities:
CodeFrequency # Commit # DependencyUpdate # Fork # Issue # IssueEvent # KebechetUpdateManager # License # PullRequest # PullRequestDiscussion # RawIssue # RawPullRequest # ReadMe # Release # Stargazer # TrafficClones # TrafficPaths # TrafficPaths # TrafficReferrers # TrafficClones # TrafficViews
INFO:srcopsmetrics.bot_knowledge:########################
INFO:srcopsmetrics.bot_knowledge:Issue inspection
INFO:srcopsmetrics.entities.tools.storage:Loading knowledge locally
INFO:srcopsmetrics.entities.tools.storage:Data from file %s loaded
INFO:srcopsmetrics.entities.interface:Found previous Issue knowledge for os-climate/aicoe-osc-demo with 77 records
INFO:srcopsmetrics.iterator:-------------Issue An

## Issue Metrics

Now, lets fetch the issues for the repository and derive some metrics.

In [6]:
issue = Issue(repo_slug)
issue_df = issue.load_previous_knowledge(is_local=True)
issue_df.head()

KeyboardInterrupt: 

In [None]:
issues_df = issue_df.reset_index()

In [None]:
issues_df.head()

In [None]:
issue_cols_to_drop = ["labels", "interactions"]
issue_df = issues_df.drop(columns=issue_cols_to_drop)

issue_df["org"] = ORG
issue_df["repo"] = REPO

issue_df.head()

## todo: fix s3 communication

In [None]:
s3c.upload_df_to_s3(
    df=issues_df,
    s3_prefix="open-services-group/metrics/github/issues",
    s3_key=f"{ORG}-{REPO}.parquet",
)

## PR Metrics

Now, lets fetch the PRs for the repository and derive some metrics.

In [None]:
pr = PullRequest(repo_slug)
pr_df = pr.load_previous_knowledge(is_local=True)
pr_df.head()

In [None]:
pr_df = pr_df.reset_index()

In [None]:
pr_df.head()

In [None]:
pr_cols_to_drop = ["interactions", "reviews", "labels", "commits", "changed_files"]
prs_df = pr_df.drop(columns=pr_cols_to_drop)

prs_df["org"] = ORG
prs_df["repo"] = REPO

prs_df.head()

In [None]:
s3c.upload_df_to_s3(
    df=prs_df,
    s3_prefix="open-services-group/metrics/github/prs",
    s3_key=f"{ORG}-{REPO}.parquet",
)

## Create Trino Tables

In [None]:
_p2smap = {
    "object": "varchar",
    "int64": "bigint",
    "float64": "double",
    "datetime64[ns]": "timestamp",
    "bool": "boolean",
}


def pandas_type_to_sql(pt):
    st = _p2smap.get(pt)
    if st is not None:
        return st
    raise ValueError("unexpected pandas column type '{pt}'".format(pt=pt))


# add ability to specify optional dict for specific fields?
# if column name is present, use specified value?
def generate_table_schema_pairs(df):
    ptypes = [str(e) for e in df.dtypes.to_list()]
    stypes = [pandas_type_to_sql(e) for e in ptypes]
    pz = list(zip(df.columns.to_list(), stypes))
    return ",\n".join(["    {n} {t}".format(n=e[0], t=e[1]) for e in pz])

In [None]:
# Create a Trino client
conn = trino.dbapi.connect(
    auth=trino.auth.BasicAuthentication(
        os.environ["TRINO_USER"], os.environ["TRINO_PASSWD"]
    ),
    host=os.environ["TRINO_HOST"],
    port=int(os.environ["TRINO_PORT"]),
    http_scheme="https",
    verify=True,
)
cur = conn.cursor()

In [None]:
cur.execute("show catalogs")
cur.fetchall()[1]

In [None]:
# bucket = s3c.s3_resource.Bucket(os.environ["S3_BUCKET"])
# for i in bucket.objects.all():
#     print(i)

In [None]:
issue_schema = generate_table_schema_pairs(issue_df)

tabledef = """create table if not exists data_science_general.default.issues(
{schema}
) with (
    format = 'parquet',
    external_location = 's3a://{s3_bucket}/open-services-group/metrics/github/issues'
)""".format(
    schema=issue_schema,
    s3_bucket=os.environ["S3_BUCKET"],
)

cur.execute(tabledef)
cur.fetchall()

### Number of issues created over time

In [None]:
# Let's find the number of issues created daily
issues_created_daily = (
    issues_df["created_at"].groupby(issues_df.created_at.dt.to_period("D")).agg("count")
)

In [None]:
issues_created_daily.head()

In [None]:
issues_created_daily.plot.bar()

plt.xlabel("Days")
locs, labels = plt.xticks()
N = 10
plt.xticks(locs[::N], issues_created_daily.index[::N].strftime("%b %Y"))
plt.xticks(rotation=45)
plt.ylabel("# Issues")
plt.title("# Daily Issues Created")
plt.show()

### Number of open issues

In [None]:
num_open_issues = issues_df["closed_at"].isna().sum()
num_open_issues

### Number of closed issues

In [None]:
num_close_issues = issues_df["closed_at"].notnull().sum()
num_close_issues

### Mean time to close issues

In [None]:
# Calculate the time taken to close an issue
issues_df["time_to_close"] = issues_df.closed_at - issues_df.created_at
issues_df.head()

Now let's find out the median time taken to close issues grouped by month.

In [None]:
issues_closed_monthly = (
    issues_df["time_to_close"]
    .groupby(issues_df.created_at.dt.to_period("M"))
    .agg("median")
)

In [None]:
issues_closed_monthly.head()

We can visualize the trend in median time to close issues by month. However, in order to best capture all the median values including the outliers we can normalize the data by taking the log of the values before plotting. We should also consider the level of granularity we would like to capture and visualize the median time in i.e. days vs hours vs minutes vs seconds.

Let us first consider the different levels of granularity for the median time to close issues.

In [None]:
# days
issues_closed_monthly_days = issues_closed_monthly.astype("timedelta64[D]")
# hours
issues_closed_monthly_hours = issues_closed_monthly.astype("timedelta64[h]")
# minutes
issues_closed_monthly_minutes = issues_closed_monthly.astype("timedelta64[m]")
# seconds
issues_closed_monthly_seconds = issues_closed_monthly.astype("timedelta64[s]")

We will now consider the granularity level to be "days" and plot the median time to close issues grouped by months.

In [None]:
issues_closed_monthly_days.plot()
plt.xlabel("Month")
plt.ylabel("Median time to close (days)")
plt.title("Median Time to Close Issues (Monthly)")
plt.show()

### Number of PRs created over time

In [None]:
# Let's find the number of PRs created daily
pr_created_daily = (
    pr_df["created_at"].groupby(pr_df.created_at.dt.to_period("D")).agg("count")
)

In [None]:
pr_created_daily.head()

In [None]:
pr_created_daily.plot.bar()

plt.xlabel("Days")
locs, labels = plt.xticks()
N = 10
plt.xticks(locs[::N], pr_created_daily.index[::N].strftime("%b %Y"))
plt.xticks(rotation=45)
plt.ylabel("# PRs")
plt.title("# Daily PRs Created")
plt.show()

### Number of open PRs

In [None]:
num_open_prs = pr_df["closed_at"].isna().sum()
num_open_prs

### Number of closed PRs

In [None]:
num_close_prs = pr_df["closed_at"].notnull().sum()
num_close_prs

### Ratio of opened to closed PRs over the last 90 days (quarter) 

In [None]:
num_open_prs_90d = len(
    pr_df[pr_df["created_at"] > (dt.datetime.now() - dt.timedelta(days=90))]
)
num_closed_prs_90d = len(
    pr_df[pr_df["closed_at"] > (dt.datetime.now() - dt.timedelta(days=90))]
)
print("Number of open PRs:", num_open_prs_90d)
print("Number of closed PRs:", num_closed_prs_90d)

ratio = num_open_prs_90d / num_closed_prs_90d
print("Ratio of open to closed PRs over last 90 days:", ratio)

A ratio of 1 indicates that we have managed to close and review all the PRs that were created in the past 90 days. Hence, we should always strive for a ratio of 1.

### Mean time to close PRs

In [None]:
# Calculate the time taken to close a PR
pr_df["time_to_close"] = pr_df.closed_at - pr_df.created_at
pr_df.head()

Now let's find out the median time taken to close PRs grouped by month.

In [None]:
prs_closed_monthly = (
    pr_df["time_to_close"].groupby(pr_df.created_at.dt.to_period("M")).agg("median")
)
prs_closed_monthly

We can visualize the trend in median time to close PRs by month. However, in order to best capture all the median values including the outliers we can normalize the data by taking the log of the values before plotting. We should also consider the level of granularity we would like to capture and visualize the median time in i.e. days vs hours vs minutes vs seconds.

Let us first consider the different levels of granularity for the median time to close PRs.

In [None]:
# days
prs_closed_monthly_days = prs_closed_monthly.astype("timedelta64[D]")
# hours
prs_closed_monthly_hours = prs_closed_monthly.astype("timedelta64[h]")
# minutes
prs_closed_monthly_minutes = prs_closed_monthly.astype("timedelta64[m]")
# seconds
prs_closed_monthly_seconds = prs_closed_monthly.astype("timedelta64[s]")

We can now proceed to normalizing the values

In [None]:
prs_closed_monthly_days_norm = np.log(prs_closed_monthly_days)
prs_closed_monthly_hours_norm = np.log(prs_closed_monthly_hours)
prs_closed_monthly_minutes_norm = np.log(prs_closed_monthly_minutes)
prs_closed_monthly_seconds_norm = np.log(prs_closed_monthly_seconds)

In [None]:
prs_closed_monthly_hours_norm

We will now consider the granularity level to be "hours" and plot the median time to close PRs grouped by months.

In [None]:
prs_closed_monthly_hours_norm.plot()
plt.xlabel("Month")
plt.ylabel("Median time to close (hours)")
plt.title("Median Time to Close PRs (hours)")
plt.show()