# Exploring Onboarding Data from Operate First

- One of the goals of this team is to set opf as the defacto platform for incubating servies.
- in order to do that we want to ensure smooth cystomer experience
- we believe adding onbaording automation and service catalog can go a long way in achieving this goal
- however in order to measure the effects of tehse efforts we need to put some evalution metrics in place
- this notebook explores the data available through github and explores some possible metrics for meausring efficienty

In [1]:
from IPython.display import display, Markdown
from dotenv import find_dotenv, load_dotenv

import pandas as pd

from matplotlib import pyplot as plt
import seaborn as sns

# load env vars
load_dotenv(find_dotenv())

# NOTE: GitHub PAT needs to exist as env var before importing the srcopmetrics
# this is a known bug of this library
from srcopsmetrics.entities.issue import Issue  # noqa: E402
from srcopsmetrics.entities.pull_request import PullRequest  # noqa: E402

In [2]:
# default pretty graph settings
sns.set()

In [3]:
# load issue data using an entity, put it into df
issue_entity = Issue("operate-first/support")
issues_df = issue_entity.load_previous_knowledge(is_local=True)
issues_df = issues_df.reset_index()
issues_df.head()

Unnamed: 0,id,title,body,created_by,created_at,closed_by,closed_at,labels,interactions
0,575,Deploy curator-operator in Smaug cluster,Earlier we have deployed the curator project i...,skanthed,2022-05-18 23:12:21,,NaT,{},{'dystewart': 1}
1,573,[smaug] opf-jupyterhub namespace has reached i...,,4n4nd,2022-05-05 18:53:29,HumairAK,2022-05-05 19:01:01,"{'area/service/odh': {'color': 'fc05d7', 'labe...","{'4n4nd': 1, 'first-operator[bot]': 75}"
2,572,"As the stakeholder (sig-data-science), I want ...",- [ ] determine if definitions in doc make sen...,HumairAK,2022-05-03 19:15:14,,NaT,{},{}
3,571,"As an OPF admin / PO, I want to define the ""ti...",- [ ] define the metric (issue_close - issue_o...,HumairAK,2022-05-03 19:08:44,,NaT,"{'kind/user-story': {'color': '1D76DB', 'label...",{'HumairAK': 19}
4,570,public operations project board,### Description\n\nI'm interested in being abl...,erikerlandson,2022-05-03 02:20:16,erikerlandson,2022-05-03 20:19:19,"{'kind/question': {'color': 'd455d0', 'labeled...","{'HumairAK': 14, 'erikerlandson': 3}"


In [6]:
# lets take a look at what labels are used
labels_on_issue = issues_df["labels"].apply(
    lambda x: x.keys()
)

# flatten list of lists and remove duplicates
unique_labels = list(set(
    [l for labellist in labels_on_issue.values for l in labellist]
))

unique_labels[:10]

['onboarding-argocd',
 'kind/feature',
 'kind/website',
 'work-in-progress',
 'kind/demo',
 'lifecycle/rotten',
 'kind/flake',
 'question',
 'area/contributor',
 'priority/important-soon']

In [None]:
# how many times has each label been used
label_freq = labels_on_issue.apply(pd.value_counts)

# need to convert from label name being columns to them being rows
# then sum for each label
label_freq = label_freq.melt().groupby("variable").sum()
label_freq

In [4]:
# get issues having the onboarding labels
onboard_labels = {'onboarding', 'kind/onboarding'}
onboard_filter = issues_df["labels"].apply(lambda x: len(onboard_labels.intersection(x.keys())) != 0)
onboard_issues_df = issues_df.loc[onboard_filter]
onboard_issues_df.head()

Unnamed: 0,id,title,body,created_by,created_at,closed_by,closed_at,labels,interactions
7,567,Onboard API Designer,### Target cluster\n\nSmaug\n\n### Team name\n...,riprasad,2022-04-25 22:11:02,HumairAK,2022-05-03 17:19:24,"{'kind/onboarding': {'color': 'c7def8', 'label...","{'riprasad': 107, 'HumairAK': 3}"
15,559,"""smart-village-view""/""openshift-service-ca.crt...",### Requested actions\r\n\r\n#### Resource req...,computate,2022-04-21 18:46:05,computate,2022-04-21 20:45:31,"{'onboarding': {'color': '57bf42', 'labeled_at...",{'computate': 27}
20,553,smart-village-view,### Target cluster\r\n\r\n_No response_\r\n\r\...,computate,2022-04-13 23:31:04,4n4nd,2022-04-20 17:31:56,"{'kind/onboarding': {'color': 'c7def8', 'label...","{'durandom': 7, 'Gregory-Pereira': 24, '4n4nd'..."
26,545,BUCKET: rhosp-cloudops,### Username\n\ncsibbitt\n\n### Desired bucket...,csibbitt,2022-03-23 21:22:18,,NaT,"{'kind/onboarding': {'color': 'c7def8', 'label...",{}
27,544,NEW PROJECT: rhosp-cloudops,### Target cluster\n\n_No response_\n\n### Tea...,csibbitt,2022-03-23 21:21:43,,NaT,"{'kind/onboarding': {'color': 'c7def8', 'label...",{}


In [None]:
# time to close
onboard_issues_df['time_to_close'] = onboard_issues_df['closed_at'] - onboard_issues_df['created_at']

# summary stats
onboard_issues_df['time_to_close'].describe()

In [None]:
# histogram in terms of number of hours
fig, ax = plt.subplots(figsize=(20, 8))
sns.histplot(
    onboard_issues_df['time_to_close'].dt.total_seconds() / 3600,
    ax=ax,
    bins=50,
    stat="probability",
)
plt.ylabel("Proportion of Issues")
plt.xlabel("Time to Close Issue (hours)")
plt.title("Distribution of time taken to close issue")
plt.show()

In [None]:
# closing time greater than 6 months
onboard_issues_df[onboard_issues_df['time_to_close'].dt.total_seconds() > 6 * 30 * 24 * 60 * 60].head()

In [None]:
# calculate running mean of time to close
mttr_till_now = onboard_issues_df.sort_values(by='created_at')['time_to_close'].dt.total_seconds().expanding().mean()
mttr_till_now = mttr_till_now.rename('mttr_till_now')
mttr_till_now_days = mttr_till_now / (24 * 60 * 60)

# merge with rest of df
onboard_issues_df = onboard_issues_df.merge(
    mttr_till_now_days,
    left_index=True,
    right_index=True,
)
onboard_issues_df.head()

In [None]:
# what does the mean time  to close till now look like, over time
fig, ax = plt.subplots(figsize=(20, 8))
sns.lineplot(onboard_issues_df['created_at'], onboard_issues_df['mttr_till_now'])
plt.ylabel("Mean Time to Resolve (agg until now)")
plt.xlabel("Date")
plt.title("Distribution of overall MTTR over time")
plt.show()

In [None]:
# what if we only consider the MTTR in the last sprint (~14 days)
# mttr_sprintwise = onboard_issues_df.sort_values(by='created_at')['time_to_close'].dt.total_seconds().expanding().mean()
ttr = onboard_issues_df[["created_at", "time_to_close"]].set_index("created_at")

# get timedelta as seconds and then days
ttr["time_to_close"] = ttr["time_to_close"].dt.total_seconds()
ttr /= (24 * 3600)

ttr.resample("2W").mean()

# ttr.plot()


# ttr = ttr / (3600
# ttr.resample("W").mean().plot()
# onboard_issues_df.head()#set_index("created_at").resample("W").mean()

In [None]:
onboard_issues_df[(onboard_issues_df["created_at"] > "2022-04-17") & (onboard_issues_df["created_at"] < "2022-04-17")].mean()