# Exploring Onboarding Data from Operate First

- One of the goals of this team is to set opf as the defacto platform for incubating servies.
- in order to do that we want to ensure smooth cystomer experience
- we believe adding onbaording automation and service catalog can go a long way in achieving this goal
- however in order to measure the effects of tehse efforts we need to put some evalution metrics in place
- this notebook explores the data available through github and explores some possible metrics for meausring efficienty

In [34]:
from IPython.display import display, Markdown
from dotenv import find_dotenv, load_dotenv

import numpy as np
import pandas as pd

from matplotlib import pyplot as plt
import seaborn as sns

# load env vars
load_dotenv(find_dotenv())

# NOTE: GitHub PAT needs to exist as env var before importing the srcopmetrics
# this is a known bug of this library
from srcopsmetrics.entities.issue import Issue  # noqa: E402
from srcopsmetrics.entities.pull_request import PullRequest  # noqa: E402

In [2]:
# default pretty graph settings
sns.set()

## GitHub Issues related to Onboarding

In [3]:
# load issue data using an entity, put it into df
issue_entity = Issue("operate-first/support")
issues_df = issue_entity.load_previous_knowledge(is_local=True)
issues_df = issues_df.reset_index()
issues_df.head()

Unnamed: 0,id,title,body,created_by,created_at,closed_by,closed_at,labels,interactions
0,575,Deploy curator-operator in Smaug cluster,Earlier we have deployed the curator project i...,skanthed,2022-05-18 23:12:21,,NaT,{},{'dystewart': 1}
1,573,[smaug] opf-jupyterhub namespace has reached i...,,4n4nd,2022-05-05 18:53:29,HumairAK,2022-05-05 19:01:01,"{'area/service/odh': {'color': 'fc05d7', 'labe...","{'4n4nd': 1, 'first-operator[bot]': 75}"
2,572,"As the stakeholder (sig-data-science), I want ...",- [ ] determine if definitions in doc make sen...,HumairAK,2022-05-03 19:15:14,,NaT,{},{}
3,571,"As an OPF admin / PO, I want to define the ""ti...",- [ ] define the metric (issue_close - issue_o...,HumairAK,2022-05-03 19:08:44,,NaT,"{'kind/user-story': {'color': '1D76DB', 'label...",{'HumairAK': 19}
4,570,public operations project board,### Description\n\nI'm interested in being abl...,erikerlandson,2022-05-03 02:20:16,erikerlandson,2022-05-03 20:19:19,"{'kind/question': {'color': 'd455d0', 'labeled...","{'HumairAK': 14, 'erikerlandson': 3}"


In [16]:
# lets take a look at what labels are used
labels_on_issue = issues_df["labels"].apply(
    lambda x: list(x.keys())
)

# flatten list of lists and remove duplicates
unique_labels = list(set(
    [l for labellist in labels_on_issue.values for l in labellist]
))

unique_labels[:10]

['priority/important-longterm',
 'work-in-progress',
 'user/experience',
 'triage/needs-information',
 'lifecycle/stale',
 'task',
 'sig/devops',
 'documentation',
 'kind/demo',
 'kind/onboarding']

In [22]:
# how many times has each label been used
label_freq = labels_on_issue.apply(pd.value_counts)

# need to convert from label name being columns to them being rows
# then sum for each label
label_freq = label_freq.melt().groupby("variable").sum()
label_freq = label_freq.sort_values("value", ascending=False)

display(Markdown(
    "### How often does each specific label occur?"
))
label_freq.head(10)

### How often does each specific label occur?

Unnamed: 0_level_0,value
variable,Unnamed: 1_level_1
lifecycle/stale,165.0
lifecycle/rotten,104.0
onboarding,55.0
kind/onboarding,51.0
kind/feature,43.0
human_intervention_required,41.0
kind/bug,39.0
priority/critical-urgent,21.0
kind/documentation,19.0
priority/important-soon,19.0


In [23]:
# which ones might be related to onboarding?
label_freq[label_freq.index.str.contains("onboard")]

Unnamed: 0_level_0,value
variable,Unnamed: 1_level_1
onboarding,55.0
kind/onboarding,51.0
onboarding-argocd,1.0


**NOTE** From the unique labels set and from the tables above, it seems like the issues we'd be interested in would have the label `onboarding` or `kind/onboarding`. This has been confirmed after talking to one of our subject matter experts.

In [24]:
# get issues having the onboarding labels
onboard_labels = {'onboarding', 'kind/onboarding'}
onboard_filter = issues_df["labels"].apply(lambda x: len(onboard_labels.intersection(x.keys())) != 0)
onboard_issues_df = issues_df.loc[onboard_filter]
onboard_issues_df.head()

Unnamed: 0,id,title,body,created_by,created_at,closed_by,closed_at,labels,interactions
7,567,Onboard API Designer,### Target cluster\n\nSmaug\n\n### Team name\n...,riprasad,2022-04-25 22:11:02,HumairAK,2022-05-03 17:19:24,"{'kind/onboarding': {'color': 'c7def8', 'label...","{'riprasad': 107, 'HumairAK': 3}"
15,559,"""smart-village-view""/""openshift-service-ca.crt...",### Requested actions\r\n\r\n#### Resource req...,computate,2022-04-21 18:46:05,computate,2022-04-21 20:45:31,"{'onboarding': {'color': '57bf42', 'labeled_at...",{'computate': 27}
20,553,smart-village-view,### Target cluster\r\n\r\n_No response_\r\n\r\...,computate,2022-04-13 23:31:04,4n4nd,2022-04-20 17:31:56,"{'kind/onboarding': {'color': 'c7def8', 'label...","{'durandom': 7, 'Gregory-Pereira': 24, '4n4nd'..."
26,545,BUCKET: rhosp-cloudops,### Username\n\ncsibbitt\n\n### Desired bucket...,csibbitt,2022-03-23 21:22:18,,NaT,"{'kind/onboarding': {'color': 'c7def8', 'label...",{}
27,544,NEW PROJECT: rhosp-cloudops,### Target cluster\n\n_No response_\n\n### Tea...,csibbitt,2022-03-23 21:21:43,,NaT,"{'kind/onboarding': {'color': 'c7def8', 'label...",{}


In [25]:
# how many times has each set of labels been used
display(Markdown(
    "### How often (percent of issues) does each label combination occur?",
))
labels_on_issue.value_counts(normalize=True)

### How often (percent of issues) does each label combination occur?

[]                                                                                                       0.278119
[onboarding]                                                                                             0.079755
[lifecycle/stale, lifecycle/rotten]                                                                      0.063395
[kind/onboarding]                                                                                        0.055215
[lifecycle/stale]                                                                                        0.051125
                                                                                                           ...   
[kind/feature, priority/important-soon, lifecycle/frozen, triage/accepted]                               0.002045
[kind/bug, priority/important-soon, area/user, human_intervention_required, priority/critical-urgent]    0.002045
[kind/onboarding, priority/important-soon, lifecycle/stale, lifecycle/rotten]           

**NOTE** Looks like ~28% of the issues don't have any labels at all. So it might be worth it to use some basic heuristics to manually check if any of these label-less issues are related to onboarding and thus relevant to our analysis.

### Side Quest: Are Label-less Issues Relevant?

In [28]:
# filter to get labelless issues
labelless_filter = issues_df["labels"].apply(len) == 0
labelless_issues_df = issues_df[labelless_filter].sort_values("created_at", ascending=False)
labelless_issues_df.head()

Unnamed: 0,id,title,body,created_by,created_at,closed_by,closed_at,labels,interactions
0,575,Deploy curator-operator in Smaug cluster,Earlier we have deployed the curator project i...,skanthed,2022-05-18 23:12:21,,NaT,{},{'dystewart': 1}
2,572,"As the stakeholder (sig-data-science), I want ...",- [ ] determine if definitions in doc make sen...,HumairAK,2022-05-03 19:15:14,,NaT,{},{}
16,558,Username changed and removed access to ressources,**Describe the bug**\r\nIt looks like my usern...,PixelJonas,2022-04-21 12:35:33,HumairAK,2022-05-02 20:20:58,{},{'HumairAK': 27}
21,552,Authentication error when accessing OS-Climate...,**Describe the bug**\r\nUnable to log into htt...,tumido,2022-04-06 16:02:01,tumido,2022-04-07 01:44:19,{},"{'HumairAK': 72, 'harshad16': 5, 'tumido': 7}"
22,551,[TASK] Better use of Github Labels to scope is...,**User story**\r\nIt would be nice to make bet...,Gregory-Pereira,2022-04-02 03:05:59,,NaT,{},{'quaid': 13}


**NOTE** Looks like these are pretty recent issues, so we can rule out the possiblity that "they lack labels because labels weren't defined at the time of issue creation". So maybe it's worth it to dive a bit deeper.

In [31]:
# lets see if the title or body contains anything related to onboarding
onboard_filter = labelless_issues_df["body"].str.lower().str.contains("onboard") | labelless_issues_df["title"].str.lower().str.contains("onboard")
labelless_issues_df[onboard_filter]

Unnamed: 0,id,title,body,created_by,created_at,closed_by,closed_at,labels,interactions
24,547,[TASK] Update the issue and bug templates,**User story**\r\nWith many changes having occ...,Gregory-Pereira,2022-03-28 22:31:12,,NaT,{},"{'HumairAK': 101, 'Gregory-Pereira': 32}"
187,368,[TASK] Provision a cluster for DevConf US 2021...,Zero is not stable enough for a workshop and S...,tumido,2021-08-30 15:25:39,tumido,2021-09-02 02:19:04,{},{'tumido': 1}
191,364,Support for multiple buckets in Trino and acce...,**Is your feature request related to a problem...,HumairAK,2021-08-26 17:35:45,HumairAK,2021-09-10 17:45:48,{},"{'HumairAK': 43, 'erikerlandson': 27}"
207,345,"Issue Template ""Onboard to Continuous Deployme...",**Describe the bug**\r\n\r\nThe issue template...,rbo,2021-08-08 14:09:23,tumido,2021-10-01 04:11:30,{},"{'HumairAK': 10, 'rbo': 18, 'tumido': 3}"
208,344,Update Rick cluster to 4.8.2,"Before onboarding of users, we should try to u...",rbo,2021-08-08 13:50:27,sesheta,2021-10-07 19:24:56,{},"{'tumido': 16, 'rbo': 8}"
218,333,Onboard OS Climate to Operate First,This issue to identify the tools/services that...,HumairAK,2021-08-04 18:57:45,HumairAK,2021-10-07 18:05:47,{},"{'HumairAK': 45, 'erikerlandson': 50, 'durando..."
251,292,[TASK] Convert issue templates to issue forms,**User story**\r\nIssue forms are a new fancy ...,tumido,2021-06-30 16:16:04,tumido,2021-10-11 10:57:33,{},"{'HumairAK': 6, 'tumido': 3}"
372,149,Onboarding rbohne to zero,**Describe the bug**\r\nAfter login into the z...,rbo,2021-03-25 15:13:30,sesheta,2021-03-31 22:35:46,{},"{'tumido': 80, 'rbo': 197, 'durandom': 193, 'H..."
390,129,feat: Make onboarding issue template clearer/e...,We need the onboarding to cluster template a l...,HumairAK,2021-03-16 23:56:25,sesheta,2021-03-19 16:24:29,{},{}
394,124,[TASK] Better wording for the README explainin...,**User story**\r\nHeidi:\r\n> There is onboard...,tumido,2021-03-16 16:12:22,HumairAK,2021-07-07 15:56:13,{},{'HumairAK': 215}


**NOTE** We went through these issues manually, and talked to a subject matter expert regarding some of these. We determined that these issues are either not strictly related to onboarding, or, related to onboarding but outliers / one-off special cases. 

### Side Quest Conclusion

Labelless issues do not contain any meaningful information that we want to include in the analysis at this stage. So we can safely skip over these.

## GitHub PRs related to Onboarding

## Possible Metrics

### Mean Time to Close Issue

Timedelta between issue create and issue close

In [32]:
# time to close
onboard_issues_df['time_to_close'] = onboard_issues_df['closed_at'] - onboard_issues_df['created_at']

# summary stats
onboard_issues_df['time_to_close'].describe()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  onboard_issues_df['time_to_close'] = onboard_issues_df['closed_at'] - onboard_issues_df['created_at']


count                            91
mean     67 days 19:43:58.736263736
std      90 days 05:37:05.405439012
min                 0 days 00:41:21
25%                 0 days 23:55:02
50%                11 days 06:33:52
75%               134 days 14:22:30
max               294 days 02:54:41
Name: time_to_close, dtype: object

**NOTE** An average time to close issue of ~68 days seems rather high. This could be an inaccurate measure since it is sensitive to outliers. The median time to close, ~11 days seems more reasonable.

In [48]:
onboard_issues_df['closed_at'].dt.date.tolist()
# onboard_issues_df['created_at'].dt.date.tolist()

[datetime.date(2022, 5, 3),
 datetime.date(2022, 4, 21),
 datetime.date(2022, 4, 20),
 NaT,
 NaT,
 NaT,
 datetime.date(2022, 3, 28),
 NaT,
 datetime.date(2022, 3, 23),
 NaT,
 NaT,
 datetime.date(2022, 3, 2),
 datetime.date(2022, 2, 25),
 datetime.date(2022, 3, 10),
 datetime.date(2022, 1, 6),
 NaT,
 datetime.date(2021, 12, 10),
 datetime.date(2021, 12, 7),
 datetime.date(2022, 2, 24),
 datetime.date(2022, 3, 11),
 datetime.date(2021, 11, 24),
 datetime.date(2021, 11, 12),
 datetime.date(2022, 4, 9),
 datetime.date(2022, 5, 2),
 datetime.date(2022, 4, 9),
 datetime.date(2021, 11, 5),
 datetime.date(2022, 1, 24),
 datetime.date(2022, 1, 28),
 datetime.date(2021, 11, 2),
 datetime.date(2022, 2, 1),
 NaT,
 datetime.date(2021, 11, 11),
 datetime.date(2021, 10, 1),
 datetime.date(2021, 10, 19),
 datetime.date(2022, 2, 19),
 datetime.date(2021, 10, 19),
 NaT,
 datetime.date(2022, 2, 6),
 datetime.date(2021, 9, 3),
 datetime.date(2021, 9, 17),
 datetime.date(2022, 3, 27),
 datetime.date(2021, 

In [54]:
pd.isna(onboard_issues_df['closed_at'].dt.date.tolist()[-1])

True

In [57]:
np.busday_count(onboard_issues_df['closed_at'].dt.date.tolist()[0], onboard_issues_df['created_at'].dt.date.tolist()[0])

-6

In [56]:
np.where(
    lambda x: pd.isna(x),
    np.busday_count(onboard_issues_df['closed_at'].dt.date.tolist(), onboard_issues_df['created_at'].dt.date.tolist()),
    -1
)

ValueError: cannot convert float NaN to integer

In [None]:
# histogram in terms of number of hours
fig, ax = plt.subplots(figsize=(20, 8))
sns.histplot(
    onboard_issues_df['time_to_close'].dt.total_seconds() / 3600,
    ax=ax,
    bins=50,
    stat="probability",
)
plt.ylabel("Proportion of Issues")
plt.xlabel("Time to Close Issue (hours)")
plt.title("Distribution of time taken to close issue")
plt.show()

In [None]:
# closing time greater than 6 months
onboard_issues_df[onboard_issues_df['time_to_close'].dt.total_seconds() > 6 * 30 * 24 * 60 * 60].head()

In [None]:
# calculate running mean of time to close
mttr_till_now = onboard_issues_df.sort_values(by='created_at')['time_to_close'].dt.total_seconds().expanding().mean()
mttr_till_now = mttr_till_now.rename('mttr_till_now')
mttr_till_now_days = mttr_till_now / (24 * 60 * 60)

# merge with rest of df
onboard_issues_df = onboard_issues_df.merge(
    mttr_till_now_days,
    left_index=True,
    right_index=True,
)
onboard_issues_df.head()

In [None]:
# what does the mean time  to close till now look like, over time
fig, ax = plt.subplots(figsize=(20, 8))
sns.lineplot(onboard_issues_df['created_at'], onboard_issues_df['mttr_till_now'])
plt.ylabel("Mean Time to Resolve (agg until now)")
plt.xlabel("Date")
plt.title("Distribution of overall MTTR over time")
plt.show()

In [None]:
# what if we only consider the MTTR in the last sprint (~14 days)
# mttr_sprintwise = onboard_issues_df.sort_values(by='created_at')['time_to_close'].dt.total_seconds().expanding().mean()
ttr = onboard_issues_df[["created_at", "time_to_close"]].set_index("created_at")

# get timedelta as seconds and then days
ttr["time_to_close"] = ttr["time_to_close"].dt.total_seconds()
ttr /= (24 * 3600)

ttr.resample("2W").mean()

# ttr.plot()


# ttr = ttr / (3600
# ttr.resample("W").mean().plot()
# onboard_issues_df.head()#set_index("created_at").resample("W").mean()

In [None]:
onboard_issues_df[(onboard_issues_df["created_at"] > "2022-04-17") & (onboard_issues_df["created_at"] < "2022-04-17")].mean()