# Time to Merge Prediction Inference Service

In the previous notebook, we explored some basic machine learning models for predicting time to merge of a PR. We then deployed the model with the highest f1-score as a service using Seldon. The purpose of this notebook is to check whether this service is running as intended, and more specifically to ensure that the model performance is what we expect it to be. So here, we will use the test set from the aforementioned notebook as the query payload for the service, and then verify that the return values are the same as those obtained during training/testing locally.

In [1]:
import os
import ast
import sys
import json
import datetime
from io import StringIO
import requests
from dotenv import load_dotenv, find_dotenv

import numpy as np
import pandas as pd

from sklearn.metrics import classification_report

metric_template_path = "../../../notebooks/data-sources/TestGrid/metrics"
if metric_template_path not in sys.path:
    sys.path.insert(1, metric_template_path)

from ipynb.fs.defs.metric_template import (  # noqa: E402
    CephCommunication,
)

load_dotenv(find_dotenv())

True

In [2]:
## CEPH Bucket variables
## Create a .env file on your local with the correct configs,
s3_endpoint_url = os.getenv("S3_ENDPOINT")
s3_access_key = os.getenv("S3_ACCESS_KEY")
s3_secret_key = os.getenv("S3_SECRET_KEY")
s3_bucket = os.getenv("S3_BUCKET")
s3_path = "github/thoth"
REMOTE = os.getenv("REMOTE")
INPUT_DATA_PATH = "../../../data/processed/github"

In [3]:
# read raw dataset
data_path = "../../data/raw/GitHub/thoth_PR_data.csv"

if REMOTE:
    print("getting dataset from ceph")
    cc = CephCommunication(s3_endpoint_url, s3_access_key, s3_secret_key, s3_bucket)
    s3_object = cc.s3_resource.Object(s3_bucket, "thoth_PR_data.csv")
    file = s3_object.get()["Body"].read().decode("utf-8")

pr_df = pd.read_csv(StringIO(file))

getting dataset from ceph


In [4]:
# github pr dataset collected using thoth's mi-scheduler
pr_df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,id,title,body,size,created_by,created_at,closed_at,closed_by,...,interactions,reviews,labels,commits,changed_files,first_review_at,first_approve_at,org,repo,index
0,0,0,678.0,Automatic update of base-image in CI,Automatic update of base-image in CI.,XS,sesheta,2022-05-09 19:38:22,2022-05-09 19:43:18,harshad16,...,{'sesheta': 210},"{'966742282': {'author': 'sefkhet-abwy[bot]', ...","['approved', 'size/XS', 'ok-to-test']",['1a643bbdf0304b1d7b94e374680901765bb695c1'],['.aicoe-ci.yaml'],2022-05-09 19:38:24,2022-05-09 19:38:24,thoth-station,graph-refresh-job,
1,1,1,677.0,Release of version 0.3.19,"Hey, @harshad16!\n\nOpening this PR to fix the...",XS,khebhut[bot],2022-05-09 19:36:51,2022-05-09 19:43:05,harshad16,...,{'sesheta': 252},"{'966740804': {'author': 'sefkhet-abwy[bot]', ...","['approved', 'size/XS', 'bot', 'needs-ok-to-te...",['050ed88f0b7b0dcaa94ea082d5a1a34862d59848'],"['CHANGELOG.md', 'version.py']",2022-05-09 19:36:53,2022-05-09 19:36:53,thoth-station,graph-refresh-job,
2,2,2,675.0,Automatic update of base-image in CI,Automatic update of base-image in CI.,XS,sesheta,2022-05-09 18:44:54,2022-05-09 19:23:46,harshad16,...,{'sesheta': 447},"{'966683048': {'author': 'sefkhet-abwy[bot]', ...","['approved', 'size/XS', 'ok-to-test']",['1a643bbdf0304b1d7b94e374680901765bb695c1'],['.aicoe-ci.yaml'],2022-05-09 18:44:57,2022-05-09 18:44:57,thoth-station,graph-refresh-job,
3,3,3,674.0,Automatic update of dependencies by Kebechet f...,Kebechet has updated the dependencies to the l...,L,khebhut[bot],2022-05-09 18:44:08,2022-05-09 19:23:27,harshad16,...,{'sesheta': 429},"{'966682266': {'author': 'sefkhet-abwy[bot]', ...","['approved', 'size/L', 'bot', 'needs-ok-to-tes...",['39c9414fdd8575bfa55d2e51ecab3639f42f40da'],['Pipfile.lock'],2022-05-09 18:44:10,2022-05-09 18:44:10,thoth-station,graph-refresh-job,
4,4,4,672.0,Automatic update of dependencies by Kebechet f...,Kebechet has updated the dependencies to the l...,L,khebhut[bot],2022-02-24 17:42:22,2022-02-24 18:11:28,sesheta,...,{'sesheta': 257},"{'892779373': {'author': 'sefkhet-abwy[bot]', ...","['approved', 'size/L', 'bot', 'needs-ok-to-tes...",['d06965a0ab7c603b9adb8dbc747b47e963209a8f'],['Pipfile.lock'],2022-02-24 17:42:25,2022-02-24 17:42:25,thoth-station,graph-refresh-job,


In [5]:
# remove PRs from train/test which are still open
pr_df = pr_df[pr_df["closed_at"].notna()]
pr_df = pr_df[pr_df["merged_at"].notna()]

In [6]:
pr_df["created_at"] = pr_df["created_at"].apply(
    lambda x: int(datetime.datetime.timestamp(pd.to_datetime(x)))
)
pr_df["closed_at"] = pr_df["closed_at"].apply(
    lambda x: float(datetime.datetime.timestamp(pd.to_datetime(x)))
)
pr_df["merged_at"] = pr_df["merged_at"].apply(
    lambda x: float(datetime.datetime.timestamp(pd.to_datetime(x)))
)

In [7]:
# read processed and split data created for train/test in the model training notebook
if REMOTE:
    cc = CephCommunication(s3_endpoint_url, s3_access_key, s3_secret_key, s3_bucket)
    X_test = cc.read_from_ceph(s3_path, "X_test.parquet")
    y_test = cc.read_from_ceph(s3_path, "y_test.parquet")

else:
    print(
        "The X_test.parquet and y_test.parquet files are not included in the ocp-ci-analysis github repo."
    )
    print(
        "Please set REMOTE=1 in the .env file and read this data from the S3 bucket instead."
    )

In [8]:
X_test

Unnamed: 0,size,created_at_day,created_at_month,created_at_weekday,created_at_hour,changed_files_number,body_size,commits_number,filetype_None,title_wordcount_add,...,title_wordcount_slo,title_wordcount_stage,title_wordcount_toml,title_wordcount_upgrade,title_wordcount_v0,title_wordcount_v1,title_wordcount_v2021,title_wordcount_version,title_wordcount_wip,title_wordcount_💊
4298,2.0,7.0,11.0,3.0,14.0,1.0,14.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13316,1.0,19.0,1.0,0.0,21.0,1.0,23.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
46056,0.0,19.0,1.0,0.0,15.0,2.0,4.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2540,3.0,11.0,9.0,2.0,6.0,3.0,0.0,4.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
28650,0.0,1.0,3.0,1.0,11.0,1.0,23.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28461,0.0,1.0,3.0,1.0,11.0,1.0,23.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
39492,1.0,20.0,1.0,0.0,17.0,1.0,14.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
30436,1.0,16.0,9.0,3.0,22.0,1.0,23.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
45481,0.0,19.0,1.0,0.0,20.0,1.0,7.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [9]:
y_test

Unnamed: 0,ttm_class
4298,2
13316,0
46056,0
2540,5
28650,2
...,...
28461,3
39492,5
30436,3
45481,2


In [10]:
# endpoint from the seldon deployment
base_url = "http://thoth-github-ttm-ds-ml-workflows-ws.apps.smaug.na.operate-first.cloud/predict"

In [11]:
# lets extract the raw PR data corresponding to the PRs used in the test set
sample_payload = pr_df.reindex(X_test.index)

In [12]:
sample_payload.head(2)

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,id,title,body,size,created_by,created_at,closed_at,closed_by,...,interactions,reviews,labels,commits,changed_files,first_review_at,first_approve_at,org,repo,index
4298,450,450,178.0,Automatic update of dependency thoth-storages ...,Dependency thoth-storages was used in version ...,M,sesheta,1573136911,1573137000.0,,...,{'ghost': 18},{},['bot'],['650fd49866318df000c2d3e25865e314b41fe4b2'],['Pipfile.lock'],,,thoth-station,metrics-exporter,
13316,6215,6215,4398.0,💊 Package 'django-bitfield' is hosted on GitHub,This change was automatically generated using ...,S,khebhut[bot],1631186,1631187.0,fridex,...,{},{},['bot'],['f8bdcb1835748e817ee15a7a6e0ca0da9de07407'],['prescriptions/dj_/django-bitfield/gh_link.ya...,,,thoth-station,prescriptions,


In [13]:
sample_payload.changed_files = sample_payload.changed_files.apply(ast.literal_eval)

In [14]:
sample_payload.dtypes

Unnamed: 0.1              int64
Unnamed: 0                int64
id                      float64
title                    object
body                     object
size                     object
created_by               object
created_at                int64
closed_at               float64
closed_by                object
merged_at               float64
merged_by                object
commits_number          float64
changed_files_number    float64
interactions             object
reviews                  object
labels                   object
commits                  object
changed_files            object
first_review_at          object
first_approve_at         object
org                      object
repo                     object
index                   float64
dtype: object

In [15]:
sample_payload

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,id,title,body,size,created_by,created_at,closed_at,closed_by,...,interactions,reviews,labels,commits,changed_files,first_review_at,first_approve_at,org,repo,index
4298,450,450,178.0,Automatic update of dependency thoth-storages ...,Dependency thoth-storages was used in version ...,M,sesheta,1573136911,1.573137e+09,,...,{'ghost': 18},{},['bot'],['650fd49866318df000c2d3e25865e314b41fe4b2'],[Pipfile.lock],,,thoth-station,metrics-exporter,
13316,6215,6215,4398.0,💊 Package 'django-bitfield' is hosted on GitHub,This change was automatically generated using ...,S,khebhut[bot],1631186,1.631187e+06,fridex,...,{},{},['bot'],['f8bdcb1835748e817ee15a7a6e0ca0da9de07407'],[prescriptions/dj_/django-bitfield/gh_link.yaml],,,thoth-station,prescriptions,
46056,1017,1017,765.0,Bump advise reporter stage to v0.5.1,Signed-off-by: Francesco Murdaca <fmurdaca@red...,XS,pacospace,1611132,1.611133e+06,sesheta,...,{'sesheta': 65},"{'571991870': {'author': 'fridex', 'words_coun...","['approved', 'size/XS']",['344798e8457a55f84789f3c402e40109bffb2e87'],[advise-reporter/overlays/ocp4-stage/imagestre...,1970-01-19 15:32:12.514,1970-01-19 15:32:12.514,thoth-station,thoth-application,
2540,1272,1272,958.0,PostgreSQL sync package analyzer results,,L,fridex,1568184496,1.568190e+09,,...,"{'todo[bot]': 20, 'ghost': 102, 'fridex': 1}","{'286587429': {'author': 'pacospace', 'words_c...","['approved', 'size/L']","['bfd1d1c6f4c8596fffbcb1e39398f9ddb2562ca9', '...","[.coafile, thoth/storages/graph/models.py, tho...",2019-09-11 06:56:49,2019-09-11 06:56:49,thoth-station,storages,
28650,21549,21549,19155.0,💊 Package 'codacy-coverage' is hosted on GitHub,This change was automatically generated using ...,XS,khebhut[bot],1646133404,1.646134e+09,fridex,...,{},{},['bot'],['0c4bdb6894b40593a47a649cafbb79a8e1db6ec4'],[prescriptions/co_/codacy-coverage/gh_link.yaml],,,thoth-station,prescriptions,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28461,21360,21360,19344.0,💊 Project 'alabaster' has less than 3 maintain...,This change was automatically generated using ...,XS,khebhut[bot],1646134159,1.646135e+09,fridex,...,{},{},"['bot', 'size/XS']",['6ca953b04000c99f670efa511a16ad96fd27b5b0'],[prescriptions/al_/alabaster/pypi_project_main...,,,thoth-station,prescriptions,
39492,215,215,354.0,Automatic update of dependency thoth-common fr...,Dependency thoth-common was used in version 0....,S,sesheta,1579540470,1.579548e+09,,...,{'ghost': 67},{},"['size/S', 'bot']",['0d784b97b437fade48b0725dbea546f40885f591'],[Pipfile.lock],,,thoth-station,package-releases-job,
30436,23335,23335,17353.0,💊 Project 'ukpostcodeparser' was not updated f...,This change was automatically generated using ...,S,khebhut[bot],1631829984,1.631831e+09,fridex,...,{'sesheta': 117},{},"['bot', 'size/S', 'needs-ok-to-test']",['7dbad9dbb429131615efb136cad9daf5d6bad972'],[prescriptions/uk_/ukpostcodeparser/gh_updated...,,,thoth-station,prescriptions,
45481,442,442,1883.0,Bump adviser to v0.39.0 in stage environment,## Related Issues and Dependencies\r\n\r\nRela...,XS,fridex,1629141,1.629354e+06,sesheta,...,"{'fridex': 7, 'sesheta': 65}",{},"['approved', 'size/XS']",['2bd29922e76d1e2b1bc27f5185fb8ba36e5421a6'],[adviser/overlays/ocp4-stage/imagestreamtag.yaml],,,thoth-station,thoth-application,


In [16]:
# convert the dataframe into a numpy array and then to a list (required by seldon)
data = {
    "data": {
        "names": sample_payload.columns.tolist(),
        "ndarray": sample_payload.to_numpy().tolist(),
    }
}

# create the query payload
json_data = json.dumps(data)
headers = {"content-Type": "application/json"}

In [17]:
class_dict = {
    0: "0 to 1 min",
    1: "1 to 2 mins",
    2: "2 to 8 mins",
    3: "8 to 20 mins",
    4: "20 mins to 1 hr",
    5: "1 to 4 hrs",
    6: "4 to 18 hrs",
    7: "18 hrs to 3 days",
    8: "3 days to 3 weeks",
    9: "more than 3 hrs",
}

In [18]:
# query our inference service
response = requests.post(base_url, data=json_data, headers=headers)
response

<Response [200]>

In [19]:
# what are the names of the prediction classes
json_response = response.json()
json_response["data"]["names"]

['Class_0',
 'Class_1',
 'Class_2',
 'Class_3',
 'Class_4',
 'Class_5',
 'Class_6',
 'Class_7',
 'Class_8',
 'Class_9']

In [20]:
sample_pr = 20

In [21]:
# probabality estimates for each of the class for a sample PR
json_response["data"]["ndarray"][sample_pr][:10]

[0.0,
 0.7815000000000001,
 0.023333333333333334,
 0.11983333333333332,
 0.005,
 0.020833333333333336,
 0.0016666666666666666,
 0.011000000000000001,
 0.036833333333333336,
 0.0]

In [22]:
# get predicted classes from probabilities for each PR
preds = np.argmax(json_response["data"]["ndarray"], axis=1)
print(
    "The PR belongs to class",
    preds[sample_pr],
    "and it is most likely to be merged in",
    class_dict[preds[sample_pr]],
)

The PR belongs to class 1 and it is most likely to be merged in 1 to 2 mins


In [23]:
print("The PR was actually merged in", class_dict[int(y_test.iloc[sample_pr])])

The PR was actually merged in 1 to 2 mins


In [24]:
# evaluate results on the entire dataset
print(classification_report(y_test, preds))

              precision    recall  f1-score   support

           0       0.89      0.98      0.93      3194
           1       0.61      0.47      0.53      1163
           2       0.48      0.53      0.50       971
           3       0.32      0.32      0.32       619
           4       0.40      0.32      0.36       445
           5       0.61      0.64      0.62       658
           6       0.63      0.56      0.59       469
           7       0.75      0.72      0.74       754
           8       0.43      0.44      0.43       348
           9       0.32      0.19      0.24        98

    accuracy                           0.68      8719
   macro avg       0.54      0.52      0.53      8719
weighted avg       0.67      0.68      0.67      8719



# Conclusion

This notebook shows how raw PR data can be sent to the deployed Seldon service to get time-to-merge predictions. Additionally, we see that the evaluation scores in the classification report match the ones we saw in the training notebook. So, great, looks like our inference service and model are working as expected, and are ready to predict some times to merge for GitHub PRs! 