# Seldon deployment for build log clustering
In this notebook, we deploy a seldon service for clustering build logs. First, we take the experiments in [build log clustering notebook](build_log_term_freq.ipynb) and train a Sklearn pipeline with all the components. Then, we save the model on s3 storage and deploy a seldon service that uses the saved model. Finally, we test the service for inference on an example request. 

In [1]:
import os
import pandas as pd
import random
import requests
import seaborn as sns
from matplotlib import cm
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
import joblib
import boto3
import json
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())
# Some settings for the notebook
pd.options.display.float_format = "{:.2f}".format
random.seed(1)
sns.set(rc={"figure.figsize": (15, 5)})
colormap = (
    cm.brg
)  # see https://matplotlib.org/stable/tutorials/colors/colormaps.html for alternatives

# Load Dataset

In [2]:
# Note: periodic jobs only (see FIXME in class Builds)
job_name = "periodic-ci-openshift-release-master-ci-4.8-e2e-gcp"

logs_path = "../../../../data/raw/gcs/build-logs/"  # local cache of build log files
metadata_path = "../../../../data/raw/gcs/build-metadata/"  # path to saved metadata
metadata_file_name = os.path.join(metadata_path, f"{job_name}_build-logs.csv")


def log_path_for(build_id):
    return os.path.join(logs_path, f"{build_id}.txt")


def prow_url_for(build_id):
    project = "origin-ci-test"
    # FIXME: this prefix is only for periodic jobs
    job_prefix = f"logs/{job_name}/"
    return f"https://prow.ci.openshift.org/view/gcs/{project}/{job_prefix}{build_id}"


def clean_df(df):
    """Polishes the metadata DataFrame"""
    build_errors = df[df["result"] == "error"].index
    df.drop(build_errors, inplace=True)  # Remove builds that erroed (prow error)
    df["duration"] = df["end"] - df["start"]  # From timestamps to job duration
    df["success"] = df["result"] == "SUCCESS"  # A boolean version of the result
    return df


print("Reading metadata from", metadata_file_name)
df = pd.read_csv(metadata_file_name, index_col=0)
df = clean_df(df)
df

Reading metadata from ../../../../data/raw/gcs/build-metadata/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp_build-logs.csv


Unnamed: 0,result,size,start,end,duration,success
1368338379971760128,FAILURE,5846974,1615072270,1615076702,4432,False
1372646746210963456,SUCCESS,9785,1616099471,1616104582,5111,True
1375709011390763008,SUCCESS,9962,1616829565,1616834000,4435,True
1380281977160077312,SUCCESS,3841,1617919845,1617929494,9649,True
1385025333723402240,SUCCESS,3868,1619050750,1619056186,5436,True
...,...,...,...,...,...,...
1381826019446493184,SUCCESS,3837,1618287973,1618292751,4778,True
1371068681810874368,SUCCESS,9782,1615723223,1615728027,4804,True
1377999575511470080,SUCCESS,12329,1617375678,1617380533,4855,True
1369750447904002048,FAILURE,6467331,1615408933,1615413675,4742,False


In [3]:
# Get a list of paths to the local copy of each build log
build_logs = []
for build_id in df.index:
    with open(log_path_for(build_id), "r") as f:
        build_logs.append(f.read())

# Train SKlearn Pipeline

In [4]:
token_pattern = r"\b[a-z][a-z0-9_/\.-]+\b"
vectorizer = TfidfVectorizer(
    min_df=0.03,
    token_pattern=token_pattern,
)

k = 3
kmeans = KMeans(n_clusters=k, random_state=123)

pipeline = Pipeline([("tfidf", vectorizer), ("kmeans", kmeans)])

In [None]:
pipeline.fit(build_logs)

# Save Pipeline

In [None]:
joblib.dump(pipeline, "model.joblib")

In [5]:
# Sanity check to see if the saved model works locally
pipeline_loaded = joblib.load("model.joblib")
pipeline_loaded
pipeline_loaded.predict(build_logs[50:75])

array([2, 2, 1, 0, 1, 2, 0, 0, 1, 2, 0, 1, 2, 2, 1, 0, 0, 0, 2, 2, 0, 1,
       1, 1, 2], dtype=int32)

In [6]:
# Set credentials for your s3 storage
s3_endpoint_url = os.getenv("S3_ENDPOINT")
aws_access_key_id = os.getenv("S3_ACCESS_KEY")
aws_secret_access_key = os.getenv("S3_SECRET_KEY")
s3_bucket = os.getenv("S3_BUCKET")

In [7]:
s3_resource = boto3.resource(
    "s3",
    endpoint_url=s3_endpoint_url,
    aws_access_key_id=aws_access_key_id,
    aws_secret_access_key=aws_secret_access_key,
)
bucket = s3_resource.Bucket(name=s3_bucket)

In [8]:
# Upload your model
bucket.upload_file("model.joblib", "build-log-clustering/tfidf-kmeans/model.joblib")

# Check if your model exists on s3
objects = [
    obj.key for obj in bucket.objects.filter(Prefix="") if "model.joblib" in obj.key
]
objects

['build-log-clustering/tfidf-kmeans/model.joblib',
 'github/ttm-model/model.joblib',
 'github/ttm-model/pipeline/model.joblib',
 'model.joblib',
 'ocp-ci-analysis/models/build-log-classifier/model.joblib']

# Test seldon deployment service 
We use the deployment [config](seldon_deployment_config.yaml) to deploy a seldon service.

In [9]:
# Service url
base_url = "http://build-log-clustering-opf-seldon.apps.zero.massopen.cloud/predict"

In [10]:
# Test set (same as locally checked model)
test_list = build_logs[50:75]

In [11]:
# convert the dataframe into a numpy array and then to a list (required by seldon)
data = {"data": {"ndarray": test_list}}

# create the query payload
json_data = json.dumps(data)
headers = {"content-Type": "application/json"}

# query our inference service
response = requests.post(base_url, data=json_data, headers=headers)
response

<Response [200]>

In [12]:
response.json()

{'data': {'names': [],
  'ndarray': [2,
   2,
   1,
   0,
   1,
   2,
   0,
   0,
   1,
   2,
   0,
   1,
   2,
   2,
   1,
   0,
   0,
   0,
   2,
   2,
   0,
   1,
   1,
   1,
   2]},
 'meta': {'requestPath': {'classifier': 'registry.connect.redhat.com/seldonio/sklearnserver@sha256:88d126455b150291cbb3772f67b4f35a88bb54b15ff7c879022f77fb051615ad'}}}

# Conclusion
In this notebook, we saw how to create and save an unsupervised model for clustering build logs. We successfully deployed and tested the model using s3 for storage and a seldon service on Openshift. 