# E2E scenario for Wine dataset on KFP

Steps:
- download dataset
- clean/preprocess data
- perform training / hyper-parameter tuning with results in MLFlow + MinIO
- serve with Seldon
- perform inference

Artifacts:
- raw data, preprocessed
- model per experiment
- experiment metadata and results

## Tested with

This notebook has been tested with the following core component versions:

|                              |     **Charm**     | **Client** |                            **Image**                           |
|:----------------------------:|:-----------------:|:----------:|:--------------------------------------------------------------:|
| **Kubeflow Pipelines (KFP)** |      2.0/edge     |   1.8.22   |           gcr.io/ml-pipeline/api-server:2.0.0-alpha.7          |
|          **MLFlow**          | latest/edge (2.1) |    2.1.1   |             docker.io/ubuntu/mlflow:2.1.1_1.0-22.04            |
|           **MinIO**          |    ckf-1.7/edge   |    6.0.2   |            minio/minio:RELEASE.2021-09-03T03-56-13Z            |
|          **Seldon**          |     1.15/edge     |     N/A    | docker.io/charmedkubeflow/seldon-core-operator:v1.15.0_22.04_1 |

## Setup

In [9]:
# pin kfp to the latest <2.0 version to ensure compatibility
# with the KFP API server version deployed in CKF 1.7
# pin the mlflow client to match the version of the deployed MLflow server
# pin scikit-learn to ensure compatibility with the installed mlflow client
!pip install boto3 kfp==1.8.22 minio mlflow==2.1.1 numpy pyarrow requests "scikit-learn<1.2" tenacity -q

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
jupyterlab-server 2.21.0 requires jsonschema>=4.17.3, but you have jsonschema 3.2.0 which is incompatible.[0m[31m
[0m

In [1]:
USER="kimonas-sotirchos"

## Download Data

In [2]:
DATA_URL = "https://raw.githubusercontent.com/canonical/kubeflow-examples/main/e2e-wine-kfp-mlflow/winequality-red.csv"
DATA_FILE = "winequality-red.csv"

In [4]:
import requests
from urllib import request

request.urlretrieve(DATA_URL, DATA_FILE)

print(f"File '{DATA_FILE}' downloaded successfully.")

File 'winequality-red.csv' downloaded successfully.


## Preprocess Data

In [5]:
import pandas as pd

df = pd.read_csv(DATA_FILE, header=0, sep=";")
df.tail()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
1594,6.2,0.6,0.08,2.0,0.09,32.0,44.0,0.9949,3.45,0.58,10.5,5
1595,5.9,0.55,0.1,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1596,6.3,0.51,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5
1598,6.0,0.31,0.47,3.6,0.067,18.0,42.0,0.99549,3.39,0.66,11.0,6


In [6]:
df.columns = [c.lower().replace(" ", "_") for c in df.columns]
df.tail()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,ph,sulphates,alcohol,quality
1594,6.2,0.6,0.08,2.0,0.09,32.0,44.0,0.9949,3.45,0.58,10.5,5
1595,5.9,0.55,0.1,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1596,6.3,0.51,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5
1598,6.0,0.31,0.47,3.6,0.067,18.0,42.0,0.99549,3.39,0.66,11.0,6


In [7]:
# Let's put it all in a function
def preprocess(file_path: str, output_file: str):
    df = pd.read_csv(file_path, header=0, sep=";")
    df.columns = [c.lower().replace(" ", "_") for c in df.columns]
    df.to_parquet(output_file)

In [10]:
# test out our function
OUTPUT_PARQUET_FILE = "preprocessed.parquet"
preprocess(DATA_FILE, OUTPUT_PARQUET_FILE)

## Train Model

In [12]:
import os
import pandas as pd

from sklearn.linear_model import ElasticNet
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

In [14]:
df = pd.read_parquet(OUTPUT_PARQUET_FILE)
    
# lets split the train/test datasets
target_column="quality"
train_x, test_x, train_y, test_y = train_test_split(
    df.drop(columns=[target_column]),
    df[target_column], test_size=.25,
    random_state=42, stratify=df[target_column]
)

In [15]:
# Let's visualize the datasets
train_x.tail()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,ph,sulphates,alcohol
980,9.1,0.5,0.3,1.9,0.065,8.0,17.0,0.99774,3.32,0.71,10.5
507,11.2,0.67,0.55,2.3,0.084,6.0,13.0,1.0,3.17,0.71,9.5
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2
1166,9.9,0.54,0.26,2.0,0.111,7.0,60.0,0.99709,2.94,0.98,10.2
1326,6.7,0.46,0.24,1.7,0.077,18.0,34.0,0.9948,3.39,0.6,10.6


In [16]:
train_y.tail()

980     6
507     6
1597    5
1166    5
1326    6
Name: quality, dtype: int64

In [17]:
# Train our model, based on ElasticNet
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html
lr = ElasticNet(alpha=0.5, l1_ratio=0.5, random_state=42)
lr.fit(train_x, train_y)

ElasticNet(alpha=0.5, random_state=42)

In [18]:
lr.predict(test_x)

array([5.64384332, 5.33518038, 5.3776657 , 5.75525211, 5.63549828,
       5.82317551, 5.52714359, 5.53986943, 5.61361854, 5.78970537,
       5.57221271, 5.72522354, 5.55212992, 5.58137503, 5.40099386,
       5.59967448, 5.63568101, 5.58308066, 5.33051206, 5.72426242,
       5.63587221, 5.73609149, 5.71578696, 5.77902862, 5.63160484,
       5.65691267, 5.82434131, 5.60469979, 5.86448997, 5.77643135,
       5.94299876, 5.64282476, 5.64507859, 5.64635776, 5.86953222,
       5.61752546, 5.66372144, 5.55909088, 5.63907016, 5.43194115,
       5.81002332, 5.65957084, 5.72312203, 5.673432  , 5.73288507,
       5.86448997, 5.6403798 , 5.61722089, 5.65669946, 5.38664534,
       5.84962368, 5.35439342, 5.55522296, 5.60235454, 5.54984567,
       5.319705  , 5.7437226 , 5.73661773, 5.85973029, 5.8234022 ,
       5.62036636, 5.65422402, 5.68825088, 5.57846477, 5.63882648,
       5.56204521, 5.63000762, 5.70436581, 5.64781966, 5.65505483,
       5.82729564, 5.7934075 , 5.58372028, 5.7979371 , 5.61667

## Artifact Tracking

In [19]:
import mlflow

RUN_NAME = USER + "-notebook-elastic-net"
MODEL_DIR = USER + "-notebook-experimentation-model"
REGISTERED_MODEL = USER + "-notebook-wine-elasticnet"

In [21]:
# Using DEFAULT Experiment
# Take a peek in http://18.185.52.103/mlflow/
mlflow.sklearn.autolog()
with mlflow.start_run(run_name=RUN_NAME) as run:
    mlflow.set_tag("author", USER)
    
    lr = ElasticNet(alpha=0.5, l1_ratio=0.5, random_state=42)
    lr.fit(train_x, train_y)
    
    mlflow.sklearn.log_model(lr, MODEL_DIR, registered_model_name=REGISTERED_MODEL)
    print(f"{run.info.artifact_uri}/{MODEL_DIR}")

Registered model 'kimonas-sotirchos-notebook-wine-elasticnet' already exists. Creating a new version of this model...
2023/11/04 11:03:24 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: kimonas-sotirchos-notebook-wine-elasticnet, version 3


s3://mlflow/0/e1c0d8a3eb0a44a3be6f598305b69db6/artifacts/kimonas-sotirchos-notebook-experimentation-model


Created version '3' of model 'kimonas-sotirchos-notebook-wine-elasticnet'.


## Deploy Model

In [22]:
def deploy(
    seldon_deployment_name: str = "default_seldon_deployment_name",
    seldon_image: str = "default_seldon_image",
    model_uri: str = "default_model_uri",
    model_name: str = "default_model_name",
):
    import yaml

    from kubernetes import client, config

    manifest = """
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: """ + seldon_deployment_name + """
spec:
  name: wines
  predictors:
  - componentSpecs:
    - spec:
        containers:
        - name: classifier
          image: """ + seldon_image + """
          imagePullPolicy: Always
          livenessProbe:
            initialDelaySeconds: 80
            failureThreshold: 200
            periodSeconds: 5
            successThreshold: 1
            httpGet:
              path: /health/ping
              port: http
              scheme: HTTP
          readinessProbe:
            initialDelaySeconds: 80
            failureThreshold: 200
            periodSeconds: 5
            successThreshold: 1
            httpGet:
              path: /health/ping
              port: http
              scheme: HTTP
    graph:
      children: []
      implementation: MLFLOW_SERVER
      modelUri: """ + model_uri + """
      envSecretRefName: mlflow-server-seldon-rclone-secret
      name: classifier
    name: """ + model_name + """
    replicas: 1
    """

    with open("/var/run/secrets/kubernetes.io/serviceaccount/namespace", "r") as f:
        namespace = f.read().strip()

    config.load_incluster_config()
    api_instance = client.ApiClient()
    custom_api = client.CustomObjectsApi(api_instance)

    try:
        api_response = custom_api.create_namespaced_custom_object(
            group="machinelearning.seldon.io",
            version="v1",
            plural="seldondeployments",
            namespace=namespace,
            body=yaml.safe_load(manifest),
        )
        print("Custom Resource applied successfully.")
        print(api_response)
    except client.rest.ApiException as e:
        print(f"Failed to apply Custom Resource: {e}")

In [24]:
# Let's try to deploy our model
SELDON_DEPLOYMENT_NAME = USER + "notebook-wine"
SELDON_IMAGE = "seldonio/mlflowserver:1.17.0"
MODEL_NAME = "wine-model"

run = mlflow.last_active_run()
MODEL_URI = f"{run.info.artifact_uri}/{MODEL_DIR}"

In [30]:
deploy(SELDON_DEPLOYMENT_NAME, SELDON_IMAGE, MODEL_URI, MODEL_NAME)

Custom Resource applied successfully.
{'apiVersion': 'machinelearning.seldon.io/v1', 'kind': 'SeldonDeployment', 'metadata': {'creationTimestamp': '2023-11-04T11:08:12Z', 'generation': 1, 'managedFields': [{'apiVersion': 'machinelearning.seldon.io/v1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:spec': {'.': {}, 'f:name': {}, 'f:predictors': {}}}, 'manager': 'OpenAPI-Generator', 'operation': 'Update', 'time': '2023-11-04T11:08:12Z'}], 'name': 'kimonas-sotirchosnotebook-wine', 'namespace': 'admin', 'resourceVersion': '1411313', 'uid': 'e2e389cd-0549-4531-a1cd-1870ebab4f26'}, 'spec': {'name': 'wines', 'predictors': [{'componentSpecs': [{'spec': {'containers': [{'image': 'seldonio/mlflowserver:1.17.0', 'imagePullPolicy': 'Always', 'livenessProbe': {'failureThreshold': 200, 'httpGet': {'path': '/health/ping', 'port': 'http', 'scheme': 'HTTP'}, 'initialDelaySeconds': 80, 'periodSeconds': 5, 'successThreshold': 1}, 'name': 'classifier', 'readinessProbe': {'failureThreshold': 200, 'httpGet': {'

### Setup K8s Client

In [31]:
from kubernetes import client as k8s_client, config as k8s_config
from kubernetes.client.exceptions import ApiException

with open("/var/run/secrets/kubernetes.io/serviceaccount/namespace", "r") as f:
    NAMESPACE = f.read().strip()

k8s_config.load_incluster_config()
api_instance = k8s_client.ApiClient()
custom_api = k8s_client.CustomObjectsApi(api_instance)

### Define K8s Helpers

In [32]:
def get_seldon_deployment(name, namespace):
    """Get SeldonDeployment by name."""
    return custom_api.get_namespaced_custom_object(
        group="machinelearning.seldon.io",
        version="v1",
        plural="seldondeployments",
        namespace=namespace,
        name=name,
    )

def delete_seldon_deployment(name, namespace):
    """Delete SeldonDeployment by name."""
    return custom_api.delete_namespaced_custom_object(
        group="machinelearning.seldon.io",
        version="v1",
        plural="seldondeployments",
        namespace=namespace,
        name=name,
    )

In [42]:
# Check the status of the deployment
get_seldon_deployment(SELDON_DEPLOYMENT_NAME, NAMESPACE)["status"]

{'address': {'url': 'http://kimonas-sotirchosnotebook-wine-wine-model.admin.svc.cluster.local:8000/api/v1.0/predictions'},
 'conditions': [{'lastTransitionTime': '2023-11-04T11:11:19Z',
   'reason': 'No Ambassador Mappaings defined',
   'status': 'True',
   'type': 'AmbassadorMappingsReady'},
  {'lastTransitionTime': '2023-11-04T11:11:19Z',
   'message': 'Deployment has minimum availability.',
   'reason': 'MinimumReplicasAvailable',
   'status': 'True',
   'type': 'DeploymentsReady'},
  {'lastTransitionTime': '2023-11-04T11:08:12Z',
   'reason': 'No HPAs defined',
   'status': 'True',
   'type': 'HpasReady'},
  {'lastTransitionTime': '2023-11-04T11:08:12Z',
   'reason': 'No KEDA resources defined',
   'status': 'True',
   'type': 'KedaReady'},
  {'lastTransitionTime': '2023-11-04T11:08:12Z',
   'reason': 'No PDBs defined',
   'status': 'True',
   'type': 'PdbsReady'},
  {'lastTransitionTime': '2023-11-04T11:11:19Z',
   'status': 'True',
   'type': 'Ready'},
  {'lastTransitionTime': '2

## Perform Inference

Wait for the SeldonDeployment to become available and hit it for predictions.

### Hit SeldonDeployment for Predictions

In [43]:
sd = get_seldon_deployment(SELDON_DEPLOYMENT_NAME, NAMESPACE)
url = sd['status']['address']['url']
print("SeldonDeployment URL:", url)

SeldonDeployment URL: http://kimonas-sotirchosnotebook-wine-wine-model.admin.svc.cluster.local:8000/api/v1.0/predictions


In [44]:
# prepare the data to send
inference_input = {
  "data": {
      "ndarray": [
          [
              10.1, 0.37, 0.34, 2.4, 0.085, 5.0, 17.0, 0.99683, 3.17, 0.65, 10.6
          ]
      ]
  }
}

In [45]:
# make the request
response = requests.post(url, json=inference_input)
print(response.text)

{"data":{"names":[],"ndarray":[5.737135502528464]},"meta":{"requestPath":{"classifier":"seldonio/mlflowserver:1.17.0"}}}



## Delete SeldonDeployment

In [46]:
delete_seldon_deployment(SELDON_DEPLOYMENT_NAME, NAMESPACE)

{'kind': 'Status',
 'apiVersion': 'v1',
 'metadata': {},
 'status': 'Success',
 'details': {'name': 'kimonas-sotirchosnotebook-wine',
  'group': 'machinelearning.seldon.io',
  'kind': 'seldondeployments',
  'uid': 'e2e389cd-0549-4531-a1cd-1870ebab4f26'}}