# AutoMLOps Pipeline - Using Kubeflow components

## Setup Git
Prerequisite for use of AutoMLOps.deploy() with use_ci=True

In [1]:
! git config --global user.email 'user@github.com'
! git config --global user.name 'username'

# Install AutoMLOps

Install AutoMLOps from [PyPI](https://pypi.org/project/google-cloud-automlops/), or locally by cloning the repo and running `pip install .`

In [None]:
! pip3 install google-cloud-automlops --user

# Restart the kernel
Once you've installed the AutoMLOps package, you need to restart the notebook kernel so it can find the package.

**Note: Once this cell has finished running, continue on. You do not need to re-run any of the cells above.**

In [114]:
import os

if not os.getenv('IS_TESTING'):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

### Install additional packages

In [2]:
! pip3 install --upgrade --quiet google-cloud-aiplatform \
                                 google-cloud-storage \
                                 'kfp<2' \
                                 'google-cloud-pipeline-components<2'

### Check the package versions
Check the versions of the packages you installed. The KFP SDK version should be >=1.8.

In [2]:
! python3 -c "import kfp; print('KFP SDK version: {}'.format(kfp.__version__))"
! python3 -c "import google_cloud_pipeline_components; print('google_cloud_pipeline_components version: {}'.format(google_cloud_pipeline_components.__version__))"

KFP SDK version: 1.8.22
google_cloud_pipeline_components version: 1.0.44


# Set your project ID
Set your project ID below. If you don't know your project ID, leave the field blank and the following cells may be able to find it.

In [3]:
PROJECT_ID = 'project-id'  # @param {type:"string"}
REGION = "us-central1"

In [4]:
! gcloud config set project $PROJECT_ID

Updated property [core/project].


### Create a Cloud Storage bucket

In [5]:
BUCKET_URI = f"gs://bucket-uri"  # @param {type:"string"}

### Service Account

In [6]:
SERVICE_ACCOUNT = "[your-service-account]"  # @param {type:"string"}

In [7]:
if (
    SERVICE_ACCOUNT == ""
    or SERVICE_ACCOUNT is None
    or SERVICE_ACCOUNT == "[your-service-account]"
):
    # Get your service account from gcloud
    shell_output = !gcloud auth list 2>/dev/null
    SERVICE_ACCOUNT = shell_output[2].replace("*", "").strip()

    print("Service Account:", SERVICE_ACCOUNT)

Service Account: 662741782935-compute@developer.gserviceaccount.com


#### Set service account access for Vertex AI Pipelines

Run the following commands to grant service account access to read and write pipeline artifacts in the bucket that is created in the previous step -- only need to run these once per service account.

In [8]:
! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectCreator $BUCKET_URI
! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectViewer $BUCKET_URI

No changes made to gs://fmcc-custom-ml-experiments/
No changes made to gs://fmcc-custom-ml-experiments/


### Import AutoMLOps

In [9]:
from google_cloud_automlops import AutoMLOps

### Others imports

In [10]:
import kfp
from kfp.v2 import dsl
from kfp.v2.dsl import (Artifact, ClassificationMetrics, Input, Metrics,
                        Output, component, Dataset, Model, HTML, Markdown)


## Clear the cache
`AutoMLOps.clear_cache` will remove previous instantiations of AutoMLOps components and pipelines. Use this function if you have previously defined a component that you no longer need.

In [44]:
AutoMLOps.clear_cache()

Cache cleared.


#### Vertex AI constants

Setup up the following constants for Vertex AI Pipeline:
- `PIPELINE_NAME`: Set name for the Pipeline.
- `PIPELINE_ROOT`: Cloud Storage bucket path to store pipeline artifacts.

In [45]:
# set path for storing the pipeline artifacts
PIPELINE_NAME = "automlops-pipeline"
PIPELINE_ROOT = "{}/pipeline_root/beans".format(BUCKET_URI)

## Define Kubeflow custom components
You must specify the output_component_file with the name of your component. For AutoMLOps to know where to find the Kubeflow component spec, set this variable to the following string f"{AutoMLOps.OUTPUT_DIR}/your_component_name.yaml"

In [46]:
@dsl.component(
    packages_to_install=["scorecardpy==0.1.9.6"],
    output_component_file = f'{AutoMLOps.OUTPUT_DIR}/credit_score_dataset.yaml'
)
def credit_score_dataset(
    project_id: str,
    dataset_train: Output[Dataset],
    dataset_test: Output[Dataset]
):
    import pandas as pd
    import scorecardpy as sc

    import logging

    # load germancredit data
    data = sc.germancredit()

    # filter variable via missing rate, iv, identical value rate
    dt_s = sc.var_filter(data, y="creditability")

    # breaking dt into train and test
    train, test = sc.split_df(dt_s, 'creditability').values()
    
    # woe binning ------
    bins = sc.woebin(dt_s, y="creditability")
    # sc.woebin_plot(bins)

    # binning adjustment
    breaks_adj = {
        'age.in.years': [26, 35, 40],
        'other.debtors.or.guarantors': ["none", "co-applicant%,%guarantor"]
    }
    bins_adj = sc.woebin(dt_s, y="creditability", breaks_list=breaks_adj)
    
    # converting train and test into woe values
    train_woe = sc.woebin_ply(train, bins_adj)
    test_woe = sc.woebin_ply(test, bins_adj)

    train_woe.to_csv(dataset_train.path, index=False)
    test_woe.to_csv(dataset_test.path, index=False)

In [47]:
@dsl.component(
    packages_to_install=["scorecardpy==0.1.9.6"],
    output_component_file = f'{AutoMLOps.OUTPUT_DIR}/model_train.yaml'
)
def model_train(
    dataset: Input[Dataset],
    model: Output[Artifact],
):
    import pandas as pd
    import pickle
    from sklearn.pipeline import Pipeline
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import StandardScaler
    from sklearn.linear_model import LogisticRegression

    train_woe = pd.read_csv(dataset.path)
    y_train = train_woe.loc[:,'creditability']
    X_train = train_woe.loc[:,train_woe.columns != 'creditability']

    model_pipeline =  LogisticRegression(penalty='l1', C=0.9, solver='saga', n_jobs=-1, random_state=42)

    model_pipeline.fit(X_train, y_train)

    model.metadata["framework"] = "scikit-learn"
    model.metadata["containerSpec"] = {
        "imageUri": "us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.1-0:latest"
    }

    file_name = model.path + "/model.pkl"
    import pathlib

    pathlib.Path(model.path).mkdir()
    with open(file_name, "wb") as file:
        pickle.dump(model_pipeline, file)

In [48]:
@dsl.component(
    packages_to_install=["scorecardpy==0.1.9.6"],
    output_component_file = f'{AutoMLOps.OUTPUT_DIR}/model_evaluate_metric.yaml'
)
def model_evaluate_metric(
    test_set: Input[Dataset],
    model: Input[Model],
    metrics: Output[Metrics],
) -> dict:
    import pandas as pd
    import pickle
    from sklearn.metrics import (roc_curve,
                                 confusion_matrix,
                                 accuracy_score,
                                 precision_score,
                                 recall_score,
                                 f1_score,
                                 log_loss,
                                 roc_auc_score,
                                 average_precision_score)
    
    data = pd.read_csv(test_set.path)
    file_name = model.path + "/model.pkl"
    with open(file_name, "rb") as file:
        model_pipeline = pickle.load(file)
    
    X=data.drop(columns=['creditability'])
    y=data.creditability

    y_pred = model_pipeline.predict(X)

    y_scores = model_pipeline.predict_proba(X)[:, 1]
    
    metrics.log_metric('Framework', 'scikit-learn')
    metrics.log_metric('Threshold','0.5000')
    metrics.log_metric('Precision', precision_score(y, y_pred))
    metrics.log_metric('Recall', recall_score(y, y_pred))
    metrics.log_metric('Accuracy', accuracy_score(y, y_pred))
    metrics.log_metric('F1 score', f1_score(y, y_pred))
    metrics.log_metric('Log loss', log_loss(y, y_pred))
    metrics.log_metric('ROC AUC', roc_auc_score(y, y_scores))
    metrics.log_metric('ROC PR', average_precision_score(y, y_scores))
    
    output = {'auROC': roc_auc_score(y, y_pred)}
    print(output)

## Define pipeline 
Define your pipeline. You can optionally give the pipeline a name and description. Define the structure by listing the components to be called in your pipeline; use .after to specify the order of execution.
We will define a pipeline for AutoML tabular classification using the components from `google_cloud_pipeline_components`.

In [49]:
@AutoMLOps.pipeline(name="automlops-pipeline")
def pipeline(project: str,
             location: str,
             UUID: str
            ):
    from google_cloud_pipeline_components.v1.endpoint import EndpointCreateOp, ModelDeployOp
    from google_cloud_pipeline_components.v1.model import ModelUploadOp
    from google_cloud_pipeline_components.experimental.custom_job.utils import (
        create_custom_training_job_op_from_component,
    )
    
    data_op = credit_score_dataset(project_id=project)

    custom_job_distributed_training_op = create_custom_training_job_op_from_component(
        model_train,
        replica_count=1
    )

    model_train_op = custom_job_distributed_training_op(
        dataset=data_op.outputs["dataset_train"],
        project=project,
        location=location,
    ).after(data_op)

    model_evaluate_metric_op = model_evaluate_metric(
        test_set=data_op.outputs["dataset_test"],
        model=model_train_op.outputs["model"],
    ).after(model_train_op)
    
    # shapely parameters
    parameters = {"sampled_shapley_attribution": {"path_count": 10}}
    
    # Explanation metadata
    COLUMNS = ['purpose_woe', 'installment_rate_in_percentage_of_disposable_income_woe', 'status_of_existing_checking_account_woe', 'housing_woe', 'credit_history_woe', 'savings_account_and_bonds_woe', 'duration_in_month_woe', 'present_employment_since_woe', 'age_in_years_woe', 'other_debtors_or_guarantors_woe', 'other_installment_plans_woe', 'property_woe', 'credit_amount_woe']

    metadata = {
    "inputs":{
        "features": {"index_feature_mapping": COLUMNS, "encoding": "BAG_OF_FEATURES"}
    },
    "outputs":{"creditability": {}}}

    model_upload_op = ModelUploadOp(
        project=project,
        location=location,
        display_name=f"german-credit-scroe-model-{UUID}",
        unmanaged_container_model=model_train_op.outputs["model"],
        explanation_parameters=parameters,
        explanation_metadata=metadata,
    ).after(model_train_op)

    endpoint_create_op = EndpointCreateOp(
        project=project,
        location=location,
        display_name=f"german-credit-scroe-endpoint-{UUID}",
    )

    ModelDeployOp(
        endpoint=endpoint_create_op.outputs["endpoint"],
        model=model_upload_op.outputs["model"],
        deployed_model_display_name=f"german-credit-scroe-model-{UUID}",
        dedicated_resources_machine_type="n1-standard-4",
        dedicated_resources_min_replica_count=1,
        dedicated_resources_max_replica_count=1,
    ).after(model_upload_op)

#### UUID

If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a uuid for each instance session, and append it onto the name of resources you create in this tutorial.

In [50]:
import random
import string

# Generate a uuid of a specifed length(default=8)
def generate_uuid(length: int = 8) -> str:
    return "".join(random.choices(string.ascii_lowercase + string.digits, k=length))


UUID = generate_uuid()
UUID

'2igzh819'

## Define the Pipeline Arguments


In [51]:
pipeline_params = {
    "project": PROJECT_ID,
    "location": REGION,
    "UUID": UUID,
    #"vertex_experiment_tracking_name": "mlops-experiment-name"
}

## Generate and Run the pipeline
### AutoMLOps.generate(...) generates the MLOps codebase. Users can specify the tooling and technologies they would like to use in their MLOps pipeline.

In [52]:
AutoMLOps.generate(project_id=PROJECT_ID,
                   pipeline_params=pipeline_params,
                   use_ci=True,
                   naming_prefix="automlops-kfp",
                   schedule_pattern='59 11 * * 0' # retrain every Sunday at Midnight
)

Writing directories under AutoMLOps/
Writing configurations to AutoMLOps/configs/defaults.yaml
Writing README.md to AutoMLOps/README.md
Writing kubeflow pipelines code to AutoMLOps/pipelines, AutoMLOps/components
Writing scripts to AutoMLOps/scripts
Writing submission service code to AutoMLOps/services
Writing gcloud provisioning code to AutoMLOps/provision
Writing cloud build config to AutoMLOps/cloudbuild.yaml
Code Generation Complete.


### AutoMLOps.provision(...) runs provisioning scripts to create and maintain necessary infra for MLOps.

In [53]:
AutoMLOps.provision(hide_warnings=False)

-iam.serviceAccounts.actAs
-storage.buckets.create
-artifactregistry.repositories.list
-pubsub.topics.list
-cloudbuild.builds.list
-pubsub.subscriptions.create
-cloudfunctions.functions.create
-cloudbuild.builds.create
-cloudscheduler.jobs.create
-artifactregistry.repositories.create
-serviceusage.services.enable
-source.repos.create
-pubsub.subscriptions.list
-pubsub.topics.create
-iam.serviceAccounts.create
-cloudfunctions.functions.get
-cloudscheduler.jobs.list
-serviceusage.services.use
-resourcemanager.projects.setIamPolicy
-source.repos.list
-storage.buckets.get
-iam.serviceAccounts.list

You are currently using: keshv@google.com. Please check your account permissions.
The following are the recommended roles for provisioning:
-roles/cloudscheduler.admin
-roles/serviceusage.serviceUsageAdmin
-roles/cloudfunctions.admin
-roles/resourcemanager.projectIamAdmin
-roles/storage.admin
-roles/source.admin
-roles/pubsub.editor
-roles/cloudbuild.builds.editor
-roles/iam.serviceAccountAdmin


AutoMLOps.deploy(...) builds and pushes component container, then triggers the pipeline job.

In [55]:
AutoMLOps.deploy(precheck=True,                     # precheck is optional, defaults to True
                 hide_warnings=False)               # hide_warnings is optional, defaults to True

-serviceusage.services.get
-iam.serviceAccounts.get
-cloudbuild.builds.get
-artifactregistry.repositories.get
-cloudfunctions.functions.get
-source.repos.update
-storage.buckets.update
-pubsub.topics.get
-resourcemanager.projects.getIamPolicy
-pubsub.subscriptions.get

You are currently using: keshv@google.com. Please check your account permissions.
The following are the recommended roles for deploying with precheck:
-roles/source.writer
-roles/artifactregistry.reader
-roles/serviceusage.serviceUsageViewer
-roles/storage.admin
-roles/iam.roleViewer
-roles/cloudbuild.builds.editor
-roles/cloudfunctions.viewer
-roles/iam.serviceAccountUser
-roles/pubsub.viewer

Checking for required API services in project fmcc-mlops...
Checking for Artifact Registry in project fmcc-mlops...
Checking for Storage Bucket in project fmcc-mlops...
Checking for Pipeline Runner Service Account in project fmcc-mlops...
Checking for IAM roles on Pipeline Runner Service Account in project fmcc-mlops...
Checking f

### AutoMLOps.deprovision(...): Runs provisioning scripts to tear down MLOps infra created using AutoMLOps

In [None]:
AutoMLOps.deprovision()

### AutoMLOps.launchAll(...): Runs "generate()", "provision()" and "deploy()" all in succession

In [None]:
AutoMLOps.launchAll(
    project_id=PROJECT_ID,
    pipeline_params=pipeline_params,
    use_ci=True,
    naming_prefix="automlops-kfp",
    schedule_pattern='59 11 * * 0' # retrain every Sunday at Midnight
)