# Guided Project 3

**Learning Objective:**

* Learn how to customize the tfx template to your own dataset
* Learn how to modify the Keras model scaffold provided by tfx template

In this guided project, we will use the `tfx template` tool to create a TFX pipeline for the covertype project, but this time, instead of re-using an already implemented model as we did in guided project 2, we will adapt the model scaffold generated by `tfx template` so that it can train on the covertype dataset

**Note:** The covertype dataset is loacated at
```
gs://workshop-datasets/covertype/small/dataset.csv
```

In [1]:
import os

## Step 1. Environment setup

### Envirnonment Variables

Setup the your Kubeflow pipelines endopoint below the same way you did in guided project 1 & 2.

In [2]:
ENDPOINT = 'https://164002aec59f7c0d-dot-us-central1.pipelines.googleusercontent.com/'

In [3]:
PATH=%env PATH
%env PATH={PATH}:/home/jupyter/.local/bin

env: PATH=/usr/local/cuda/bin:/opt/conda/bin:/opt/conda/condabin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/home/jupyter/.local/bin


In [4]:
shell_output=!gcloud config list --format 'value(core.project)' 2>/dev/null
GOOGLE_CLOUD_PROJECT=shell_output[0]

%env GOOGLE_CLOUD_PROJECT={GOOGLE_CLOUD_PROJECT}

env: GOOGLE_CLOUD_PROJECT=qwiklabs-gcp-04-0ad772141888


In [5]:
# Docker image name for the pipeline image.
CUSTOM_TFX_IMAGE = 'gcr.io/' + GOOGLE_CLOUD_PROJECT + '/tfx-pipeline'
CUSTOM_TFX_IMAGE

'gcr.io/qwiklabs-gcp-04-0ad772141888/tfx-pipeline'

You may need to restart the kernel at this point.

### `skaffold` tool setup

In [6]:
%%bash

LOCAL_BIN="/home/jupyter/.local/bin"
SKAFFOLD_URI="https://storage.googleapis.com/skaffold/releases/latest/skaffold-linux-amd64"

test -d $LOCAL_BIN || mkdir -p $LOCAL_BIN

which skaffold || (
    curl -Lo skaffold $SKAFFOLD_URI &&
    chmod +x skaffold               &&
    mv skaffold $LOCAL_BIN
)

/usr/local/bin/skaffold


Modify the `PATH` environment variable so that `skaffold` is available:

At this point, you shoud see the `skaffold` tool with the command `which`:

In [7]:
!which skaffold

/usr/local/bin/skaffold


## Step 2. Copy the predefined template to your project directory.

In this step, we will create a working pipeline project directory and 
files by copying additional files from a predefined template.

You may give your pipeline a different name by changing the PIPELINE_NAME below. 

This will also become the name of the project directory where your files will be put.

In [8]:
PIPELINE_NAME = 'guided_project_3' # Your pipeline name
PROJECT_DIR = os.path.join(os.path.expanduser("."), PIPELINE_NAME)
PROJECT_DIR

'./guided_project_3'

TFX includes the taxi template with the TFX python package. 

If you are planning to solve a point-wise prediction problem,
including classification and regresssion, this template could be used as a starting point.

The `tfx template copy` CLI command copies predefined template files into your project directory.

In [9]:
!tfx template copy \
  --pipeline-name={PIPELINE_NAME} \
  --destination-path={PROJECT_DIR} \
  --model=taxi

2021-11-12 03:10:06.340766: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64
2021-11-12 03:10:06.340814: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
CLI
Copying taxi pipeline template
model_analysis.ipynb -> ./guided_project_3/model_analysis.ipynb
pipeline.py -> ./guided_project_3/pipeline/pipeline.py
__init__.py -> ./guided_project_3/pipeline/__init__.py
configs.py -> ./guided_project_3/pipeline/configs.py
kubeflow_dag_runner.py -> ./guided_project_3/kubeflow_dag_runner.py
data_validation.ipynb -> ./guided_project_3/data_validation.ipynb
model.py -> ./guided_project_3/models/keras/model.py
constants.py -> ./guided_project_3/models/keras/constants.py
__init

In [10]:
%cd {PROJECT_DIR}

/home/jupyter/asl-ml-immersion/notebooks/tfx_pipelines/guided_projects/guided_project_3


## Step 3. Browse your copied source files

The TFX template provides basic scaffold files to build a pipeline, including Python source code,
sample data, and Jupyter Notebooks to analyse the output of the pipeline. 

The `taxi` template uses the same Chicago Taxi dataset and ML model as 
the [Airflow Tutorial](https://www.tensorflow.org/tfx/tutorials/tfx/airflow_workshop).

Here is brief introduction to each of the Python files:

`pipeline` - This directory contains the definition of the pipeline
* `configs.py` — defines common constants for pipeline runners
* `pipeline.py` — defines TFX components and a pipeline

`models` - This directory contains ML model definitions.
* `features.py`, `features_test.py` — defines features for the model
* `preprocessing.py`, `preprocessing_test.py` — defines preprocessing jobs using tf::Transform

`models/estimator` - This directory contains an Estimator based model.
* `constants.py` — defines constants of the model
* `model.py`, `model_test.py` — defines DNN model using TF estimator

`models/keras` - This directory contains a Keras based model.
* `constants.py` — defines constants of the model
* `model.py`, `model_test.py` — defines DNN model using Keras

`beam_dag_runner.py`, `kubeflow_dag_runner.py` — define runners for each orchestration engine


**Running the tests:**
You might notice that there are some files with `_test.py` in their name. 
These are unit tests of the pipeline and it is recommended to add more unit 
tests as you implement your own pipelines. 
You can run unit tests by supplying the module name of test files with `-m` flag. 
You can usually get a module name by deleting `.py` extension and replacing `/` with `..`

For example:

In [27]:
!python -m models.features_test

2021-11-12 05:10:02.571214: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64
2021-11-12 05:10:02.571270: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Running tests under Python 3.7.10: /opt/conda/bin/python
[ RUN      ] FeaturesTest.testNumberFeatures
INFO:tensorflow:time(__main__.FeaturesTest.testNumberFeatures): 0.0s
I1112 05:10:05.385713 140033962850112 test_util.py:1973] time(__main__.FeaturesTest.testNumberFeatures): 0.0s
[       OK ] FeaturesTest.testNumberFeatures
[ RUN      ] FeaturesTest.testTransformedName
INFO:tensorflow:time(__main__.FeaturesTest.testTransformedName): 0.0s
I1112 05:10:05.386065 140033962850112 test_util.py:1973] time(__main__.Fea

In [29]:
!python -m models.keras.model_test

2021-11-12 05:11:57.258555: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64
2021-11-12 05:11:57.258607: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Running tests under Python 3.7.10: /opt/conda/bin/python
[ RUN      ] ModelTest.testBuildKerasModel
INFO:tensorflow:time(__main__.ModelTest.testBuildKerasModel): 0.0s
I1112 05:12:00.623120 139826107111232 test_util.py:1973] time(__main__.ModelTest.testBuildKerasModel): 0.0s
[  FAILED  ] ModelTest.testBuildKerasModel
[ RUN      ] ModelTest.test_session
[  SKIPPED ] ModelTest.test_session
ERROR: testBuildKerasModel (__main__.ModelTest)
ModelTest.testBuildKerasModel
------------------------------------------------

## Step 4. Create the artifact store bucket

**Note:** You probably already have completed this step in guided project 1, so you may
may skip it if this is the case.

Components in the TFX pipeline will generate outputs for each run as
[ML Metadata Artifacts](https://www.tensorflow.org/tfx/guide/mlmd), and they need to be stored somewhere.
You can use any storage which the KFP cluster can access, and for this example we
will use Google Cloud Storage (GCS).

Let us create this bucket if you haven't created it in guided project 1.
Its name will be `<YOUR_PROJECT>-kubeflowpipelines-default`.

In [30]:
GCS_BUCKET_NAME = GOOGLE_CLOUD_PROJECT + '-kubeflowpipelines-default'
GCS_BUCKET_NAME

'qwiklabs-gcp-04-0ad772141888-kubeflowpipelines-default'

In [31]:
!gsutil ls gs://{GCS_BUCKET_NAME} | grep {GCS_BUCKET_NAME} || gsutil mb gs://{GCS_BUCKET_NAME}

gs://qwiklabs-gcp-04-0ad772141888-kubeflowpipelines-default/data/
gs://qwiklabs-gcp-04-0ad772141888-kubeflowpipelines-default/jobs/
gs://qwiklabs-gcp-04-0ad772141888-kubeflowpipelines-default/staging/
gs://qwiklabs-gcp-04-0ad772141888-kubeflowpipelines-default/tfx-template/
gs://qwiklabs-gcp-04-0ad772141888-kubeflowpipelines-default/tfx_pipeline_output/
gs://qwiklabs-gcp-04-0ad772141888-kubeflowpipelines-default/tmp/


## Step 3. Customize the pipeline to your data

We made a TFX pipeline for a model using the Chicago Taxi dataset and the covertype dataset. Now it's time to put your data into the pipeline.



Your data can be stored anywhere your pipeline can access, including GCS, or BigQuery. You will need to modify the pipeline definition to access your data.

Review the steps in guided project 1 and guided project 2 to remember what needs to be customized in full details. You'll find below a short summary of these steps:

1. If your data is stored in files, modify the `DATA_PATH` in `kubeflow_dag_runner.py` and set it to the location of your files. If your data is stored in BigQuery, modify `BIG_QUERY_QUERY` in `pipeline/configs.py` to correctly query for your data.
1. Add features in `models/features.py`
1. Modify models/preprocessing.py to [transform input data for training](https://www.tensorflow.org/tfx/guide/transform).
1. Modify `models/keras/model.py` and `models/keras/constants.py` to [describe your ML model](https://www.tensorflow.org/tfx/guide/trainer).
1. Modify the `pipeline.py` and `configs.py` so that you can train and deploy on CAIP

We suggest that you take a small sample of the data, and select columns that are easy to preprocess for the sake of time. Here are a few pointers to get inspiration:

* [A small Slice](https://www.consumerfinance.gov/data-research/consumer-complaints/search/api/v1/?date_received_max=2020-11-26&date_received_min=2020-08-26&field=all&format=csv&no_aggs=true&size=119459) of the [Consumer Complaint Database](https://www.consumerfinance.gov/data-research/consumer-complaints/). (You'll still probably need to take only a subset of the rows and columns for the sake of fast model developpement.)
* The [Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) that has a number of very interesting datasets.

The easiest way to create a small CVS file containing your dataset in the Jupyterlab, and then upload it to in a Cloud Storage bucket. This way you'll simply use `CsvExampleGen` to connect to your dataset.

In [38]:
!gsutil cp data/heart.csv gs://{GCS_BUCKET_NAME}/data/heart/heart.csv

Copying file://data/heart.csv [Content-Type=text/csv]...
/ [1 files][ 11.1 KiB/ 11.1 KiB]                                                
Operation completed over 1 objects/11.1 KiB.                                     


In [45]:
!tfx pipeline create  \
--pipeline-path=kubeflow_dag_runner.py \
--endpoint={ENDPOINT} \
--build-target-image={CUSTOM_TFX_IMAGE}

2021-11-12 05:58:44.198074: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64
2021-11-12 05:58:44.198127: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
CLI
Creating pipeline
Detected Kubeflow.
Use --engine flag if you intend to use a different orchestrator.
Reading build spec from build.yaml
Target image gcr.io/qwiklabs-gcp-04-0ad772141888/tfx-pipeline is not used. If the build spec is provided, update the target image in the build spec file build.yaml.
[Skaffold] Generating tags...
[Skaffold]  - gcr.io/qwiklabs-gcp-04-0ad772141888/tfx-pipeline -> gcr.io/qwiklabs-gcp-04-0ad772141888/tfx-pipeline:latest
[Skaffold] Checking cache...
[Skaffold]  - gcr.io/qwiklabs

## Update pipeline

In [48]:
!tfx pipeline update \
--pipeline-path=kubeflow_dag_runner.py \
--endpoint={ENDPOINT}

2021-11-12 06:13:33.060627: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64
2021-11-12 06:13:33.060680: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
CLI
Updating pipeline
Detected Kubeflow.
Use --engine flag if you intend to use a different orchestrator.
Reading build spec from build.yaml
[Skaffold] Generating tags...
[Skaffold]  - gcr.io/qwiklabs-gcp-04-0ad772141888/tfx-pipeline -> gcr.io/qwiklabs-gcp-04-0ad772141888/tfx-pipeline:latest
[Skaffold] Checking cache...
[Skaffold]  - gcr.io/qwiklabs-gcp-04-0ad772141888/tfx-pipeline: Not found. Building
[Skaffold] Starting build...
[Skaffold] Building [gcr.io/qwiklabs-gcp-04-0ad772141888/tfx-pipeline]...
[Skaffo

In [49]:
!tfx run create --pipeline-name={PIPELINE_NAME} --endpoint={ENDPOINT}

2021-11-12 06:13:53.878946: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64
2021-11-12 06:13:53.879009: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
CLI
Creating a run for pipeline: guided_project_3
Detected Kubeflow.
Use --engine flag if you intend to use a different orchestrator.
Run created for pipeline: guided_project_3
+------------------+--------------------------------------+----------+---------------------------+-------------------------------------------------------------------------------------------------------------------------------+
| pipeline_name    | run_id                               | status   | created_at                | link         

## License

Copyright 2021 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.