<a id='top'></a>
<a name="top"></a><!--Need for Colab-->
# Create a TFX pipeline using tfx template

## Beam Orchestrator

1. Introduction
2. Setup
    * 2.1 Setup project structure
3. Check previous pipelines
4. Delete pipeline in the given orchestrator
5. Check available model templates
6. Copy predefined template to project directory
7. Browse the copied source files.
8. Workaround for missing beam_runner.py file
9. Configuration of project
    * 9.1 project_root/pipeline/pipeline.py/
    * 9.2 project_root/pipeline/configs.py
    * 9.3 Configure project_root/{orchestrator}__runner.py
    * 9.4 Examine project_root/{ORCHESTRATOR_FILE}
10. Create a new pipeline in the given orchestrator.
11. Create a new run for a pipeline (tfx run create)
12. Add components and update the pipeline, creating new artifacts.
    * 12.1 Add components for data validation
    * 12.2 Run new pipeline instance (tfx run create)
    * 12.3 Add components for training






---
<a id="1.0"></a><a name="1.0"></a>
# 1. Introduction
<a href="#top">[back to top]</a>

1. This is a heavily-annotated version of the original Google tutorial, "Create a TFX pipeline using templates with Beam orchestrator". This project builds a pipeline using the Taxi Trips dataset released by the City of Chicago.

1. The main command-group options to remember are:
* `tfx pipeline` - Create and manage TFX pipelines.
* `tfx run` - Create and manage runs of TFX pipelines on various orchestration platforms.
* `tfx template` - Experimental commands for listing and copying TFX pipeline templates.


3. We define a runner to actually run the pipeline. This serves as the entrypoint to this project.

**Notes**

* TFX CLI uses the KFP (Kubeflow Pipelines) SDK underneath. 

**Resources**

* [Create a TFX pipeline using templates with Beam orchestrator](https://www.tensorflow.org/tfx/tutorials/tfx/template_beam)
* [Using the TFX Command-line Interface](https://github.com/tensorflow/tfx/blob/master/docs/guide/cli.md)
* [Create a TFX pipeline using templates with Local orchestrator](https://www.tensorflow.org/tfx/tutorials/tfx/template_local)
* [Create a TFX pipeline using templates](https://www.tensorflow.org/tfx/tutorials/tfx/template)
* [Taxi Trips dataset](
https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-psew)
* [KFP(Kubeflow Pipelines) SDK](https://github.com/tensorflow/tfx/issues/5020)

**Orchestrators**

* [Beam](https://www.tensorflow.org/tfx/tutorials/tfx/template_beam)
* [Kubeflow on Google Cloud](https://www.tensorflow.org/tfx/tutorials/tfx/cloud-ai-platform-pipelines)
* [Vertex AI example](https://www.tensorflow.org/tfx/tutorials/tfx/gcp/vertex_pipelines_vertex_training)
* [Google Cloud AI example](https://www.tensorflow.org/tfx/tutorials/tfx/cloud-ai-platform-pipelines)


---
# 2. Setup

In [1]:
import sys

# Need if running on Colab or Kaggle
IS_COLAB = "google.colab" in sys.modules
IS_KAGGLE = "kaggle_secrets" in sys.modules
if IS_COLAB or IS_KAGGLE:
    #!{sys.executable} -m pip install --upgrade "tfx<2" &> /dev/null
    !pip install --upgrade tfx &> /dev/null
    !apt-get install tree 
    print()
    print("Need to restart runtime on Colab")

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
The following NEW packages will be installed:
  tree
0 upgraded, 1 newly installed, 0 to remove and 20 not upgraded.
Need to get 40.7 kB of archives.
After this operation, 105 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 tree amd64 1.7.0-5 [40.7 kB]
Fetched 40.7 kB in 0s (297 kB/s)
Selecting previously unselected package tree.
(Reading database ... 155676 files and directories currently installed.)
Preparing to unpack .../tree_1.7.0-5_amd64.deb ...
Unpacking tree (1.7.0-5) ...
Setting up tree (1.7.0-5) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...

Need to restart runtime on Colab


In [1]:
import os
import pathlib
from pathlib import Path
import pprint
import subprocess 
import sys
import time
from time import process_time

pp = pprint.PrettyPrinter(indent=4)

DEBUG = True

# Need to reinit after runtime restart
IS_COLAB = "google.colab" in sys.modules
IS_KAGGLE = "kaggle_secrets" in sys.modules

print("Finished with imports.")

Finished with imports.


<a name="2.1"></a>
## 2.1 Setup project structure
<a href="#top">[back to top]</a>

Organize project as {engine}{template}{version}

In [2]:
# Run these examples:
# beam, taxi, 01  (DONE)
# beam, penguin, 01 (DONE)
# local, taxi, 01 (DONE)
# local, penguin, 01

######

# 1. Define engine [beam, local]
ENGINE = 'beam'

# 2. Define template [taxi, penguin]
TEMPLATE = "penguin"

# If needed, added version, such as _01
VERSION = '_01'

#####

# 3. Define up project structure
PIPELINE_NAME=f"pipeline_{ENGINE}_{TEMPLATE}{VERSION}"

# Create a project directory with absolute path
PROJECT_DIR = Path(PIPELINE_NAME).resolve()

# 4. Create shortcut to __runner.py [beam_runner.py, local_runner.py]
if ENGINE == 'beam':
    ORCHESTRATOR_FILE = f'{PROJECT_DIR}/beam_runner.py'
elif ENGINE == 'local':
    ORCHESTRATOR_FILE = f'{PROJECT_DIR}/local_runner.py'

# Create shortcut to pipeline.py
PIPELINE_FILE = f"{PROJECT_DIR}/pipeline/pipeline.py"

# Create output dir for artifacts
OUTPUT_ARTIFACTS_DIR = f"tfx_artifacts_{ENGINE}_{TEMPLATE}{VERSION}"

def HR():
    print("-"*40)
    
current_dir = !pwd
print(f"Current dir:\t{current_dir[0]}")
print(f"PIPELINE_NAME:\t{PIPELINE_NAME}")
print(f"PROJECT_DIR:\t{PROJECT_DIR}")
print(f"PIPELINE_FILE:\t{PIPELINE_FILE}")
print(f"ORCHESTRATOR_FILE:\t{ORCHESTRATOR_FILE}")
print(f"OUTPUT_ARTIFACTS_DIR:\t{OUTPUT_ARTIFACTS_DIR}")

Current dir:	/content
PIPELINE_NAME:	pipeline_beam_penguin_01
PROJECT_DIR:	/content/pipeline_beam_penguin_01
PIPELINE_FILE:	/content/pipeline_beam_penguin_01/pipeline/pipeline.py
ORCHESTRATOR_FILE:	/content/pipeline_beam_penguin_01/beam_runner.py
OUTPUT_ARTIFACTS_DIR:	tfx_artifacts_beam_penguin_01


In [3]:
# If need to clean up
# !rm -fr {PROJECT_DIR}
# !rm -fr {OUTPUT_ARTIFACTS_DIR}

# print("Done cleaning up project_dir")

---
## 3. Check previous pipelines

To avoid naming conflicts, first check any previous pipeline projects.

In [4]:
# Lists all the pipelines in the given orchestrator.
# https://github.com/tensorflow/tfx/blob/master/tfx/tools/cli/commands/pipeline.py#L278
print(f"Listing pipelines in the {ENGINE} orchestrator:")
HR()
!tfx pipeline list --engine={ENGINE}

Listing pipelines in the beam orchestrator:
----------------------------------------
CLI
Listing all pipelines
No pipelines to display.


---
## 4. Delete pipeline in the given orchestrator

Even if you delete the project folder, the pipeline is still registered with the local orchestrator, so you have to also delete it there (if you want to rebuild it).


In [5]:
# https://github.com/tensorflow/tfx/blob/master/tfx/tools/cli/commands/pipeline.py#L245
# print(f"Deleting pipeline {PIPELINE_NAME} in the {ENGINE} orchestrator:")
# HR()
# !tfx pipeline delete --pipeline-name={PIPELINE_NAME} --engine={ENGINE}
# For example,
# !tfx pipeline delete --pipeline-name=pipeline_beam_taxi_01 --engine=beam

## 5. Check available model templates

In [6]:
# https://github.com/tensorflow/tfx/blob/master/tfx/tools/cli/commands/template.py#L32
if DEBUG:
    !tfx template list

CLI
Available templates:
- taxi
- penguin


## 6. Copy predefined template to project directory


Copy a template to the destination directory. At this step, we still do not specify which orchestrator to use.

Usage:
    
```bash
tfx template copy
    --model=model
    --pipeline_name=pipeline-name
    --destination_path=destination-path
```

[Reference](https://github.com/tensorflow/tfx/blob/master/docs/guide/cli.md#copy)

---

This step should create these files:

```
├── data
│   └── data.csv
├── kubeflow_runner.py
├── local_runner.py
├── models
│   ├── constants.py
│   ├── features.py
│   ├── features_test.py
│   ├── model.py
│   ├── model_test.py
│   ├── preprocessing.py
│   └── preprocessing_test.py
└── pipeline
    ├── configs.py
    └── pipeline.py

3 directories, 12 files
```

In [7]:
print(f"Copying predefined '{TEMPLATE}' template to project directory:")

# https://github.com/tensorflow/tfx/blob/master/tfx/tools/cli/commands/template.py#L59
!tfx template copy \
    --model={TEMPLATE} \
    --pipeline_name={PIPELINE_NAME} \
    --destination_path={PROJECT_DIR} &> /dev/null

print("Done.")

HR()

proc = subprocess.Popen(
    ["tree", PROJECT_DIR, "-I", "__pycache__|__init__.py"], 
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT,
    text=True
)

tree_01_template_copy = proc.communicate()[0]
print(tree_01_template_copy)

Copying predefined 'penguin' template to project directory:
Done.
----------------------------------------
/content/pipeline_beam_penguin_01
├── data
│   └── data.csv
├── kubeflow_runner.py
├── local_runner.py
├── models
│   ├── constants.py
│   ├── features.py
│   ├── features_test.py
│   ├── model.py
│   ├── model_test.py
│   ├── preprocessing.py
│   └── preprocessing_test.py
└── pipeline
    ├── configs.py
    └── pipeline.py

3 directories, 12 files



---
Note that we still have not yet registered this pipeline with the local orchestrator.

In [8]:
#  Check if pipeline has been registered in orchestrator or not
if DEBUG:
    !tfx pipeline list --engine={ENGINE}

CLI
Listing all pipelines
No pipelines to display.


## 7. Browse the copied source files.

-   `pipeline` - This directory contains the definition of the pipeline
    -   `configs.py` — defines common constants for pipeline runners
    -   `pipeline.py` — defines TFX components and a pipeline
-   `models` - This directory contains ML model definitions.
    -   `features.py`, `features_test.py` — defines features for the model
    -   `preprocessing.py`, `preprocessing_test.py` — defines preprocessing
        jobs using `tf.Transform`
    -   `estimator` - This directory contains an Estimator based model.
        -   `constants.py` — defines constants of the model
        -   `model.py`, `model_test.py` — defines DNN model using TF estimator
    -   `keras` - This directory contains a Keras based model.
        -   `constants.py` — defines constants of the model
        -   `model.py`, `model_test.py` — defines DNN model using Keras
-   `local_runner.py`, `kubeflow_runner.py` — define runners for each orchestration engine


---
## 8. Workaround for missing beam_runner.py file

The file beam_runner.py is not generated by the template, so we implement it here.

Reference:

* https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi_pipeline/taxi_pipeline_native_keras.py

In [9]:
# Transform and Model
_beam_runner_file = f"{PROJECT_DIR}/beam_runner.py"

In [10]:
%%writefile {_beam_runner_file}
# This file is written from jupyter notebook
# Copyright 2022 George Baptista
# Copyright 2020 Google LLC. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Define BeamDagRunner to run the pipeline."""

import os
from absl import logging

from tfx import v1 as tfx
from pipeline import configs
from pipeline import pipeline

from tfx.orchestration.beam.beam_dag_runner import BeamDagRunner

# TFX pipeline produces many output files and metadata. All output data will be
# stored under this OUTPUT_DIR.
# NOTE: It is recommended to have a separated OUTPUT_DIR which is *outside* of
#       the source code structure. Please change OUTPUT_DIR to other location
#       where we can store outputs of the pipeline.
OUTPUT_DIR = '.'

# TFX produces two types of outputs, files and metadata.
# - Files will be created under PIPELINE_ROOT directory.
# - Metadata will be written to SQLite database in METADATA_PATH.
PIPELINE_ROOT = os.path.join(OUTPUT_DIR, 'tfx_pipeline_output',
                             configs.PIPELINE_NAME)
METADATA_PATH = os.path.join(OUTPUT_DIR, 'tfx_metadata', configs.PIPELINE_NAME,
                             'metadata.db')

# The last component of the pipeline, "Pusher" will produce serving model under
# SERVING_MODEL_DIR.
SERVING_MODEL_DIR = os.path.join(PIPELINE_ROOT, 'serving_model')

# Specifies data file directory. DATA_PATH should be a directory containing CSV
# files for CsvExampleGen in this example. By default, data files are in the
# `data` directory.
DATA_PATH = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'data')


def run():
    """Define a pipeline."""

    #tfx.orchestration.BeamDagRunner().run(
    BeamDagRunner().run(
        pipeline.create_pipeline(
            pipeline_name=configs.PIPELINE_NAME,
            pipeline_root=PIPELINE_ROOT,
            data_path=DATA_PATH,
            preprocessing_fn=configs.PREPROCESSING_FN,
            run_fn=configs.RUN_FN,
            train_args=tfx.proto.TrainArgs(num_steps=configs.TRAIN_NUM_STEPS),
            eval_args=tfx.proto.EvalArgs(num_steps=configs.EVAL_NUM_STEPS),
            eval_accuracy_threshold=configs.EVAL_ACCURACY_THRESHOLD,
            serving_model_dir=SERVING_MODEL_DIR,
            metadata_connection_config=tfx.orchestration.metadata
            .sqlite_metadata_connection_config(METADATA_PATH)))


if __name__ == '__main__':
    logging.set_verbosity(logging.INFO)
    run()

Writing /content/pipeline_beam_penguin_01/beam_runner.py


---
## 9. Configuration of project

### 9.1 project_root/pipeline/pipeline.py/

**Cache setting**

Every time we create a pipeline run, every component runs again and again even though the input and the parameters may not change.
It is waste of time and resources, and you can skip those executions with pipeline caching. You can enable caching by specifying `enable_cache=True` for the `Pipeline` object in `pipeline.py`.

Note:

Before, relied on this setting with -e, which specifies what follows is the script that you want to execute with sed.

```bash
sed -i -e 's/\# enable_cache=True/enable_cache=True/'
```

However, this results in the -e being consumed by the following cell `sed -i` option, resulting in the suffix py-e for the backup file. To fix this, we can stop using the -e option here, and instead use this option:

```bash
sed -i '' 's/\# enable_cache=True/enable_cache=True/' {PIPELINE_FILE}
```

In [11]:
from sys import platform

if platform == "darwin":
    print("OSX")
    OSX_syntax = "''"
else:
    OSX_syntax = ''

print(OSX_syntax)




In [12]:
!sed -i {OSX_syntax} 's/\# enable_cache=True/enable_cache=True/' {PIPELINE_FILE}

print(Path(PIPELINE_FILE).name)
# Check changes
!grep -n 'enable_cache=True' {PIPELINE_FILE} | tr -s " "

pipeline.py
153: enable_cache=True,


### 9.2 project_root/pipeline/configs.py

1. Comment out the google.auth code in pipeline/configs.py (unless you are actually using it).

2. Add a placeholder value for `GOOGLE_CLOUD_PROJECT`
After commenting out that block, and add a placeholder value in its place:

We can do this programatically via the sed tool.

Resources:
* https://github.com/sharkdp/bat

In [13]:
# The relevant code is on lines 30-37
# 
# try:
#   import google.auth  # pylint: disable=g-import-not-at-top  # pytype: disable=import-error
#   try:
#     _, GOOGLE_CLOUD_PROJECT = google.auth.default()
#   except google.auth.exceptions.DefaultCredentialsError:
#     GOOGLE_CLOUD_PROJECT = ''
# except ImportError:
#   GOOGLE_CLOUD_PROJECT = ''

# Possible error for duplicate .py-e files being created:
# Also it is looking like the -e is being gobbled up by the -i option as the suffix for the backup file will be made as filename-e as is shown in your snippet
# https://serverfault.com/questions/939762/update-all-python-files-via-linux-command

config_file = f'{PROJECT_DIR}/pipeline/configs.py'

# Use -i option to edit the original file in-place
already_commented = !grep "#####" {config_file}
if not already_commented:
    # Since we know the exact location, simply use line numbers to comment out block
    !sed -i {OSX_syntax} '30,37 s/^/##### /' {config_file}

# Need dummy value for GOOGLE_CLOUD_PROJECT
# sed on OS X requires the extension to be explicitly specified. 
# The workaround is to set an empty string.
!sed -i {OSX_syntax} "38 s/.*/GOOGLE_CLOUD_PROJECT='placeholder'/" {config_file}

#print("Affected lines in source file:\n")
# print selected lines, with line numbering
#!sed '30,40!d;=' {config_file} | sed 'N;s/\n/ /'

if IS_COLAB:
    !cat {config_file}
else:
    !bat --theme=GitHub --color=always --wrap never {config_file}


# Copyright 2020 Google LLC. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""TFX penguin template configurations.

This file defines environments for a TFX penguin pipeline.
"""

import os  # pylint: disable=unused-import

# TODO(b/149347293): Move more TFX CLI flags into python configuration.

# Pipeline name will be used to identify this pipeline.
PIPELINE_NAME = 'pipeline_beam_penguin_01'

# GCP related configs.

# Following code will retrieve your GCP project. You can c

### 9.3 Configure project_root/{orchestrator}__runner.py

1. Configure OUTPUT_DIR to `./project_root/tfx_artifacts`. This will hold both `tfx_metadata` and `tfx_pipeline_output`

In [14]:
# It's more clear to use interpolation when passing arguments containing lots of '' syntax
source = "OUTPUT_DIR = '.'"
target = f"OUTPUT_DIR = '{OUTPUT_ARTIFACTS_DIR}'"

!sed -i {OSX_syntax} "s/$source/$target/" {ORCHESTRATOR_FILE}

# Show line number, use tr -s option to replace instances of repeated chars with a single char. 
# https://pubs.opengroup.org/onlinepubs/9699919799/utilities/tr.html#tag_20_132_04

print(Path(ORCHESTRATOR_FILE).name)
!grep -n "$target" {ORCHESTRATOR_FILE} | tr -s " "

beam_runner.py
33:OUTPUT_DIR = 'tfx_artifacts_beam_penguin_01'


---
### 9.4 Examine project_root/{ORCHESTRATOR_FILE}

In [15]:
if IS_COLAB:
    !cat {config_file}
else:
    !bat --theme=GitHub --color=always --wrap never {config_file}

# Copyright 2020 Google LLC. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""TFX penguin template configurations.

This file defines environments for a TFX penguin pipeline.
"""

import os  # pylint: disable=unused-import

# TODO(b/149347293): Move more TFX CLI flags into python configuration.

# Pipeline name will be used to identify this pipeline.
PIPELINE_NAME = 'pipeline_beam_penguin_01'

# GCP related configs.

# Following code will retrieve your GCP project. You can c

---
## 10. Create a new pipeline in the given orchestrator.

`tfx pipeline`

Usage:

```shell
tfx pipeline create --pipeline_path=pipeline-path
    [
    --endpoint=endpoint
    --engine=engine
    --iap_client_id=iap-client-id
    --namespace=namespace
    --build_image
    --build_base_image=build-base-image
    ]
```

---

Next, create a new pipeline now with `pipeline create`.


This registers your pipeline as defined in local_runner.py without actually running it, and creates the empty folder `tfx_metadata`. This creates a new run instance for a pipeline in the orchestrator. 
This also creates the actual artifacts.

Components in the TFX pipeline generate outputs for each run as ML Metadata Artifacts, and they need to be stored somewhere. Here, they are stored in `tfx_metadata`.

This step should create these files:

```
tfx_artifacts
└── tfx_metadata
    └── pipeline_penguin_01
```

Resources:

* https://github.com/tensorflow/tfx/blob/master/tfx/tools/cli/commands/run.py#L79  
* https://github.com/tensorflow/tfx/blob/master/tfx/tools/cli/commands/pipeline.py#L123

In [16]:
from tfx.orchestration.beam.beam_dag_runner import BeamDagRunner
print(BeamDagRunner)

<class 'tfx.orchestration.beam.beam_dag_runner.BeamDagRunner'>


In [17]:
# Note that there is no name-parameter here. Instead, this 
# implicitly uses the value defined in /pipeline/configs.py:
# PIPELINE_NAME = 'penguin_pipeline'
# So, we have to make sure there is no naming-conflict with other pipelines.
print("Create a new pipeline in the given orchestrator.")
!tfx pipeline create --engine={ENGINE} --pipeline_path={ORCHESTRATOR_FILE}
print("Done.")

Create a new pipeline in the given orchestrator.
CLI
Creating pipeline
INFO:absl:Excluding no splits because exclude_splits is not set.
INFO:absl:Excluding no splits because exclude_splits is not set.
Pipeline "pipeline_beam_penguin_01" created successfully.
Done.


In [18]:
proc = subprocess.Popen(
    ["tree", OUTPUT_ARTIFACTS_DIR, "-I", "__pycache__|__init__.py"], 
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT,
    text=True
)
tree_00_tfx_artifacts = proc.communicate()[0]
print(tree_00_tfx_artifacts)

tfx_artifacts_beam_penguin_01
└── tfx_metadata
    └── pipeline_beam_penguin_01

2 directories, 0 files



In [19]:
# Note that we have now registered this pipeline with the orchestrator:
if DEBUG:
    !tfx pipeline list --engine={ENGINE}

CLI
Listing all pipelines
------------------------------
pipeline_beam_penguin_01
------------------------------


In [20]:
# Check that there are not files created after this step
if DEBUG:
    try:
        assert tree_01_template_copy == tree_02_pipeline_create
    except:
        print("files not the same")
    
# Hence, `tfx pipeline create --engine={ENGINE} --pipeline_path=local_runner.py` 
# does not create any artifacts, but instead registers with the appropriate orchestrator.

files not the same


**Note** Maybe we can just run `python xxxx_runner.py` on the CLI. Is this any faster?

## 11. Create a new run for a pipeline (tfx run create)

Execute the created pipeline using `tfx run create` command.

This creates a new run instance for a pipeline in the orchestrator. In other words, this  is when we actually run the TFX components defined in our project.

* This creates 

```
tfx_artifacts
├── tfx_metadata
│   └── pipeline_penguin_01
│       └── metadata.db
└── tfx_pipeline_output
    └── pipeline_penguin_01
        ├── CsvExampleGen
        │   └── examples
        │       └── 1
        │           ├── Split-eval
        │           │   └── data_tfrecord-00000-of-00001.gz
        │           └── Split-train
        │               └── data_tfrecord-00000-of-00001.gz
        ├── SchemaGen
        │   └── schema
        │       └── 3
        │           └── schema.pbtxt
        └── StatisticsGen
            └── statistics
                └── 2
                    ├── Split-eval
                    │   └── FeatureStats.pb
                    └── Split-train
                        └── FeatureStats.pb

17 directories, 6 files
```

In [21]:
# Start an execution run with the newly created pipeline.
# This contains three TFX components   # &> /dev/null
print(f"Create a new run instance for the pipeline in the {ENGINE} orchestrator:")
!tfx run create --engine={ENGINE} --pipeline_name={PIPELINE_NAME} &> /dev/null
print("Finished.")

Create a new run instance for the pipeline in the beam orchestrator:
Finished.


In [22]:
proc = subprocess.Popen(
    ["tree", OUTPUT_ARTIFACTS_DIR, "-I", "__pycache__|__init__.py"], 
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT,
    text=True
)

tree_01_tfx_artifacts = proc.communicate()[0]
print(tree_01_tfx_artifacts)

tfx_artifacts_beam_penguin_01
├── tfx_metadata
│   └── pipeline_beam_penguin_01
│       └── metadata.db
└── tfx_pipeline_output
    └── pipeline_beam_penguin_01
        ├── CsvExampleGen
        │   └── examples
        │       └── 1
        │           ├── Split-eval
        │           │   └── data_tfrecord-00000-of-00001.gz
        │           └── Split-train
        │               └── data_tfrecord-00000-of-00001.gz
        ├── SchemaGen
        │   └── schema
        │       └── 3
        │           └── schema.pbtxt
        └── StatisticsGen
            └── statistics
                └── 2
                    ├── Split-eval
                    │   └── FeatureStats.pb
                    └── Split-train
                        └── FeatureStats.pb

17 directories, 6 files



---

## 12. Add components and update the pipeline, creating new artifacts.

---

### 12.1 Add components for data validation

In this step, you will add components for data validation including StatisticsGen, SchemaGen, and ExampleValidator. If you are interested in data validation, please see Get started with Tensorflow Data Validation.

**NOTE**

Make these changes in  `{project_root}/pipeline/pipeline.py`

```python
components.append(example_gen) # already added

components.append(statistics_gen)
components.append(schema_gen)
components.append(example_validator)

```


In [23]:
!sed -i {OSX_syntax} 's/\# components.append(example_gen)/components.append(example_gen)/' {PIPELINE_FILE}
!sed -i {OSX_syntax} 's/\# components.append(statistics_gen)/components.append(statistics_gen)/' {PIPELINE_FILE}
!sed -i {OSX_syntax} 's/\# components.append(schema_gen)/components.append(schema_gen)/' {PIPELINE_FILE}
!sed -i {OSX_syntax} 's/\# components.append(example_validator)/components.append(example_validator)/' {PIPELINE_FILE}

# Show line number, use tr -s option to replace instances of repeated chars with a single char. 
# https://pubs.opengroup.org/onlinepubs/9699919799/utilities/tr.html#tag_20_132_04
!grep -n 'components.append(example_gen)' {PIPELINE_FILE} | tr -s " "
!grep -n 'components.append(statistics_gen)' {PIPELINE_FILE} | tr -s " "
!grep -n 'components.append(schema_gen)' {PIPELINE_FILE} | tr -s " "
!grep -n 'components.append(example_validator)' {PIPELINE_FILE} | tr -s " "

50: components.append(example_gen)
55: components.append(statistics_gen)
61: components.append(schema_gen)
65: components.append(schema_gen)
71: components.append(example_validator)


In [24]:
# Update the pipeline with the modified pipeline definition.
## How does this know which pipeline to run? By this variable, `--pipeline-path=local_runner.py`
print("Updating pipeline:")
!tfx pipeline update --engine={ENGINE} --pipeline_path={ORCHESTRATOR_FILE}
print("Done.")

Updating pipeline:
CLI
Updating pipeline
INFO:absl:Excluding no splits because exclude_splits is not set.
INFO:absl:Excluding no splits because exclude_splits is not set.
Pipeline "pipeline_beam_penguin_01" updated successfully.
Done.


### 12.2 Run new pipeline instance (tfx run create)

In [25]:
# Execute another run of the updated pipeline to create artifacts
print("Running new pipeline instance:")
!tfx run create --engine={ENGINE} --pipeline_name {PIPELINE_NAME} &> /dev/null
print("Done.")

Running new pipeline instance:
Done.


In [26]:
proc = subprocess.Popen(
    ["tree", OUTPUT_ARTIFACTS_DIR, "-I", "__pycache__|__init__.py"], 
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT,
    text=True
)

tree_02_tfx_artifacts = proc.communicate()[0]
print(tree_02_tfx_artifacts)

tfx_artifacts_beam_penguin_01
├── tfx_metadata
│   └── pipeline_beam_penguin_01
│       └── metadata.db
└── tfx_pipeline_output
    └── pipeline_beam_penguin_01
        ├── CsvExampleGen
        │   └── examples
        │       └── 1
        │           ├── Split-eval
        │           │   └── data_tfrecord-00000-of-00001.gz
        │           └── Split-train
        │               └── data_tfrecord-00000-of-00001.gz
        ├── SchemaGen
        │   └── schema
        │       └── 3
        │           └── schema.pbtxt
        └── StatisticsGen
            └── statistics
                └── 2
                    ├── Split-eval
                    │   └── FeatureStats.pb
                    └── Split-train
                        └── FeatureStats.pb

17 directories, 6 files



---
### 12.3 Add components for training

In this step, you will add components for training and model validation including `Transform`, `Trainer`, `Resolver`, `Evaluator`, and `Pusher`.

In [27]:
# Uncomment these components
!sed -i {OSX_syntax} 's/\# components.append(transform)/components.append(transform)/' {PIPELINE_FILE}
!sed -i {OSX_syntax} 's/\# components.append(trainer)/components.append(trainer)/' {PIPELINE_FILE}
!sed -i {OSX_syntax} 's/\# components.append(model_resolver)/components.append(model_resolver)/' {PIPELINE_FILE}
!sed -i {OSX_syntax} 's/\# components.append(evaluator)/components.append(evaluator)/' {PIPELINE_FILE}
!sed -i {OSX_syntax} 's/\# components.append(pusher)/components.append(pusher)/' {PIPELINE_FILE}

# Check the changes
!grep -n 'components.append(transform)' {PIPELINE_FILE} | tr -s " "
!grep -n 'components.append(trainer)' {PIPELINE_FILE} | tr -s " "
!grep -n 'components.append(model_resolver)' {PIPELINE_FILE} | tr -s " "
!grep -n 'components.append(evaluator)' {PIPELINE_FILE} | tr -s " "
!grep -n 'components.append(pusher)' {PIPELINE_FILE} | tr -s " "

79: components.append(transform)
92: components.append(trainer)
102: components.append(model_resolver)
135: components.append(evaluator)
145: components.append(pusher)


In [28]:
# Update the pipeline with the modified pipeline definition.
print("Updating pipeline:")
#!tfx pipeline update --engine={ENGINE} --pipeline_path={ORCHESTRATOR_FILE} &> /dev/null
!tfx pipeline update --engine={ENGINE} --pipeline_path={ORCHESTRATOR_FILE}
print("Done.")

Updating pipeline:
CLI
Updating pipeline
INFO:absl:Excluding no splits because exclude_splits is not set.
INFO:absl:Excluding no splits because exclude_splits is not set.
Pipeline "pipeline_beam_penguin_01" updated successfully.
Done.


In [29]:
%%time
tic = time.perf_counter()
t1_start = process_time() 

# Execute another run of the updated pipeline to create artifacts
!tfx run create --engine={ENGINE} --pipeline_name {PIPELINE_NAME} &> /dev/null

toc = time.perf_counter()
print(f"Elapsed time {toc - tic:0.4f} seconds")
HR()

t1_stop = process_time()
print(f"CPU process_time: {(t1_stop-t1_start):.6f}") 

Elapsed time 40.6647 seconds
----------------------------------------
CPU process_time: 0.269693
----------------------------------------
CPU times: user 233 ms, sys: 37 ms, total: 270 ms
Wall time: 40.7 s


In [30]:
%%time
tic = time.perf_counter()
t1_start = process_time() 
# Execute another run of the updated pipeline to create artifacts
#!tfx run create --engine={ENGINE} --pipeline_name {PIPELINE_NAME} &> /dev/null

!time python {ORCHESTRATOR_FILE} >/dev/null 2>&1

toc = time.perf_counter()
print(f"Elapsed time {toc - tic:0.4f} seconds")
HR()

t1_stop = process_time()
print(f"CPU process_time: {(t1_stop-t1_start):.6f}") 
HR()


real	0m21.819s
user	0m17.594s
sys	0m2.291s
Elapsed time 21.9071 seconds
----------------------------------------
CPU process_time: 0.164955
----------------------------------------
CPU times: user 142 ms, sys: 22.9 ms, total: 165 ms
Wall time: 21.9 s


This creates artifacts in serving_model, StatisticsGen, Trainer, Transform.



In [31]:
proc = subprocess.Popen(
    ["tree", OUTPUT_ARTIFACTS_DIR, "-I", "__pycache__|__init__.py"], 
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT,
    text=True
)

tree_03_tfx_artifacts = proc.communicate()[0]
print(tree_03_tfx_artifacts)

tfx_artifacts_beam_penguin_01
├── tfx_metadata
│   └── pipeline_beam_penguin_01
│       └── metadata.db
└── tfx_pipeline_output
    └── pipeline_beam_penguin_01
        ├── CsvExampleGen
        │   └── examples
        │       └── 1
        │           ├── Split-eval
        │           │   └── data_tfrecord-00000-of-00001.gz
        │           └── Split-train
        │               └── data_tfrecord-00000-of-00001.gz
        ├── Evaluator
        │   ├── blessing
        │   │   ├── 13
        │   │   │   └── BLESSED
        │   │   └── 21
        │   │       └── BLESSED
        │   └── evaluation
        │       ├── 13
        │       │   ├── attributions-00000-of-00001.tfrecord
        │       │   ├── eval_config.json
        │       │   ├── metrics-00000-of-00001.tfrecord
        │       │   ├── plots-00000-of-00001.tfrecord
        │       │   └── validations.tfrecord
        │       └── 21
        │           ├── attributions-00000-of-00001.tfrecord
        │           ├── eva

In [32]:
!tfx pipeline list --engine=local

CLI
Listing all pipelines
No pipelines to display.


In [33]:
!tfx pipeline list --engine=beam

CLI
Listing all pipelines
------------------------------
pipeline_beam_penguin_01
------------------------------
