This repository shows how to build a complete MLOps system with TensorFlow eXtended(TFX) and various GCP products such as Vertex Pipeline, Vertex Training, Vertex Endpoint, Google Cloud Storage. The main goal is to achieve the two common scenarios of adapting to changes in codebase and adapting to changes in data over time. To achieve these, we need three separate pipelines:
- CI/CD pipeline
- This pipeline is implemented in TFX, GitHub Action, and Vertex Pipeline.
- GitHub Action basically detects to any changes occured in codebase. There are two branches of listening.
- The detection scope of the first branch and the second branch are the whole codebase and the only data preprocessing and modeling parts of the whole codebase respectively.
- Both branches trigger different sub-workflows, but they have a lot in common.
- Clones the current codebase
- Unit tests the
*_test.py
files - Create TFX pipeline
- Run the TFX pipeline in local
- Trigger TFX pipeline on Vertex Pipeline
- The only difference between them is that the first branch has additional step to build a new docker image while the second branch has copying modules in the cloud location(GCS) in between step d and e.
flowchart LR;
subgraph GitHub Action
direction LR
A[Changes in whole codebase]-->B[Trigger Cloud Build];
C[Changes in modules]-->D[Trigger Cloud Build];
end
subgraph GitHub Sub Action1
direction LR
E[Clone Repo]-->F[Unit Test];
F-->G[Create TFX Pipeline];
G-->I[Build Docker Image];
I-->J[Trigger Pipeline on Vertex];
end
subgraph GitHub Sub Action2
direction LR
K[Clone Repo]-->L[Unit Test];
L-->M[Create TFX Pipeline];
M-->O[Copy Modules];
O-->P[Trigger Pipeline on Vertex];
end
B-->E;
D-->K;
- Model evaluation pipeline
- This pipeline is implemented in TFX, GitHub Action, and Vertex Pipeline.
- GitHub Action periodically checks if there is enough data to evaluate currently deployed model on. The model is released in GitHub Release.
- If there is enough data, it triggers another GitHub Action for Model evaluation, and it consists of the following:
- Batch predictions.
- Evaluate how the predicted result is good or worse by checking with the predefined accuracy threshold.
- When the predicted result is not good enough, it will launch a Vertex Pipeline written in TFX:
- SpanPreparator to prepare TFRecord of collected data and put it in different SPAN folder (here different SPAN means a model drift is detected)
- PipelineTrigger to trigger ML pipeline, and it gives which SPAN to look up for.
flowchart LR;
subgraph Periodic Check-GitHub Action
direction LR
A[Check # of collected data]--Enough-->B[Trigger GitHub Action];
end
subgraph Model Evaluation-GitHub Action
direction LR
C[Batch Inference]--Not Good Enough-->D[Trigger MRP];
end
subgraph MRP - Model Retraining Pipeline
direction LR
E[SpanPreparator]-->F[PipelineTrigger];
F-->G[Model Retraining];
end
B-->C;
D-->E;
👋 NOTE: One could argue the whole component can be implemented without any cloud services. However, in my opinion, it is non trivial to achieve production ready quality of MLOps system without any help of cloud services.
I am thankful to the ML Developer Programs team at Google that provided GCP support.