# Collecting, Labeling, and Validating Data
Instructor: Robert Crowe, TensorFlow dev engineer at Google

## Introduction to MLE in Production

#### The importance of data
"Data is the hardest part of ML and the most imporant piece to get right... Broken data is the most common cause of problems in production ML systems"
- Scaling Machine Learning at Uber with Michelangelo, Uber


"No other activity in the machine learning life cycle has a higher return on investment than improving the data a model has access to."
- Feast: Bridging ML Models and Data, Gojek

#### Overview
- Google has found that in ML, the model is typically about 5% of the code required to put an ML application into production 

<img src='img/1.png' width="600" height="300" align="center"/>

<img src='img/2.png' width="600" height="300" align="center"/>

Production machine learning = Machine learning development + Modern software development

#### Managing the entire life cycle of data
- Labeling
- Feature space coverage
- Minimal dimensionality
- Maximum predictive data
- Fairness
- Rare conditions

#### Modern software development
- Accounts for:
    - Scalability
    - Extensibility
    - Configuration
    - Consistency & reproducibility
    - Safety & security
    - Modularity
    - Testability
    - Monitoring (health & performance)
    - Industry best practices
    
#### Challenges in production grade ML
- Build integrated ML systems
- Continuously operate it in production
- Handle continuously changing data
- Optimize compute resource costs

### ML Pipelines
- **ML Pipelines:** Infrastructure for automating, monitoring, and maintaining model training and development

<img src='img/3.png' width="800" height="400" align="center"/>

- ML Pipelines are almost always DAGs (Directed Acyclic Graphs)

#### Directed Acyclic Graphs
- A directed acyclic graph (DAG) is a directed graph that has no cycles
- ML pipeline workflows are usually DAGs (although in some advanced cases they can sometimes include cycles)
- A DAG is a collection of all the tasks you want to run, sequenced in a way that reflects their relationships and dependencies

<img src='img/4.png' width="300" height="150" align="center"/>

- Orchestrators are responsible for scheduling the various components in an ML Pipeline based on dependencies defined by a DAG
- Orchestrators help with pipeline automation 
- Examples include: Airflow, Argo, Celery, Luigi, Kubeflow

### TensorFlow Extended (TFX)
- Open source, end-to-ed platform for deploying production ML pipelines (used at Google)

<img src='img/5.png' width="800" height="400" align="center"/>

- A TFX Pipeline is a sequence of scalable components that can handle large volumes of data
- Sequence of components that are designed for scalable, high-performance machine learning tasks
- In this course, we'll be using TFX

<img src='img/6.png' width="800" height="400" align="center"/>

- TFX and production components are built on top of open source libraries such as TensorFlow Data Validation, TensorFlow Transform, which we'll also learn about, and others
- Components (in orange) leverage these libraries and form your DAG
- As you sequence these components and set up the dependencies between them, you create your DAG, which is your ML pipeline 
- Below, all the boxes in orange are TFX components that come with TFX when you just do a `pip install`

<img src='img/7.png' width="800" height="400" align="center"/>

## Collecting Data

<img src='img/x.png' width="800" height="400" align="center"/>