# Collecting, Labeling, and Validating Data
Instructor: Robert Crowe, TensorFlow dev engineer at Google

## Introduction to MLE in Production

#### The importance of data
"Data is the hardest part of ML and the most imporant piece to get right... Broken data is the most common cause of problems in production ML systems"
- Scaling Machine Learning at Uber with Michelangelo, Uber


"No other activity in the machine learning life cycle has a higher return on investment than improving the data a model has access to."
- Feast: Bridging ML Models and Data, Gojek

#### Overview
- Google has found that in ML, the model is typically about 5% of the code required to put an ML application into production 

<img src='img/1.png' width="600" height="300" align="center"/>

<img src='img/2.png' width="600" height="300" align="center"/>

Production machine learning = Machine learning development + Modern software development

#### Managing the entire life cycle of data
- Labeling
- Feature space coverage
- Minimal dimensionality
- Maximum predictive data
- Fairness
- Rare conditions

#### Modern software development
- Accounts for:
    - Scalability
    - Extensibility
    - Configuration
    - Consistency & reproducibility
    - Safety & security
    - Modularity
    - Testability
    - Monitoring (health & performance)
    - Industry best practices
    
#### Challenges in production grade ML
- Build integrated ML systems
- Continuously operate it in production
- Handle continuously changing data
- Optimize compute resource costs

### ML Pipelines
- **ML Pipelines:** Infrastructure for automating, monitoring, and maintaining model training and development

<img src='img/3.png' width="800" height="400" align="center"/>

- ML Pipelines are almost always DAGs (Directed Acyclic Graphs)

#### Directed Acyclic Graphs
- A directed acyclic graph (DAG) is a directed graph that has no cycles
- ML pipeline workflows are usually DAGs (although in some advanced cases they can sometimes include cycles)
- A DAG is a collection of all the tasks you want to run, sequenced in a way that reflects their relationships and dependencies

<img src='img/4.png' width="300" height="150" align="center"/>

- Orchestrators are responsible for scheduling the various components in an ML Pipeline based on dependencies defined by a DAG
- Orchestrators help with pipeline automation 
- Examples include: Airflow, Argo, Celery, Luigi, Kubeflow

### TensorFlow Extended (TFX)
- Open source, end-to-ed platform for deploying production ML pipelines (used at Google)

<img src='img/5.png' width="800" height="400" align="center"/>

- A TFX Pipeline is a sequence of scalable components that can handle large volumes of data
- Sequence of components that are designed for scalable, high-performance machine learning tasks
- In this course, we'll be using TFX

<img src='img/6.png' width="800" height="400" align="center"/>

- TFX and production components are built on top of open source libraries such as TensorFlow Data Validation, TensorFlow Transform, which we'll also learn about, and others
- Components (in orange) leverage these libraries and form your DAG
- As you sequence these components and set up the dependencies between them, you create your DAG, which is your ML pipeline 
- Below, all the boxes in orange are TFX components that come with TFX when you just do a `pip install`

<img src='img/7.png' width="800" height="400" align="center"/>

## Collecting Data

### Importance of Data
- For most applications, you don't just collect data once, you're going to collect data throughout the lifetime of that application.
- In programming language design, a **first class citizen** in a given programming language is an entity which supports all the operations generally available to other entities.
- In ML, Data is a first-class citizen
- Meaningful data:
    - maximize predictive content
    - remove non-informative data
    - feature space coverage
- Data collection is an important and critical step to building ML systems
- Understand users, translate user needs into data problems
- Ensure data coverage and high predictive signal
- Source, store, and monitor quality data responsibly

#### Example application: suggesting runs
#### Key considerations
- Data availability and collection
    - What kind of/how much data is available?
    - How often does the new data come in?
    - Is it annotated?
        - If not, how hard/expensive is it to get it labeled?
- Translate user needs into data needs
    - Data needed
    - Features needed
    - Labels needed
    


- Get to know your data
    - Identify data sources
    - Check if they are refreshed
    - Consistency for values, units, & data types
    - Monitor outliers and errors
    
#### Dataset issues
- Inconsistent formatting
    - Is zero "0", "0.0", or an indicator of a missing value?
- Compounding errors from other ML models
- Monitor data sources for system issues and outages

#### Measure data effectiveness
- Intuition about data value can be misleading
    - Which features have predictive value and which do not?
- **Feature engineering** helps to maximize the predictive signals
- **Feature selection** helps to measure the predicive signals

### Responsible Data: Security, Privacy & Fairness

#### Data security and privacy
- Data collection and management is not just about your model
    - Give user control of what data can be collected
    - Is there a risk of inadvertently revealing user data
- Compliance with regulations and policies (e.g., GDPR)

#### User privacy
- Protect personally identifiable information
    - Aggregation- replace unique values with summary value
    - Redaction- remoce some data to create less complete picture
    
#### How ML systems can fail users
- Fair
- Accountable
- Transparent
- Explainable
    - Representational harm
    - Opportunity denial
    - Disproportionate product failure
    - Harm by disadvantage
    
#### Commit to fairness
- Make sure your models are fair
    - Group fairness, equal accuracy
- Bias in human labeled and/or collected data
- ML models can amplify biases

#### Reducing bias: Design fair labeling systems
- Accurate labels are necessary for supervised learning
- Labeling can be done by
    - Automation  (logging or weak supervision)
    - Humans (aka "Raters", often semi-supervised)
    
<img src='img/8.png' width="800" height="400" align="center"/>

#### Key points
- Ensure rater pool diversity
- Investigate rater context and incentives
- Evaluate rater tools
- Manage cost
- Determine freshness requirements

<img src='img/x.png' width="800" height="400" align="center"/>