---
**Chapter 06**
# **The universal workflow of machine learning**
---

Machine learning is about encoding human experience/observation in a model

---
# **Universal Workflow**
---
  
  - **Task Definition**
    - Understand the problem
    - Collect the dataset
    - Understand the dataset
    - Define metric of success
  
  - **Model Development**
    - Dataset processing
    - Evaluation protocol
    - Statistical power
    - Regularization
  
  - **Model Deployment**
    - Deploy the model
    - Maintain the model  

---
# **Task Definition**
---

### **<ins />01. Understand the problem**

- What will be input
- What will be predicted
- What will be assumptions
- What will be task (classification/regression/segmentation etc.)
- Examples:
  - Photo search: Multiclass multilabel classification
  - Music recommendation: No machine learning. Matrix factorization
- How the existing systems work
- What will be the constraints
  - How to get new dataset (encrypted device)
  - Runtime requirements (embedded system)

### **<ins />02. Collect the dataset**

- Take dataset from the same environment (same features) where the model will be used
- Classes should be equally represented in the dataset, else account for imbalance
- Example:
  - Take images from same camera which will be used in production
- Concept drift:
  - Prediction using model trained on past data assumes future will behave like the past
  - Dataset remains static while production environment changes with time
  - As dataset grows, wrong labelling grows too
- Most time consuming part is framing the problem and collecting, annotating the dataset 
  
### **<ins />03. Understand the dataset**

- Design tools to thoroughly visualize the dataset and annotations

### **<ins />04. Define metric of success**

- Metric of success guides all technical choice in the project
- ROC AUC: Receiver Operating Characterstic Area Under Curve
- See Kaggle for the problems and their success metrics

---
# **Model Development**
---

### **<ins />01. Dataset processing**

- **Data vectorization:**
  
  - Dataset typically must be tensors of float32
  
- **Data normalization:**
  
  - Data should have small values:
    - Values in range 0-1
    - Divide by 255
  - Data should be homogenous
    - All features should have same range
    - Feature-wise normalization (mean=0, std=1)
    - To avoid large gradient updates which prevent convergence

- **Handle missing values:**
  
  - Feature values missing in training dataset
    -  <ins>Categorical Feature</ins>
       -  Create a new category/class (**value missing**)
       -  Example: In Boston dataset, a feature value is missing in some samples
    - <ins>Numerical Feature</ins>
      - Take mean/median of same feature in other samples
      - Example: In Torque dataset, position/velocity/acceleration are missing in some samples
  - Feature values missing in test dataset
    - Duplicate few samples in the training dataset
    - Drop feature values in these samples
    - Handle dropped features (categorical/numerical)

### **<ins />02. Evaluation protocol**

  - Methods
    - Holdout Validation
    - K-Fold Cross-Validation
    - Iterative K-Fold Cross-Validation
  - Consideration:
    - Training and validation should not have redundant samples
    - Validation metrics should decrease/increase similar to training metrics — otherwise overfitting

### **<ins />03. Statistical power**
  
  - Develop a smallest possible model to beat baseline (some generalization, some overfitting)
    - Feature Engineering
    - Architecture priors
    - Training configurations
  - Develop an overfitting model
    - Number of layers
    - Size of layers
    - Number of epochs
  - See Chapters 04-05

### **<ins />04. Regularization**

  - Methods:
    - Auto-Hyperparameter tuner (KerasTuner)
    - L1/L2 regularization (small models)
    - Dropout (large models)
    - Reduce network size
  - Beware of information leaks
  - See Chapter 05


---
# **Model Deployment**
---

### **<ins />01. Deploy the model**

| Server | Detail |
| --- | --- |
| Data | No sensitive inference data (Rest API) |
| Latency | No strict latency requirements |
| Accuracy | Highest accuracy requirements |
| Connection | Internet connection requirements |
| Resource | Server compute resources |
| Deployment | [TensorFlow Serving](https://www.tensorflow.org/tfx/guide/serving) |
| Note | [Industrial TensorFlow (TFX)](https://www.tensorflow.org/tfx) |

| Client |Detail |
| --- | ---|
| Data | Sensitive inference data |
| Latency | Strict latecy requirements |
| Accuracy | Tradeoff between accuracy and runtime |
| Resource | User compute resources |
| Connection | No internet connection requirements |
| Deployment | [TensorFlow JS](https://www.tensorflow.org/js) |

| Mobile |Detail |
| --- | ---|
| Data | Sensitive inference data |
| Latency | Strict latecy requirements |
| Accuracy | Tradeoff between accuracy and runtime |
| Connection | No internet connection requirements |
| Resource | User compute resources |
| Deployment | [TensorFlow Lite](https://www.tensorflow.org/lite) |

**Optimization:**
- Weight pruning:
  - Reduce number of weights
- Weight quantization:
  - Float32 to int8
- [TensorFlow Model Optimizer](https://www.tensorflow.org/model_optimization)

### **<ins />02. Maintain the dataset**

- Collect and annotate new dataset
- Improve collection and annotation pipeline
- **Pay special attention to samples where the model has low accuracy**

---
---
---