# Model Deployment Workflow

<img src="../assets/workflow.svg" alt="Workflow" width="100%">

1. **Data Splitting**
   - Isolate changes in data preparation.
   - Prevent issues (bias) with whole dataset processing.


2. **Data Preparation**
   - Data needs to be in a format the model can use for training.
   - Most data is dirty.
   - Some data may be processed by the EDA step.


3. **Training**
   - Model training works best with clean data.


4. **Evaluation**
   - Evaluate model on unseen data.
   - Outcome leads to product decisions.


5. **Tuning**
   - Change how model is trained based on feedback from previous step.
   - The whole workflow is a code, allowing for repeatable iterations.

---

## Workflow Importance

1. **Bad Data**: "Garbage in, garbage out". If the data is bad before training
the model, it won't perform well.

2. **Unrepeatable Results**: When the workflow is structured in a repeatable
way, it'll prevent errors, such as any surprise when moving the work to final
stages of production.

3. **Not Enough Resources**: Having a workflow provides a way to see how the
model reacts to different resources. It's possible to run on better machines,
reducing the time it tales to complete. It's also possible to calculate the
costs from the workflows when iterating, providin better estimates of product
timeline.

4. **Model Complexity**: Having a workflow allows for a better undestanding of
the entire process in a transparent way. If there's a bad metric or
miscalculation, it's possible to quickly change what needs to be done in order
for the workflow to be rerun.

---

## Dataset Principles

1. **Feature Values** are columnar data for a row.
   - Used by models to predict the target value.
   - Can be described as the attributes of a given data point.


2. **Target Value** is a selected column to be predicted by the model.
   - In reality, any feature could be a target value.

> Training a model on the entire dataset and then testing it on the same
dataset, allows the model to memorize what the data looks like. In order to
have a true idea of its performance, we need to test it on unseen data.
Datasets are then split before training to prevent **overfitting** the model.

There are two methods to split the data:

1. Split the data in **train** and **test**. Train the model on the training
dataset and test and tune the model on the **test** dataset.

2. Split the data in **train**, **validation**, and **test**. Train the model
on the training dataset, tune the model on the **validation** dataset, and
finally test your model on the **test** dataset.

### Example / Practice

```python
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8]])

df_train, df_test = train_test_split(df, test_size=0.2, random_state=0)
```

- `test_size=0.2` send 20% to the test dataset.
- `random_state=0` makes the splitting repeatable.

---

> There's an article about `train_test_split` in Real Python's website,
[here](https://realpython.com/train-test-split-python-data).

---

## Data Cleansing

- Bad data could be **missing** or **wrong**.
- **Remove** missing or invalid data.
- Remove entre rows or columns if dats is missing.
- Possible fix bad values by **replacing with average or interpolation**.

---

## Feature Engineering

Modify or extend the current features in a datset with additional insights or
data, increasing the effectiveness of a model.

1. **Changing Data Types** can be done with `astype()`.

2. **Normalizing Data**, by transforming numerical data to have specific range
of values, typically having zero mean, with Scikit Learn's `StandardScaler`.

3. **Parsing Data Types**, such as datetime strings with `to_datetime()`.

4. **One-Hot Encoding**, i.e. convert categorical data to feature columns with
`get_dummies()`.

---

> There's an article on Real Python about data cleansing,
[here](https://realpython.com/python-data-cleaning-numpy-pandas).

---

## Model Training

- Training is resource intensive (memory, CPU, GPU).
- Datasets can have hundreds of GBs.
- Sample dataseta, batch processing and distributed training are ways to reduce the size.

Scikit-Learn's API is standardized across all algorithms:

- `fit()` trains the model.
- `score()` evaluates metrics.
- `predict()` is used to predict new target values.

---

## Metrics Evaluation

1. **Regression metrics** compare the predicted output with real output values,
and their difference determines the model performance.
   - **R2** measures the proportion of variance and is related to correlation,
   ranging from `-1` to `1` (the higher the better).
   - **Root mean square error (RMSE)** measures the standard deviation of
   prediction errors. The lower the better.


2. **Classification metrics** compare the predicted label with real label.
   - $Accuracy = \frac{TP + TN}{Total}$

   - $Precision = \frac{TP}{TP + FP}$

   - $Recall = \frac{TP}{TP + FN}$
   
   - $F1$ is the harmonic mean between precision and recall.

---

> A list of metrics available can be found
[here](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics).

> There's a blog post on Medium about popular metrics,
[here](https://towardsdatascience.com/20-popular-machine-learning-metrics-part-1-classification-regression-evaluation-metrics-1ca3e282a2ce).

---

## Hyperparameters Tuning

- Change model's configurations, therefore increase/decrease its perfomance.
- Recommended to start with the default.
- There are numerous methods for searching the optimized combination (Grid
search, Randomized search, Bayesian search...)

---

> There's a blog post from Neptune about hyperparameter search, available
[here](https://neptune.ai/blog/hyperparameter-tuning-in-python-a-complete-guide-2020).

---

## Exceptions

- Small datasets may not work with train/validation/test data splitting.
- If a dataset is already cleaned and processed, there's no point in doing it
again.
- Avoid metrics with unknown relation to the data.
- If default hyperparameters work well, doing a hyperparameter search may not
be worth it.

---

> MLOps is another name for model deployment, and there is an article about it
[here](https://ml-ops.org/content/end-to-end-ml-workflow).

---

## Exercises

08. **Dataset Principles**: This basic lesson's goal was to introduce the
    concept of splitting the dataset into three parts: training, validation,
    and test.

09. **Data Cleansing and Feature Engineering**: Having learned about data
    splitting, this exercise provides a hands-on experience cleaning and
    processing it.

10. **Model Training and Evaluation**: After finishing working on the dataset,
    we learn about training the model with this practical exercise. Also, after
    training, we evalute the model's performance calculating its score.

11. **Diabetes Model**: This final lesson, wraps up using everything: splitting
    the dataset (no cleaning and processing, since it's already done), creating
    and training a model, tuning the hyperparameters, and finally calculating
    the score.