# Model Development
Model development is an iterative process. After each iteration, you’ll want to compare your model’s performance against its performance in previous iterations and evaluate how suitable this iteration is for production.

## Evaluating ML Models
When considering what model to use, it’s important to consider not only the model’s performance, measured by metrics such as accuracy, F1 score, and log loss, but also its other properties, such as how much data, compute, and time it needs to train, what’s its inference latency, and interpretability. For example, a simple logistic regression model might have lower accuracy than a complex neural network, but it requires less labeled data to start, it’s much faster to train, it’s much easier to deploy, and it’s also much easier to explain why it’s making certain predictions.

# Tips for model selection

### Avoid the state-of-the-art trap
Do not jump away from simple models to complex models straight into the state-of-the-art models. Start with simple models and then move to complex models.

Researchers often only evaluate models in academic settings, which means that a model being state of the art often means that it performs better than existing models on some static datasets. It doesn’t mean that this model will be fast enough or cheap enough for you to implement. It doesn’t even mean that this model will perform better than other models on your data.

### Start with the simplest models.
“simple is better than complex,” and the principle is applicable to machine learning models too.
- Simple models are easier to deploy.
- Starting with something simple and adding more complex components step-by-step makes it easier to understand your model and debug it.
- The simplest model serves as a baseline to which you can compare your more complex models.

### Avoid human biases in selecting models.
The choice of a model should be based on the data and the problem you are trying to solve, not on your personal preferences or biases.
Some people are attracted to specific architectures or models because they are popular or because they are the ones they are most familiar with. This can lead to suboptimal results.

### Evaluate good performance now versus good performance later.
The best model now does not always mean the best model two months from now. For example, a tree-based model might work better now because you don’t have a ton of data yet, but two months from now, you might be able to double your amount of training data, and your neural network might perform much better. A simple way to estimate how your model’s performance might change with more data is to use learning curves.

### Evaluate trade-offs.
There are many trade-offs you have to make when selecting models. Understanding what’s more important in the performance of your ML system will
help you choose the most suitable model.
- One classic example of trade-off is the false positives and false negatives trade-off.
- Another example of trade-off is compute requirement and accuracy.

### Understand your model’s assumptions.
- Prediction assumption: Every model that aims to predict an output Y from an input X makes the assumption that it’s possible to predict Y based on X.
- IID: Neural networks assume that the examples are independent and identically distributed, which means that all the examples are independently drawn from the same joint distribution
- Smoothness: If an input X produces an output Y, then an input close to X would produce an output proportionally close to Y.
- Tractability: Let X be the input and Z be the latent representation (compressed) of X. Every generative model makes the assumption that it’s tractable to compute the probability P(Z|X).
- Boundaries: A linear classifier assumes that decision boundaries are linear.
- Conditional independence: A naive Bayes classifier assumes that the attribute values are independent of each other given the class.
- Normally distributed: Many statistical methods assume that data is normally distributed.

# Ensembles
For example, for the task of predicting whether an email is SPAM or NOT SPAM, you might have three different models. The final prediction for each email is the majority vote of all three models.

Ensembling methods are less favored in production because ensembles are more complex to deploy and harder to maintain.

# Bagging
Bagging, shortened from bootstrap aggregating, is designed to improve both the training stability and accuracy of ML algorithms. It reduces variance and helps to avoid overfitting.

Given a dataset, instead of training one classifier on the entire dataset, you sample with replacement to create different datasets, called bootstraps, and train a classification or regression model on each of these bootstraps.

![bagging](./screenshots/bagging.png)

A random forest is an example of bagging. A random forest is a collection of decision trees constructed by both bagging and feature randomness, where each tree can pick only from a random subset of features to use.

# Boosting
Sequentially train a series of weak learners, where each learner learns from the mistakes of the previous one. The final prediction is the weighted sum of the predictions of the weak learners.

![boosting](./screenshots/boosting.png)


# Stacking

Stacking means that you train base learners from the training data then create a meta-learner that combines the outputs of the base learners to output final predictions

![stacking](./screenshots/stacking.png)


# Experiment Tracking and Versioning
During the model development process, you often have to experiment with many architectures and many different models to choose the best one for your problem. It’s important to keep track of all the definitions needed to re-create an experiment and its relevant artifacts. An artifact is a file generated during an experiment—examples of artifacts can be files that show the loss curve, evaluation loss graph, logs, or intermediate results of a model throughout a training process.

## Experiment tracking
A large part of training an ML model is babysitting the learning processes. Many
problems can arise during the training process, including
- loss not decreasing
- overfitting
- underfitting
 - fluctuating weight values
 - dead neurons
- running out of memory.

 It’s important to track what’s going on during training not only to detect and address these issues but also to evaluate whether your model is learning anything useful.

Following is just a short list of things you might want to consider tracking for each experiment during its training process:

- Loss curve
- model performance metrics such as accuracy, F1 score, and perplexity
- Log of corresponding sample, prediction, and ground truth label.
- Speed of the model
- System performance metrics such as memory usage and CPU/GPU utilization.
- parameter and hyperparameter

# Versioning
Versioning is the process of assigning unique identifiers to different versions of your code, data, and models. Versioning is important because it allows you to track changes to your code, data, and models over time. It also allows you to reproduce experiments and results by being able to go back to a specific version of your code, data, and models.

# Distributed Training
As models are getting bigger and more resource-intensive, companies care a lot more about training at scale.
When your data doesn’t fit into memory, your algorithms for preprocessing, shuffling, and batching data will need to run out of core and in parallel.

### Data parallelism
You split your data on multiple machines, train your model on all of them, and accumulate gradients. This gives rise to a couple of issues. A challenging problem is how to accurately and effectively accumulate gradients from different machines. As each machine produces its own gradient, if your model waits for all of them to finish a run—synchronous stochastic gradient descent (SGD)—stragglers will cause the entire system to slow down, wasting time and resources.

### Model parallelism
With data parallelism, each worker has its own copy of the whole model and does all the computation necessary for its copy of the model. Model parallelism is when different components of your model are trained on different machines.

![model-parallelism](./screenshots/model-parallelism.png)

**Pipeline parallelism** is a clever technique to make different components of a model on different machines run more in parallel. There are multiple variants to this, but the key idea is to break the computation of each machine into multiple parts. When machine 1 finishes the first part of its computation, it passes the result onto
machine 2, then continues to the second part, and so on.

![pipeline-parallelism](./screenshots/pipeline-parallelism.png)

# Baseline Models
These are the models that you compare your model against. They are usually simple models that are easy to implement and understand.

1. **Random baseline**: This model predicts a random class for each example. It’s a good baseline to compare your model against.
2. **Simple heuristic baseline**: This model uses a simple heuristic to make predictions. For example, if you are predicting whether an email is spam or not, you might use the heuristic that all emails containing the word “Viagra” are spam.
3. **Zero rule baseline**: This model always predicts the most common class in the training data. It’s a good baseline to compare your model against because it’s simple and easy to implement.
4. **Human baseline**: This model is a human expert who makes predictions on the test data. It’s a good baseline to compare your model against because it tells you how well your model is doing compared to a human expert.
5. **Existing model baseline**: This model is an existing model that has been trained on the same data. It’s a good baseline to compare your model against because it tells you how well your model is doing compared to an existing model.

# Evaluation Methods

### Perturbation tests
Perturbation tests are a way to evaluate the robustness of a model. The idea is to introduce small changes to the input data and see how much the output changes. If the output changes a lot, it means that the model is not robust.

### Invariance tests
Similar to perturbation tests, invariance tests are a way to evaluate the robustness of a model. The idea is to introduce small changes to the input data and see how much the output changes. There are features that **must not cause changes in the model output**. For example, if your are predicting the Covid-19 cases, the model should not change its prediction if you change the name of the patient.

### Directional expectation tests
When developing a model to predict housing prices, keeping all the features the same but increasing the lot size shouldn’t decrease the predicted price, and decreasing the square footage shouldn’t increase it.

### Model calibration
Suppose user A watches romance movies 80% of the time and comedy 20% of the time. If your recommender system shows exactly the movies A will most likely watch, the recommendations will consist of only romance movies because A is much more likely to watch romance than any other type of movies. You might want a more calibrated system whose recommendations are representative of users’ actual watching habits. In this case, they should consist of 80% romance and 20% comedy.

To measure a model's calibration, a simple method is counting: you count the number of times your model outputs the probability X and the frequency Y of that prediction coming true, and plot X against Y.

- sklearn.calibration.CalibratedClassifierCV

### Confidence measurement
A model’s confidence is how sure it is about its predictions. For example, if a model predicts that a patient has a 70% chance of having cancer, it’s 70% confident in that prediction. Confidence is important because it can help you decide how much you trust a model’s predictions.

### Slice-based evaluation
Slicing means to separate your data into subsets and look at your model’s performance on each subset separately.
This is useful because it can help you identify biases in your model. For example, if your model performs well on one subset of your data but poorly on another subset, it might be because your model is biased towards the first subset.