# Best practices when training and evaluating ML models using supervised learning

1. Look at your data
   * How many data points?
   * Type of each data point?
   * Missing or erroneous data?
   * Relationships between data points (e.g. temporal, spatial)?
   * Type of dependent variable: will task be regression, classification?
   * If you can, get an idea of the accuracy of the labeling (e.g. interannotator agreement).
   * Compute summary statistics on your data
2. If necessary, transform or normalize your data
   * Fill in missing values, or drop data points
   * Correct erroneous values, or drop data points
   * Consider scaling, translation, or rotation of data (*using what information from step 1?*)
   * Consider data augmentation
   * You may have to embed and/or encode your data
3. Consider dimensionality reduction (*What are some methods for dimensionality reduction?*)
   * PCA
4. Determine how to split your data (*What are some ways?*)
   * Random train/dev/test split
   * Stratified sampling
   * Temporal train/dev/test split
   * k-fold cross-validation
     * without replacement
     * with replacement
5. Select model architectures and define relevant items (*What do we need to define for each neural network architecture?*)
   * number of hidden layers
   * width of each hidden layer
   * nature of connectedness
   * activation function(s)
   * loss function
   * optimization algorithm
6. Figure out the hyperparameters you will vary, and what the possible values or range for each will be (*What are some hyperparameters when using minibatch SGD as the optmization algorithm?*)
  * learning rate
  * number of epochs
  * whether and how to regularize
  * batch size
7. Train and evaluate (for each model architecture and each combination of hyperparameters)
  * As you evaluate, look for underfitting or overfitting
8. Test on **held-out test data**
9. Deploy and monitor

## Deploy and monitor: The ground is shifting beneath our feet

Let's say you train a model on irises from 1998. However, now it's 2023 and:
* natural selection may have led to changes in iris sizes
* human selection has led to changes in iris sizes

This is **covariate shift**.

Or some biologist may have come along and done genetic testing on our irises, and discovered that some of them were assigned to the wrong species.

This is **label shift**.

Or, based on the genetic testing, the pesky biologist may have discovered that there's a fourth species of iris in our data - a new species.

This is **concept shift**.

The textbook covers some of the theory. Here's some practice:
1. When you deploy your model, monitor its performance over time. This means continuing to label some data periodically, and the comparing your newly labeled data with your previously labeled data. You can discover all three types of drift in this way.
2. If there has been shift, your model may have sufficient generalization to accommodate the shift, or it may not. Evaluate your model on your newly labeled data.
3. If your model's performance has gotten worse, you may need to retrain your model. Many production models are retrained nightly (recommender systems) to quarterly or yearly (image classification, object detection, NLP).


## Generalization, part two

**Team exercise**: In your teams, answer these questions from the reading for today:
1. If we wish to estimate the error of a fixed model to within 0.0001 with probability greater than 99.9%, how many samples do we need?
2. Suppose that somebody else possesses a labeled test set and only makes available the unlabeled inputs (features). Now suppose that you can only access the test set labels by running a model (no restrictions placed on the model class) on each of the unlabeled inputs and receiving the corresponding error. How many models would you need to evaluate before you leak the entire test set and thus could appear to have error 0, regardless of your true error?

## Optimization, review

We use an optimization algorithm to minimize the loss function for deep learning, and ultimately, to fit a function to the training data that we hope generalizes to test data. 

The function we are trying to fit, in deep learning, is typically *not* a line. What are some issues that arise when trying to fit different types of function?

* local minima
* saddle points
* vanishing gradients

How do we deal with these issues?

* good parameter initialization
* minibatch SGD
* adjust the learning rate as the slope changes
