# ML ENGINEERING

<img src="../img/ml-engineering.png" width="60%" >




## Static vs Dynamic Training
- **A static model is trained offline**. That is, we train the model exactly once and then use that trained model for a while.
- **A dynamic model is trained online**. That is, data is continually entering the system and we're incorporating that data into the model through continuous updates.

## Static vs Dynamic Inference (making predictions)
- **Offline inference**, meaning that you make all possible predictions in a batch, using a MapReduce or something similar. You then write the predictions to an SSTable or Bigtable, and then feed these to a cache/lookup table.
- **Online inference**, meaning that you predict on demand, using a server.



***

# **THE ML FINE PRINT**
The following three basic assumptions guide generalization:

- We draw examples **independently and identically** (i.i.d) at random from the distribution. In other words, examples don't influence each other. (An alternate explanation: i.i.d. is a way of referring to the randomness of variables.)
- The distribution is **stationary**; that is the distribution doesn't change within the data set.
- We draw examples from partitions from the **same distribution**.

In practice, we sometimes violate these assumptions. For example:

- Consider a model that chooses ads to display. The i.i.d. assumption would be violated if the model bases its choice of ads, in part, on what ads the user has previously seen.
- Consider a data set that contains retail sales information for a year. User's purchases change seasonally, which would violate  stationarity.


***


# **SUMMARY - HYPERPARAMETER TRAINING**



- Training loss should steadily decrease, steeply at first, and then more slowly until the slope of the curve reaches or approaches zero.
- If the training loss does not converge, train for more epochs.
- If the training loss decreases too slowly, increase the learning rate. Note that setting the learning rate too high may also prevent training loss from converging.
- If the training loss varies wildly (that is, the training loss jumps around), decrease the learning rate.
- Lowering the learning rate while increasing the number of epochs or the batch size is often a good combination.
- Setting the batch size to a very small batch number can also cause instability. First, try large batch size values. Then, decrease the batch size until you see degradation.
- For real-world datasets consisting of a very large number of examples, the entire dataset might not fit into memory. In such cases, you'll need to reduce the batch size to enable a batch to fit into memory.
- Remember: the ideal combination of hyperparameters is data dependent, so you must always experiment and verify.

***


# **TRAINING NEURAL NETWORKS - BEST PRACTICES**

## Failure Cases
There are a number of common ways for backpropagation to go wrong.

### Vanishing Gradients
The gradients for the lower layers (closer to the input) can become very small. In deep networks, computing these gradients can involve taking the product of many small terms.

When the gradients vanish toward 0 for the lower layers, these layers train very slowly, or not at all.

The ReLU activation function can help prevent vanishing gradients.

### Exploding Gradients
If the weights in a network are very large, then the gradients for the lower layers involve products of many large terms. In this case you can have exploding gradients: gradients that get too large to converge.

Batch normalization can help prevent exploding gradients, as can lowering the learning rate.

### Dead ReLU Units
Once the weighted sum for a ReLU unit falls below 0, the ReLU unit can get stuck. It outputs 0 activation, contributing nothing to the network's output, and gradients can no longer flow through it during backpropagation. With a source of gradients cut off, the input to the ReLU may not ever change enough to bring the weighted sum back above 0.

Lowering the learning rate can help keep ReLU units from dying.

## Dropout Regularization
Yet another form of regularization, called Dropout, is useful for neural networks. It works by randomly "dropping out" unit activations in a network for a single gradient step. The more you drop out, the stronger the regularization:

0.0 = No dropout regularization.
1.0 = Drop out everything. The model learns nothing.
Values between 0.0 and 1.0 = More useful.

***


## **MULTICLASS NEURAL NETWORK**

**ONE VS ALL** is ok when the number of classes is small

<img src="../img/one-vs-all.png" width="60%">


**SOFTMAX** when using many labels; the sum of the probabilities in the output layer needs to be equal to 1.

<img src="../img/softmax.png" width="60%">




- Full Softmax:  Softmax calculates a probability for every possible class.
- Candidate sampling means that Softmax calculates a probability for all the positive labels but only for a random sample of negative labels. For example, if we are interested in determining whether an input image is a beagle or a bloodhound, we don't have to provide probabilities for every non-doggy example.

Full Softmax is fairly cheap when the number of classes is small but becomes prohibitively expensive when the number of classes climbs. Candidate sampling can improve efficiency in problems having a large number of classes.

**One Label vs. Many Labels**
Softmax assumes that each example is a member of exactly one class. Some examples, however, can simultaneously be a member of multiple classes. For such examples:

- You may not use Softmax.
- You must rely on multiple logistic regressions.


***

# **TRAINING AN EMBEDDING AS A PART OF A LARGER MODEL**

<img src="../img/embeding.png" width="60%">