### Importance of Test/Validation set choices

Choosing a test set that is reflective of real world scenarios is **the most important** piece of machine learning.
Choosing a validation set which reflects the reality of the test set is the **second most important** piece.

Strategy for ensuring test/validation sets are in sync:
- spin up 5 models of varying accuracy
- plot results on both test/validation
- where the results align in a line is the most accurate validation set to use

#### Cross Validation
randomly shuffle the data and separate into n chunks  
run model n times using different chunk as validation set each time

- takes n times longer to run
- data must be randomizable (i.e. not temporal)

### Decision Tree Ensembles

**Ensembling:** set of weak learners are combined to create a strong learner that obtains better performance than a single one

Type of nearest neighbors (in tree space)   
Data limited because each branch halves the data  
    + Highly interpretable, scalable, flexible, work well for most data  
    - don't extrapolate well

#### RandomForests
    + harder to screw up, easier to scale 
    + don't need to normalize the data (sort order determines splits) - Immune to outliers

#### Gradient Boosting Machines

In Boosting algorithms each classifier is trained on data, taking into account the previous classifiers’ success. After each training step, the weights are redistributed. Misclassified data increases its weights to emphasise the most difficult cases. In this way, subsequent learners will focus on them during their training

Bagging and Boosting decrease the variance of your single estimate as they combine several estimates from different models. So the result may be a model with higher stability.

If the problem is that the single model gets a very low performance, Bagging will rarely get a better bias. However, Boosting could generate a combined model with lower errors as it optimises the advantages and reduces pitfalls of the single model.

By contrast, if the difficulty of the single model is over-fitting, then Bagging is the best option. Boosting for its part doesn’t help to avoid over-fitting; in fact, this technique is faced with this problem itself. For this reason, Bagging is effective more often than Boosting.

### Neural Networks

#### SGD - Chain Rule

- x: inputs
- f(): linear layer
- g(): sigmoid/softmax (non-linearity)
- h(): x-entropy/RMSE (loss function)  
h(g(f(x))) = 0.6

- d: derivative wrt weights (w)  
Chain Rule (allows us to calculate all of the derivatives at the same time):  

d(h(g(f(x),w))) / dw = h'(u)*g'(v)*f'(x)

- v = f(x)
- u = g(v)  
- h(u)

Derivative is a vector of same length as all the weights  
Can be thought of as how much does changing w1 affect the loss, how much does changing w2 affect the loss...

General rule of thumb for loss functions:  
- RMSE - regression
- X-entropy - classification

**NLP embedding matrix size: ** ~600 has been proven empirically to be the best choice.  
Embedding size -> dependent on the complexity and nuance of variable  
Human language is incredibly complex.  Normally don't need embedding matrices approaching 600 in size.

### ML pro-tip

- over-parameterize model for better accuracy
- regularization (dropout/weight decay) to limit overfitting and generalize