# Hyperparameters & parameters

### Model Optimization

* This subunit is all about hyperparameters and how to optimize them. Unlike standard parameters, which are derived when training occurs, hyperparameters should be set before the learning process. There are several standard methods you can use to set hyperparameters for a model. We'll also cover one out-of-the-box method. 

#### Parameters vs Hyperparameters
* **Parameters** of your model are *W* and $\beta$
* **Hyperparameters:** alpha (learning rate), # of iterations of gradient descent you carry out, # of hidden layers *L*, # of hidden units $n^1$, $n^2$, $n^3$, etc... choice of activation function, momentum term, mini-batch size, various regularization parameters
* **Hyperparameters** are "paramaters" that control the ultimate parameter values of *W* and $\beta$. In other words, the hyperparameters determine the final values of the parameters.
* Note: in some "earlier" versions of ML, alpha $\alpha$ has been referred to as a parameter and is sometimes still referred to as a parameter.


* ML and deep learning are very empirical processes. 
* $\Rightarrow$ IDEA $\Rightarrow$ CODE $\Rightarrow$ EXPERIMENT $\Rightarrow$ IDEA $\Rightarrow$
* empirical is (arguably) a fancy word for "trial and error"



# Hyperparameter Tuning

* In the realm of machine learning, hyperparameter tuning is a “meta” learning task.
* Machine learning models are basically mathematical functions that represent the relationship between different aspects of data
* “Training a model” involves using an optimization procedure to determine the best model parameter that “fits” the data.
* There is another set of parameters known as hyperparameters, sometimes also knowns as “nuisance parameters.” These are values that must be specified outside of the training procedure.
    * Ridge regression and lasso both add a regularization term to linear regression; the weight for the regularization term is called the regularization parameter
    * Decision trees have hyperparameters such as the desired depth and number of leaves in the tree.
    * Support vector machines (SVMs) require setting a misclassification penalty term.
    * Kernelized SVMs require setting kernel parameters like the width for radial basis function (RBF) kernels. 
    * ... etc...
* A regularization hyperparameter controls the capacity of the model, i.e., how flexible the model is, how many degrees of freedom it has in fitting the data.
* Proper control of model capacity can prevent overfitting, which happens when the model is too flexible, and the training process adapts too much to the training data, thereby losing predictive accuracy on new test data. 
* Another type of hyperparameter comes from the training process itself. Training a machine learning model often involves optimizing a loss function (the training metric).
*  A number of mathematical optimization techniques may be employed, some of them having parameters of their own.
    * stochastic gradient descent optimization requires a learning rate or a learning schedule. 
    * Some optimization methods require a convergence threshold.
    * Random forests and boosted decision trees require knowing the number of total trees (though this could also be classified as a type of regularization hyperparameter).
* Since the training process doesn’t set the hyperparameters, there needs to be a meta process that tunes the hyperparameters. This is what we mean by hyperparameter tuning.
* **Hyperparameter tuning is a meta-optimization task.** But, each trial of a particular hyperparameter setting involves training a model—an inner optimization process. **The outcome of hyperparameter tuning is the best hyperparameter setting, and the outcome of model training is the best model parameter setting.**
* Tuning hyperparameters: For each proposed hyperparameter setting, the inner model training process comes up with a model for the dataset and outputs evaluation results on hold-out or cross-validation datasets. After evaluating a number of hyperparameter settings, the hyperparameter tuner outputs the setting that yields the best performing model. The last step is to train a new model on the entire dataset (training and validation) under the best hyperparameter setting.