# <span style='color:#547DCD'> Validating and tuning  </span> 

There are several user choices to be made before training. E.g. parameter values or architectural choices (*see Probst, Bischl, and Boulesteix (2018) for a study on the impact of hyperparameter tuning on model performance*).

## <span style='color:#7F8BC7'> Learning metrics  </span> 

The parameter values that are set before training are called **hyperparameters**. For an econometric perspective see J. Li, Liao, and Quaedvlieg (2020). 

### <span style='color:#AA9AC2'> Regression analysis  </span> 

The $L^{1}$ and $L^{2}$ norms are the mainstream as they are easy to interpret and compute. The first one, the mean absolute error gives the average distance to the realized value but is not differentiable at zero. The second one is the root mean squared error, which is differentiable everywhere, but is harder to interpret and gives more weight to outliers. 
\begin{align*}
    \mathrm{MAE}(\mathbf{y},\tilde{\mathbf{y}}) & = \frac{1}{I} \sum_{i=1}^{I} \lvert y_i - \tilde{y}_i \rvert, \\
    \mathrm{MSE}(\mathbf{y}, \tilde{\mathbf{y}}) & = \frac{1}{I} \sum_{i=1}^{I}(y_i - \tilde{y}_i)^{2},
\end{align*}
and the RMSE is simply the square root of the MSE.

- We can add weights to produce heterogeneity in the importance of instances. 
- MSE is the most common loss function in ML, but not necessarily the best choice for return prediction in portfolio allocation tasks.
    - We can decompose loss into 3 terms: sum of squared realized returns, the sum of squared predicted returns and the product between the two (covariante terms if we assume zero mean).
        - First term: does not matter.
        - Second term: controls the dispersion around zero of the predictions.
        - Third term (most interesting): negativity of the cross-product is to the investor's benifit: either both terms are positive and the model has recognized a profitable asset, or they are negative and it has identified a bad opportunity. When $y_i$ and $\tilde{y}_i$ have different signs problems arise. Algorithms do not optimize with respect to this indicator.


Other indicators used to quantify the quality of a model are presented below:

The $R^{2}$ is computed like usual
$$
R^{2} (\mathbf{y}, \tilde{\mathbf{y}}) = 1 - \frac{\sum_{i=1}^{I}(y_i - \tilde{y}_i)^{2}}{\sum_{i=1}^{I}(y_i - \bar{y})^{2}}
$$
This can be used on the test and not train sample. Sometimes $\bar{y}$ is removed which means we compare with the zero mean predictor. This is relevant with returns because the simplest prediction of all is the constant zero value and $R^{2}$ can then measure if the model beats this naive benchmark. Sample averages can be period dependent. Also removing $\bar{y}$ makes $R^{2}$ more conservative by reducing it mechanically.

We can also use the mean absolute percentage error and mean squared percentage error:
\begin{gather*}
    \mathrm{MAPE}(\mathbf{y},\tilde{\mathbf{y}}) = \frac{1}{I} \sum_{i=1}^{I} \left\| \frac{y_i - \tilde{y}_i}{y_i} \right\|, \\
    \mathrm{MSPE}(\mathbf{y},\tilde{\mathbf{y}}) = \frac{1}{I} \sum_{i=1}^{I}\left( \frac{y_i-\tilde{y}_i}{y_i} \right) ^{2}
\end{gather*}

When the label is positive with possibly large values, it is possible to scale the magnitude of errors, which can be very large. One way to do this is to resort to the Root Mean Squared Logarithmic Error (RMSLE), defined below
$$
\mathrm{RMSLE}(\mathbf{y},\tilde{\mathbf{y}}) = \sqrt{\frac{1}{I}\sum_{i=1}^{I} \log\left( \frac{1+y_i}{1 + \tilde{y}_i} \right) }
$$

A shortcoming of the MSE, which is by far the most widespread metric and objective in regression tasks. A simple decomposition yields:
$$
\mathrm{MSE}(\mathbf{y},\tilde{\mathbf{y}}) = \frac{1}{I}\sum_{i=1}^{I}(y_i^{2}\tilde{y}_i^{2} - 2 y_i \tilde{y}_i)
$$
Since the first term is given, the model focuses on minimizing the latter terms. The second term is the dispersion of model values. The third term is a cross-product. While variations in $\tilde{y}_i$ do matter, the third term is by far the most important especially in the cross-section It is more valuable to reduce the MSE by increasing $y_i \tilde{y}_i$. This product is indeed positive when the two terms have the same sign, which is exactly what an investor is looking for: correct directions for the bets. For some algorithms (like neural networks), it is possible to manually specify custom losses. Maximizing the sum of $y_i \tilde{y}_i$ may be a good alternative to vanilla quadratic optimization.

### <span style='color:#AA9AC2'> Classification analysis  </span> 

In binary classification, it is convenient to think in terms of true versus false. In an investment setting, true can be related to a positive return, or a return being above that of a benchmark - false being the opposite.

We have the following types of outcomes:
- frequency of true positives: $\mathrm{TP} = I^{-1}\sum_{i=1}^{I} 1_{\{ y_i = \tilde{y}_i = 1 \}}$
- frequency of true negatives: $\mathrm{TP} = I^{-1}\sum_{i=1}^{I} 1_{\{ y_i = \tilde{y}_i = 0 \}}$
- frequency of false positives: $\mathrm{TP} = I^{-1}\sum_{i=1}^{I} 1_{\{ y_i = 0,  \tilde{y}_i = 1 \}}$
- frequency of false negatives: $\mathrm{TP} = I^{-1}\sum_{i=1}^{I} 1_{\{ y_i = 1, \tilde{y}_i = 0 \}}$

With these metrics we can calculate a confusion matrix. In the setting of investment we have the following:

|              | Positive | Negative |
|--------------|----------|----------|
| **Positive** |    TP    |   FP     |
| **Negative** |    FN    |   TN     |

Where the rows pertain to predictions and columns to label.

Among the two types of errors, type I is the most daunting for investors because it has a direct effect on the portfolio. The type II error is simply a missed opportunity and is somewhat less impactful. Finally, true negatives are those assets which are correctly excluded from the portfolio.

We can create the following metrics from this baseline:
- **Accuracy**: $\mathrm{TP} + \mathrm{TN}$ is the percentage of correct forecasts;
- **Recall**: $\frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}}$ measures the ability to detect a winning strategy/asset (left column analysis). Also known as sensitivity or true positive rate (TPR);
- **Precision**: $\frac{\mathrm{TP}}{\mathrm{FP} + \mathrm{TP}}$ computes the probability of good investments;
- **Specificity**: $\frac{\mathrm{TN}}{\mathrm{FP} + \mathrm{TN}}$ measures the proportion of actual negatives that are correctly identified as such;
- **Fallout**: $\frac{\mathrm{FP}}{\mathrm{FP} + \mathrm{TN}}$ $1-\text{Specificity}$ is the probability of false alarm (or false positive rate), i.e., the frequence at which the algorithm detects falsely performing assets;
- **F-score**: $\mathbf{F}_1 = 2 \times \frac{\text{recall} \times \text{precision}}{\text{recall} + \text{precision}}$ is the harmonic average of recall and precision.