# Regression metrics and optimization

#### Metrics optimization : our plan
**1. Regression**
  * **MSE, (R)MSE, R-squared**
  * **MAE**
  * **(R)MSPE, MAPE**
  * **(R)MSLE**
2. Classification
  * Accuracy
  * Logloss
  * AUC
  * Cohen's Kappa

## RMSE, MSE, R-squared

<br>
<center>How do you optimize them?</center>
<center>Just fit the right model!</center>

$$MSE = \frac{1}{N}\sum^N_{i=1}(y_i - \hat{y_i})^2$$

$$RMSE = \sqrt{MSE}$$

$$R^2 = 1 - \frac{MSE}{\frac{1}{N}\sum^N_{i=1}(y_i - \hat{y_i})^2}$$

### Models supporrting MSE optimization

#### Tree-based
> XGBoost, LightGBM<br>
> sklearn.RandomForestRegressor (can split based on MSE)

#### Linear Models
> sklearn.<>Regression<br>
> sklearn.SGDRegressor (using gradient decent to train // very versatile)<br>
> Vowpal / Wabbit (quantile loss)

#### Neural nets
> PyTorch, Keras, TensorFlow, etc.

Synonyms: L2 loss (Read the docs!)

## MAE

$$MAE = \frac{1}{N}\sum^N_{i=1}|y_i - \hat{y_i}|$$

How to optimize? --- Again, just run the right model!



### Models supporrting MSE optimization

#### Tree-based
* XGB cannot optimize MAE because MAE has zero as a second derivative while LightGBM can
  * So you still can use gradient boosting decision trees to this metric
* MAF criteria was implemented for RandomForestRegressor from sklearn
  * But note that running time will be quite high compared with MSE Corte.

> LightGBM<br>
> sklearn.RandomForestRegressor (can split based on MSE)

#### Linear Models
* Unfortunately linear models from sklearn including SG Regressor cannot optimize MAE negatively
  * BUT there is a loss function called **`Huber Loss`** implemented in some of the models
  * Basically it's very similar to MAE especially when the errors are large
> Vowpal / Wabbit (different name --- `quantile loss`)

#### Neural nets
* As we discussed MAF is not differentiable only when the predictions are equal to target (= rare case)
* That is why we may use any model train to put out optimize MAE
> PyTorch, Keras, TensorFlow, etc.

Synonyms: L1, Median Regression (Read the docs!)

### MAE : optimal constant

![mae-optimal-const](../img/mae-optimal-const.png)

You can actually make up your own smooth function that have upload that look like MAE error.
* the most famous one is `Huber loss` - it's basically a mix between MSE and MAE.
  * MSF is computed when the error is small so we can safely approach zero error
  * MAE is computed for large errors given robustness

## MSPE and MAPE

$$MSPE = \frac{\text{100%}}{N}\sum^N_{i=1}(\frac{y_i-\hat{y_i}}{y_i})^2$$

$$MAPE = \frac{\text{100%}}{N}\sum^N_{i=1}|\frac{y_i-\hat{y_i}}{y_i}|$$

<br>
<center>How do you optimize them?</center>
<center></center>

* It's much harder to find the model which can optimize them out of the box
  * we can always either implement a custom loss for an integer boost or a neural net
  * Or we can optimize different metric and do early stopping
  
#### But there are several approaches that ...

## MSPE (MAPE) as weighted MSE (MAE)

This approach is based on the fact that MSPE is a weighted version of MSE and MAPE is a weighted version of MAE.


![mspe-as-weighted-mse](../img/mspe-as-weighted-mse.png)

* The common denominator just ensures that the weights are summed up to 1 but it is not required
* Intuitively, the sample weights are indicating how important the object is for us while training the model

#### The smaller the target, the more important the object
#### How do we use this knowledge?

## MSPE (MAPE)

### Use weights for samples (`sample_weights`)
- And use MSE (MAE)
  - The model will actaully optimize desired MSPE loss!
- *Not every libarary accepts sample weights*
  * XGBoost, LightGBM accept
  * Neural nets
    - Easy to implement if not supported
    
#### But there is another method which works whenever a libarary can optimize MSE/MAE

### Resample the train set
- `df.sample(weights=sample_weights)`
- And use *any* model that optimizes MSE (MAE) 
- Train the data resampled with the model
- **Usually need to resample many times and average each time you fit the model**
  - And then average models' predictions if we will get the score much better and more stable

It is important to **set the probabilities for each object to be sampled to the weights we've calculated**. The size of the new data set is up to you.
* You can sample twice as many objects as it was in original train set.
* Note that we do not need to do anything with the test set

#### The results are another way we can optimize MSPE
* If the errors are small, we can optimize the predictions in logarithmic scale - we will cover this in the next slide. For details, you can find in the reading materials.

## RMSLE

![rmsle](../img/rmsle-opt.png)

Quite easy to optimize becuase of the connection with MSE Loss. All we need to do is ...

1. Apply and transform to our taget variables in train set
  * $z_i = log(y_{i}+1)$
2. Fit a model with MSE loss to transform target 
3. Get a prediction for a test subject, we first obtain the prediction, $\hat{z}$ in the logarithmic scale just by calling `model.predict()` or something like that
4. Do an inverse transform from logarithmic scale back to the original by exponentiating z hat and subtracting by one
  * $\hat{y_i} = exp(\hat{z_i}) - 1$