# Comparing Two Models

1. Actual vs Predicted
2. Simple Quantile Plots
3. Double Lift Charts
4. Loss Ratio Charts
5. Gini Index

## Actual vs Predicted Plots

- Plot $y_i$ on y-axis and $\mu_i$ on x-axis for each model.<br><br>

<center><img src ='images/Act_vs_Pred.JPG'></center>

#### Important Considerations

- Need to create these on holdout datasets.
- Need to aggregate data before plotting.
    - Sort based on predicted target variable.
    - Group into 100 buckets with same aggregate model weight.
    - Calculate avg. actual and predicted target variable values.
- Need to plot graph on log scale.

## Simple Quantile Plots

- Visual representation of model's ability to accurately differentiate between the best and the worst risks.<br><br>

<b>Steps</b><br>
- Sort dataset based on predicted target variable.
- Bucket the data into quantiles (5, 10, etc.) with same volume of exposures.
- Calculate avg actual and predicted target variable values.
    - Both can be divided by overall avg predicted value for ease of interpretation.
- Plot the two values.


<center><img src = 'images/Quantile_Plots.JPG'></center>

#### Criteria for "winning" model

1. Predictive accuracy
2. Monotonicity
3. Vertical distance between the first and last quantiles

## Double Lift Charts

- Unlike other plots, this directly compares two models.

<b>Steps</b><br>
- Calculate $\color{blue}{\text{Sort Ratio} = \frac{\text{Model A Predicted Loss Cost}}{\text{Model B Predicted Loss Cost}}}$<br>
- Sort dataset based on Sort Ratio (smallest to largest).
- Bucket data into quantiles.
- Calculate avg. pure premium in each bucket (Actual, Model A and Model B).
- Divide by overall avg. (Actual, Model A and Model B).
- Plot the results.

<center><img src='images/Double_Lift.JPG'></center><br><br>

- Winning model closely matches the actual pure premium relativities in each quantile.

- Can also create double lift chart by plotting percent error from the actual value.<br><br>

$$ \frac{\text{Predicted Loss Cost}}{\text{Actual Loss Cost}} - 1$$<br><br>

- Winning model would be closest to the y = 0 line.

## Loss Ratio Charts

- Steps
    - Sort the dataset based on model prediction.
    - Bucket into quantiles with same volume of exposures.
    - Calculate actual loss ratio within each bucket.<br><br>
    


<center><img src='images/LR_Chart.JPG'></center><br><br>

- If model is able to show variation in loss ratios, then it is outperforming the current plan.
- Greater the vertical distance between the first and the last bucket, the better the model at segmenting vs the current plan.

## Gini Index

- Quantifies the ability of the model to differentiate between the best and worst risks.

- Steps
    - Sort dataset based on model predicted loss cost.
    - On x-axis, plot cumulative percentage of exposures.
        - Exposures increase faster than losses - few exposures have most of the losses.
    - On y-axis, plot cumulative percentage of actual losses.

<center><img src='images/Gini_Curve.JPG'></center><br><br>

- Gini index is calculated as twice the area between the Lorenz curve and the line of equality.

# Validation of Logistic Regression Models

- Actual vs Predicted
- Lorenz Curve
- ROC Curve

## ROC Curves

- Used for probability models.
- Use <b>discrimination threshold</b> to convert prob. into binary output.
    - Values higher than the threshold will signal action.
    - Lower threshold results in more true positives and fewer false negatives.
        - Cost: more false positives and fewer true negatives.<br><br>


$$
\begin{array}{c|cc}
 & \text{Predicted}  & \text{Predicted}\\
\text{Actual} & \text{Fraud} & \text{No Fraud}\\
\hline
\text{Fraud} &  \text{True Positive} & \text{False Negative}\\
\text{No Fraud} &  \text{False Positive} & \text{True Negative}
\end{array}
$$

<center><img src='images/Confusion_Matrix.JPG'></center>

- <b>Sensitivity / true positive rate / hit rate:</b> $\frac{\text{True positives}}{\text{Total positives}} = \frac{39}{109}$<br><br>
- <b>Specificity:</b> $\frac{\text{True negatives}}{\text{Total negatives}} = \frac{673}{704}$<br><br>

- A random model yields true positives and false positives in the same proportion as the overall mix of positives and negatives in the data, regardless of the threshold.
    - ROC curve always follows the line of equality.

#### Plotting

- On x-axis, false positive rate = 1 - specificity.
- On y-axis, true positive rate.<br><br>

<center><img src = 'images/ROC.JPG'></center>

- ROC higher than the line of equality is desired.
- Model with no predictive power has AUROC of .5
- Model with max. predictive power has AUROC of 1<br><br>

$$\begin{align} \text{AUROC} & = \frac{1}{2} \cdot \text{Normalized Gini} + \frac{1}{2} \\ \\
\text{Normalized Gini} & = \frac{\text{Gini index}}{\text{Gini index of "perfect" model}} \end{align}$$

## Coverage Options and GLM

- Coverage options are best analyzed outside of GLM.
    - There could be correlation with variables not included in the model.
    - Could be due to selection effect 
        - UW forcing high risk insureds to have higher deductibles.<br><br>

- Charging rates other than pure loss elimination could lead to changes in insured behavior.
        

## Territory Modeling

- Territory should be included in the model as an offset.
- Territory should also be offset for the classification plan.

## Ensembling

- Take straight average of model predictions.
- The model errors should be uncorrelated (i.e. model built by independent teams).