## Components of GLM<br><br>

- <b>Systematic component:</b> The portion of the outcome that is explained by predictors in the model.
<br><br>
- <b>Random component:</b> The portion of the outcome driven by causes that are not accounted for in the model. Includes "pure randomness".

#### Random Component

- Biggest assumption made about each risk ($y_i$) is it belongs to some exponential family of distributions:<br><br>

$$y_i \sim Exponential(\mu_i, \phi)$$<br><br>

- $\mu_i$ is the mean of the record's distribution.<br>
- $\phi$ is the dispersion parameter. It is same for all records.

$$ Var(y_i) = \frac{\phi \cdot V(\mu_i)}{\omega_i}$$
<br><br>
- $V(u_i)$ is the variance function for the selected distribution.
<br><br>
- The mean of every exponential family distribution is $\mu$.
<br><br>
- $w_i$ is the weight assigned to observation i. Twice the weight $\rightarrow$ half the variance.

$$
\begin{array}{c|c}
\text{Distribution} & \text{Variance Function [V($\mu$)]}\\
\hline
\text{Normal} & 1 \\
\text{Poisson} & \mu \\ 
\text{Gamma} & \mu^2 \\
\text{Inverse Gaussian} & \mu^3 \\
\text{Negative Binomial} & \mu(1+ \kappa \mu) \\
\text{Binomial} & \mu(1-\mu) \\
\text{Tweedie} & \mu^p
\end{array}
$$
<br><br>
- Variance is an increasing function of $\mu$, which makes sense since higher expected losses have higher variance in losses.
- The $\phi$ parameter for negative binomial is restricted 1.

#### Systematic Component
<br>
$$g(\mu_i) = \beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+ \cdots + \beta_p x_{ip} + \text{offset}$$

- g(.) is the link function.
    - Provides flexibility in relating the model prediction ($\mu_i$) to predictors.
    - Log link function allows for multiplicative rating structure.
    - For binary target variables, we use logit link function. 
- RHS of the equation is called the <b>linear predictor</b>.

#### Log Link
<br>
$$ln(\mu_i) = \beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+ \cdots + \beta_p x_{ip} + \text{ln(offset)}$$
<br>
$$\Downarrow$$
<br>
$$\mu_i = e^{\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+ \cdots + \beta_p x_{ip}}\cdot \text{offset}$$
<br>
$$\Downarrow$$
<br>
$$\mu_i = e^{\beta_0} \cdot e^{\beta_1 x_{i1}} \cdot e^{\beta_2 x_{i2}} \cdot \ldots  \cdot e^{\beta_p x_{ip}} \cdot \text{offset}$$

#### Logit Link
<br>
$$\begin{align}ln \left( \frac{\mu_i}{1-\mu_i} \right) & = ln \left( \frac{\text{Prob of event i occuring}}{\text{Prob of event i not occuring}} \right) \\ \\
& = \underbrace{\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+ \cdots + \beta_p x_{ip} + \text{ln(offset)}}_{x_i} \end{align}$$

$$\Downarrow$$

$$\frac{\mu_i}{1-\mu_i} = e^{\beta_0} \cdot e^{\beta_1 x_{i1}} \cdot e^{\beta_2 x_{i2}} \cdot \ldots  \cdot e^{\beta_p x_{ip}} \cdot \text{offset}$$

$$\Downarrow$$

$$\mu_i = \frac{1}{1+e^{-x_i}}$$

### Variable Significance

- We need to test coefficients to determine if they are indeed significant / non-zero.
- Helpful Statistics provided by GLM software:
    - Standard Error
    - P-Value
    - Confidence Interval

#### Standard Error

- Small std error gives us confidence that the estimated coefficient is close to the true value.
- Larger datasets produce smaller std errors.
- The larger the estimate of $\phi$, the larger the std errors.
    - Since variance depends on $\phi$.
    

#### P-Value

- Derived from std error.
- Gives the prob of observing the coefficient value or higher by chance if the true value is zero (or any chosen value).
- P-value of .05 or less is desired when making variable selection decisions.
    - If p-value is less than .05, we can reject the null  hypothesis that the true value of the variable is zero.
- In GLM, this is based on relationship to the base class.

#### Confidence Interval

- Gives us the reasonable range of coefficient estimates.
- Can be seen as complement of p-value.

### Continuous Variables

- Predictor values are related to one another.
- When log link is used, it is almost always appropriate to take the natural log of a continuous predictor.
    - This allows the scale of the predictor to match the scale of the target variable.
    - The response curve doesn't need to be restricted to the exponential form.
        - Can have increasing at decreasing rate (generally seen in insurance models) and not just increasing at increasing rate.
        - Can have linear relation.<br><br>

$$ln(\mu_i) = \beta_0 + \beta_{1} \cdot ln(x_i)$$

$$\big\Downarrow$$

$$\mu_i = e^{\beta_0} \cdot x_i^{\beta_{1}}$$

<center><img src='images/Continuous_Vars.JPG'></center>

- For negative beta estimates (logged case), the graph is flipped around y = 0 axis.

#### Example

- Assume base AOI is 100k
- Record's AOI is 200K
- The coefficient estimate is .62
- Then indicated relativity for 200k AOI:

$$\text{Relativity =} \frac{\text{200,000}^{.62}}{\text{100,000}^{.62}} = 2^{.62}= 1.54$$

#### Cases where it may be desirable to include continuous variable in unlogged form

- When x is a temporal variable meant to pickup trend.
    - Trend is often modeled as an exponential function.<br><br>

- If the variable has value of 0.
    - ln(0) is undefined.
    - Need to adjust before logging.<br><br>

- Unlogged form should rarely be used for continuous variables.

<table><tr><td><img src='images/GLM_Eval_1.JPG'></td><td><img src='images/GLM_Eval_2.JPG'></td></tr></table>

- The high p-value for van means that we are not confident that this class is different from the base class "sedan".
- The error bar for "van" crosses the y = 0 line, indicating that the parameter estimate is not significant at the 95% confidence level.
- When we set "van" as the base class, size of error bars and p-values increase.
    - The predicted coefficients are still equivalent.
        - Can subtract coefficient of "sedan" to get the same coefficients as above.<br><br>
        
<center><img src='images/GLM_Eval_3.JPG'></center>

- Use Base Class with the most data!

GLM with only one categorical variable, regardless of the distribution and link function, will always have average value as the prediction.

## Offsets

- Used when we are leaving some of the variables untouched (i.e., territory relativities, deductible factors are usually based on loss elimination-based techniques, etc.)
- The coefficient is constrained to be 1 - becomes the power when exponentiated.
- When using log link, the offset variable is also logged.

- Can be used to adjust for exposure difference when modeling claim counts.
    - A 1-year policy term will have twice the expected claim counts as a 6-month policy term.<br><br>
    
$$\begin{align} ln(\mu_i) & = \beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+ \cdots + \beta_p x_{ip} + ln(\text{1 car-year}) \\ \\
& = \beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+ \cdots + \beta_p x_{ip} + ln(1) \\ \\
& = e^{\beta_0} \cdot e^{\beta_1 x_{i1}} \cdot e^{\beta_2 x_{i2}} \cdot \ldots  \cdot e^{\beta_p x_{ip}} \cdot 1 \end{align}$$

## Severity Model Distributions

- <b>Gamma</b>
    - Most commonly used.
- <b>Inverse Gaussian</b>
    - Wider tail and sharper peak than Gamma.
- Both are right-skewed and have lower bound at 0.

## Frequency Model Distributions

- <b>Poisson</b>
    - Most commonly used.
    - Models the count of events occuring within a fixed timeframe.
    - GLM implementation allows it to be continuous as well.
    - ODP is used because in addition to the Poisson variance, there is variance in the Poisson mean ($\mu$).
    - Poisson and ODP distributions produce the same estimates of coefficients.
        - Model diagnostics might be too optimistic under Poisson.
            - Due to lower variance since $\phi$ = 1.



- <b>Negative Binomial: </b>
    
$$\begin{align} y & \sim Poisson(\mu = \theta) \\ 
\theta & \sim Gamma(....)\end{align}$$<br>

$$Var(\mu) = \mu(1+\kappa \mu)$$<br>

- $\phi$ = 1
- $\kappa$ is the overdispersion parameter.
- As $\kappa \rightarrow 0$ then Negative Binomial $\rightarrow$ Poisson.

## Pure Premium Model Distribution

#### Tweedie <br>

- Tweedie models a Poisson-gamma process where claims occur following the Poisson process and loss amounts follow a Gamma distribution.
    - Can be thought of as "Poisson-distributed sum of gamma distributions".<br>

$$\begin{align}  
\text{Gamma mean} & = \alpha \theta \\ 
\text{Tweedie mean ($\mu$)} & = \lambda \cdot ( \alpha \theta) \\ 
V(\mu) & = \mu^p  \\ 
\phi & = \frac{\lambda^{1-p} \cdot (\alpha \theta)^{2-p}}{2-p} \end{align}$$<br>







$$ p = \frac{\alpha + 2}{\alpha + 1}$$<br>

- The power parameter "p" takes on following values: 0, [1,3].<br><br>
- p is strictly a function of gamma parameter $\alpha$, which is a function of gamma's coefficient of variation (CV).
    - CV $\rightarrow 0 \text{ then } p \rightarrow 1$




$$
\begin{array}{c|c|c}
\text{P} & \text{Distribution} & \text{Variance Function}\\
\hline
0 & \text{Normal} & \mu^0 = 1\\
1 & \text{Poisson} & \mu \\
2 & \text{Gamma} & \mu^2 \\
3 & \text{Inverse Gaussian} & \mu^3
\end{array}
$$<br><br>

- Range between 1 and 2 is great for modeling pure premium or loss ratio.<br>


#### Important Implicit assumption of Tweedie GLM

- It assumes that frequency and severity move in the same direction (i.e. increase in target variable is made up of an increase in both frequency and severity)
    - Problematic - Some predictors might increase severity but reduce frequency.
    - Tweedie is still robust against such violation of its assumptions and produces very strong models.

## Probability Model Distribution

#### Binomial Distribution

- $\mu$ = probability that the event will occur.
- Used with the logit link function: $g(\mu) = ln \frac{\mu}{1-\mu}$<br>
    - Can't use log link since RHS is unbounded.<br><br>
    



- Logit: $ln \frac{\mu}{1-\mu}$ and logistic: $\frac{1}{1+e^{-x}}$<br><br>

<center><img src='images/Logit.JPG'></center>

- Linear predictor of 0 indicate 50% prob.

$$\begin{align} ln \frac{\mu}{1-\mu} & = \beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+ \cdots + \beta_p x_{ip} + \text{offset} \\ \\
& = \beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+ \cdots + \beta_p x_{ip} + ln\frac{p}{1-p} \\ \\
\frac{\mu}{1-\mu} & =e^{\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+ \cdots + \beta_p x_{ip}} \cdot \frac{p}{1-p}\\ \\
& =e^{\beta_0} \cdot e^{\beta_1 x_{i1}} \cdot e^{\beta_2 x_{i2}} \cdot \ldots  \cdot e^{\beta_p x_{ip}} \cdot \frac{p}{1-p} \end{align}$$<br><br>

- Assume $\beta_1 = .24$ then $e^{.24} - 1 = 27\%$ means 27% higher probability of occurrence than base class.


## Correlations Among Predictor Variables

- GLM can easily handle moderate correlation.
- Higher correlation between variables causes issues for GLM since same information appears twice.
    - Coefficients may behave irratically - extremely high or low coefficients.
    - Std errors become quite large.
    - Such model is called unstable.

#### Ways of dealing with high correlation

1. Remove all but one of the highly correlated variables.
    - Could lead to loss of some unique information.
2. Use principal components analysis (PCA) or factor analysis to create new uncorrelated variables.
    - Requires time consuming analysis.

#### Multicollinearity

- When two or more predictors are strongly predictive of another predictor variable.
- Variance inflation factor (VIF) can be used to detect multicollinearity.
    - Measures the increase in (squared) std error due to the presence of collinearity with other predictors.
    - VIF of 10 or higher is considered high.

#### Aliasing

- When two variables are the same (i.e. male variable and female variable) 
- GLM will not converge  in this case.
    - Most GLM softwares automatically detect this and drop one of the predictor variables.

## GLM Limitations

1. <b>GLMs assign full credibility to the data</b>  
    - It gives warning using high std error and p-value.<br><br>
    
2. <b>GLMs assume the <u>randomness</u> of outcomes are uncorrelated</b>
    - Large instances of group of correlated outcomes causes GLM to produce sub-optimal predictions.