# Model Evaluation

In [1]:
# Load data manipulation package
import numpy as np
import pandas as pd

# Load data visualization package
import matplotlib.pyplot as plt
import seaborn as sns

## **Model Evaluation**
---

## **1. Model Estimation Fit**
---

#### **1.1 Likelihood Ratio Test**

Uses the likelihood function through the ratio of two maximizations of it :
1. The maximum over the possible parameter values that **assume the null hypothesis**,
2. The maximum over the larger set of possible parameter values, **assume the alternative hypothesis is true**.

**The hypotheses**:
  - Null Hypothesis: the logit model only contains intercept $\beta_{0}$.
  - Alternative Hypothesis: the logit model contains $\beta_{0}, \beta_{1}, \dots, \beta_{p}$.
    - $\text H_{0} : \beta = \beta_{0}$
    - $\text H_{1} : \beta = \beta_{0}, \beta_{1}, \dots, \beta_{p}$

**Test statistic**:
- The likelihood ratio (LLR):

$$
\begin{align*}
\text{LLR} &= -2 \log \left (\frac{\ell_{0}}{\ell_{1}}\right) \\
\text{LLR} &= -2 (L_{0}-L_{1})
\end{align*}
$$

<br>

- $L_{0}$ is the log likelihood function calculated at $\beta_{0}$.
- $L_{1}$ is the log likelihood function calculated at ML estimate $\beta_{0}, \beta_{1}, \dots, \beta_{p}$.

**Rejection region**:
- Reject null hypothesis ($\text H_{0}$) if $\text{LLR}>\chi^{2}_\alpha, _{\; \text{df}=1}$ or
- Reject null hypothesis ($\text H_{0}$) if $\text{P-value}<\alpha$.

#### **1.2 Deviance**

The deviance is the likelihood-ratio statistic for comparing model M to the saturated model.

  - Deviance:
$$
\text{Deviance} = -2(L_{M}-L_{S})
$$

<br>

- $L_{M}$ is the maximized log-likelihood value for a model $M$ of interest.
- $L_{S}$ is the maximized log-likelihood value for the most complex model possible or saturated model.

We can compare the models by comparing their deviances.
- $M_{0}$ with maximized log-likelihood $L_{0}$.
- $M_{1}$ with maximized log-likelihood $L_{1}$.
$$
\text{Deviance difference} = -2(L_{0}-L_{1})
$$

<br>

The difference is large when $M_{0}$ fits poorly compared with $M_{1}$.


#### **1.3 AIC (Akaike Information Criterion)**

- AIC penalizes a model for having many parameters.
- The optimal model is the one that tends to have the maximum log likelihood.
- That is the model that minimizes:

$$
\text{AIC} = -2(\text{log likelihood - number of parameters in model})
$$
- When comparing two models, the smaller value of AIC indicates the better model.


## **2. Predictive Performance**
---

- A cross-tabulation of the actual outcome with the predicted outcome.
- Remember the decision boundary $\hat{y}=1$ if $\pi(x) > 0.5$, otherwise $\hat{y}=0$.

<img src="../assets/classification_matrix.jpg" width=400>

- Correctly predicted outcomes :
 - True Positive (TP)
 - True Negative (TN)

- Incorrectly predicted outcomes :
 - False Positive (FP)
 - False Negative (FN)

**Accuracy**
- Measure of overall classification accuracy for both true positive and true negative.
$$
\text{Accuracy} = \frac{\text{TP+TN}}{\text{TP+FP+TN+FN}}
$$

**Sensitivity (True Positive Rate)**
- The percentage of actual positive outcomes that were correctly predicted.
$$
\text{Sensitivity} = \frac{\text{TP}}{\text{TP+FN}}
$$

**Spesificity (True Negative Rate)**
- The percentage of actual negative outcomes that were correctly predicted.
$$
\text{Specificity} = \frac{\text{TN}}{\text{TN+FP}}
$$

#### **2.2 ROC and AUC**

- In logistic regression, we use cut-off probability $\pi_{0}=0.5$ to classify the predicted probabilities.
- The prediction is $\hat{y}=1$ if $\pi(x) > \pi_{0}$, otherwise $\hat{y}=0$.
- However, if :
  - A **low proportion of observations have $y=1$**, the model fit may never have $\pi_{i}>0.5$, in which case one **never predicts $\hat{y}=1$**.
  - A **high proportion of observations have $y=1$**, the model fit may always have $\pi_{i}>0.5$, in which case one **always predicts $\hat{y}=1$**.
- Another possibility takes $\pi_{0}$ as the sample proportion of 1 outcomes.
- How do we know the chosen $\pi_{0}$ is the optimum cut-off probability?
---

- If we choose other values for cut-off probability, we will have different classification matrix and its measures.
- Receiver Operating Characteristic (ROC) is the plot of $\text{sensitivity}$ as a function of $(1 – \text{specificity})$ for the possible cut-off probabilities.
  - The $y$-axis : 	$\text{Sensitivity (True Positive Rate)}$
  - The $x$-axis : 	$1- \text{Specificity (False Positive Rate)}$
  
  $$
  \text{FPR} = \frac{\text{FP}}{\text{FP+TN}}
  $$

<img src="../assets/ROC.jpg" width = 500>

- The most preferable cut-off values are ones that move into the upper left of the chart, where the **$\text{sensitivity}$ is high** and the **$1 – \text{specificity}$ is low**.

- The AUC (area under the curve) is particularly useful in comparing between different models.

<img src="../assets/AUC.jpg" width = 500>

- Models with the higher AUC values are those extending towards the upper left portion of the graph and the desirable combinations of sensitivity and specificity.

# **Model Evaluation**
---

In [14]:
# Import dataset from csv file
data = pd.read_csv('../data/horseshoe_crab.csv')
data.drop(columns=['index'], inplace=True)

# Table check
data.head()

Unnamed: 0,Color,Spine,Width,Weight,Satellite
0,2,3,28.3,3.05,8
1,3,3,26.0,2.6,4
2,3,3,25.6,2.15,0
3,4,2,21.0,1.85,0
4,2,3,29.0,3.0,1


In [15]:
# Information check
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 173 entries, 0 to 172
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Color      173 non-null    int64  
 1   Spine      173 non-null    int64  
 2   Width      173 non-null    float64
 3   Weight     173 non-null    float64
 4   Satellite  173 non-null    int64  
dtypes: float64(2), int64(3)
memory usage: 6.9 KB


- The dataset has **173 observations** from **5 variables**:
  - Color : multicategory (ordinal)
  - Spine : multicategory (nominal)
  - Width : continuous
  - Weight : continuous
  - Satellite : discrete (**response variable**)
- We gonna treat response variable Satellite as binary response (0 or 1).
- We need to code number of satellites into 1 for having satellites > 0, and 0 for having satellites = 0.

In [16]:
# Code the response variable Satellite
# Satellite=0 --> Satellite=0, otherwise Satellite=1
data['Satellite'] = data['Satellite'].apply(lambda x: 0 if x==0 else 1)

# Data check
data.head()

Unnamed: 0,Color,Spine,Width,Weight,Satellite
0,2,3,28.3,3.05,1
1,3,3,26.0,2.6,1
2,3,3,25.6,2.15,0
3,4,2,21.0,1.85,0
4,2,3,29.0,3.0,1


## **1. Model Estimation Fit**
---

### **1.1 Likelihood Ratio Test**
---

We want to assess the logit model $M_{1}$ :
$$
\text{logit(satellite)} = \beta_{0} + \beta_{1}(\text{width})
$$
compared to null model $M_{0}$ :
$$
\text{logit(satellite)} = \beta_{0}
$$
<br>
**Hypotheses** :
- Null Hypothesis: the logit model only contains intercept $\beta_{0}$.
- Alternative Hypothesis: the logit model contains $\beta_{0}$ and $\beta_{1}$.
    - $\text H_{0} : \beta = \beta_{0}$
    - $\text H_{1} : \beta = \beta_{0}, \beta_{1}$

**Test statistic**:
- The likelihood ratio (LLR):

 $$
\text{LLR} = -2 (L_{0}-L_{1})
$$

<br>

- $L_{0}$ is the maximum log likelihood of $M_{0}$.
- $L_{1}$ is the maximum log likelihood of $M_{1}$.


In [17]:
# Define the response satellite and predictor width
satellite = data['Satellite']
width = data[['Width']]

In [19]:
# Modeling with statsmodels.formula
import statsmodels.formula.api as smf

# Model fitting
formula = 'satellite ~ width'
model_width = smf.logit(formula = formula,
                        data = data)
result_width = model_width.fit()

# Print the result
print(result_width.summary())

Optimization terminated successfully.
         Current function value: 0.562002
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:              satellite   No. Observations:                  173
Model:                          Logit   Df Residuals:                      171
Method:                           MLE   Df Model:                            1
Date:                Mon, 28 Oct 2024   Pseudo R-squ.:                  0.1387
Time:                        21:03:48   Log-Likelihood:                -97.226
converged:                       True   LL-Null:                       -112.88
Covariance Type:            nonrobust   LLR p-value:                 2.204e-08
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -12.3508      2.629     -4.698      0.000     -17.503      -7.199
width          0.4972      0.

- Statsmodels package yields output for LLR Test.
- You can see the details [here](https://www.statsmodels.org/devel/generated/statsmodels.discrete.discrete_model.LogitResults.html)
- Some attributes of model result:
  - `llr` : the likelihood ratio statistic
  - `llr_pvalue` : the chi-squared probability (p-value)
  - `llnull` : log-likelihood of null model (intercept only)
  - `llf` : maximum log-likelihood of model of interest
- Example:
  - Run `result_model.llr` to get the LLR value of the model.


In [20]:
# Extract L0 as ll_null
ll_null = result_width.llnull

print(f"L0 = {ll_null:.2f}")

L0 = -112.88


In [21]:
# Extract L1 as ll_width
ll_width = result_width.llf

print(f"L1 = {ll_width:.3f}")

L1 = -97.226


Then, calculate $\text{LLR} = -2 (L_{0}-L_{1})$.

In [22]:
# Calculate LLR
llr_width_scratch = -2*(ll_null - ll_width)

# Print LLR from the summary output to cross check the scratch
llr_width_summary = result_width.llr

print(f"LLR from scratch = {llr_width_scratch:.3f}")
print(f"LLR from summary = {llr_width_summary:.3f}")

LLR from scratch = 31.306
LLR from summary = 31.306


In [23]:
# Extract P-value of LLR
p_val_llr = result_width.llr_pvalue

print(f"P-value of LLR = {p_val_llr:.4f}")

P-value of LLR = 0.0000


**Rejection decision** :
- Since $\text{P-value of LLR} < \alpha=0.5$, the null hypothesis $\beta=\beta_{0}$ **is rejected** at $\alpha=0.5$.

**Conclusion** :
> The predictor width has an effect on logit model.

Next, we want to assess another logit model $M_{2}$ :
$$
\text{logit(satellite)} = \beta_{0} + \beta_{1}(\text{width}) + \beta_{2}(\text{weight})
$$
compared to null model $M_{0}$ :
$$
\text{logit(satellite)} = \beta_{0}
$$

<br>

**Hypotheses** :
- Null Hypothesis: the logit model only contains intercept $\beta_{0}$.
- Alternative Hypothesis: the logit model contains $\beta_{0}$, $\beta_{1}$, and $\beta_{2}$.
  - $\text H_{0} : \beta = \beta_{0}$
  - $\text H_{1} : \beta = \beta_{0}, \beta_{1}, \beta_{2}$

**Test statistic**:
  - The likelihood ratio (LLR):

$$
\text{LLR} = -2 (L_{0}-L_{2})
$$

<br>

- $L_{0}$ is the maximum log likelihood of $M_{0}$.
- $L_{2}$ is the maximum log likelihood of $M_{2}$.


In [24]:
# Define the response variable
satellite = data['Satellite']

# Define the predictors
width = data[['Width']]
weight = data[['Weight']]

In [25]:
# Modeling with statsmodels.formula

# Model fitting
formula = 'satellite ~ width + weight'
model_width_weight = smf.logit(formula = formula,
                               data = data)
result_width_weight = model_width_weight.fit()

# Print the result
print(result_width_weight.summary())

Optimization terminated successfully.
         Current function value: 0.557489
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:              satellite   No. Observations:                  173
Model:                          Logit   Df Residuals:                      170
Method:                           MLE   Df Model:                            2
Date:                Mon, 28 Oct 2024   Pseudo R-squ.:                  0.1456
Time:                        21:05:25   Log-Likelihood:                -96.446
converged:                       True   LL-Null:                       -112.88
Covariance Type:            nonrobust   LLR p-value:                 7.294e-08
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -9.3540      3.528     -2.651      0.008     -16.269      -2.439
width          0.3068      0.

Then, calculate $\text{LLR} = -2 (L_{0}-L_{2})$.
Or you can extract the LLR and its P-value from statsmodels result.

In [26]:
# Extract the LLR
llr_width_weight = result_width_weight.llr

print(f"LLR M2 = {llr_width_weight:.3f}")

LLR M2 = 32.867


In [27]:
# Extract P-value of LLR
p_val_M2 = result_width_weight.llr_pvalue

print(f"P-value of LLR M2 = {p_val_M2:.4f}")

P-value of LLR M2 = 0.0000


**Rejection decision** :
- Since $\text{P-value of LLR} < \alpha=0.5$, the null hypothesis $\beta=\beta_{0}$ **is rejected** at $\alpha=0.5$.

**Conclusion** :
> Model 2 fits better than null model, at least one of predictors width and weight has an effect on logit model.

### **1.2 Deviance**
---

We want to compare the two models by comparing their deviances.
  - $M_{1}$ with maximized log-likelihood $L_{1}$ and logit model:
  $$
  \text{logit(satellite)} = \beta_{0} + \beta_{1}(\text{width})
  $$
  - $M_{2}$ with maximized log-likelihood $L_{2}$ and logit model:
  $$
  \text{logit(satellite)} = \beta_{0} + \beta_{1}(\text{width}) + \beta_{2}(\text{weight})
  $$

<br>

The deviance difference:
$$
\text{Deviance difference} = -2(L_{1}-L_{2})
$$
<br>

The difference is large when $M_{1}$ fits poorly compared with $M_{2}$.

In [28]:
# Extract L1 as ll_width
ll_width = result_width.llf

print(f"L1 = {ll_width:.3f}")

L1 = -97.226


In [29]:
# Extract L2 as ll_width_weight
ll_width_weight = result_width_weight.llf

print(f"L2 = {ll_width_weight:.3f}")

L2 = -96.446


Then, calculate $\text{Deviance difference} = -2(L_{1}-L_{2})$.

In [30]:
# Calculate deviance difference
deviance_diff = -2*(ll_width - ll_width_weight)

print(f"Deviance difference = {deviance_diff:.3f}")

Deviance difference = 1.561


In [31]:
# Extract tabulated chi-squared of deviance difference
import scipy.stats as stats

chi_sq = stats.chi2.ppf(q = 1-0.05,
                        df = 1)

print(f"Tabulated chi-sq = {chi_sq:.4f}")

Tabulated chi-sq = 3.8415


**Rejection decision** :
- Since $\text{deviance difference}<\chi^{2}_{\alpha=0.05}, _{\; \text{df}=1}$, we **fail to reject** the null hypothesis that the model only contains $\beta_{0}$ and $\beta_{1}$.

**Conclusion** :
> Adding predictor weight has no effect on logit model. This suggests that we don't need predictor weight in the logit model.

### **1.3 AIC**
---

- When comparing two models, the smaller value of AIC indicates the better model.
$$
\text{AIC} = -2(\text{log likelihood - number of parameters in model})
$$
- We want to compare two models before by comparing their AIC values.
  - $M_{1}$ with maximized log-likelihood $L_{1}$ and logit model:
  $$
  \text{logit(satellite)} = \beta_{0} + \beta_{1}(\text{width})
  $$
  - $M_{2}$ with maximized log-likelihood $L_{2}$ and logit model:
  $$
  \text{logit(satellite)} = \beta_{0} + \beta_{1}(\text{width}) + \beta_{2}(\text{weight})
  $$

---
Model 1 (Width only)

In [32]:
# Print the result
print(result_width.summary())

                           Logit Regression Results                           
Dep. Variable:              satellite   No. Observations:                  173
Model:                          Logit   Df Residuals:                      171
Method:                           MLE   Df Model:                            1
Date:                Mon, 28 Oct 2024   Pseudo R-squ.:                  0.1387
Time:                        21:09:23   Log-Likelihood:                -97.226
converged:                       True   LL-Null:                       -112.88
Covariance Type:            nonrobust   LLR p-value:                 2.204e-08
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -12.3508      2.629     -4.698      0.000     -17.503      -7.199
width          0.4972      0.102      4.887      0.000       0.298       0.697


- Statsmodels package yields output for AIC.
- You can see the details [here](https://www.statsmodels.org/devel/generated/statsmodels.discrete.discrete_model.LogitResults.html)
- Example:
  - Run `result_model.aic` to get the AIC value of the model.

In [33]:
# Extract AIC of model 1
aic_width = result_width.aic

print(f"AIC model 1 = {aic_width:.2f}")

AIC model 1 = 198.45


Or calculate with the formula $\text{AIC} = -2(\text{log likelihood - number of parameters in model})$, where number of parameters in model 1 is 2 (parameter $\beta_{0}$ and $\beta_{1}$)

In [34]:
# Calculate with the formula
aic_width_scratch = -2*(ll_width - 2)
aic_width_scratch

np.float64(198.45266392876505)

---
Model 2 (adding Weight)

In [35]:
# Print the result
print(result_width_weight.summary())

                           Logit Regression Results                           
Dep. Variable:              satellite   No. Observations:                  173
Model:                          Logit   Df Residuals:                      170
Method:                           MLE   Df Model:                            2
Date:                Mon, 28 Oct 2024   Pseudo R-squ.:                  0.1456
Time:                        21:10:00   Log-Likelihood:                -96.446
converged:                       True   LL-Null:                       -112.88
Covariance Type:            nonrobust   LLR p-value:                 7.294e-08
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -9.3540      3.528     -2.651      0.008     -16.269      -2.439
width          0.3068      0.182      1.686      0.092      -0.050       0.663
weight         0.8336      0.671      1.242      0.2

In [36]:
# Extract AIC of model 2
aic_width_weight = result_width_weight.aic

# Print and compare AIC model 1 & 2
print(f"AIC model 1 = {aic_width:.2f}")
print(f"AIC model 2 = {aic_width_weight:.2f}")

AIC model 1 = 198.45
AIC model 2 = 198.89


**Conclusion**
> Model 1 with single predictor width yields the smaller AIC, thus adding predictor weight has no effect on logit model.

This interpretation is consistent with the interpretation of comparing the two models by their deviance difference.

## **2. Predictive Performance**
---