The `log-likelihood` measures how well a model explains the observed data. It‚Äôs the logarithm of the likelihood function. Higher log-likelihood ‚Üí better fit

Machine Learning Models can be compared using Log-Likelihood.

There are 2 scenarios of comparison.  

1. `Direct Comparison` (Same Complexity)    
If two models have the same number of parameters:

    - Compare raw log-likelihood values  
    - Higher value = better model  
2. `Likelihood Ratio Test` (Nested Models)
For comparing nested models (one is a special case of the other):

    Œõ= ‚àí2√ó(logùêø<sub>restricted</sub>‚àílogùêø<sub>full</sub>)  
    `Degree of freedom` is needed to do significance test. Here, the degree of freedom is the difference between the number of parameters estimated in complex model and number of parameter estimated in simpler model.

3. `Information Criteria` (Different Complexity)
    If models differ in complexity, use:

    AIC: Akaike Information Criterion

    AIC=2ùëò‚àí2logùêø

    BIC: Bayesian Information Criterion

    BIC=ùëòlogùëõ‚àí2logùêø

    Where:

    ùëò = number of parameters

    ùëõ = number of observations

Lower AIC/BIC = better model (penalizes overfitting)

Below example uses the data from [Kaggle](https://www.kaggle.com/datasets/shalmamuji/electricity-cost-prediction-dataset)

In [3]:
import pandas as pd 
df = pd.read_csv("electricity_cost_dataset.csv")
df.head(3)

Unnamed: 0,site area,structure type,water consumption,recycling rate,utilisation rate,air qality index,issue reolution time,resident count,electricity cost
0,1360,Mixed-use,2519.0,69,52,188,1,72,1420.0
1,4272,Mixed-use,2324.0,50,76,165,65,261,3298.0
2,3592,Mixed-use,2701.0,20,94,198,39,117,3115.0


**Direct Comparison:** Fitting 2 models with `same number of parameters` to compare their Log-Likelihood values

In [6]:
import statsmodels.api as sm
from scipy.stats import chi2

y = df['electricity cost']
X1 = df['resident count']
X2 = df['utilisation rate']
model1 = sm.OLS(y, X1).fit()
model2 = sm.OLS(y, X2).fit()

print("Log-likelihood Model 1:", model1.llf)
print("Log-likelihood Model 2:", model2.llf)

Log-likelihood Model 1: -91277.63101816681
Log-likelihood Model 2: -85537.56000102143


In this case, `Model 2` has higher Log-Likelihood indicating it is a better fit.

**Likelihood Ratio Test:** This is done in case of comparison between nested models [Models that have hierarchical relationship or Nested]

`Example:` ARMA(P1,Q1) is nested under ARMA(P2,Q2) if all the below condition are satisfied   
`P2+Q2` > `P1+Q1`  
`P2>=P1`  
`Q2>=Q1`

In [8]:
y = df['electricity cost']
X_full = sm.add_constant(df[['resident count', 'utilisation rate']])  # Full model
X_reduced = sm.add_constant(df[['resident count']])            # Nested model

# Fit both models
model_full = sm.OLS(y, X_full).fit()
model_reduced = sm.OLS(y, X_reduced).fit()

# Extract log-likelihoods
ll_full = model_full.llf
ll_reduced = model_reduced.llf
lr_stat = -2 * (model1.llf - model2.llf)
df_diff = X_full.shape[1] - X_reduced.shape[1] # Finding the degree of freedom
p_value = chi2.sf(lr_stat, df_diff)

if p_value < 0.05:
    print("Full model gets signifiant improved fit")
else:
    print("Reduced model is sufficient")

Full model gets signifiant improved fit
