## Missingness Analysis & Imputation Evaluation Demo

This notebook demonstrates how to analyze missingness in a dataset and evaluate imputation quality using the `missmecha.analysis` modules.

We show:
- Column-wise and overall missing rate analysis
- Visual inspection of missing patterns
- Evaluation of imputation quality using RMSE / Accuracy, depending on variable type


### A Note on AvgERR

The idea behind `AvgERR` is to evaluate imputation performance based on variable types:

$
\text{AvgErr}(v_j) =
\begin{cases}
\frac{1}{n} \sum\limits_{i=1}^{n} |X_{ij} - \hat{X}_{ij}|, & \text{if } v_j \text{ is continuous (MAE)} \\\\
\sqrt{\frac{1}{n} \sum\limits_{i=1}^{n} (X_{ij} - \hat{X}_{ij})^2}, & \text{if } v_j \text{ is continuous (RMSE)} \\\\
\frac{1}{n} \sum\limits_{i=1}^{n} (X_{ij} - \hat{X}_{ij})^2, & \text{if } v_j \text{ is continuous (MSE)} \\\\
\frac{1}{n} \sum\limits_{i=1}^{n} \text{Acc}(X_{ij}, \hat{X}_{ij}), & \text{if } v_j \text{ is categorical}
\end{cases}
$


In this implementation, if a `status` dictionary is provided, the function automatically applies the appropriate metric:
- **Numerical columns** use the selected method (RMSE or MAE)
- **Categorical/discrete columns** use classification accuracy

## Setup
Import required packages and the evaluation function. We'll start by importing necessary packages and simulating a dataset with mixed-type variables and missing values.


In [None]:
import pandas as pd
import numpy as np

from missmecha.analysis import evaluate_imputation,compute_missing_rate

### Create fully observed mixed-type dataset


In [52]:
df_true = pd.DataFrame({
    "age": [25, 30, 22, 40, 35, 50],
    "income": [3000, 4500, 2800, 5200, 4100, 6000],
    "gender": ["M", "F", "M", "F", "F", "M"],
    "job_level": ["junior", "mid", "junior", "senior", "mid", "senior"]
})
df_true

Unnamed: 0,age,income,gender,job_level
0,25,3000,M,junior
1,30,4500,F,mid
2,22,2800,M,junior
3,40,5200,F,senior
4,35,4100,F,mid
5,50,6000,M,senior


### Inject missing values

In [53]:
df_incomplete = df_true.copy()
df_incomplete.loc[1, "age"] = np.nan
df_incomplete.loc[2, "income"] = np.nan
df_incomplete.loc[3, "gender"] = np.nan
df_incomplete.loc[4, "job_level"] = np.nan
df_incomplete

Unnamed: 0,age,income,gender,job_level
0,25.0,3000.0,M,junior
1,,4500.0,F,mid
2,22.0,,M,junior
3,40.0,5200.0,,senior
4,35.0,4100.0,F,
5,50.0,6000.0,M,senior


In [54]:
compute_missing_rate(df_incomplete)

Overall missing rate: 16.67%
4 / 24 total values are missing.

Top variables by missing rate:


Unnamed: 0_level_0,n_missing,missing_rate (%),n_unique,dtype,n_total
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
age,1,16.67,5,float64,6
income,1,16.67,5,float64,6
gender,1,16.67,2,object,6
job_level,1,16.67,3,object,6


{'report':            n_missing  missing_rate (%)  n_unique    dtype  n_total
 column                                                            
 age                1             16.67         5  float64        6
 income             1             16.67         5  float64        6
 gender             1             16.67         2   object        6
 job_level          1             16.67         3   object        6,
 'overall_missing_rate': np.float64(16.67)}

### Impute missing values (integer mean for numeric, mode for categorical)

In [55]:
df_filled = df_incomplete.copy()

for col in df_filled.columns:
    if df_filled[col].dtype.kind in "iufc":
        df_filled[col] = df_filled[col].fillna(round(df_filled[col].mean()))
    else:
        df_filled[col] = df_filled[col].fillna(df_filled[col].mode()[0])

df_filled

Unnamed: 0,age,income,gender,job_level
0,25.0,3000.0,M,junior
1,34.0,4500.0,F,mid
2,22.0,4560.0,M,junior
3,40.0,5200.0,M,senior
4,35.0,4100.0,F,junior
5,50.0,6000.0,M,senior


### Define variable types


In [56]:
status = {
    "age": "num",
    "income": "num",
    "gender": "cat",
    "job_level": "disc"
}


### Run `evaluate_imputation()` with AvgERR logi



In [57]:
results = evaluate_imputation(
    ground_truth=df_true,
    filled_df=df_filled,
    incomplete_df=df_incomplete,
    status=status
)

In [58]:
print("Column-wise scores:")
for k, v in results["column_scores"].items():
    print(f"  {k}: {v:.2f}")

print(f"\nOverall numeric score (MAE): {results['overall_numeric_score']:.2f}")
print(f"Overall categorical score (Accuracy): {results['overall_categorical_score']:.2f}")


Column-wise scores:
  age: 4.00
  income: 1760.00
  gender: 0.00
  job_level: 0.00

Overall numeric score (MAE): 882.00
Overall categorical score (Accuracy): 0.00
