# Statistical Analysis Pipeline Demo

This notebook demonstrates the "Diagnosis First, Inference Second" workflow using the Python pipeline.

## Setup

In [None]:
import sys
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Add src to path
sys.path.append(os.path.abspath('../src'))

from eda import EDA_Diagnosis
from method_selector import MethodSelector
from inference import Inference

# Ensure plots show inline
%matplotlib inline

## Case Study A: The "Ideal" Scenario (Parametric Path)

**Scenario:** Comparison of Reaction Times between two groups (Control vs Treatment).
**Expectation:** Data is Normal -> Independent T-test.

In [None]:
# 1. Data Generation
np.random.seed(42)
n = 50
group_a = np.random.normal(loc=300, scale=30, size=n)
group_b = np.random.normal(loc=320, scale=30, size=n)

df_a = pd.DataFrame({'RT': group_a, 'Group': 'Control'})
df_b = pd.DataFrame({'RT': group_b, 'Group': 'Treatment'})
df_ideal = pd.concat([df_a, df_b]).reset_index(drop=True)

df_ideal.head()

### Step 1: EDA & Diagnosis
We check the distribution of the data.

In [None]:
# Run EDA on the full dataset (visual check) or per group
eda = EDA_Diagnosis(df_ideal['RT'])
stats_desc = eda.descriptive_stats()
print("Descriptive Stats:", stats_desc)

# Check Normality (Per group is strictly better, but we check residuals or overall for simple demo)
dist_check = eda.check_distribution()
print("Distribution Check:", dist_check)

eda.visualize()
plt.show()

**Interpretation:** 
- Skewness is low (near 0).
- Normality tests (Shapiro-Wilk) pass (p > 0.05).
- Histograms look bell-shaped.

### Step 2: Method Selection

In [None]:
selector = MethodSelector(df_ideal, group_col='Group')

# Check Homogeneity
var_check = selector.check_homogeneity('RT')
print("Homogeneity (Levene):", var_check)

# Recommend
method, advice = selector.recommend_method(
    normality_check=dist_check['Is_Normal_Assumption'], 
    homogeneity_check=var_check['Homogeneity']
)

print(f"Recommended Method: {method}")
print(f"Advice: {advice}")

### Step 3: Statistical Inference

In [None]:
inf = Inference(df_ideal)

if method == 'Independent T-test':
    res = inf.run_ttest('RT', 'Group', equal_var=True)
    print(res)
elif method == "Welch's T-test":
    res = inf.run_ttest('RT', 'Group', equal_var=False)
    print(res)
else:
    print("Running alternative test...")

---
## Case Study B: The "Messy" Scenario (Real-world)

**Scenario:** Multi-subject repeated measures (Hierarchical). Data is skewed and has outliers.
**Expectation:** Detect Hierarchical structure -> Recommend LMM.

In [None]:
# 1. Data Generation (Hierarchical + Skewed)
np.random.seed(99)
n_subjects = 20
n_trials = 10

data_rows = []
for subj in range(n_subjects):
    # Random intercept for each subject
    subj_intercept = np.random.normal(0, 5)
    for trial in range(n_trials):
        condition = np.random.choice(['A', 'B'])
        # Condition effect (A=0, B=2)
        cond_effect = 2 if condition == 'B' else 0
        # Lognormal noise (Right Skewed)
        noise = np.random.lognormal(0, 0.5)
        
        value = 10 + subj_intercept + cond_effect + noise
        data_rows.append({'SubjectID': f'S{subj}', 'Condition': condition, 'Value': value})

df_messy = pd.DataFrame(data_rows)
df_messy.head()

### Step 1: EDA & Diagnosis

In [None]:
eda_messy = EDA_Diagnosis(df_messy['Value'])
print(eda_messy.descriptive_stats())
print(eda_messy.check_distribution())

# Check Outliers
outliers = eda_messy.check_outliers()
print(f"Number of Outliers (IQR): {len(outliers)}")

eda_messy.visualize()
plt.show()

**Interpretation:**
- Kurtosis and Skewness indicate non-normality.
- Plots show right-skew.
- Outliers detected.

### Step 2: Method Selection

In [None]:
# Initialize with ID column to signal Hierarchy
selector_messy = MethodSelector(df_messy, group_col='Condition', id_col='SubjectID')

# Normality check failed previously
method_messy, advice_messy = selector_messy.recommend_method(
    normality_check=False, 
    homogeneity_check=True # Assuming for now or checked
)

print(f"Recommended Method: {method_messy}")
print(f"Advice: {advice_messy}")

### Step 3: Statistical Inference (LMM)
Since LMM is recommended, we run it using `statsmodels`.

In [None]:
inf_messy = Inference(df_messy)

if 'LMM' in method_messy:
    print("Running Linear Mixed Model...")
    # Formula: Value ~ Condition + (1|SubjectID) -> Random intercept per subject
    res_lmm = inf_messy.run_lmm('Value ~ Condition', 'SubjectID')
    print(res_lmm.summary())