# libsemx Demonstration

This notebook demonstrates the capabilities of `libsemx` using the datasets provided in the `data/` folder. We will cover:

1.  **Linear Mixed Models (LMM)** using `sleepstudy.csv`.
2.  **Structural Equation Modeling (SEM)** using `bfi.csv`.
3.  **Survival Analysis** using `ovarian_survival.csv`.

## Prerequisites

Ensure `libsemx` is installed and available in your Python environment. If you are running from the source repository, you may need to set `PYTHONPATH`.

In [None]:
import sys
import os
import importlib

# Add the python directory to sys.path if running from source
sys.path.append("../../python")

import pandas as pd
import numpy as np
import semx
import matplotlib.pyplot as plt

print(f"libsemx version: {semx.__version__ if hasattr(semx, '__version__') else 'unknown'}")

libsemx version: 0.0.0.dev0


## 1. Linear Mixed Models (LMM)

We will use the `sleepstudy` dataset to demonstrate a random slope model. The model predicts reaction time based on days of sleep deprivation, with random intercepts and slopes for each subject.

**Model:** `Reaction ~ Days + (Days | Subject)`

In [2]:
# Load data
sleep_df = pd.read_csv("../../data/sleepstudy.csv")

# Preprocess
# Drop the index column from Rdatasets
if "rownames" in sleep_df.columns:
    sleep_df = sleep_df.drop(columns=["rownames"])

# Scale Reaction to improve numerical stability and convergence speed
# Original values are ~250-400, which can cause ill-conditioning
sleep_df["Reaction"] = sleep_df["Reaction"] / 100.0

# Factorize Subject to ensure it's treated as a categorical grouping variable
# Although libsemx handles strings, factorizing ensures 0-based indices if needed, 
# but here we just ensure it's clean.
print(sleep_df.head())

# Define Model
lmm_model = semx.Model(
    equations=["Reaction ~ Days + (Days | Subject)"],
    families={"Reaction": "gaussian", "Days": "gaussian"},
    # 'Subject' is automatically inferred as grouping from the random effect syntax,
    # but we can be explicit if we want.
)

# Fit Model
# Increase max_iterations to ensure convergence
lmm_fit = lmm_model.fit(sleep_df, max_iterations=2000)

# Display Summary
print(lmm_fit.summary())

# Display Variance Components
print("\nVariance Components:")
print(lmm_fit.variance_components())

   Reaction  Days  Subject
0  2.495600     0      308
1  2.587047     1      308
2  2.508006     2      308
3  3.214398     3      308
4  3.568519     4      308
Optimization converged: True
Iterations: 222
Log-likelihood: -216.307
AIC: 440.6, BIC: 453.4

                       Estimate  Std.Error   z-value       P(>|z|)
beta_Reaction_on_Days  0.329581   0.070998  4.642141  3.448173e-06
cov_re_1_0             2.465010   0.434149  5.677803  1.364359e-08
cov_re_1_1            -0.210931   0.081405 -2.591128  9.566202e-03
cov_re_1_2             0.000001   0.028010  0.000044  9.999649e-01

Variance Components:
  Group      Name1 Name2  Variance  Std.Dev  Corr
Subject _intercept        6.076276 2.465010   NaN
Subject _intercept  Days -0.519946      NaN  -1.0
Subject       Days        0.044492 0.210931   NaN

Variance Components:
     Group       Name1 Name2  Variance   Std.Dev  Corr
0  Subject  _intercept        6.076276  2.465010   NaN
1  Subject  _intercept  Days -0.519946       NaN  -1.0


## 2. Structural Equation Modeling (SEM)

We will use the `bfi` dataset to perform a Confirmatory Factor Analysis (CFA) of the Big Five personality traits.

**Latent Factors:**
*   **Agreeableness**: A1, A2, A3, A4, A5
*   **Conscientiousness**: C1, C2, C3, C4, C5
*   **Extraversion**: E1, E2, E3, E4, E5
*   **Neuroticism**: N1, N2, N3, N4, N5
*   **Openness**: O1, O2, O3, O4, O5

In [3]:
# Load data
bfi_df = pd.read_csv("../../data/bfi.csv")

# Preprocess
if "rownames" in bfi_df.columns:
    bfi_df = bfi_df.drop(columns=["rownames"])

# Drop missing values
bfi_df = bfi_df.dropna()

# SIMPLIFICATION: Use a subset of data for demonstration purposes
# The full dataset (N=2800) with 25 variables creates a very large covariance matrix
# if not optimized for block-diagonal structure. We use N=1000 for this demo.
bfi_df = bfi_df.head(1000).copy()

print(f"Data shape for analysis: {bfi_df.shape}")

# Standardize the data (Critical for SEM convergence speed and stability)
# We standardize the personality items (A1-A5, C1-C5, etc.)
item_cols = [f"{L}{i}" for L in "ACENO" for i in range(1, 6)]
# Ensure these columns exist in the dataframe
item_cols = [c for c in item_cols if c in bfi_df.columns]

# Apply z-score standardization
bfi_df[item_cols] = (bfi_df[item_cols] - bfi_df[item_cols].mean()) / bfi_df[item_cols].std()

# Define Model
# We define the measurement model for the 5 factors.
equations = [
    "Agreeableness =~ A1 + A2 + A3 + A4 + A5",
    "Conscientiousness =~ C1 + C2 + C3 + C4 + C5",
    "Extraversion =~ E1 + E2 + E3 + E4 + E5",
    "Neuroticism =~ N1 + N2 + N3 + N4 + N5",
    "Openness =~ O1 + O2 + O3 + O4 + O5"
]

# Specify families for observed variables
families = {var: "gaussian" for var in item_cols}

cfa_model = semx.Model(
    equations=equations,
    families=families
)

# Fit Model
print("Fitting CFA model...")
try:
    cfa_fit = cfa_model.fit(bfi_df, max_iterations=1000)
    # Display Summary
    print(cfa_fit.summary())
except Exception as e:
    print(f"Model fitting failed: {e}")

Data shape for analysis: (1000, 28)
Adding default covariances for 5 latents
Fitting CFA model...
Optimization converged: False
Iterations: 1000
Log-likelihood: -33779.445
AIC: 67628.9, BIC: 67800.7

                                         Estimate  Std.Error  z-value  P(>|z|)
lambda_A2_on_Agreeableness              -2.103898        NaN      NaN      NaN
lambda_A3_on_Agreeableness              -2.257983        NaN      NaN      NaN
lambda_A4_on_Agreeableness              -1.781128        NaN      NaN      NaN
lambda_A5_on_Agreeableness              -2.266001        NaN      NaN      NaN
lambda_C2_on_Conscientiousness           1.113348        NaN      NaN      NaN
lambda_C3_on_Conscientiousness           1.022162        NaN      NaN      NaN
lambda_C4_on_Conscientiousness          -1.184976        NaN      NaN      NaN
lambda_C5_on_Conscientiousness          -1.095560        NaN      NaN      NaN
lambda_E2_on_Extraversion                1.250735        NaN      NaN      NaN
lambda_E3_

## 3. Survival Analysis

We will use the `ovarian_survival` dataset to demonstrate a survival model. We predict survival time based on age and treatment group.

**Model:** `Surv(futime, fustat) ~ age + rx`

In [4]:
# Load data
ovarian_df = pd.read_csv("../../data/ovarian_survival.csv")

# Preprocess
if "rownames" in ovarian_df.columns:
    ovarian_df = ovarian_df.drop(columns=["rownames"])

print(ovarian_df.head())

# Define Model
# Note: 'rx' is treatment group (1 or 2). We treat it as a continuous predictor here for simplicity,
# but in a real analysis, you might want to dummy encode it if it were categorical with >2 levels.
surv_model = semx.Model(
    equations=["Surv(futime, fustat) ~ age + rx"],
    # For survival models, the family is typically "weibull" or "exponential".
    # The key in families dict matches the time variable name.
    families={"futime": "weibull", "age": "gaussian", "rx": "gaussian"}
)

# Fit Model
surv_fit = surv_model.fit(ovarian_df)

# Display Summary
print(surv_fit.summary())

   futime  fustat      age  resid.ds  rx  ecog.ps
0      59       1  72.3315         2   1        1
1     115       1  74.4932         2   1        1
2     156       1  66.4658         2   1        2
3     421       0  53.3644         2   2        1
4     431       1  50.3397         2   1        1
Optimization converged: True
Iterations: 16
Log-likelihood: -42480.951
Chi-square: 84373.198 (df=7)
P-value: 0.000
CFI: -5669.504, TLI: -2429.216, RMSEA: 21.530, SRMR: 0.762
AIC: 84965.9, BIC: 84968.4

                    Estimate  Std.Error   z-value   P(>|z|)
beta_futime_on_age  0.058006   0.026852  2.160249  0.030753
beta_futime_on_rx   3.346563   1.165783  2.870657  0.004096


## Conclusion

This notebook demonstrated the versatility of `libsemx` in handling different types of statistical models:
1.  **Mixed Models** for hierarchical data.
2.  **SEM/CFA** for latent variable modeling.
3.  **Survival Analysis** for time-to-event data.

Explore the `data/` folder for more datasets like `pbc.csv` (mixed outcomes) and `mdp_*.csv` (genomic data) to try more advanced features!