In this example, we need to import `numpy`, `pandas`, `graphviz`, and `IPython` in addition to `lingam`.

In [3]:
import numpy as np
import pandas as pd
import graphviz
import IPython
from IPython.display import display, HTML
import lingam

print([np.__version__, pd.__version__, graphviz.__version__, IPython.__version__, lingam.__version__])

np.set_printoptions(precision=3, suppress=True)
np.random.seed(100)

['1.26.4', '2.2.3', '0.20.3', '8.2.0', '1.10.0']


# Missingness-LiNGAM (m-LiNGAM)
This notebook explains how to use `mLiNGAM` to learn the LiNGAM model when the dataset contains missing values. `mLiNGAM` also reconstruct the missingness mechanism that caused the data to be missing, making it possible to identify the missingness category for each variable (MCAR, MAR, or MNAR).

First, we initialize a test grah and generate a dataset affected by missingness.

In [13]:
# Initialize the parameters
sample_size=5000
B_true = np.array([[0,0,0,0],[1.1,0,0,0],[0,0.9,0,0],[0.8,0,1.2,0]])

# Generate the data
X = np.random.laplace(size=sample_size, scale=1.0)
Z = 1.1*X + np.random.laplace(size=sample_size, scale=1.0)
Y = 0.9*Z + np.random.laplace(size=sample_size, scale=1.0)
W = 0.8*X + 1.2*Y + np.random.laplace(size=sample_size, scale=1.0)
Ry = (0.5*W -1 + np.random.logistic(size=sample_size, scale=1.0))>0

# Mask missing values
Ys = Y.copy()
Ys[Ry==1]=np.nan

# Dataset
ds = pd.DataFrame(np.array([X, Z, Ys, W]).T ,columns=['X', 'Z', 'Ys', 'W'])
print("Percentage of rows affected by missing data: ", Ys[Ry==1].size/Ys.size)

Percentage of rows affected by missing data:  0.3558


Then, we run both `DirectLiNGAM` and `mLiNGAM` on the same dataset. Since DirectLiNGAM cannot handle missing values, we apply it to the dataset after removing all rows affected by missingness.

In [None]:
dl = lingam.DirectLiNGAM(random_state=100, prior_knowledge=None)
dl.fit(ds[~np.any(np.isnan(ds), axis=1)])

ml = lingam.mLiNGAM(random_state=100, prior_knowledge=None)
ml.fit(ds);

We can now compare the ground truth graph, the `mLiNGAM` output graph, and the list-wise deletion `DirectLiNGAM` graph.

In [15]:
node_labels = ['X', 'Z', 'Y', 'W']

dotTrue = graphviz.Digraph(engine='neato')
dotML = graphviz.Digraph(engine='neato')
dotDL = graphviz.Digraph(engine='neato')

# Add nodes with labels
for d in [dotTrue, dotML, dotDL]:
    d.node('0', 'X', pos='0,1!')
    d.node('1', 'Z', pos='1,2!')
    d.node('2', 'Y', pos='2,1!')
    d.node('3', 'W', pos='1,0!')

# Add edges from adjacency matrix
n = B_true.shape[0]
for i in range(n):
    for j in range(n):
        weightTrue = B_true[j, i]
        weightDL = dl.adjacency_matrix_[j, i]
        weightML = ml.adjacency_matrix_[j, i]
        if B_true[j, i] != 0:
            dotTrue.edge(str(i), str(j), label=str(round(weightTrue, 2)))
        if ml.adjacency_matrix_[j, i] != 0:
            dotML.edge(str(i), str(j), label=str(round(weightML, 2)))
        if dl.adjacency_matrix_[j, i] != 0:
            dotDL.edge(str(i), str(j), label=str(round(weightDL, 2)))

# Add missingness mechanisms to m-LiNGAM's output
dotML.node('4', 'Ry', pos='3,0!')
dotTrue.node('4', 'Ry', pos='3,0!')
dotTrue.edge('3', '4', label=str(0.5))
for k,missigness_parent in enumerate(ml._missingness_mechanisms_parents[2]):
    dotML.edge(str(missigness_parent), '4', label=str(round(ml._missingness_mechanisms_coef[2][k+1], 2)))

# Render both graphs as SVG strings
svgTrue = dotTrue.pipe(format='svg').decode('utf-8')
svgML = dotML.pipe(format='svg').decode('utf-8')
svgDL = dotDL.pipe(format='svg').decode('utf-8')

# Display side by side using HTML
html = f"""
<div style="display: flex; gap: 40px; align-items: flex-start;">
<div>
    <h3 style="text-align: center;">Ground Truth</h3>
    {svgTrue}
  </div>
  <div>
    <h3 style="text-align: center;">m-LiNGAM</h3>
    {svgML}
  </div>
  <div>
    <h3 style="text-align: center;">List-wise deletion DirectLiNGAM</h3>
    {svgDL}
  </div>
</div>
"""

display(HTML(html))

As expected, while naively applying `DirectLiNGAM` to the list-wise deleted dataset produce extraneous edges and biased estimates for the parameters, `mLiNGAM` is able to produce a more accurate estimate of both the graph structure and the parameters. Note that `mLiNGAM` also reconstructs the missingness mechanism that led to missingness, correctly identyfing that the data was missing at random with $W$ as parent of $R_Y$, the missingness mechanism corresponding to variable $Y$.