# Chapter 4 - Testing the Causal Markov Property on the Transportation DAG

The notebook is a code companion to chapter 4 of the book [Causal Machine Learning](https://www.manning.com/books/causal-machine-learning) by Robert Osazuwa Ness. This is not an exact copy of the code in Chapter 4, but it captures the core elements.

<a href="https://colab.research.google.com/github/altdeep/causalML/blob/master/book/chapter%204/Testing_Markov_Property_on_Transportation_DAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#!pip install pgmpy==0.1.24

Causal relationships impose conditional independence constraints on the joint probability distribution of the variables in the data generating process. [D-separation](https://networkx.org/documentation/stable/reference/algorithms/d_separation.html) is a graphical criterion used to determine whether a set of variables is independent of another set of variables, given a third set. The causal Markov assumption says that if our selected causal DAG is true, then variables that are d-separated in the graph will be conditionally independent. Let’s revisit the transportation model:

![transportation DAG](images/transportation_DAG.png)

* Age (A): Recorded as young (young) for individuals up to and including 29 years, adult (adult) for individuals between 30 and 60 years old (inclusive), and old (old) for people 61 and over.
* Gender (S): The self-reported gender of an individual, recorded as male (M), female (F), or other (O).
* Education (E): The highest level of education or training completed by the individual, recorded either high school (high) or university degree (uni).
* Occupation (O): Employee (emp) or a self-employed (self) worker.
* Residence (R): The population size of the city the individual lives in, recorded as small (small) or big (big).
* Travel (T): The means of transport favored by the individual, recorded as car (car), train (train) or other (other)

Age (A) and Gender (S) determine Education (E). Education causes Occupation (O) and Residence (R). Occupation and Residence causes Transportation (T).

In [2]:
import pandas as pd
import numpy as np
from pgmpy.base.DAG import DAG
from pgmpy.estimators.CITests import chi_square
from pgmpy.independencies import IndependenceAssertion

First we build the graph. The method `get_independencies` will enumerate the d-separation statements that hold for this graph. Note that pgmpy attempts to remove redundant d-separations from this enumeration.

In [3]:
G = DAG()
G.add_edges_from(
    [
      ('A','E'),
      ('S','E'),
      ('E','O'),
      ('E','R'),
      ('O','T'),
      ('R','T')
    ]
)

dseps = G.get_independencies()
print(dseps)
print(f"Total number of assertions: {len(dseps.get_assertions())}") 

(A ⟂ S)
(A ⟂ O, R, T | E)
(A ⟂ O, R, T | E, S)
(A ⟂ T | O, R)
(A ⟂ R, T | O, E)
(A ⟂ O, T | R, E)
(A ⟂ O, R | E, T)
(A ⟂ T | O, R, S)
(A ⟂ R, T | O, E, S)
(A ⟂ O, T | R, E, S)
(A ⟂ O, R | T, E, S)
(A ⟂ T | O, R, E)
(A ⟂ R | O, E, T)
(A ⟂ O | R, E, T)
(A ⟂ T | O, R, E, S)
(A ⟂ R | T, O, E, S)
(A ⟂ O | T, R, E, S)
(S ⟂ A)
(S ⟂ O, R, T | E)
(S ⟂ O, R, T | A, E)
(S ⟂ T | O, R)
(S ⟂ R, T | O, E)
(S ⟂ O, T | R, E)
(S ⟂ O, R | E, T)
(S ⟂ T | A, O, R)
(S ⟂ R, T | A, O, E)
(S ⟂ O, T | A, R, E)
(S ⟂ O, R | A, E, T)
(S ⟂ T | O, R, E)
(S ⟂ R | O, E, T)
(S ⟂ O | R, E, T)
(S ⟂ T | A, O, E, R)
(S ⟂ R | A, O, E, T)
(S ⟂ O | A, R, E, T)
(O ⟂ A, R, S | E)
(O ⟂ R, S | A, E)
(O ⟂ A, R | E, S)
(O ⟂ A, S | R, E)
(O ⟂ A, S | E, T)
(O ⟂ R | A, E, S)
(O ⟂ S | A, R, E)
(O ⟂ S | A, E, T)
(O ⟂ A | R, E, S)
(O ⟂ A | T, E, S)
(O ⟂ A, S | R, E, T)
(O ⟂ S | A, R, E, T)
(O ⟂ A | T, R, E, S)
(R ⟂ A, O, S | E)
(R ⟂ O, S | A, E)
(R ⟂ A, O | E, S)
(R ⟂ A, S | O, E)
(R ⟂ A, S | E, T)
(R ⟂ O | A, E, S)
(R ⟂ S | A, O, E)
(R 

We test our DAG by actually testing these if there is evidence of this conditional independence in the data. First, let's load the data. 

In [4]:
survey_url = "https://raw.githubusercontent.com/altdeep/causalML/master/datasets/transportation_survey.csv"
full_data = pd.read_csv(survey_url)

The easiest way (but not the only way) to test for conditional independence is to run a classical [frequentist statistical hypothesis test](https://en.wikipedia.org/wiki/Statistical_hypothesis_test) for independence. Below, I write a function that will run a [Chi-square test for independence](https://en.wikipedia.org/wiki/Chi-squared_test). The core element of the code is the function:

`chi_square(X=X, Y=Y, Z=Z, data=data, boolean=True, significance_level=significance)`

If the `boolean` argument is False, the test returns a tuple containing chi-squared test statistic, the p_value, and the [degrees of freedom](https://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)) used to calculate the test statistic and p-value. In this test, lower p-values are evidence against independence. So lower p-values are evidence against our model.

For each of the above d-separation statements, I run the test.

In [5]:
def test_dsep(dsep: IndependenceAssertion, data, boolean=True, significance = .01):
  test_outputs = []
  for X in list(dsep.get_assertion()[0]):
    for Y in list(dsep.get_assertion()[1]):
      Z = list(dsep.get_assertion()[2])
      test_result = chi_square(X=X, Y=Y, Z=Z, data=data, boolean=boolean, significance_level=significance)
      test_outputs.append((IndependenceAssertion(X, Y, Z), test_result))
  return test_outputs

results = [test_dsep(dsep, data=full_data, boolean=False) for dsep in dseps.get_assertions()]
results_flat = [item for sublist in results for item in sublist]
results = {k: v for k, v in results_flat}
print(results)

{(A ⟂ S): (1.9853449962420004, 0.37058497877829005, 2), (A ⟂ O | E): (2.0985435790243034, 0.7176399600516277, 4), (A ⟂ R | E): (2.8026991954963454, 0.5913668739613169, 4), (A ⟂ T | E): (10.693950409937166, 0.21965055000059597, 8), (A ⟂ O | E, S): (6.5410721048388965, 0.5868555930315218, 8), (A ⟂ R | E, S): (9.27447234495111, 0.31967361894631896, 8), (A ⟂ T | E, S): (16.9969080647128, 0.3857972070728527, 16), (A ⟂ T | O, R): (17.303945408152316, 0.13851716499180977, 12), (A ⟂ R | O, E): (3.8133663932876196, 0.7019157718399125, 6), (A ⟂ T | O, E): (14.661327352311131, 0.26048193401591613, 12), (A ⟂ O | R, E): (3.1768654060661357, 0.7863434276852411, 6), (A ⟂ T | R, E): (13.067469907330542, 0.4426137158614294, 13), (A ⟂ O | E, T): (4.857705338926742, 0.9004783671841152, 10), (A ⟂ R | E, T): (7.405368145500495, 0.6867011056470256, 10), (A ⟂ T | O, R, S): (25.44919823281326, 0.276079664102723, 22), (A ⟂ R | O, E, S): (9.337520274798791, 0.40671936433378186, 9), (A ⟂ T | O, E, S): (16.260859

To make life a bit easier, we can define a cut-off and evaluate how many tests meet (or fail to meet) the cut-off. Since this is a statistical hypothesis test, the most straightforward cut-off is a [significance level](https://en.wikipedia.org/wiki/Statistical_significance), against which we can directly compare the p-value. If the `boolean` argument in `chi_square` is true and we provide a significance level, then the test will return True if the p_value of the test is greater than  the significance_level (evidence in favor of independence), otherwise it will returns False. Below, we'll look of the proportion of tests that beat this significance level.

In [6]:
results = [test_dsep(dsep, data=full_data, boolean=True, significance=.1) for dsep in dseps.get_assertions()]
results_flat = [item for sublist in results for item in sublist]
results = {k: v for k, v in results_flat}
print(results)
sum(results.values())/len(dseps.get_assertions())

{(A ⟂ S): True, (A ⟂ O | E): True, (A ⟂ R | E): True, (A ⟂ T | E): True, (A ⟂ O | E, S): True, (A ⟂ R | E, S): True, (A ⟂ T | E, S): True, (A ⟂ T | O, R): True, (A ⟂ R | O, E): True, (A ⟂ T | O, E): True, (A ⟂ O | R, E): True, (A ⟂ T | R, E): True, (A ⟂ O | E, T): True, (A ⟂ R | E, T): True, (A ⟂ T | O, R, S): True, (A ⟂ R | O, E, S): True, (A ⟂ T | O, E, S): True, (A ⟂ O | R, E, S): True, (A ⟂ T | R, E, S): True, (A ⟂ O | S, E, T): True, (A ⟂ R | S, E, T): True, (A ⟂ T | O, R, E): True, (A ⟂ R | O, E, T): True, (A ⟂ O | R, E, T): True, (A ⟂ T | O, R, E, S): True, (A ⟂ R | O, S, E, T): True, (A ⟂ O | S, R, E, T): True, (S ⟂ O | E): True, (S ⟂ R | E): True, (S ⟂ T | E): True, (S ⟂ O | A, E): True, (S ⟂ R | A, E): True, (S ⟂ T | A, E): True, (S ⟂ T | O, R): True, (S ⟂ R | O, E): True, (S ⟂ T | O, E): True, (S ⟂ O | R, E): True, (S ⟂ T | R, E): True, (S ⟂ O | E, T): True, (S ⟂ R | E, T): True, (S ⟂ T | A, O, R): True, (S ⟂ R | A, O, E): True, (S ⟂ T | A, O, E): True, (S ⟂ O | A, R, E): Tr

0.7625

This approach is not conceptually perfect, for reasons outlined in the book. For one, p-values are dependent on the size of the data. Typically, the more data you have, the more independent variables start to look dependent due to coincidental [spurious correlation](https://en.wikipedia.org/wiki/Spurious_relationship). Specifically, p-values depend on the size of the data, the bigger the data, the smaller the p-values (evidence against indepedence). For example, we see that the proportion goes down if we only use 30 data points.

In [7]:
N = 30
data = full_data[:N]
results = [test_dsep(dsep, data=data, boolean=True, significance=.1) for dsep in dseps.get_assertions()]
results_flat = [item for sublist in results for item in sublist]
results = {k: v for k, v in results_flat}
print(results)
sum(results.values())/len(dseps.get_assertions())


{(A ⟂ S): True, (A ⟂ O | E): True, (A ⟂ R | E): True, (A ⟂ T | E): True, (A ⟂ O | E, S): True, (A ⟂ R | E, S): True, (A ⟂ T | E, S): True, (A ⟂ T | O, R): True, (A ⟂ R | O, E): True, (A ⟂ T | O, E): True, (A ⟂ O | R, E): True, (A ⟂ T | R, E): True, (A ⟂ O | E, T): True, (A ⟂ R | E, T): True, (A ⟂ T | O, R, S): True, (A ⟂ R | O, E, S): True, (A ⟂ T | O, E, S): True, (A ⟂ O | R, E, S): True, (A ⟂ T | R, E, S): True, (A ⟂ O | S, E, T): True, (A ⟂ R | S, E, T): True, (A ⟂ T | O, R, E): True, (A ⟂ R | O, E, T): True, (A ⟂ O | R, E, T): True, (A ⟂ T | O, R, E, S): True, (A ⟂ R | O, S, E, T): True, (A ⟂ O | S, R, E, T): True, (S ⟂ O | E): True, (S ⟂ R | E): True, (S ⟂ T | E): True, (S ⟂ O | A, E): True, (S ⟂ R | A, E): True, (S ⟂ T | A, E): True, (S ⟂ T | O, R): True, (S ⟂ R | O, E): True, (S ⟂ T | O, E): True, (S ⟂ O | R, E): True, (S ⟂ T | R, E): True, (S ⟂ O | E, T): True, (S ⟂ R | E, T): True, (S ⟂ T | A, O, R): True, (S ⟂ R | A, O, E): True, (S ⟂ T | A, O, E): True, (S ⟂ O | A, R, E): Tr

0.7125

## Alternatives to canonical statistical tests for independence

There are alternatives to using traditional "parametric" (i.e., use chi-square, Normal distribution, or another canonical distribution to calculate a p-value) statistical hypothesis testing to test the causal Markov assumption. The library [PyWhy-Stats](https://github.com/py-why/pywhy-stats) has some more sophisticated approaches that can be applied towards causal inference. You can also simply try prediction; if (S ⟂ T | E), then a model that predicts S given E and T should not perform much better than an model that predicts S only given E.

Don't get too hung up on statistical rigor -- the goal is just to evaluate your causal DAG. You want to quickly <i>falsify</i> your causal DAG if it is a bad model and move on to finding a good one.