<a href="https://colab.research.google.com/github/altdeep/causalML/blob/master/book/chapter%204/Testing_Markov_Property_on_Transportation_DAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install pgmpy

In [None]:
import pandas as pd
import numpy as np
from pgmpy.base.DAG import DAG
from pgmpy.estimators.CITests import chi_square
from pgmpy.independencies import IndependenceAssertion

In [None]:
# Build the causal DAG.
G = DAG()
G.add_edges_from(
    [
      ('A','E'),
      ('S','E'),
      ('E','O'),
      ('E','R'),
      ('O','T'),
      ('R','T')
    ]
)

# Load N examples from the data
survey_url = "https://raw.githubusercontent.com/altdeep/causalML/master/datasets/transportation_survey.csv"

N = 30
full_data = pd.read_csv(survey_url)
data = full_data[:N]

# List D-Separations
dseps = G.get_independencies()
print(dseps)

# Run Chi-squared tests for independence
significance = .01

def test_dsep(dsep: IndependenceAssertion):
  test_outputs = []
  for X in list(dsep.get_assertion()[0]):
    for Y in list(dsep.get_assertion()[1]):
      Z = list(dsep.get_assertion()[2])
      test_result = chi_square(X=X, Y=Y, Z=Z, data=data, boolean=True, significance_level=significance)
      test_outputs.append((IndependenceAssertion(X, Y, Z), test_result))
  return test_outputs

results = [test_dsep(dsep) for dsep in dseps.get_assertions()]
results_flat = [item for sublist in results for item in sublist]
results = {k: v for k, v in results_flat}
print(results)

# Hint on how to count the number of Trues.
sum(results.values())

## Learning questions

### *How do conditional independence tests like the Chi-squared test used in the code work?*  

The null hypothesis is independence, the alternative hyphothesis is not independence. To test if X is conditional independence of Y given Z, the test looks in the data for a statistically significant amount of statistical association between X and Y for different values of Z. If the p-value is greater than the significance level, we conclude independence.

### *What is the impact of sample size on this analysis?*

In this analysis, we took the first 30 rows from the data, thus sample size in the test was 30.  Test statistics (e.g. p-values) depend on sample size. Thus, the number of p-values that fall above or below the threshold will depend on the size of the data. The size of the data has nothing to do with ground truth causality. As an exercise, vary the variable N. As N increases, you should see that The number of tests that return True increases with N.

### *Why does the number of validated d-separations go up when the signficance level goes down?*

The CI test concludes in favor of independence when the p-value is greater than the significance threshold. So reducing the threshold makes more tests pass.
As data size goes up, p-values go down. As data size goes down, p-values go up. The more data you have, the more likely you will have spurious patterns of connection that look like dependence but really aren't. 

The presence of a d-separation implies independence.  But don't confuse the presence of d-separation *evidence* of independence; the two are not the same. For better or for worse, evidence of independence depends on data size.