# causalgraph (cg) with Tigramite and DoWhy

This tutorial describes an example of learning a causal graph with ```Tigramite```, exporting it to ```cg```
and then doing causal inference with ```DoWhy```.


> ```DoWhy``` is a Python library that aims to spark causal thinking and analysis. DoWhy provides a principled four-step interface for causal inference that focuses on explicitly modeling causal assumptions and validating them as much as possible. The key feature of DoWhy is its state-of-the-art refutation API that can automatically test causal assumptions for any estimation method, thus making inference more robust and accessible to non-experts. DoWhy supports estimation of the average causal effect for ```backdoor```, ```frontdoor```, ```instrumental variable``` and other identification methods, and estimation of the conditional effect (```CATE```) through an integration with the EconML library. 
> ([Source - https://microsoft.github.io/dowhy/](https://microsoft.github.io/dowhy/); there you can find more detailed information about DoWhy as well.)



NOTE: This notebook is not intended to provide an extensive overview of the features of DoWhy but rather to show how to start working with DoWhy when you have a cg graph.

Required packages to run this tutorial are:
- dowhy==0.7.1
- tigramite==4.2.2.1
- pandas==1.2.3'
- causalgraph (cg)
- jupyter (to run this notebook)
- [owlready2 0.35](https://owlready2.readthedocs.io/en/v0.35/) (backend for causalgraph store)

Initially, import the required packages.

In [None]:
import os
import sys
sys.path.insert(1, os.path.join(sys.path[0], '..'))
# general imports
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# dowhy imports
from dowhy import CausalModel
# tigramite imports
from tigramite.pcmci import PCMCI
from tigramite import plotting as tp
from tigramite import data_processing as pp
from tigramite.independence_tests import ParCorr
# causalgraph imports
from causalgraph import Graph
from causalgraph.utils import mapping
from causalgraph.utils.path_utils import get_project_root

## Tigramite

Use the built-in Tigramite function ```var_process``` to create time series data with linear dependencies.  
In the given example, the values of time series 0 and time series 1 are computed in the following way:  
```timeseries0(t) = 3 * timeseries1(t-4) + noise```  
```timeseries1(t) = 2 * timeseries2(t-3) + 0.6 * timeseries4(t-1) + noise```  

In [None]:
# create time series data
np.random.seed(41)
T = 3000 # time series length
links_coeffs = {0: [((1, -4), 3)                    ],
                1: [((2, -3), 2), ((4, -1), 0.6)    ],
                2: [                                ],
                3: [((1, -2), 0.7)                  ],
                4: [                                ],
                5: [((1, -2), 0.7)                  ]}
data, _ = pp.var_process(links_coeffs, T=T)
plt.plot(data[:200,1])

Create a Tigramite dataframe and assign names to the time series. Furthermore, initialize ```ParCorr``` which is an independence test suited to detect linear dependencies. ParCorr is used in the algorithm ```PCMCI``` which is initialized as well.

In [None]:
# init dataframe and variable names
var_names = ['bumpy_feeling', 'flat_tire', 'thorns_on_road', 'noise', 'glass_on_road', 'steering_problems']
dataframe = pp.DataFrame(data, datatime = np.arange(len(data)), var_names=var_names)
# init independence test and algorithm
parcorr = ParCorr(significance='analytic', confidence='analytic')
pcmci = PCMCI(dataframe=dataframe, cond_ind_test=parcorr)

Run PCMCI and plot the results.

In [None]:
# run algorithm
results = pcmci.run_pcmci(tau_max=8, pc_alpha=None)
q_matrix = pcmci.get_corrected_pvalues(p_matrix=results['p_matrix'], fdr_method='fdr_bh')
results['q_matrix'] = q_matrix
# get links and plot graph
link_matrix = pcmci.return_significant_links(pq_matrix=q_matrix,
              val_matrix=results['val_matrix'], alpha_level=0.01)['link_matrix']
tp.plot_graph(val_matrix=results['val_matrix'], link_matrix=link_matrix,
              var_names=var_names,link_colorbar_label='cross-MCI',
              node_colorbar_label='auto-MCI', figsize=(12, 7)) 
plt.show()

When the tigramite results are exported to cg, there needs to be a name for each edge encoded in a dictionary.  
This dictionary is created here.

In [None]:
# create dictionary of edges
edge_names = {}
d1, d2, d3 = np.nonzero(link_matrix)
inds = [(d1[i], d2[i], d3[i]) for i in range(np.sum(link_matrix))]
for i in range(len(inds)):
    cause = inds[i][0]
    effect = inds[i][1]
    edge_names['Edge_'+str(i)] = { 'cause': var_names[cause],
                                   'effect': var_names[effect]}
print(json.dumps(edge_names, indent=4))

## causalgraph (cg)
The results of the Tigramite analysis are now imported in a cg graph.  
Therefore, a new cg graph is initialized.

In [None]:
# instantiate graph
sql_file_name = 'dowhy_example.sqlite3'
if os.path.exists(sql_file_name):
    os.remove(sql_file_name)
    print(f"Deleted old db with name {sql_file_name}")
G = Graph(sql_db_filename=sql_file_name)

Converting a causal graph from Tigramite to the cg format is done by using a dictionary that describes the causal graph.  
This dictionary is created from the Tigramite results. The dictionary contains all the properties describing the graph.

In [None]:
# insert tigramite result in dictionary
graph_dict = G.readwrite.tigra.read(var_names, edge_names, link_matrix, q_matrix, 1)
print(json.dumps(graph_dict, indent=4))

The cg graph is updated with the information in the dictionary.  
Then, it is exported to ```.gml``` since this file format is used by DoWhy.

In [None]:
mapping.update_graph_from_dict(G.store, graph_dict)
# export graph to gml
G.readwrite.nx.export_gml('./gml_graph')

## DoWhy
DoWhy takes a graph and data as input. The graph already exists as a gml file. It encodes our assumptions about the dependencies of the variables.  
The data exists as well but only as time series data. [DoWhy does not support timeseries data directly](https://github.com/microsoft/dowhy/issues/174). Since Tigramite detected the time lag between the variables, the time series are shifted in such a way that the causal dependencies all take place at the same time step.  
Then, a pandas dataframe is created from the shifted time series.

In [None]:
# move the data by the timelag
for i in range(1, len(data)):
    data[1000-i, 1] = data[1000-i-4, 1]
    data[1000-i, 2] = data[1000-i-7, 2]
    data[1000-i, 3] = data[1000-i-2, 3]
    data[1000-i, 4] = data[1000-i-5, 4]
    data[1000-i, 5] = data[1000-i-2, 5]
for i in range(10):
    data = np.delete(data, (0), axis=0)
df = pd.DataFrame(data, columns=var_names)

Doing ```Causal Inference``` with DoWhy usually involves four steps which are explained in the following sections.\
NOTE: The descriptions of the steps are taken directly from the [official documentation](https://microsoft.github.io/dowhy/#).

```1. Model a causal problem```

DoWhy creates an underlying causal graphical model for each problem. This serves to make each causal assumption explicit. Currently, DoWhy supports two formats for graph input: gml (preferred) and dot. We strongly suggest to use gml as the input format, as it works well with networkx. You can provide the graph either as a .gml file or as a string. If you prefer to use dot format, you will need to install additional packages (pydot or pygraphviz, see the installation section above). Both .dot files and string format are supported. While not recommended, you can also specify common causes and/or instruments directly instead of providing a graph. ([Source](https://microsoft.github.io/dowhy/#i-model-a-causal-problem))

The gml graph is imported. Then, a causal model is created by using the ```CausalModel``` function of DoWhy. We specify the graph, the data and the dependency that should be examined. In this case, we want to know the effect of ```thorns_on_road``` on ```bumpy_feeling```.  

In [None]:
# read graph from gml create causal model with dowhy
graph_str = open('./gml_graph.gml', 'r').read()
model = CausalModel(data=df, treatment=['thorns_on_road'],
                    outcome=['bumpy_feeling'], graph=graph_str)
# show model
model.view_model()
plt.show()

```2. Identify a target estimand under the model.```

Based on the causal graph, DoWhy finds all possible ways of identifying a desired causal effect based on the graphical model. It uses graph-based criteria and do-calculus to find expressions that can identify the causal effect. Supported identification criteria are ```Back-door criterion```, ```Front-door criterion```, ```Instrumental Variables```, ```Mediation``` (Direct and indirect effect identification) ([Source](https://microsoft.github.io/dowhy/#ii-identify-a-target-estimand-under-the-model))

In [None]:
# identification
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
print(identified_estimand)

```3. Estimate causal effect based on the identified estimand```

DoWhy supports methods based on both back-door criterion and instrumental variables. It also provides a non-parametric confidence intervals and a permutation test for testing the statistical significance of obtained estimate. ([Source](https://microsoft.github.io/dowhy/#iii-estimate-causal-effect-based-on-the-identified-estimand))

In [None]:
# estimation
causal_estimate = model.estimate_effect(identified_estimand,
                                        method_name="backdoor.linear_regression")
print(causal_estimate)
print("Causal Estimate is " + str(causal_estimate.value))

```4. Refute the obtained estimate```

Having access to multiple refutation methods to validate an effect estimate from a causal estimator is a key benefit of using DoWhy. ([Source](https://microsoft.github.io/dowhy/#iv-refute-the-obtained-estimate))\
NOTE: Not shown here