[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/crunchdao/quickstarters/blob/master/competitions/causality-discovery/quickstarters/baseline/baseline.ipynb)

![Banner](https://raw.githubusercontent.com/crunchdao/quickstarters/refs/heads/master/competitions/causality-discovery/assets/banner.webp)

## DAG Competition - Baseline Notebook

The purpose of this notebook is both to introduce participants to the competition and to suggest some possible starting points. The suggestions presented here are not binding, of course, and can be taken in any number of directions.

## The problem

The problem of discovering causal relationships among variables from observational data is important in fields like healthcare and economics. Participants in this competition receive datasets with known causal graphs to develop algorithms that reveal the underlying causal structures. The focus is on identifying how other variables influence the relationship between two key variables, X (treatment) and Y (outcome). Both unsupervised and supervised methods are welcome, with evaluation based on the accuracy of predicted causal links. Successful solutions will improve causal inference methods, aiding decision-making and understanding in various domains.

### Preliminary step

In [None]:
%pip install crunch-cli --upgrade
%pip install gcastle torch

In [None]:
# update the token via https://hub.crunchdao.com/competitions/causality-discovery/submit/via/notebook

!crunch setup --notebook causality-discovery default --token aaaabbbbccccddddeeeeffff

In [None]:
"""
This is a basic example of what you need to do to enter the competition.
The code will not have access to the internet (or any socket related operation).
"""

import os
import typing

import castle.algorithms
import joblib
import networkx as nx
import numpy as np
import pandas as pd
from tqdm.auto import tqdm

# keep me, I am needed by castle
import torch

In [2]:
import crunch
crunch = crunch.load_notebook()

loaded inline runner with module: <module '__main__'>


The following function is provided to help you get a DAG from your predicted graph, if it is not a DAG, also ensuring that there is an edge from X to Y as designed. This is only one way to get such a result, and not necessarily optimal for competition. An improved algorithm for obtaining a DAG from your predicted graph could lead to better results.

In [3]:
def fix_DAG(g):
    """
    Ensure that the graph is a DAG and has an edge X→Y

    We look for cycles, and remove an edge in each cycle, until there are no cycles left.

    Inputs: g: nx.DiGraph
    Output: g: nx.DiGraph

    This function provides just a possible solution to the problem
    of DAG-ifying a graph. Other solutions can be conceived that could
    be better for the competition.
    """

    assert 'X' in g.nodes
    assert 'Y' in g.nodes

    gg = g.copy()

    # Add X→Y if it is missing
    if ('X', 'Y') not in gg.edges:
        gg.add_edge('X', 'Y')

    # Look for cycles and remove them
    while not nx.is_directed_acyclic_graph(gg):

        h = gg.copy()

        # Remove all the sources and sinks
        while True:
            finished = True

            for i, v in nx.in_degree_centrality(h).items():
                if v == 0:
                    h.remove_node(i)
                    finished = False

            for i, v in nx.out_degree_centrality(h).items():
                if v == 0:
                    h.remove_node(i)
                    finished = False

            if finished:
                break

        # Find a cycle, with a random walk starting at a random node
        node = list(h.nodes)[0]
        cycle = [node]
        while True:
            edges = list(h.out_edges(node))
            _, node = edges[np.random.choice(len(edges))]

            if node in cycle:
                break

            cycle.append(node)

        # We have a path that ends with a cycle: remove the begining, if it is not part of the cycle
        cycle = np.array(cycle)
        i = np.argwhere(cycle == node)[0][0]
        cycle = cycle[i:]
        cycle = cycle.tolist() + [node]

        # Edges in that cycle
        edges = list(zip(cycle[:-1], cycle[1:]))

        # Pick an edge at random, but make sure it is not X→Y -- we want to keep that one
        edges = [e for e in edges if e != ('X', 'Y')]
        edge = edges[np.random.choice(len(edges))]

        gg.remove_edge(*edge)

    return gg

This is the core of the solution's code, which reads one record at a time, applies the PC algorithm, ensures that the result is a DAG, and then puts the result into a single data frame in the required format, ready for submission.

In [4]:
# Uncomment what you need!
def train(
    X_train: typing.Dict[str, pd.DataFrame],
    y_train: typing.Dict[str, pd.DataFrame],
    # number_of_features: int,
    model_directory_path: str,
    # id_column_name: str,
    # prediction_column_name: str,
    # has_gpu: bool,
) -> None:
    # TODO replace me with a real model
    model = ...

    joblib.dump(
        model,
        os.path.join(model_directory_path, "model.joblib")
    )

In [5]:
# Uncomment what you need!
def infer(
    X_test: typing.Dict[str, pd.DataFrame],
    # number_of_features: int,
    model_directory_path: str,
    id_column_name: str,
    prediction_column_name: str,
    # has_gpu: bool,
    # has_trained: bool,
) -> pd.DataFrame:
    # TODO use me
    # model = joblib.load(os.path.join(model_directory_path, "model.joblib"))

    predictions = {}
    for dataset_id in tqdm(X_test):
        X = X_test[dataset_id]

        nodes = X.columns
        model = castle.algorithms.PC()
        model.learn(X)

        A_hat = pd.DataFrame(model.causal_matrix, columns=nodes, index=nodes)
        g_hat = nx.from_pandas_adjacency(A_hat, create_using=nx.DiGraph)
        g_hat = fix_DAG(g_hat)

        G = pd.DataFrame(nx.to_numpy_array(g_hat).astype(int), columns=nodes, index=nodes)
        for i in nodes:
            for j in nodes:
                predictions[f'{dataset_id}_{i}_{j}'] = int(G.loc[i, j])

    return pd.DataFrame(
        predictions.items(),
        columns=[id_column_name, prediction_column_name]
    )

In [None]:
crunch.test(
    no_determinism_check=True
)

print("Download this notebook and submit it to the platform: https://hub.crunchdao.com/competitions/causality-discovery/submit/via/notebook")  