This notebook is a quickstarter for DataCrunching. It is a collection of code snippets and explanations to get you started with the Causality Discovery code competition. It should give you a good starting point to understand how to work with the Crunch Foundation infrastructure.

It shows how to load the data, how to create a submission file and how to submit it.

## Environment setup


- Environment variables are configured to specify the API and web base URLs.
- The `crunch-cli` package is upgraded to the latest version.
- The notebook is set up to interact with the competition using a provided token.Update the token via https://hub.crunchdao.io/competitions/causality-discovery/submit/via/notebook


In [1]:
# Set environment variables
%env API_BASE_URL=http://api.hub.crunchdao.io
%env WEB_BASE_URL=http://hub.crunchdao.io


env: API_BASE_URL=http://api.hub.crunchdao.io
env: WEB_BASE_URL=http://hub.crunchdao.io


In [2]:
# Upgrade crunch-cli
%pip install crunch-cli --upgrade

# Setup crunch-cli with token
!crunch setup --notebook causality-discovery default --token cjMlVzlafiyv0F8YxrUsQ8SSTJJ2aoXTdMDKYq0sAnoXMoENm7VrxXxE5NgFTOUx

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.1[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

---
Your token seems to have expired or is invalid.

Please follow this link to copy and paste your new setup command:
http://hub.crunchdao.io/competitions/causality-discovery/submit

If you think that is an error, please contact an administrator.


## Import
IMPORTANT: For each library import, in order to avoid any issue related to the library version, it is strongly recommended to specify the version of the library you are using. This to ensure that the notebook will be reproducible in the Crunch Foundation environment without any undesirable modification to the behavior of your code.

In [3]:
from pathlib import Path
from glob import glob
import os
import pickle

import joblib
import numpy as np
import pandas as pd
import crunch
from sklearn.exceptions import NotFittedError
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import RidgeClassifierCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier


## Loading all datasets 

### Get the data

In [4]:
# Load data using Crunch
crunch = crunch.load_notebook()
X_train, y_train, X_test = crunch.load_data()

loaded inline runner with module: <module '__main__'>
download data/X_train.pickle from https://crunchdao--competition--staging.s3.eu-west-1.amazonaws.com/data-releases/34/X_train.pickle (7591133 bytes)
already exists: file length match
download data/y_train.pickle from https://crunchdao--competition--staging.s3.eu-west-1.amazonaws.com/data-releases/34/y_train.pickle (98523 bytes)
already exists: file length match
download data/X_test.pickle from https://crunchdao--competition--staging.s3.eu-west-1.amazonaws.com/data-releases/34/X_test_reduced.pickle (329528 bytes)
already exists: file length match
download data/y_test.pickle from https://crunchdao--competition--staging.s3.eu-west-1.amazonaws.com/data-releases/34/y_test_reduced.pickle (4935 bytes)
already exists: file length match
download data/example_prediction.parquet from https://crunchdao--competition--staging.s3.eu-west-1.amazonaws.com/data-releases/34/example_prediction_reduced.parquet (3939 bytes)
already exists: file length ma

### Helper function
For each variable in each dataset, we create a vector of features based on the relationship between that variable and - for example - `X` or `Y`, such has the correlation value, or the result of a T-test given a certain $p$-value threshold (1 = null hypothesis rejected, 0 = not rejected).


The architecture is extensible, to have an idea on how to create new features. In short, each family of features (called feature set), such as *correlation*, loads all the files one by one and creates the desired correlation-based features. All feature sets are merged at the end to create `Xy` matching dataset and varialble in the join. The computation of a feature set is parallelized. In this way, if some feature sets are slow, we can save them and load at a later stage and just spend time to create new ones.

In [5]:
import numpy as np
from pathlib import Path
import pandas as pd
from sklearn.feature_selection import mutual_info_regression
from collections import defaultdict
from tqdm import tqdm
from glob import glob
from os.path import basename
from sklearn.linear_model import Ridge, RidgeClassifier, RidgeClassifierCV
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from dcor import distance_correlation
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.pipeline import make_pipeline
from itertools import combinations, product
from functools import reduce
from pingouin.correlation import partial_corr
from joblib import Parallel, delayed
from scipy.stats import ttest_rel, pearsonr, spearmanr
from dcor.homogeneity import energy_test



# This dictionary maps graph structures into our class labels:
r = defaultdict(str)
r[(0,0,0,1)] = 'Cause of X'
r[(1,0,0,0)] = 'Consequence of X'
r[(0,0,1,1)] = 'Confounder'
r[(1,1,0,0)] = 'Collider'
r[(1,0,1,0)] = 'Mediator'
r[(0,0,0,0)] = 'Independent'
r[(0,0,1,0)] = 'Cause of Y'
r[(0,1,0,0)] = 'Consequence of Y'


def get_labels(A):
    """
    Classify the nodes of A as "collider", "confounder", etc., wrt the edge X→Y

    For each node i, we look at the role of i wrt X→Y, ignoring all other nodes.
    There are 8 possible cases:
    - Cause of X
    - Consequence of X
    - Confounder
    - Collider
    - Mediator
    - Independent
    - Cause of Y
    - Consequence of Y

    Caveat:
    - The notions of "confounder", "collider", etc. only make sense for small, textbook graphs.

    Input:  A: adjacency_matrix
    Output: dictionary, with the edges as keys (excluding 'X' and 'Y'), and class as value

    """
    global r
    res = {}
    for node in A.columns:
        if node in "XY":
            continue
        B = A.loc[('X','Y',node),:].loc[:,('X','Y',node)]
        res[node] = r[ tuple( B.values[ [0,1,2,2], [2,2,1,0] ] ) ]

    return res



def ttest(dataset, pvalue_threshold=0.05):
    variables = dataset.columns.drop(["X", "Y"])
    df = []
    for variable in variables:
        ttest_vX = ttest_rel(dataset[variable], dataset["X"])
        ttest_vY = ttest_rel(dataset[variable], dataset["Y"])
        d = {
            "variable": variable,
            "ttest(v,X)": ttest_vX.statistic,
            f"pvalue(ttest(v,X))<={pvalue_threshold}": (ttest_vX.pvalue <= pvalue_threshold).astype(float),
            "ttest(v,Y)": ttest_vY.statistic,
            f"pvalue(ttest(v,Y))<={pvalue_threshold}": (ttest_vY.pvalue <= pvalue_threshold).astype(float),
        }
        df.append(d)

    df = pd.DataFrame(df)
    df["dataset"] = dataset.name
    ttest_XY = ttest_rel(dataset["X"], dataset["Y"])
    df["ttest(X,Y)"] = ttest_XY.statistic
    df[f"pvalue(ttest(X,Y))<={pvalue_threshold}"] = (ttest_XY.pvalue <= pvalue_threshold).astype(float)
    # some the ttest returns NaN when the variance is 0, so we fill with 0
    df.fillna(0, inplace=True)
    return df


def pearson_correlation(dataset):
    variables = dataset.columns.drop(["X", "Y"])
    df = []
    for variable in variables:
        tmp = dataset.corr().drop([variable], axis="columns").loc[variable].abs()  # TODO: this can be computed just ones at the beginning and then sliced at each iteration
        d = {
            "variable": variable,
            "corr(v,X)": dataset[[variable, "X"]].corr().loc[variable, "X"],
            "corr(v,Y)": dataset[[variable, "Y"]].corr().loc[variable, "Y"],
            "max(corr(v, others))": tmp.max(),
            "min(corr(v, others))": tmp.min(),
            "mean(corr(v, others))": tmp.mean(),
            "std(corr(v, others))": tmp.std(),
        }
        df.append(d)

    df = pd.DataFrame(df)
    df["dataset"] = dataset.name
    df["corr(X,Y)"] = dataset[["X", "Y"]].corr().loc["X", "Y"]
    return df


def pearson_correlation2(dataset):
    """Same as correlation but excluding X and Y in max, min, mean, std computations.

    Note: nothing really changes in the results of the classification task.
    """
    variables = dataset.columns.drop(["X", "Y"])
    df = []
    for variable in variables:
        tmp = dataset[variables].corr().drop([variable], axis="columns").loc[variable].abs()
        d = {
            "variable": variable,
            "corr(v,X)": dataset[[variable, "X"]].corr().loc[variable, "X"],
            "corr(v,Y)": dataset[[variable, "Y"]].corr().loc[variable, "Y"],
            "max(corr(v, others))": tmp.max(),
            "min(corr(v, others))": tmp.min(),
            "mean(corr(v, others))": tmp.mean(),
            "std(corr(v, others))": tmp.std(),
        }
        if len(tmp) == 1:
            d["std(corr(v, others))"] = 0.0
        elif len(tmp) == 0:
            d["max(corr(v, others))"] = 0.0
            d["min(corr(v, others))"] = 0.0
            d["mean(corr(v, others))"] = 0.0
            d["std(corr(v, others))"] = 0.0
        else:
            pass

        df.append(d)

    df = pd.DataFrame(df)
    df["dataset"] = dataset.name
    df["corr(X,Y)"] = dataset[["X", "Y"]].corr().loc["X", "Y"]
    return df


def pearson_correlation_test(dataset, pvalue_threshold=0.05):
    """Same as correlation2 but with significance tests (pvalue <= pvalue_threshold)
    """
    variables = dataset.columns # .drop(["X", "Y"])
    corr_pvalue = pd.DataFrame(
        index=pd.MultiIndex.from_tuples(product(variables, variables)),
        columns=["correlation", "pvalue"],
    )
    for variable1, variable2 in corr_pvalue.index:
        corr_pvalue.loc[(variable1, variable2), "correlation"], corr_pvalue.loc[(variable1, variable2), "pvalue"] = pearsonr(dataset[variable1], dataset[variable2])

    correlations = corr_pvalue.reset_index().pivot(index="level_0", columns="level_1", values="correlation")
    pvalues_test = (corr_pvalue.reset_index().pivot(index="level_0", columns="level_1", values="pvalue") <= pvalue_threshold).astype(float)

    df = []
    for variable in variables.drop(["X", "Y"]):
        d = {
            "variable": variable,
            "pearson(v,X)": correlations.loc[variable, "X"],
            f"pvalue(pearson(v,X))<={pvalue_threshold}": pvalues_test.loc[variable, "X"],
            "pearson(v,Y)": correlations.loc[variable, "Y"],
            f"pvalue(pearson(v,Y))<={pvalue_threshold}": pvalues_test.loc[variable, "Y"],
            "max(pearson(v, others))": correlations.drop([variable], axis="columns").loc[variable].abs().max(),
            "min(pearson(v, others))": correlations.drop([variable], axis="columns").loc[variable].abs().min(),
            "mean(pearson(v, others))": correlations.drop([variable], axis="columns").loc[variable].abs().mean(),
            "std(pearson(v, others))": correlations.drop([variable], axis="columns").loc[variable].abs().std(),
        }
        df.append(d)

    df = pd.DataFrame(df)
    df["dataset"] = dataset.name
    df["pearson(X,Y)"] = correlations.loc["X", "Y"]
    df[f"pvalue(pearson(X,Y))<={pvalue_threshold}"] = pvalues_test.loc["X", "Y"]
    # some of the pearsonr returns NaN when the variance is 0, so we fill with 0
    df.fillna(0, inplace=True)
    return df


def spearman_correlation_test(dataset, pvalue_threshold=0.05):
    """Same as correlation2 but with significance tests (pvalue <= pvalue_threshold)
    """
    variables = dataset.columns # .drop(["X", "Y"])
    corr_pvalue = pd.DataFrame(
        # index=pd.MultiIndex.from_tuples(list(combinations(variables, 2))),
        index=pd.MultiIndex.from_tuples(product(variables, variables)),
        columns=["correlation", "pvalue"],
    )
    for variable1, variable2 in corr_pvalue.index:
        corr_pvalue.loc[(variable1, variable2), "correlation"], corr_pvalue.loc[(variable1, variable2), "pvalue"] = spearmanr(dataset[variable1], dataset[variable2])

    correlations = corr_pvalue.reset_index().pivot(index="level_0", columns="level_1", values="correlation")
    pvalues_test = (corr_pvalue.reset_index().pivot(index="level_0", columns="level_1", values="pvalue") <= pvalue_threshold).astype(float)

    df = []
    for variable in variables.drop(["X", "Y"]):
        d = {
            "variable": variable,
            "spearman(v,X)": correlations.loc[variable, "X"],
            f"pvalue(spearman(v,X))<={pvalue_threshold}": pvalues_test.loc[variable, "X"],
            "spearman(v,Y)": correlations.loc[variable, "Y"],
            f"pvalue(spearman(v,Y))<={pvalue_threshold}": pvalues_test.loc[variable, "Y"],
            "max(spearman(v, others))": correlations.drop([variable], axis="columns").loc[variable].abs().max(),
            "min(spearman(v, others))": correlations.drop([variable], axis="columns").loc[variable].abs().min(),
            "mean(spearman(v, others))": correlations.drop([variable], axis="columns").loc[variable].abs().mean(),
            "std(spearman(v, others))": correlations.drop([variable], axis="columns").loc[variable].abs().std(),
        }
        df.append(d)

    df = pd.DataFrame(df)
    df["dataset"] = dataset.name
    df["spearman(X,Y)"] = correlations.loc["X", "Y"]
    df[f"pvalue(spearman(X,Y))<={pvalue_threshold}"] = pvalues_test.loc["X", "Y"]
    return df


def mutual_information(dataset):
    variables = dataset.columns.drop(["X", "Y"])
    df = []
    for variable in variables:
        tmp = mutual_info_regression(dataset.drop([variable], axis="columns"), dataset[variable])
        d = {
            "variable": variable,
            "MI(v,X)": mutual_info_regression(dataset[[variable]], dataset["X"], discrete_features=False)[0],
            "MI(v,Y)": mutual_info_regression(dataset[[variable]], dataset["Y"], discrete_features=False)[0],
            "max(MI(v, others))": tmp.max(),
            "min(MI(v, others))": tmp.min(),
            "mean(MI(v, others))": tmp.mean(),
            "std(MI(v, others))": tmp.std(),
        }
        df.append(d)

    df = pd.DataFrame(df)
    df["dataset"] = dataset.name
    df["MI(X,Y)"] = mutual_info_regression(dataset[["X"]], dataset["Y"], discrete_features=False)[0]
    return df


def distance_correlation_features(dataset):
    """distance correlation between each variable and X, Y, [X,Y], and the
    rest of the variables.
    """
    variables = dataset.columns.drop(["X", "Y"])
    df = []
    for variable in variables:
        d = {
            "variable": variable,
            "dcor(v,X)": distance_correlation(dataset[[variable]], dataset["X"]),
            "dcor(v,Y)": distance_correlation(dataset[[variable]], dataset["Y"]),
            "dcor(v,[X,Y])": distance_correlation(dataset[[variable]], dataset[["X","Y"]]),
            "dcor(v,not([v,X,Y]))": distance_correlation(dataset[[variable]], dataset.drop([variable, "X","Y"], axis="columns")),
        }
        df.append(d)

    df = pd.DataFrame(df)
    df["dataset"] = dataset.name
    df["dcor(X,Y)"] = distance_correlation(dataset["X"], dataset["Y"])
    return df


def energy_distance_test(dataset, pvalue_threshold=0.05):
    variables = dataset.columns.drop(["X", "Y"])
    df = []
    for variable in variables:
        energy_test_vX = energy_test(dataset[[variable]], dataset["X"])
        energy_test_vY = energy_test(dataset[[variable]], dataset["Y"])
        d = {
            "variable": variable,
            "energy_test(v,X))": energy_test_vX.statistic,
            f"pvalue(energy_test(v,X))<={pvalue_threshold}": (energy_test_vX.pvalue <= pvalue_threshold).astype(float),
            "energy_test(v,Y))": energy_test_vY.statistic,
            f"pvalue(energy_test(v,Y))<={pvalue_threshold}": (energy_test_vY.pvalue <= pvalue_threshold).astype(float),
        }
        df.append(d)

    df = pd.DataFrame(df)
    df["dataset"] = dataset.name
    energy_test_XY = energy_test(dataset[["X"]], dataset["Y"])
    df["energy_test(X,Y)"] = energy_test_XY.statistic
    df[f"pvalue(energy_test(X,Y))<={pvalue_threshold}"] = (energy_test_XY.pvalue <= pvalue_threshold).astype(float)
    return df


def label(adjacency_matrix):
    labels = get_labels(adjacency_matrix)
    variables = adjacency_matrix.columns.drop(["X", "Y"])
    df = pd.DataFrame(
        {
            "variable": variables,
            "label": [labels[variable] for variable in variables],
        }
    )
    df["dataset"] = adjacency_matrix.name
    return df


def create_some_columns(filenames, function):
    df = []
    for filename in tqdm(filenames):
        dataset_number = int(basename(filename).split(".")[0])
        dataset = pd.read_csv(filename, index_col=0 if function == label else None)  # hack to have index_col=0 for label
        dataset.name = dataset_number
        df_dataset = function(dataset)
        df_dataset["dataset"] = dataset_number
        df.append(df_dataset)

    df = pd.concat(df, axis="index").reset_index(drop=True)
    return df


def create_some_columns_parallel(filenames, function, n_jobs=-1):
    def f(filename, function):
        dataset_number = int(basename(filename).split(".")[0])
        dataset = pd.read_csv(filename, index_col=0 if function == label else None)  # hack to have index_col=0 for label
        dataset.name = dataset_number
        df_dataset = function(dataset)
        df_dataset["dataset"] = dataset_number
        return df_dataset

    df = Parallel(n_jobs=n_jobs)(delayed(f)(filename, function) for filename in tqdm(filenames))
    df = pd.concat(df, axis="index").reset_index(drop=True)
    return df


def create_all_columns(functions_filenames, n_jobs=-1):
    columns = []
    for function, filenames in functions_filenames.items():
        print(f"set: {function.__name__}")
        feature_set = create_some_columns_parallel(filenames, function, n_jobs=n_jobs)
        columns.append(feature_set)

    # Merge all feature sets into a single dataframe:
    columns = reduce(
        lambda left, right: pd.merge(
            left, right, on=["dataset", "variable"]
        ),
        columns,
    )
    return columns


def create_submission(X_test, filename="submission.csv"):
    submission_file = {}
    for name, group in tqdm(X_test.groupby("dataset")):
        variables_labels = group[["variable", "label_predicted"]].set_index("variable")
        variables = variables_labels.index.tolist()
        variables_all = ["X", "Y"] + variables
        adjacency_matrix = pd.DataFrame(index=variables_all, columns=variables_all)
        adjacency_matrix.index.name = "parent"
        adjacency_matrix[:] = 0
        adjacency_matrix.loc["X", "Y"] = 1
        for v in variables:
            l = variables_labels.loc[v].item()
            if l == "Cause of X":
                adjacency_matrix.loc[v, "X"] = 1
            elif l == "Cause of Y":
                adjacency_matrix.loc[v, "Y"] = 1
            elif l == "Consequence of X":
                adjacency_matrix.loc["X", v] = 1
            elif l == "Consequence of Y":
                adjacency_matrix.loc["Y", v] = 1
            elif l == "Confounder":
                adjacency_matrix.loc[v, "X"] = 1
                adjacency_matrix.loc[v, "Y"] = 1
            elif l == "Collider":
                adjacency_matrix.loc["X", v] = 1
                adjacency_matrix.loc["Y", v] = 1
            elif l == "Mediator":
                adjacency_matrix.loc["X", v] = 1
                adjacency_matrix.loc[v, "Y"] = 1
            elif l == "Confounder":
                pass

        for i in variables_all:
            for j in variables_all:
                submission_file[f'{name:05d}_{i}_{j}'] = int(adjacency_matrix.loc[i,j])

    submission_file = pd.Series(submission_file)
    submission_file = submission_file.reset_index()
    submission_file.columns = ['example_id', 'prediction']
    print(f"Saving submission to {filename}")
    submission_file.to_csv(filename, index=False)
    return submission_file

### Preprocessing and feature extraction  

In [6]:
# Define the paths for pickle files
data_paths = {
    'X_train': './data/X_train.pickle',
    'y_train': './data/y_train.pickle',
    'X_test': './data/X_test.pickle'
}

# Define the output directories for CSV files
output_dirs = {
    'X_train': './data/train/X/',
    'y_train': './data/train/y/',
    'X_test': './data/test/X/'
}

# Function to create directories if they do not exist
def create_output_directories(output_dirs):
    for dir_path in output_dirs.values():
        if not os.path.exists(dir_path):
            os.makedirs(dir_path)
            print(f"Created directory: {dir_path}")
        else:
            print(f"Directory already exists: {dir_path}")

# Create the output directories
create_output_directories(output_dirs)


# Process each pickle file
for key, pickle_path in data_paths.items():
    # Load the pickle file
    with open(pickle_path, 'rb') as file:
        data_dict = pickle.load(file)

    # Save each DataFrame in the dictionary to a CSV file
    for sub_key, df in data_dict.items():
        output_dir = output_dirs[key]
        Path(output_dir).mkdir(parents=True, exist_ok=True)  # Ensure the directory exists
        csv_path = Path(output_dir) / f'{sub_key}.csv'
        df.to_csv(csv_path, index=False)


In [7]:
datadir = Path("train")
print("Retrieving filenames")
filenames_X = sorted(glob(str(datadir / "X" / "*.csv")))
filenames_y = sorted(glob(str(datadir / "y" / "*.csv")))

print(f"Creating Xy from {len(filenames_X)} datasets")
Xy = create_all_columns(
    {
        ttest: filenames_X,
        pearson_correlation_test: filenames_X,
        mutual_information: filenames_X,
        distance_correlation_features: filenames_X,
        energy_distance_test: filenames_X,
        label: filenames_y
    }
)
print("Adding numeric labels")
le = LabelEncoder()
Xy["y"] = le.fit_transform(Xy["label"])
# reordering columns:
Xy = Xy[["dataset", "variable"] + Xy.columns.drop(["dataset", "variable", "label", "y"]).tolist() + ["label", "y"]]
display(Xy)

Retrieving filenames
Creating Xy from 114 datasets
set: ttest


100%|██████████| 114/114 [00:01<00:00, 61.68it/s]


set: pearson_correlation_test


100%|██████████| 114/114 [00:00<00:00, 202.20it/s]


set: mutual_information


100%|██████████| 114/114 [00:03<00:00, 34.64it/s]


set: distance_correlation_features


100%|██████████| 114/114 [02:42<00:00,  1.43s/it]


set: energy_distance_test


100%|██████████| 114/114 [00:05<00:00, 19.71it/s]


set: label


100%|██████████| 114/114 [00:00<00:00, 601.21it/s]


Adding numeric labels


Unnamed: 0,dataset,variable,"ttest(v,X)","pvalue(ttest(v,X))<=0.05","ttest(v,Y)","pvalue(ttest(v,Y))<=0.05","ttest(X,Y)","pvalue(ttest(X,Y))<=0.05","pearson(v,X)","pvalue(pearson(v,X))<=0.05",...,"dcor(v,not([v,X,Y]))","dcor(X,Y)","energy_test(v,X))","pvalue(energy_test(v,X))<=0.05","energy_test(v,Y))","pvalue(energy_test(v,Y))<=0.05","energy_test(X,Y)","pvalue(energy_test(X,Y))<=0.05",label,y
0,3,2,-1.986131e-16,0.0,0.000000e+00,0.0,-3.006382e-16,0.0,-0.917882,1.0,...,0.852501,0.872689,1.665335e-13,0.0,-1.665335e-13,0.0,-1.110223e-13,0.0,Cause of X,0
1,3,3,9.936606e-17,0.0,1.137520e-16,0.0,-3.006382e-16,0.0,-0.915585,1.0,...,0.852501,0.872689,2.220446e-13,0.0,1.110223e-13,0.0,-1.110223e-13,0.0,Collider,2
2,10,0,-3.270930e-17,0.0,9.676273e-17,0.0,3.437833e-17,0.0,-0.104880,1.0,...,0.543819,0.095028,0.000000e+00,0.0,-5.551115e-14,0.0,-1.665335e-13,0.0,Cause of Y,1
3,10,1,-4.584965e-17,0.0,1.750369e-16,0.0,3.437833e-17,0.0,0.121371,1.0,...,0.359845,0.095028,5.551115e-14,0.0,-1.110223e-13,0.0,-1.665335e-13,0.0,Cause of Y,1
4,10,3,2.926888e-17,0.0,6.687996e-17,0.0,3.437833e-17,0.0,-0.379893,1.0,...,0.205578,0.095028,-1.110223e-13,0.0,5.551115e-14,0.0,-1.665335e-13,0.0,Cause of X,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
713,406,7,2.140600e-16,0.0,2.100344e-16,0.0,1.994522e-16,0.0,0.896808,1.0,...,0.971216,0.878168,-3.330669e-13,0.0,-1.665335e-13,0.0,-1.665335e-13,0.0,Consequence of X,4
714,409,0,-2.139994e-16,0.0,-3.002144e-16,0.0,1.780548e-16,0.0,0.586998,1.0,...,0.414781,0.739884,1.665335e-13,0.0,-3.885781e-13,0.0,2.220446e-13,0.0,Confounder,3
715,409,1,-2.958868e-17,0.0,-1.166739e-16,0.0,1.780548e-16,0.0,0.662444,1.0,...,0.564421,0.739884,3.330669e-13,0.0,0.000000e+00,0.0,2.220446e-13,0.0,Mediator,7
716,409,2,-2.526721e-16,0.0,-1.281781e-16,0.0,1.780548e-16,0.0,-0.185013,1.0,...,0.223753,0.739884,1.110223e-13,0.0,0.000000e+00,0.0,2.220446e-13,0.0,Independent,6


### Extracting `X`, `y`, and grouping

Groups are essentials for cross-validation because we do not want variables of one dataset to be in the training set and other variables of the same dataset to be in the test set - because this would artifically inflate results.

In [8]:
print("Extracting X, y, and group")
X = Xy.drop(["variable", "dataset", "label", "y"], axis="columns")
y = Xy["y"]
group = Xy["dataset"]
display(X)
display(y)
display(group)

Extracting X, y, and group


Unnamed: 0,"ttest(v,X)","pvalue(ttest(v,X))<=0.05","ttest(v,Y)","pvalue(ttest(v,Y))<=0.05","ttest(X,Y)","pvalue(ttest(X,Y))<=0.05","pearson(v,X)","pvalue(pearson(v,X))<=0.05","pearson(v,Y)","pvalue(pearson(v,Y))<=0.05",...,"dcor(v,Y)","dcor(v,[X,Y])","dcor(v,not([v,X,Y]))","dcor(X,Y)","energy_test(v,X))","pvalue(energy_test(v,X))<=0.05","energy_test(v,Y))","pvalue(energy_test(v,Y))<=0.05","energy_test(X,Y)","pvalue(energy_test(X,Y))<=0.05"
0,-1.986131e-16,0.0,0.000000e+00,0.0,-3.006382e-16,0.0,-0.917882,1.0,0.792132,1.0,...,0.810337,0.910528,0.852501,0.872689,1.665335e-13,0.0,-1.665335e-13,0.0,-1.110223e-13,0.0
1,9.936606e-17,0.0,1.137520e-16,0.0,-3.006382e-16,0.0,-0.915585,1.0,0.977161,1.0,...,0.975479,0.974374,0.852501,0.872689,2.220446e-13,0.0,1.110223e-13,0.0,-1.110223e-13,0.0
2,-3.270930e-17,0.0,9.676273e-17,0.0,3.437833e-17,0.0,-0.104880,1.0,0.494988,1.0,...,0.529882,0.452116,0.543819,0.095028,0.000000e+00,0.0,-5.551115e-14,0.0,-1.665335e-13,0.0
3,-4.584965e-17,0.0,1.750369e-16,0.0,3.437833e-17,0.0,0.121371,1.0,0.035420,0.0,...,0.132359,0.141118,0.359845,0.095028,5.551115e-14,0.0,-1.110223e-13,0.0,-1.665335e-13,0.0
4,2.926888e-17,0.0,6.687996e-17,0.0,3.437833e-17,0.0,-0.379893,1.0,-0.057123,0.0,...,0.079846,0.471113,0.205578,0.095028,-1.110223e-13,0.0,5.551115e-14,0.0,-1.665335e-13,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
713,2.140600e-16,0.0,2.100344e-16,0.0,1.994522e-16,0.0,0.896808,1.0,0.892814,1.0,...,0.892603,0.920827,0.971216,0.878168,-3.330669e-13,0.0,-1.665335e-13,0.0,-1.665335e-13,0.0
714,-2.139994e-16,0.0,-3.002144e-16,0.0,1.780548e-16,0.0,0.586998,1.0,0.527831,1.0,...,0.525221,0.602685,0.414781,0.739884,1.665335e-13,0.0,-3.885781e-13,0.0,2.220446e-13,0.0
715,-2.958868e-17,0.0,-1.166739e-16,0.0,1.780548e-16,0.0,0.662444,1.0,0.804614,1.0,...,0.777567,0.754745,0.564421,0.739884,3.330669e-13,0.0,0.000000e+00,0.0,2.220446e-13,0.0
716,-2.526721e-16,0.0,-1.281781e-16,0.0,1.780548e-16,0.0,-0.185013,1.0,-0.151198,1.0,...,0.141962,0.169959,0.223753,0.739884,1.110223e-13,0.0,0.000000e+00,0.0,2.220446e-13,0.0


0      0
1      2
2      1
3      1
4      0
      ..
713    4
714    3
715    7
716    6
717    4
Name: y, Length: 718, dtype: int64

0        3
1        3
2       10
3       10
4       10
      ... 
713    406
714    409
715    409
716    409
717    409
Name: dataset, Length: 718, dtype: int64

## Train

The `train` function trains, evaluates, and saves a list of machine learning models using cross-validation. 

In [9]:
def train(models, X, y, groups=None, model_directory_path="models"):
    """
    Train, evaluate, and save a list of models using cross-validation.

    Parameters:
    - models: List of models to be evaluated.
    - X: Feature matrix.
    - y: Target vector.
    - groups: Optional array of group labels for group-based cross-validation.
    - model_directory_path: Directory where the models will be saved.

    Returns:
    - None
    """
    if not os.path.exists(model_directory_path):
        os.makedirs(model_directory_path)

    for model in models:
        # Fit the model on the entire dataset
        model.fit(X, y)
        
        # Evaluate the model
        results = cross_val_score(model, X, y, groups=groups, verbose=True, scoring="balanced_accuracy")
        print(f"{model}: mean balanced accuracy = {results.mean():.4f}")

        # Save the model
        model_name = type(model).__name__  # Get the name of the model class
        model_file_path = os.path.join(model_directory_path, f"{model_name}.joblib")
        joblib.dump(model, model_file_path)
        print(f"Model saved to {model_file_path}")

# Define the models
models = [
    make_pipeline(StandardScaler(), RidgeClassifierCV(class_weight="balanced")),
    RandomForestClassifier(n_estimators=100, max_depth=3, n_jobs=-1, class_weight="balanced"),
    RandomForestClassifier(n_estimators=100, max_depth=5, n_jobs=-1, class_weight="balanced"),
    RandomForestClassifier(n_estimators=100, max_depth=7, n_jobs=-1, class_weight="balanced"),
    RandomForestClassifier(n_estimators=100, max_depth=11, n_jobs=-1, class_weight="balanced"),
    RandomForestClassifier(n_estimators=100, max_depth=13, n_jobs=-1, class_weight="balanced"),
    RandomForestClassifier(n_estimators=100, n_jobs=-1, class_weight="balanced"),
    DecisionTreeClassifier(class_weight="balanced", max_depth=3),
    DecisionTreeClassifier(class_weight="balanced", max_depth=5),
    DecisionTreeClassifier(class_weight="balanced", max_depth=7),
    DecisionTreeClassifier(class_weight="balanced", max_depth=11),
]

train(models, X, y, groups=group)




Pipeline(steps=[('standardscaler', StandardScaler()),
                ('ridgeclassifiercv',
                 RidgeClassifierCV(class_weight='balanced'))]): mean balanced accuracy = 0.2708
Model saved to models/Pipeline.joblib




RandomForestClassifier(class_weight='balanced', max_depth=3, n_jobs=-1): mean balanced accuracy = 0.2931
Model saved to models/RandomForestClassifier.joblib




RandomForestClassifier(class_weight='balanced', max_depth=5, n_jobs=-1): mean balanced accuracy = 0.2949
Model saved to models/RandomForestClassifier.joblib




RandomForestClassifier(class_weight='balanced', max_depth=7, n_jobs=-1): mean balanced accuracy = 0.3238
Model saved to models/RandomForestClassifier.joblib




RandomForestClassifier(class_weight='balanced', max_depth=11, n_jobs=-1): mean balanced accuracy = 0.2785
Model saved to models/RandomForestClassifier.joblib




RandomForestClassifier(class_weight='balanced', max_depth=13, n_jobs=-1): mean balanced accuracy = 0.2673
Model saved to models/RandomForestClassifier.joblib




RandomForestClassifier(class_weight='balanced', n_jobs=-1): mean balanced accuracy = 0.2458
Model saved to models/RandomForestClassifier.joblib
DecisionTreeClassifier(class_weight='balanced', max_depth=3): mean balanced accuracy = 0.2632
Model saved to models/DecisionTreeClassifier.joblib
DecisionTreeClassifier(class_weight='balanced', max_depth=5): mean balanced accuracy = 0.2987
Model saved to models/DecisionTreeClassifier.joblib
DecisionTreeClassifier(class_weight='balanced', max_depth=7): mean balanced accuracy = 0.2964
Model saved to models/DecisionTreeClassifier.joblib




DecisionTreeClassifier(class_weight='balanced', max_depth=11): mean balanced accuracy = 0.2588
Model saved to models/DecisionTreeClassifier.joblib


## Infer

The infer function that will be called by the Crunch platform is defined below.



In [14]:
def infer(model_directory_path="models", X_test=None, le=None):
    filenames_test_X = sorted(glob(str(Path("test") / "X" / "*.csv")))
    print(f"Creating X_test from {len(filenames_test_X)} datasets")
    X_test = create_all_columns(
        {
            ttest: filenames_test_X,
            pearson_correlation_test: filenames_test_X,
            mutual_information: filenames_test_X,
            distance_correlation_features: filenames_test_X,
            energy_distance_test: filenames_test_X,
        }
    )
    
    """
    Load models from the specified directory and make predictions on the test data.

    Parameters:
    - model_directory_path: Directory where the models are saved.
    - X_test: DataFrame of test features.
    - le: LabelEncoder instance used for transforming labels.

    Returns:
    - predictions: Dictionary with model names as keys and predictions as values.
    - X_test: DataFrame with predictions as new columns.
    """
    predictions = {}
    
    # Check if the model directory exists
    if not os.path.exists(model_directory_path):
        raise FileNotFoundError(f"Model directory {model_directory_path} does not exist.")
    
    # Get a list of all model files in the directory
    model_files = [f for f in os.listdir(model_directory_path) if f.endswith(".joblib")]
    
    for model_file in model_files:
        # Load the model
        model_name = os.path.splitext(model_file)[0]
        model_path = os.path.join(model_directory_path, model_file)
        
        try:
            model = joblib.load(model_path)
            
            # Make predictions
            if X_test is not None:
                X_test_features = X_test.drop(["dataset", "variable"], axis="columns", errors='ignore')
                try:
                    y_predicted = model.predict(X_test_features)
                    predictions[model_name] = y_predicted
                    
                    # Add predictions to X_test
                    X_test[f"y_predicted"] = y_predicted
                    
                    if le is not None:
                        X_test[f"label_predicted"] = le.inverse_transform(y_predicted)
                    else:
                        print("LabelEncoder instance is not provided. Predictions will not be inverse-transformed.")
                        
                except NotFittedError:
                    print(f"Model {model_name} is not fitted.")
                    predictions[model_name] = None
            else:
                raise ValueError("X_test must be provided for predictions.")
        
        except Exception as e:
            print(f"An error occurred with model {model_name}: {e}")
            predictions[model_name] = None
    
    # Display the updated DataFrame
    print(X_test.head())
    
    return predictions, X_test

## Creating the submission file

In [15]:
submission = create_submission(X_test, filename="supervised_baseline.csv")

100%|██████████| 5/5 [00:00<00:00, 391.70it/s]

Saving submission to supervised_baseline.csv





## Test your model


In [13]:
#crunch.test()