# Figure 4. analysis

Co-annotation prediction, module detection and gene function prediction evaluation code can be run using the **BIONIC-evals** library from [this](https://github.com/duncster94/BIONIC-evals) respository. The README provides instructions on installing and running the library. A config file is provided to replicate the performance evaluations from Fig. 4.

**NOTE:** The module detection and gene function prediction evaluations are computationally intensive and take a long time to run. I am happy to implement multiprocessing support to speed these up - if you are interested in this feature please open an [issue](https://github.com/duncster94/BIONIC-evals/issues).

This notebook contains the dataset splitting procedure used to generate the training and corresponding testing sets for the co-annotation prediction, module detection, and gene function prediction evaluations.

Please also **NOTE** that due to the size of the co-annotation files and GitHub's file size constraints, I have opted to show the splitting process for IntAct only. The train/test splitting process detailed in this notebook is identical for GO and KEGG however, and can be applied to those standards with the corresponding input files (available [here](https://figshare.com/articles/dataset/Evaluation_Standards/16629139)).

In [1]:
import math
import json
import numpy as np
import pandas as pd
from pathlib import Path

output_path = Path("../data/standards")

Define co-annotation standard splitting function.

In [2]:
def split_coannotation(standard: pd.DataFrame, test_genes: set, trial: int, name: str, save: bool = False):
    print("Splitting co-annotation...")
    test_standard = []

    for row in standard.itertuples():
        gene_1, gene_2 = row[1], row[2]
        if gene_1 in test_genes or gene_2 in test_genes:
            test_standard.append(row[1:])

    test_standard = pd.DataFrame(test_standard)

    if save:
        test_standard.to_csv(
            output_path / f"{name}-coannotation-test-{trial}.csv", index=False, header=False
        )

Define module standard splitting function.

In [3]:
def split_module(standard: dict, test_genes: set, trial: int, name: str, save: bool = False):
    print("Splitting module...")

    test_standard = {}
    standard_genes = []


    for module, genes in standard.items():
        standard_genes += genes
        genes_ = set(genes)
        if len(genes_.intersection(test_genes)) > 0:
            test_standard[module] = genes

    if save:
        with (output_path / f"{name}-module-test-{trial}.json").open("w") as f:
            json.dump(test_standard, f)

Define gene function prediction splitting function. This function also produces the training set (used by BIONIC) for the given split.

In [4]:
def split_supervised(standard: dict, test_genes: set, trial: int, name: str, save: bool = False):
    print("Splitting supervised...")

    test_standard = {gene: complexes for gene, complexes in standard.items() if gene in test_genes}
    train_standard = {gene: complexes for gene, complexes in standard.items() if gene not in test_genes}

    if save:
        with (output_path /  f"{name}-supervised-test-{trial}.json").open("w") as f:
            json.dump(test_standard, f)
        with (output_path /  f"{name}-supervised-train-{trial}.json").open("w") as f:
            json.dump(train_standard, f)

Import IntAct standards and run the splitting functions. To save the outputs of this notebook `save=True` should be passed to each splitting function.

In [5]:
n_trials = 10
test_size = 0.2  # 20% test size

coannotation = pd.read_csv(output_path / "yeast-IntAct-complex-coannotation.csv", header=None)

with (output_path / "yeast-IntAct-complex-modules.json").open("r") as f:
    module = json.load(f)

with (output_path / "yeast-IntAct-complex-labels.json").open("r") as f:
    supervised = json.load(f)  # gene function prediction standard

genes = list(set(supervised.keys()))

for trial in range(n_trials):
    print(f"Trial: {trial}")
    shuffled_genes = np.array(genes)
    np.random.shuffle(shuffled_genes)
    train_size = math.floor((1 - test_size) * len(shuffled_genes))

    train_genes = shuffled_genes[:train_size]
    test_genes = set(shuffled_genes[train_size:])

    # pass `save=True` to save results
    split_coannotation(coannotation, test_genes=test_genes, trial=trial, name="IntAct")
    split_module(module, test_genes=test_genes, trial=trial, name="IntAct")
    split_supervised(supervised, test_genes=test_genes, trial=trial, name="IntAct")

    print("\n")

Trial: 0
Splitting co-annotation...
Splitting module...
Splitting supervised...


Trial: 1
Splitting co-annotation...
Splitting module...
Splitting supervised...


