# Tutorial for Usage of DIGEST

## Setup

In [1]:
import os
import sys
import json
import pandas as pd
# ==== import single validation script ====
from pathlib import Path
sys.path.append(str(Path.cwd().parent))
from single_validation import single_validation



## ID versus Set

Compare a target set of genes against a single disease id.<br>
Note: Runtime for `--run=1000` around 00:00:15 (hh:mm:ss)

### Set input parameters

In [2]:
ref_id = "0007079"
ref_id_type = "mondo"
tar_set = "input/target_gene_set.txt"
tar_id_type = "uniprot"
mode = "id-set"
out_dir = "results/"
runs = 1000

### Run script

In [3]:
single_validation(tar=tar_set, tar_id=tar_id_type, mode=mode, ref=ref_id, ref_id=ref_id_type, 
                  enriched = False, out_dir=out_dir, runs=runs)

[00:00:00|148.45MB] Starting validation ...
[00:00:00|148.45MB] Load mappings for input into cache ...
querying 1-26...done.
Finished.
1 input query terms found dup hits:
	[('P34998', 2)]
Pass "returnall=True" to return complete lists of duplicate or missing query terms.
querying 1-25...done.
Finished.
[00:00:04|438.30MB] Validation of input ...
[00:00:04|438.30MB] Validation of random runs ...
[00:00:18|438.30MB] Calculating p-values ...
[00:00:18|438.30MB] Save files
[00:00:18|438.30MB] Finished validation


### Inspect results

In [4]:
with open(out_dir+'digest_'+mode+'_result.json', 'r') as f:
    data = json.load(f)
pd.DataFrame.from_dict(data)

Unnamed: 0,input_values,p_values
pathway.kegg,0.365,1.0


While the values under `input_values` are the raw values for the input set, the values under `p_value` are the calculated p-values for the original input against the 1000 runs.

## Set versus Set

Compare a target set of genes against a reference set of genes.<br>
Note: Runtime for `--run=1000` around 00:00:20 (hh:mm:ss)

### Set input parameters

In [5]:
ref_set = "input/0007079_reference_gene_set.txt"
ref_id_type = "uniprot"
tar_set = "input/target_gene_set.txt"
tar_id_type = "uniprot"
mode = "set-set"
out_dir = "results/"
runs = 1000

### Non-Enriched setup
Use all attributes values mapped to the reference ids

### Run script

In [6]:
single_validation(tar=tar_set, tar_id=tar_id_type, mode=mode, ref=ref_set, ref_id=ref_id_type, 
                  out_dir=out_dir, runs=runs, enriched=False)

[00:00:00|438.30MB] Starting validation ...
[00:00:00|438.30MB] Load mappings for input into cache ...
querying 1-26...done.
Finished.
1 input query terms found dup hits:
	[('P34998', 2)]
Pass "returnall=True" to return complete lists of duplicate or missing query terms.
querying 1-25...done.
Finished.
[00:00:04|458.66MB] Validation of input ...
[00:00:04|458.66MB] Validation of random runs ...
[00:00:19|458.66MB] Calculating p-values ...
[00:00:19|458.66MB] Save files
[00:00:19|458.66MB] Finished validation


### Inspect results

In [7]:
with open(out_dir+'digest_'+mode+'_result.json', 'r') as f:
    data = json.load(f)
pd.DataFrame.from_dict(data)

Unnamed: 0,input_values,p_values
go.BP,0.56,1.0
go.CC,0.91,1.0
go.MF,0.905,1.0
pathway.kegg,0.365,1.0


While the values under `input_values` are the raw values for the input set, the values under `p_value` are the calculated p-values for the original input against the 1000 runs.

### Enriched setup
Use only enriched attribute values mapped to the reference ids

### Run script

In [8]:
single_validation(tar=tar_set, tar_id=tar_id_type, mode=mode, ref=ref_set, ref_id=ref_id_type, 
                  out_dir=out_dir, runs=runs, enriched=True)

[00:00:00|458.66MB] Starting validation ...
[00:00:00|458.66MB] Load mappings for input into cache ...




querying 1-26...done.
Finished.
1 input query terms found dup hits:
	[('P34998', 2)]
Pass "returnall=True" to return complete lists of duplicate or missing query terms.
querying 1-25...done.
Finished.
[00:00:21|478.18MB] Validation of input ...
[00:00:21|478.18MB] Validation of random runs ...
[00:00:38|478.18MB] Calculating p-values ...
[00:00:38|478.18MB] Save files
[00:00:38|478.18MB] Finished validation


### Inspect results

In [9]:
with open(out_dir+'digest_'+mode+'_result.json', 'r') as f:
    data = json.load(f)
pd.DataFrame.from_dict(data)

Unnamed: 0,input_values,p_values
go.BP,0.365,1.0
go.CC,0.0,1.0
go.MF,0.085,0.979021
pathway.kegg,0.365,1.0


While the values under `input_values` are the raw values for the input set, the values under `p_value` are the calculated p-values for the original input against the 1000 runs.

## Set itself

Compare a target set in itself.<br>
Note: Runtime for `--run=1000` around 00:00:51 (hh:mm:ss)

### Set input parameters

In [10]:
tar_set = "input/target_gene_set.txt"
tar_id_type = "uniprot"
mode = "set"
out_dir = "results/"
runs = 1000

### Run script

In [11]:
single_validation(tar=tar_set, tar_id=tar_id_type, mode=mode, out_dir=out_dir, runs=runs)

[00:00:00|478.18MB] Starting validation ...
[00:00:00|478.18MB] Load mappings for input into cache ...
[00:00:01|482.74MB] Load distances for input into cache ...
querying 1-26...done.
Finished.
1 input query terms found dup hits:
	[('P34998', 2)]
Pass "returnall=True" to return complete lists of duplicate or missing query terms.
querying 1-25...done.
Finished.
[00:00:09|1717.99MB] Validation of input ...
[00:00:19|1760.62MB] Validation of random runs ...
[00:00:50|1765.59MB] Calculating p-values ...
[00:00:50|1765.59MB] Save files
[00:00:50|1765.59MB] Finished validation


### Inspect results

In [12]:
with open(out_dir+'digest_'+mode+'_result.json', 'r') as f:
    data = json.load(f)
pd.DataFrame.from_dict(data)

Unnamed: 0,input_values,p_values
go.BP,0.017596,0.000999
go.CC,0.104296,0.000999
go.MF,0.107253,0.067932
pathway.kegg,0.029286,0.000999


While the values under `input_values` are the raw values for the input set, the values under `p_value` are the calculated p-values for the original input against the 1000 runs.

## Cluster itself

Compare a target cluster of diseases based on dunn index and sillhouette score, while the random runs are simply cluster size preserving perturbation of cluster assignments.<br>
Note: Runtime for `--run=1000` around 00:02:45 (hh:mm:ss)

### Set input parameters

In [13]:
tar_set = "input/target_disease_cluster.txt"
tar_id_type = "ICD-10"
mode = "cluster"
out_dir = "results/"
runs = 1000

### Run script

In [14]:
single_validation(tar=tar_set, tar_id=tar_id_type, mode=mode, out_dir=out_dir, runs=runs)

[00:00:00|1765.59MB] Starting validation ...
[00:00:00|1765.59MB] Load mappings for input into cache ...
[00:00:01|1798.86MB] Load distances for input into cache ...
[00:00:01|1947.71MB] Load input data ...
[00:00:01|1947.71MB] Validation of input ...
[00:00:02|1949.25MB] Validation of random runs ...
[00:02:45|1950.39MB] Save files
[00:02:45|1950.39MB] Finished validation


### Inspect results

In [23]:
with open(out_dir+'digest_'+mode+'_result.json', 'r') as f:
    data = json.load(f)
pd.concat({k: pd.DataFrame(v) for k, v in data.items()})

Unnamed: 0,Unnamed: 1,di,ss
input_values,disgenet.genes_related_to_disease,0.0,-0.608696
input_values,disgenet.variants_related_to_disease,0.0,-0.606749
input_values,ctd.pathway_related_to_disease,0.0,-0.632568
p_values,disgenet.genes_related_to_disease,0.137862,0.632368
p_values,disgenet.variants_related_to_disease,0.420579,0.518482
p_values,ctd.pathway_related_to_disease,0.424575,0.332667


While the values under `input_values` are the raw values for the input set, the values under `p_value` are the calculated p-values for the original input against the 1000 runs.