# OVERVIEW

Determining context-specific circuit of biological pathways is a fundamental goal of molecular cell biology.
ACSNI combines prior knowledge of biological processes (gene set) with a deep neural network to decompose gene 
expression profiles (GEP) into pathway activities and identify unknown pathway components, see Anene et al., 2021.

# Required inputs

1. Gene expression matrix with genes in rows and samples in columns (format .csv).
2. Gene set membership file representing prior knowledge of gene functions (format .csv)
   For a single gene analysis, the second input is a gene name provided at run time.
3. Optional-optional weights file (as integer values) for the genes in the second input. 


# CASES

For this tutorial, we will use the cases reported in the manuscript (Anene et al., 2021) to demonstrate how to set up, run ACSNI, interpret the results and navigate the extended database. 

# INSTALLATION

You can install ACSNI with the PIP command, which automates installing the required packages (e.g. Tensorflow). 

Ensure you have python version 3.8 installed before running the code below; if not, see https://www.python.org/downloads/

In [None]:
!pip3 install ACSNI
# or 
!pip install ACSNI

The above should install the latest version of ACSNI. 

In addition to specifying a version during installation (pythonic way), 
You can also install directly from the .wheel file provided at https://github.com/caanene1/ACSNI or
compile the code yourself.

ACSNI has three entry commands, including:
- ACSNI-run : multiple genes prior
- ACSNI-derive : single gene prior
- ACSNI-get : phenotype linking

Run the code with option -h to check installation and parameters.
Below are the arguments for the ACSNI-run entrance.

Further, you can use ACSNI functions can be used in regular python imports and calls.

In [36]:
!ACSNI-run -h

usage: ACSNI-run [-h] [-m MAD] [-b BOOT] [-c ALPHA] [-p LP] [-f FULL] -i INPUT
                 -t PRIOR [-w WEIGHT] [-s SEED]

System biology information extraction for genomics.

optional arguments:
  -h, --help            show this help message and exit
  -m MAD, --mad MAD     Minimum median absolute deviance for geneSets
  -b BOOT, --boot BOOT  Number of ensemble models to run
  -c ALPHA, --alpha ALPHA
                        Alpha threshold to make prediction calls
  -p LP, --lp LP        Dimension of the pathway layer. It is also half of the
                        subprocess,set to 0 or default for automatic
                        estimation
  -f FULL, --full FULL  Run tool in 1=full 0=sub (error only) mode
  -i INPUT, --input INPUT
                        Input expression data (.csv)
  -t PRIOR, --prior PRIOR
                        Prior matrix, binary
  -w WEIGHT, --weight WEIGHT
                        Use weights for the genes
  -s SEED, --seed SEED  Set seed for reproduci

Except for the two required arguments -i and -t, the rest of the arguments have well-tested defaults. 
You can tune these parameters to your specific needs. Caution!!

# mTOR case
The first case infers the extended mTOR signalling network in clear cell renal cell carcinoma (ccRCC).

Input Files (included): 
    1. TCGA_.csv - gene expression matrix (source-TCGA)
    2. mTOR.csv - mTOR gene set from Pathway interaction database
    3. sample_info.csv - sample phenotype for the first input

Here, we set the -f to 1 for the full run and output
                 -p to 0 for automatic estimation of layers
                 -b to 5 for five models and 
                 -m to 2.5 for minimum absolute deviation

In [37]:
!ACSNI-run -i TCGA_.csv -t mTOR.csv -f 1 -p 0 -b 5 -m 2.5

Running for PID_MTOR_4PATHWAY
Results will be saved to TCGA_PID_MTOR_4PATHWAY-704AHXJ
Geneset with 67 genes in the expression
146 samples
2021-04-02 20:13:24.916282: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-02 20:13:24.919062: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-04-02 20:13:25.041742: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
The optimal number of dimension is 11
Geneset with 67 genes in the expression
146 samples
Geneset with 67 genes in the expression
146 samples
Geneset with 67 genes in the expression
146 samples
Geneset with 67 genes in the expre

The command will print the run progress and location of the output.
As shown above, the results are in the folder "TCGA_PID_MTOR_4PATHWAY-704AHXJ".
We can visualise the contents of the output folder below.

In [38]:
!ls TCGA_PID_MTOR_4PATHWAY-704AHXJ

NULL_TCGA_.csv    Network_TCGA_.csv dbsTCGA_.ptl


The output of ACSNI-run can be identified by the prefix, specifically:

NULL - randomly shuffled expression matrix derived from -i. You can use it to check randomness in the predictions. For this, rerun the model with the argument -i set to this file.

dbs - is the database of intermediate files and run details.

Network - the inferred network and components. Most users only need this output.

In [39]:
!head TCGA_PID_MTOR_4PATHWAY-704AHXJ/Network_TCGA_.csv

name,sub,Direction
ANAPC13,AE_SE8IA2_0,0.0067188637331128
ANKRD46,AE_SE8IA2_0,0.0033135202247649
ATP9A,AE_SE8IA2_0,0.0136904744431376
BPGM,AE_SE8IA2_0,0.0053974902257323
C18orf32,AE_SE8IA2_0,0.0077944286167621
C6orf226,AE_SE8IA2_0,0.0003798712568823
C8orf80,AE_SE8IA2_0,-0.0013907115207985
CCDC56,AE_SE8IA2_0,0.0074006947688758
COL24A1,AE_SE8IA2_0,0.0001676924148341


It has three columns including :
    1. name - gene name, 
    2. sub - name of the inferred sub-network 
    3. Direction - strength and direction of the interaction within the sub-network

# Advanced useage
You can further inspect the dbs output of ACSNI-run using the python pickle package. Internally, the database is a python class with methods and members.

This database also contains predictions made through linear decomposition approaches; see Anene et al., 2021.

In [40]:
# Import the os and pickle packages
import os
import pickle

# Load the database and visualise the information
database = pickle.load(open("TCGA_PID_MTOR_4PATHWAY-704AHXJ/dbsTCGA_.ptl", "rb"))
database


ACSNI result with 14 modules over 5 bootstraps.

In [41]:
# Extract run information
database.get_run_info()

# Extract predicted network and save to file
output = database.get_p()
output.to_csv("TCGA_PID_MTOR_4PATHWAY-704AHXJ/results.csv")

# This file has extended output, including results from decomposing with PCA, NMF and median
output.head(5)

Unnamed: 0_level_0,AE_SE8IA2_0,AE_SE8IA2_1,AE_SE8IA2_2,AE_SE8IA2_3,AE_SE8IA2_4,AE_SE8IA2_5,AE_SE8IA2_6,AE_SE8IA2_7,AE_SE8IA2_8,AE_SE8IA2_9,...,AE_QD6XXF_8,AE_QD6XXF_9,AE_QD6XXF_10,PCA_QD6XXF_0,NMF_QD6XXF_0,NMF_QD6XXF_1,MEDIAN_QD6XXF_0,Predicted,Sum_stat,Boot_Count
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG,B,B,B,B,B,B,B,B,B,B,...,B,B,B,B,B,B,B,0,0,5
A1CF,B,B,B,B,B,B,B,B,B,B,...,B,B,B,B,B,B,B,0,1,5
A2LD1,B,B,B,B,B,B,B,B,B,B,...,B,B,B,B,B,B,B,0,1,5
A2M,B,B,B,B,B,B,B,B,B,B,...,B,B,B,B,B,B,B,0,0,5
A4GALT,B,B,B,B,B,B,B,B,B,B,...,B,B,B,B,B,B,B,0,0,5


Above the column prefix allows for additional assessement agaisnt linear methods
AE_  autoencoder decomposition
NMF_ non-negative factorisation
PCA_ principle component analysis 
MEDIAN_ simple median expression 

> Linking to phenotype
Further, you can link the predicted sub-processes to clinical or biological information. 
In the included "sample_info.csv" contains a grouping variable 1.normal and 2.tumour.

We can check the file

In [1]:
!head sample_info.csv

ID,group
TCGA.A3.3358.11A.01R.1541.07,Normal
TCGA.A3.3358.01A.01R.1541.07,Tumor
TCGA.A3.3387.11A.01R.1541.07,Normal
TCGA.A3.3387.01A.01R.1541.07,Tumor
TCGA.B0.4700.11A.01R.1541.07,Normal
TCGA.B0.4700.01A.02R.1541.07,Tumor
TCGA.B0.4712.11A.02R.1503.07,Normal
TCGA.B0.4712.01A.01R.1503.07,Tumor
TCGA.B0.5402.11A.01R.1503.07,Normal


Then, we can run the ACSNI-get command with the database and the sample information file.

In [42]:
!ACSNI-get -r TCGA_PID_MTOR_4PATHWAY-704AHXJ/dbsTCGA_.ptl -v sample_info.csv -c character
!head group_to\ subprocess\ associations.csv

Statistics of variations in subprocesses explained by group
q25 0.5068493150684932 
 q75 0.5856164383561644 
 mean 0.5620174346201743 
 std 0.09858342604975479
sub,0,Association
AE_SE8IA2_0,0.5342465753424658,Weak
AE_SE8IA2_1,0.5068493150684932,Weak
AE_SE8IA2_2,0.4863013698630137,Weak
AE_SE8IA2_3,0.6712328767123288,Strong
AE_SE8IA2_4,0.589041095890411,Strong
AE_SE8IA2_5,0.5068493150684932,Weak
AE_SE8IA2_6,0.7054794520547946,Strong
AE_SE8IA2_7,0.5958904109589042,Strong
AE_SE8IA2_8,0.4863013698630137,Weak


# ATF2 case
The second case investigates ATF2-dependent bzip transcriptional output in healthy artery aorta.

Input Files (included): 
    1. AA_.csv - gene expression matrix (soruce-GTEX)
    2. ATF2.csv - ATF2 gene set from Pathway interaction database

Here, we set the -f to 1 for a full run and output
                 -p to 0 for automatic estimation of layers
                 -b to 10 for ten models and 
                 -m to 1.2 for minimum absolute deviation

In [43]:
!ACSNI-run -i AA_.csv -t ATF2.csv -f 1 -p 0 -b 10 -m 1.2

Running for PID_ATF2_PATHWAY
Results will be saved to AA_PID_ATF2_PATHWAY-51GJJUE
Geneset with 37 genes in the expression
432 samples
2021-04-02 20:34:22.983361: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-02 20:34:22.983526: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-04-02 20:34:23.045616: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
The optimal number of dimension is 5
Geneset with 37 genes in the expression
432 samples
Geneset with 37 genes in the expression
432 samples
Geneset with 37 genes in the expression
432 samples
Geneset with 37 genes in the expression

# HOTAIRM1 case
Third case explores the regulaotry network of the lncRNA HOTAIRM1 in healthy kidney. 
It is a case of a single gene, thus we need the ACSNI-derive command.

Input Files (not included)
    1. KID_.csv - gene expression matrix (soruce-GTEX)
    2. "ENSG00000233429" - gene ID for HOTAIRM1
    3. btype.csv - gene biotype information for the first input
    4. exclude.csv - exclude biotype file for the third input 

Before runing the command, we can check the arguments as before.

In [3]:
!ACSNI-derive -h

usage: ACSNI-derive [-h] -i INPUT -g GENE [-f BIO_FILE] [-b BIO_TYPE] [-m MAD]
                    [-p LP] [-c ALPHA] [-ex EXCLUDE] [-t CT] [-z PC]
                    [-u CORR_FILE]

De-Novo generation of gene sets

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Input expression data (.csv)
  -g GENE, --gene GENE  Gene ID/symbol to analyse
  -f BIO_FILE, --bio_file BIO_FILE
                        Gene Bio_type table (.csv)
  -b BIO_TYPE, --bio_type BIO_TYPE
                        Gene Bio_type of interest
  -m MAD, --mad MAD     Minimum median absolute deviation
  -p LP, --lp LP        Percentage of gene_set for model layers
  -c ALPHA, --alpha ALPHA
                        Alpha threshold to make prediction calls
  -ex EXCLUDE, --exclude EXCLUDE
                        Name of bio_types to exclude in csv format
  -t CT, --ct CT        Threshold to use for correlation
  -z PC, --pc P

Run the command

In [None]:
ACSNI-derive -f btype.csv -b "lncRNA" -i kid_.csv -m 1.2 -g "ENSG00000233429" --ct 0.80 --pc 5 --ex exclude.csv

In [None]:
This run has three outputs, which can be identifed by the prefix, including
"NULL" data for randomness assessment and "dbs" for intermediate files assess, as described before. 
Please, note that the "dbs" file here cannot be used with ACSNI-get command as it is not meanginful. 
You can correlate the expression of the single gene to your phenotype directly.

Finally, this command outputs the prediction in "Predicted", which is a list of predicted genes.
Most users of this command only need this file.

# Conclusion
The tutorial shows how to use the different ACSNI commands. The predicted genes are the components of the analysed pathways or network; see the original manuscript. Ultimately, you should apply orthogonal validation to refine the predictions; please see the original manuscript for an extended discussion on such approaches.

The included R-scripts further demonstrates some downstream analysis and other approaches are available depending on your intended use case (hypothesis generation or conclusive insight).