# The main script
This is the batch script for EG prediction used for the experiments.
themanual page iof EG_prediction.py is the follwing:

In [1]:
!python EG_prediction.py -h

usage: EG_prediction.py [-h] -i <inputfile> [<inputfile> ...]
                        [-c <chunks> [<chunks> ...]]
                        [-X <excludelabels> [<excludelabels> ...]]
                        [-L <labelname>] -l <labelfile> [-A <aliases>]
                        [-b <seed>] [-r <repeat>] [-f <folds>] [-j <jobs>]
                        [-B] [-sf <subfolds>] [-P] [-ba] [-fx]
                        [-n <normalize>] [-o <outfile>] [-s <scorefile>]

PLOS COMPBIO

options:
  -h, --help            show this help message and exit
  -i <inputfile> [<inputfile> ...], --inputfile <inputfile> [<inputfile> ...]
                        input attribute filename list
  -c <chunks> [<chunks> ...], --chunks <chunks> [<chunks> ...]
                        no of chunks for attribute filename list
  -X <excludelabels> [<excludelabels> ...], --excludelabels <excludelabels> [<excludelabels> ...]
                        labels to exclude (default NaN, values any list)
  -L <labelname>, --label

# Script execution for label prediction experiments
Run this script by changing arguments to reproduce the results reported in paper. For example, the *Table 3. "E vs NE" classification performance based on HELP labelling* is produced in this way:

1. Set the datapath where label files and attribute files are placed (``datapath = "../datafinal"``).
2. Choose the tissue (``tissue = "Kidney"``)
3. Rename ``aE`` and ``sNE`` as ``NE`` to deal with the "E vs NE" problem (``aliases = "-A \"{'aE': 'NE', 'sNE':'NE'}\""``). 
4. All labelled genes are use (``excludeflags = ""``)
5. Set the subsampling factor (``sfolds = "4"`` for 1:4 factor for E:NE)
6. Specify the input attributes (``-i {datapath}/{tissue}_BIO.csv ...``)
7. Specify the number of chunks some attribute files are split into (``chunks = "-c 1 5 1"``)
8. Set parallelism (``njobs = "-1"`` for all cpus)
9. Configure balancing option of LighGBM classifier (``-ba``)
10. Enable probability output of classifier (``-P``)
10. Enable batch mode execution with no debug printing (``-B``)

In [12]:
datapath = "../datafinal"
tissue = "Kidney"                               # or 'Lung'
labelfile = f"{tissue}_HELP.csv"                # label filename
aliases = "-A \"{'aE': 'NE', 'sNE':'NE'}\""     # dictionary for label renaming {'oldlabel': 'newlabel', ...}: es. {'aE': 'NE'} replace aE label with NE 
#aliases = ""
#excludeflags = "-X aE"                         # label to remove: es. -X aE (removes aE label and reduce to "E vs sNE" problem)
excludeflags = ""
njobs = "-1"                                    # parallelism level: -1 = all cpus usage, 1 = sequential, n = n cpus usage
sfolds = "4"                                    # dataset subsampling factor (0 means no subsampling): es: 4 for 1:4 ratio of <min-class>:<major-class>
chunks = "-c 1 5 1"                             # chunks list specification (for files split in chunks)
normalization = "-n std"                        # normalization mode (std,max). es: -n std apply z-score normalization
!python EG_prediction.py -i {datapath}/{tissue}_BIO.csv \
                            {datapath}/{tissue}_CCcfs.csv \
                            {datapath}/{tissue}_EmbN2V_128.csv \
                            {chunks} \
                            -l {datapath}/{labelfile} \
                            {aliases} {excludeflags} \
                            {normalization} \
                            -sf {sfolds} \
                            -ba -j {njobs} -P -B

METHOD: LGBM	MODE: prob	BALANCE: yes
PROBL: E vs NE
INPUT: Kidney_BIO.csv Kidney_CCcfs.csv Kidney_EmbN2V_128.csv
LABEL: Kidney_HELP.csv DISTRIB: E : 1242, NE: 4809
SUBSAMPLE: 1:4
+-------------+-------------------------------+
|             | measure                       |
|-------------+-------------------------------|
| ROC-AUC     | 0.9528±0.0071                 |
| Accuracy    | 0.9081±0.0081                 |
| BA          | 0.8629±0.0130                 |
| Sensitivity | 0.7862±0.0264                 |
| Specificity | 0.9397±0.0092                 |
| MCC         | 0.7209±0.0238                 |
| CM          | [[9764, 2656], [2902, 45188]] |
+-------------+-------------------------------+


Another example, if you need to reproduce 6th row of *Table 6. “E vs aE” classification performance based on HELP labelling* on the lung tissue:

1. set another tissue (tissue = "Lung")
2. Labelled genes renaming is not applied (aliases = "")
3. Labelled genes removal is applied: we exclude all genes labelled as sNE (excludeflags = "-X sNE")
5. Disable subsampling since E and AE classes have similar sizes (``sfolds = "0"`)
6. Specify the input attributes (``-i {datapath}/{tissue}_BIO.csv ...``)
7. Chunks specification is not required: all attribute files are not split (``chunks = ""``)
... all other arguments are as before...


In [14]:
datapath = "../datafinal"
tissue = "Lung"                                 # or 'Kidney'
labelfile = f"{tissue}_HELP.csv"                # label filename
#aliases = "-A \"{'aE': 'NE', 'sNE':'NE'}\""    
aliases = ""                                    # renaming is not needed
excludeflags = "-X sNE"                         # label to remove is sNE for "E vs AE" problem)
#excludeflags = ""
njobs = "-1"                                    # parallelism level: -1 = all cpus usage, 1 = sequential, n = n cpus usage
sfolds = "0"                                    # dataset subsampling is disabled: E and AE classes have similar sizes
chunks = ""                                     # chunks list in not requested (all input files aren't split)
normalization = "-n std"                        # normalization mode (std,max). es: -n std apply z-score normalization
!python EG_prediction.py -i {datapath}/{tissue}_BIO.csv \
                            {datapath}/{tissue}_EmbN2V_128.csv \
                            -l {datapath}/{labelfile} \
                            {aliases} {excludeflags} \
                            {normalization} \
                            -sf {sfolds} \
                            -ba -j {njobs} -P -B

METHOD: LGBM	MODE: prob	BALANCE: yes
PROBL: E vs aE
INPUT: Lung_BIO.csv Lung_EmbN2V_128.csv
LABEL: Lung_HELP.csv DISTRIB: E : 1224, aE: 2759
SUBSAMPLE: NONE
+-------------+-------------------------------+
|             | measure                       |
|-------------+-------------------------------|
| ROC-AUC     | 0.8630±0.0135                 |
| Accuracy    | 0.8067±0.0149                 |
| BA          | 0.7658±0.0183                 |
| Sensitivity | 0.6596±0.0316                 |
| Specificity | 0.8719±0.0140                 |
| MCC         | 0.5398±0.0356                 |
| CM          | [[8073, 4167], [3533, 24057]] |
+-------------+-------------------------------+
