<a target="_blank" href="https://colab.research.google.com/github/giordamaug/HELP/blob/main/HELPpy/notebooks/experiment.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
<a target="_blank" href="https://www.kaggle.com/notebooks/welcome?src=https://github.com/giordamaug/HELP/blob/main/HELPpy/notebooks/experiment.ipynb">
  <img src="https://kaggle.com/static/images/open-in-kaggle.svg" alt="Open In Colab"/>
</a>

### 1. Install HELP from GitHub
Skip this cell if you already have installed HELP.

In [None]:
!pip install git+https://github.com/giordamaug/HELP.git

### 2. Download the input files
For a chosen tissue (here `Kidney`), download from GitHub the label file (here `Kidney_HELP.csv`, computed as in Example 1) and the attribute files (here BIO `Kidney_BIO.csv`, CCcfs `Kidney_CCcfs_1.csv`, ..., `Kidney_CCcfs_5.csv`, and N2V `Kidney_EmbN2V_128.csv`). 
Skip this step if you already have these input files locally.

In [None]:
tissue='Kidney'
!wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/{tissue}_HELP.csv
!wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/{tissue}_BIO.csv
for i in range(5):
  !wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/{tissue}_CCcfs_{i}.csv
!wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/{tissue}_EmbN2V_128.csv
#!wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/{tissue}_CCBeder.csv

Other attribute files (CCBeder) are shown but commented to help the user experiment with different data.

### 3. Download the script for the experiments and show the man page
Download the batch script for EG prediction used for the experiments and show its manual page:

In [1]:
!wget https://raw.githubusercontent.com/giordamaug/HELP/main/HELPpy/notebooks/EG_prediction.py
!python EG_prediction.py -h

usage: EG_prediction.py [-h] -i <inputfile> [<inputfile> ...]
                        [-c <chunks> [<chunks> ...]]
                        [-X <excludelabels> [<excludelabels> ...]]
                        [-L <labelname>] -l <labelfile> [-A <aliases>]
                        [-b <seed>] [-r <repeat>] [-f <folds>] [-j <jobs>]
                        [-B] [-v <voters>] [-ba] [-fx] [-n <normalize>]
                        [-o <outfile>] [-s <scorefile>] [-p <predfile>]

PLOS COMPBIO

options:
  -h, --help            show this help message and exit
  -i <inputfile> [<inputfile> ...], --inputfile <inputfile> [<inputfile> ...]
                        input attribute filename list
  -c <chunks> [<chunks> ...], --chunks <chunks> [<chunks> ...]
                        no of chunks for attribute filename list
  -X <excludelabels> [<excludelabels> ...], --excludelabels <excludelabels> [<excludelabels> ...]
                        labels to exclude (default NaN, values any list)
  -L <labelname>,

### 4. Run the E vs NE experiments
This cell's code reproduces the results for Kidney reported in Table 3 (A) of the HELP paper. 

In [14]:
datapath = "../data"
tissue = "Brain"                                # or 'Lung', or 'Brain'
labelfile = f"{tissue}_HELP.csv"                # label filename
aliases = "-A \"{'aE':0, 'sNE':0, 'E':1}\""     # dictionary for renaming labels before prediction: es. {'oldlabel': 'newlabel'}
excludeflags = ""                               # label to remove (none for E vs NE problem)
njobs = "-1"                                    # parallelism level: -1 = all cpus, 1 = sequential
nchunks = ""                                    # no. of chunks for each input attribute file: es. 1 5 (Bio is one chunk, CCcfs split in 5 chunks)
voters = "-v 13"                                # no. of voters on classifier ensemble
estimators = "-e 200"                           # no. of estimators in classifier
lr = "-lr 0.1"                                  # learning rate
repeats = "-r 10"                               # no. of iterations for experiments 
!python EG_prediction.py -i {datapath}/{tissue}_BIO.csv \
                            {datapath}/{tissue}_CCcfs.csv \
                            {datapath}/{tissue}_EmbN2V_128.csv \
                            -l {datapath}/{labelfile} \
                            {aliases} {excludeflags}  \
                            {voters} {estimators} {repeats} \
                            -n std {lr}\
                            -j -1

Brain_BIO.csv: 19456it [00:00, 157718.57it/s]                                   
[Brain_BIO] found 58547 Nan...
[Brain_BIO] No Nan fixing...
[Brain_BIO] found 2 Nan...
[Brain_BIO] Removing 2 constant features ...
[Brain_BIO] Normalization with std ...
Brain_CCcfs.csv: 19456it [00:08, 2232.72it/s]                                   
[Brain_CCcfs] found 6735590 Nan...
[Brain_CCcfs] No Nan fixing...
[Brain_CCcfs] found 3 Nan...
[Brain_CCcfs] Removing 3 constant features ...
[Brain_CCcfs] Normalization with std ...
Brain_EmbN2V_128.csv: 19456it [00:00, 38071.90it/s]                             
[Brain_EmbN2V_128] found 0 Nan...
[Brain_EmbN2V_128] No Nan fixing...
[Brain_EmbN2V_128] found 0 Nan...
[Brain_EmbN2V_128] Removing 0 constant features ...
[Brain_EmbN2V_128] No normalization...
- removing label []
- replacing label aE with 0
- replacing label sNE with 0
- replacing label E with 1
DATASET: (17244, 3453), LABEL: (17244, 1)
Running par on 8 cpus...
^C


### 5. Run the E vs sNE experiments
This cell's code reproduces the results for Kidney reported in Table 4 (A) of the HELP paper, removing the `aE` flags (`excludeflags = "-X aE"`). 

In [28]:
datapath = "../data"
tissue = "Kidney"                               # or 'Lung', or 'Brain'
labelfile = f"{tissue}_HELP.csv"                # label filename
aliases = ""                                    # dictionary for renaming labels before prediction: es. {'oldlabel': 'newlabel'}
excludeflags = "-X aE"                          # label to remove: es. -X aE (for E vs sNE problem)
aliases = "-A \"{'sNE':0, 'E':1}\""             # dictionary for renaming labels before prediction: es. {'oldlabel': 'newlabel'}
njobs = "-1"                                    # parallelism level: -1 = all cpus, 1 = sequential
estimators = "-e 200"                           # no. of estimators in classifier
lr = "-lr 0.1"                                  # learning rate
voters = "-v 10"                                # no. of voters on classifier ensemble
repeats = "-r 10"                               # no. of iterations for experiments 
!python EG_prediction.py -i {datapath}/{tissue}_BIO.csv \
                            {datapath}/{tissue}_CCcfs.csv \
                            {datapath}/{tissue}_EmbN2V_128.csv \
                            -l {datapath}/{labelfile} \
                            {aliases} {excludeflags}  \
                            {voters} {estimators} {repeats} \
                            -n std {lr} \
                            -j {njobs} -B

5-fold: 100%|████████████████████████████████████| 5/5 [25:33<00:00, 306.79s/it]
5-fold: 100%|████████████████████████████████████| 5/5 [25:34<00:00, 306.89s/it]
5-fold: 100%|████████████████████████████████████| 5/5 [25:35<00:00, 307.04s/it]
5-fold: 100%|████████████████████████████████████| 5/5 [25:35<00:00, 307.07s/it]
5-fold: 100%|████████████████████████████████████| 5/5 [25:35<00:00, 307.17s/it]
5-fold: 100%|████████████████████████████████████| 5/5 [25:35<00:00, 307.17s/it]
5-fold: 100%|████████████████████████████████████| 5/5 [25:37<00:00, 307.51s/it]
5-fold: 100%|████████████████████████████████████| 5/5 [25:37<00:00, 307.52s/it]
5-fold: 100%|█████████████████████████████████████| 5/5 [02:57<00:00, 35.59s/it]
5-fold: 100%|█████████████████████████████████████| 5/5 [02:59<00:00, 35.81s/it]
METHOD: VotingEnsembleLGBM(boosting_type=gbdt,learning_rate=0.1,n_estimators=200,n_voters=10,...)
INPUT: Kidney_BIO.csv Kidney_CCcfs.csv Kidney_EmbN2V_128.csv
LABEL: Kidney_HELP.csv (0:12886

Please be aware that this will take a while in sequential execution. 