# Code

Running this notebook end-to-end will reproduce the solution. Step by step guide is also provided

In [None]:
import sys
import os
import glob
import yaml

In [None]:
!cat config.yaml

with open('config.yaml') as f:
    CONFIG = yaml.safe_load(f)
    
BASE_PATH = CONFIG['base_path']
CONFIG_PATH = os.path.join(BASE_PATH, 'config.yaml')
RAPIDS_ENV = os.path.join(BASE_PATH, CONFIG['rapids-env'])
PYTORCH_ENV = os.path.join(BASE_PATH, CONFIG['pytorch-env'])

# 1. Preparation

### 1.1. Setup envs

Create the following python envs:

    1) `pytorch-env` - env to deal with all DL models
    2) `rapids-env`  - env to preprocess via RAPIDS and train py-boost and logregs

In [None]:
!./create-rapids-env.sh {BASE_PATH}
!./create-pytorch-env.sh {BASE_PATH}

### 1.2. Get the input data

Here we describe what should be stored in the working dir to reproduce the results

Following data scheme was provided by Kaggle:

    ./Train - cafa train data
    ./Test (targets) - cafa test data
    ./sample_submission.tsv - cafa ssub
    ./IA.txt - cafa IA

    
Following are the solution code libraries, scipts, and notebooks used for training:

    ./protlib
    ./protnn
    ./nn_solution
    
And the installed envs

    ./pytorch-env
    ./rapids-env

### 1.3. Produce the helpers data

First, we made some preprocessing of the input data to store everything in format that is convinient to us to handle and manipulate. Here is the structure:

    ./helpers
        ./fasta - fasta files stored as feather
            ./train_seq.feather
            ./test_seq.feather
        ./real_targets - targets stored as n_proteins x n_terms parquet containing 0/1/NaN values
            ./biological_process
                ./part_0.parquet
                ...
                ./part_14.parquet
                ./nulls.pkl - NaN rate of each term
                ./priors.pkl - prior mean of each term (excluding NaN cells, like np.nanmean)
            ./cellular_component
            ./molecular_function
            

In [None]:
%%time
# parse fasta files and save as feather
!{RAPIDS_ENV} protlib/scripts/parse_fasta.py \
    --config-path {CONFIG_PATH}

# convert targets to parquet and calculate priors
!{RAPIDS_ENV} protlib/scripts/create_helpers.py \
    --config-path {CONFIG_PATH} \
    --batch-size 10000

### 1.4. Get external data

Datasets downloaded from outside:

    ./temporal - extra data downloaded from http://ftp.ebi.ac.uk/pub/databases/GO/goa/old/UNIPROT/
    
First step is downloading and parsing the datasets. After parsing, script will separate the datasets by the evidence codes. The most important split for us is kaggle/no-kaggle split. We refer `kaggle` as experimental codes, `no-kaggle` as electornic labeling, that will be used as features for the stacker models

In [None]:
# download external data from ebi.ac.uk
!{RAPIDS_ENV} protlib/scripts/downloads/dw_goant.py \
    --config-path {CONFIG_PATH}

# # parse the files
!{RAPIDS_ENV} protlib/scripts/parse_go_single.py \
    --file goa_uniprot_all.gaf.216.gz \
    --config-path {CONFIG_PATH}

!{RAPIDS_ENV} protlib/scripts/parse_go_single.py \
    --file goa_uniprot_all.gaf.214.gz \
    --config-path {CONFIG_PATH} \
    --output old214

The next step is propagation. Since ebi.ac datasets contains the labeling without propagation, we will apply the rules provided in organizer's repo to labeling more terms. We will do it only for `goa_uniprot_all.gaf.216.gz` datasets since it is the actual dataset at the active competition phase

In [None]:
folder = BASE_PATH + '/temporal'

for file in glob.glob(folder + '/labels/train*') + glob.glob(folder + '/labels/test*'):
    name = folder + '/labels/prop_' + file.split('/')[-1]

    !{RAPIDS_ENV} {BASE_PATH}/protlib/scripts/prop_tsv.py \
        --path {file} \
        --graph {BASE_PATH}/Train/go-basic.obo \
        --output {name} \
        --device 0 \
        --batch_size 30000 \
        --batch_inner 5000

The last part is reproducing MT's datasets that are commonly used in all public kernels. We didn't use it directly, but we used `cafa-terms-diff` dataset, that represents the difference between our labeling obtained by parsing `goa_uniprot_all.gaf.216.gz` dataset and `all_dict.pkl` dataset given by MT. As he claims in the dicussion [here](https://www.kaggle.com/competitions/cafa-5-protein-function-prediction/discussion/404853#2329935) he used the same FTP source as we. But our source is more actual than the public. So the difference is actually the temporal. After analysis, we find out, that we are able to reproduce it as the difference between `goa_uniprot_all.gaf.216.gz` and `goa_uniprot_all.gaf.214.gz` sources. So, we just create `cafa-terms-diff` dataset by the given script. The only difference between the source in the kaggle script and used here is deduplication. We removed duplicated protein/terms pairs from the dataset, it has almost zero impact on the metric value (less than 1e-4)


In [None]:
# create datasets
!{RAPIDS_ENV} {BASE_PATH}/protlib/scripts/reproduce_mt.py \
    --path {BASE_PATH}/temporal \
    --graph {BASE_PATH}/Train/go-basic.obo

# # make propagation for quickgo51.tsv
!{RAPIDS_ENV} {BASE_PATH}/protlib/scripts/prop_tsv.py \
    --path {BASE_PATH}/temporal/quickgo51.tsv \
    --graph {BASE_PATH}/Train/go-basic.obo \
    --output {BASE_PATH}/temporal/prop_quickgo51.tsv \
    --device 0 \
    --batch_size 30000 \
    --batch_inner 5000

### 1.5 Preparation step for neural networks

Produce some helpers to train NN model

In [None]:
%%time

!{PYTORCH_ENV} {BASE_PATH}/nn_solution/prepare.py \
    --config-path {CONFIG_PATH}

# 2. Embeddings

In [None]:
!mkdir embeds

### 2.1 T5 pretrained inference

In [None]:
%%time
!{PYTORCH_ENV} {BASE_PATH}/nn_solution/t5.py \
    --config-path {CONFIG_PATH} \
    --device 0

### 2.2 ESM pretrained inference

In [None]:
%%time
!{PYTORCH_ENV} {BASE_PATH}/nn_solution/esm2sm.py \
    --config-path {CONFIG_PATH} \
    --device 0

# 3. Base models

In [None]:
!mkdir models

### 3.1. Train and inference py-boost models

GBDT models description:

1) Features: T5 + taxon, targets: multilabel

2) Features: T5 + taxon, targets: conditional

3) Features: T5 + ESM + taxon, targets: multilabel

4) Features: T5 + ESM + taxon, targets: conditional

Pipeline and hyperparameters are the same for all the models. Target is 4500 output: BP 3000, MF: 1000, CC: 500. All models could be ran in parallel to save a time. We used single V100 32GB and it requires about 15 hours to train 5 fold CV loop for each model type. 32GB GPU RAM is required, otherwise OOM will occur.

In [None]:
for model_name in ['pb_t54500_raw', 'pb_t54500_cond', 'pb_t5esm4500_raw', 'pb_t5esm4500_cond', ]:

    print(f'Training {model_name}')

    !{RAPIDS_ENV} {BASE_PATH}/protlib/scripts/train_pb.py \
        --config-path {CONFIG_PATH} \
        --model-name {model_name} \
        --device 0


### 3.2. Train and inference logreg models

Logistic Regression models description:

1) Features: T5 + taxon, targets: multilabel

2) Features: T5 + taxon, targets: conditional


Pipeline and hyperparameters are the same for all the models. Target is 13500 output: BP 10000, MF: 2000, CC: 1500. All models could be ran in parallel to save a time. We used single V100 32GB and it requires about 10 hours for model 1 and 2 hours for model 2 to train 5 fold CV loop. 32GB GPU RAM is required, otherwise OOM will occur.

In [None]:
for model_name in ['lin_t5_raw', 'lin_t54500_cond']:

    print(f'Training {model_name}')

    !{RAPIDS_ENV} {BASE_PATH}/protlib/scripts/train_lin.py \
        --config-path {CONFIG_PATH} \
        --model-name {model_name} \
        --device 0


### 3.3. Train and inference NN models

In [None]:
# first, create train folds (the same as used for pb_t54500_cond model)
!{RAPIDS_ENV} {BASE_PATH}/protlib/scripts/create_gkf.py \
    --config-path {CONFIG_PATH}

In [None]:
%%time

!{PYTORCH_ENV} {BASE_PATH}/nn_solution/train_models.py \
    --config-path {CONFIG_PATH} \
    --device 0

In [None]:
%%time

!{PYTORCH_ENV} {BASE_PATH}/nn_solution/inference_models.py \
    --config-path {CONFIG_PATH} \
    --device 0

In [None]:
%%time

!{PYTORCH_ENV} {BASE_PATH}/nn_solution/make_pkl.py \
    --config-path {CONFIG_PATH}

# 4. Final model

### 4.1. Train GCN models

This step is training 3 independent stacking models for each ontology. Models are trained on single V100 GPU and it takes about 13 hours for BP, 4 hours for MF and 2 hours for CC. 32 GB GPU RAM is required to fit. Could be trained in parallel if 2 GPUs are avaliable - BP and MF/CC

In [None]:
%%time

for ont in ['bp', 'mf', 'cc']:
    !{PYTORCH_ENV} {BASE_PATH}/protnn/scripts/train_gcn.py \
        --config-path {CONFIG_PATH} \
        --ontology {ont} \
        --device 0

### 4.2. Inference GCN models and TTA

Inference and Test-Time-Augmentation


In [None]:
%%time

!{PYTORCH_ENV} {BASE_PATH}/protnn/scripts/predict_gcn.py \
    --config-path {CONFIG_PATH} \
    --device 0

### 4.3. Postprocessing and build submission file

In [None]:
# since we have 4 TTA predictions, we need to aggregate all as an average
!{RAPIDS_ENV} {BASE_PATH}/protlib/scripts/postproc/collect_ttas.py \
    --config-path {CONFIG_PATH} \
    --device 0

# create 0.3 * pred + 0.7 * max children propagation
!{RAPIDS_ENV} {BASE_PATH}/protlib/scripts/postproc/step.py \
    --config-path {CONFIG_PATH} \
    --device 0 \
    --batch_size 30000 \
    --batch_inner 3000 \
    --lr 0.7 \
    --direction min

# create 0.3 * pred + 0.7 * min parents propagation
!{RAPIDS_ENV} {BASE_PATH}/protlib/scripts/postproc/step.py \
    --config-path {CONFIG_PATH} \
    --device 0 \
    --batch_size 30000 \
    --batch_inner 3000 \
    --lr 0.7 \
    --direction max

# here we average min prop and max prop solutions, mix with cafa-terms-diff and quickgo51 datasets from 1.4
!{RAPIDS_ENV} {BASE_PATH}/protlib/scripts/postproc/make_submission.py \
    --config-path {CONFIG_PATH} \
    --device 0 \
    --max-rate 0.5

In [None]:
!head {BASE_PATH}/sub/submission.tsv