# PhageHostLearn - v3.3.klebsiella - inference

An AI-based Phage-Host interaction predictor framework with K-loci and receptor-binding proteins at its core. This particular PhageHostLearn is for *Klebsiella pneumoniae* related phages. 

This notebook offers complete functionality to make predictions for new bacteria, phages or both, using a trained PhageHostLearn prediction model for Klebsiella phage-host interactions.

**Overview of this notebook**
1. Initial set-up
2. Processing phage genomes and bacterial genomes into RBPs and K-locus proteins, respectively
3. Computing feature representations based on ESM-2 and Hyperdimensional computing
4. Predicting new interactions and ranking

**Architecture of the PhageHostLearn framework**: 
- Multi-RBP setting: phages consisting of one or more RBPs (multi-instance)
- K-loci proteins (multi-instance) 
- Embeddings for both based on ESM-2 language models and HDC
- Combined XGBoost model (for language embeddings) and Random Forest (for HDC embeddings) to make predictions

## 1. Initial set-up

PhageHostLearn takes as inputs phage genomes and bacterial genomes that are later transformed into phage RBPs and bacterial K-locus proteins. To do this data processing, you'll need to do the following:
1. Set up a TEST folder for all the test data that will be stored and generated by PhageHostLearn. Write the path to this folder in the code block below for 'test_path'. The 'training_path' is the path to the training data as it is used in the `phagehostlearn_training.ipynb` notebook.
2. In the TEST folder, create one or two subfolders for the new phage genomes and/or bacterial genomes to test (one for phage genomes and one for bacterial genomes if you have both). Collect both phage genomes and bacterial genomes as individual FASTA files and store them in the two separate folders. You can also make predictions for either new bacteria or new phages against the training set, in that case you only need to create one subfolder.
3. Install [PHANOTATE](https://github.com/deprekate/PHANOTATE) and [Kaptive](https://github.com/katholt/Kaptive), both of which you'll need to process the phage and bacterial genomes. Locate PHANOTATE and write the path under the 2.1 code block below. **(Can be simplified by copying PHANOTATE into code folder.)** From the Kaptive repository, copy the .gbk databases into the training data folder.
4. Optionally install [bio_embeddings](https://github.com/sacdallago/bio_embeddings) to locally compute protein embeddings needed for RBP detection or opt do do this step in the cloud for faster results (see instructions below).
5. Install [fair-esm](https://github.com/facebookresearch/esm) to compute ESM-2 embeddings for the PhageHostLearn interaction prediction models.
6. Install [Julia](https://julialang.org) to compute hyperdimensional embeddings for the PhageHostLearn interaction prediction models. **extra info on packages etc...**

In [1]:
training_path = '/Users/dimi/GoogleDrive/PhD/4_PHAGEHOST_LEARNING/42_DATA/Valencia_data'
training_suffix = 'Valencia'
test_path = '/Users/dimi/GoogleDrive/PhD/4_PHAGEHOST_LEARNING/42_DATA/inference'
test_suffix = '_test'
results_path = '/Users/dimi/GoogleDrive/PhD/4_PHAGEHOST_LEARNING/43_RESULTS/inference'

## 2. Data processing

The data processing of PhageHostLearn consists of four consecutive steps: (1) phage gene calling with PHANOTATE, (2) phage protein embedding with bio_embeddings, (3) phage RBP detection and (4) bacterial genome processing with Kaptive.

- Test new phages against the bacteria in the training set: only run the processing steps for the phage genomes (2.1-2.3)
- Test new bacteria against the phages in the training set: only run the processing steps for the bacterial genomes (2.4)
- Test combinations of new phages and new bacteria: run all the processing steps.

In [2]:
import phagehostlearn_processing as phlp

#### 2.1 PHANOTATE

In [None]:
phage_genomes_path = test_path+'/phages_genomes'
phanotate_path = '/opt/homebrew/Caskroom/miniforge/base/envs/ML1/bin/phanotate.py'
phlp.phanotate_processing(test_path, phage_genomes_path, phanotate_path, data_suffix=test_suffix)

#### 2.2 Protein embeddings

The code block below computes protein embeddings for all of the detected phage genes (translated to proteins) using the bio_embeddings package (see Initial set-up). This might take a while on CPU. Alternatively, you can run this step in Google Colab or on Kaggle using the 'compute_embeddings_cloud.ipynb', which does exactly the same thing.

In [None]:
phlp.compute_protein_embeddings(test_path, data_suffix=test_suffix)

#### 2.3 PhageRBPdetect

Either copy the `RBPdetect_phageRBPs.hmm` and `RBPdetect_xgb_hmm.json` files into the training data folder, or provide their absolute paths in the code block below.

In [None]:
pfam_path = test_path+'/RBPdetect_phageRBPs.hmm'
hmmer_path = '/Users/Dimi/hmmer-3.3.1'
xgb_path = test_path+'/RBPdetect_xgb_hmm.json'
gene_embeddings_path = test_path+'/phage_protein_embeddings'+test_suffix+'.csv'
phlp.phageRBPdetect(test_path, pfam_path, hmmer_path, xgb_path, gene_embeddings_path, data_suffix=test_suffix)

#### 2.4 Kaptive

In [4]:
bact_genomes_path = test_path+'/clinical_strains'
kaptive_database_path = training_path+'/Klebsiella_k_locus_primary_reference.gbk'
phlp.process_bacterial_genomes(test_path, bact_genomes_path, kaptive_database_path, data_suffix=test_suffix)

  0%|          | 0/30 [00:00<?, ?it/s]

## 3. Feature construction

Starts from the RBPbase.csv and the Locibase.json in the training_path or test_path, depending on what setting you want to test. Adjust the second code block below accordingly. If the ESM-2 embeddings take too long, you might opt to do this step in the cloud or on a high-performance computer.

- Test new phages against the bacteria in the training set: only run the feature steps for the phage (3.1, 3.3, 3.4). Set the correct paths to Locibase_train, RBPbase_test and the embeddings!
- Test new bacteria against the phages in the training set: only run the feature steps for the bacteria (3.2, 3.3, 3.4). Set the correct paths to Locibase_test, RBPbase_train and the embeddings!
- Test combinations of new phages and new bacteria: run all the feature steps and set the paths to Locibase_test, RBPbase_test and the embeddings.

In [5]:
import phagehostlearn_features as phlf

In [6]:
locibase_path = test_path+'/Locibase'+test_suffix+'.json'
rbpbase_path = training_path+'/RBPbase'+training_suffix+'.csv'

#### 3.1 ESM-2 RBP features

In [None]:
phlf.compute_esm2_embeddings_rbp(test_path, data_suffix=test_suffix)

#### 3.2 ESM-2 loci features

In [7]:
phlf.compute_esm2_embeddings_loci(test_path, data_suffix=test_suffix)


  0%|                                                    | 0/30 [00:00<?, ?it/s][A
  3%|█▍                                          | 1/30 [00:31<15:09, 31.36s/it][A
  7%|██▉                                         | 2/30 [01:12<17:17, 37.04s/it][A
 10%|████▍                                       | 3/30 [01:54<17:40, 39.28s/it][A
 13%|█████▊                                      | 4/30 [02:36<17:26, 40.24s/it][A
 17%|███████▎                                    | 5/30 [03:26<18:13, 43.75s/it][A
 20%|████████▊                                   | 6/30 [03:57<15:45, 39.42s/it][A
 23%|██████████▎                                 | 7/30 [04:28<14:08, 36.87s/it][A
 27%|███████████▋                                | 8/30 [05:10<14:06, 38.49s/it][A
 30%|█████████████▏                              | 9/30 [05:42<12:46, 36.48s/it][A
 33%|██████████████▎                            | 10/30 [06:25<12:48, 38.42s/it][A
 37%|███████████████▊                           | 11/30 [07:05<12:18, 38.89

#### 3.3 HDC features

In [9]:
phlf.compute_hdc_embedding(test_path, test_suffix, locibase_path, rbpbase_path, mode='test')

b'Loading data...\nComputing loci representations...\nComputing RBP representations...\nDone!\n'


#### 3.4 Construct feature matrices

In [10]:
rbp_embeddings_path = training_path+'/esm2_embeddings_rbp'+training_suffix+'.csv'
loci_embeddings_path = test_path+'/esm2_embeddings_loci'+test_suffix+'.csv'
hdc_embeddings_path = test_path+'/hdc_features'+test_suffix+'.txt'

In [11]:
features_esm2, features_hdc, groups_bact = phlf.construct_feature_matrices(training_path, training_suffix, loci_embeddings_path, 
                                                             rbp_embeddings_path, hdc_embeddings_path, mode='test')

Dimensions match? True
Dimensions match? True


## 4. Predict and rank new interactions

What we want is to make predictions per bacterium for all of the phages, and then use the prediction scores to rank the potential phages per bacterium.

In [12]:
import math
import pickle
import subprocess
import numpy as np
import pandas as pd
import phagehostlearn_utils as phlu
import matplotlib.pyplot as plt
from matplotlib import cm
from joblib import dump, load
from tqdm.notebook import tqdm
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, LeaveOneGroupOut, GroupShuffleSplit, GroupKFold
from sklearn.metrics import roc_auc_score, auc, precision_recall_curve
%matplotlib inline

#### 4.1 Make predictions with trained models

In [16]:
# ESM-2 FEATURES + XGBoost model
xgb = XGBClassifier()
xgb.load_model('phagehostlearn_esm2_xgb.json')
scores_xgb = xgb.predict_proba(features_esm2)[:,1]

In [17]:
# HDC FEATURES + RF model
rf = load('phagehostlearn_hdc_rf.joblib')
scores_rf = rf.predict_proba(features_hdc)[:,1]

In [18]:
# combine scores with uninorm operator
scores = np.asarray([phlu.uninorm(scores_rf[j], scores_xgb[j]) for j in range(len(scores_xgb))])

#### 4.2 Save predictions as a matrix and ranked list

In [20]:
# save prediction scores in an interaction matrix
groups_bact = np.asarray(groups_bact)
loci_embeddings = pd.read_csv(loci_embeddings_path)
rbp_embeddings = pd.read_csv(rbp_embeddings_path)
bacteria = list(loci_embeddings['accession'])
phages = list(set(rbp_embeddings['phage_ID']))

score_matrix = np.zeros((len(bacteria), len(phages)))
for i, group in enumerate(list(set(groups_bact))):
    scores_this_group = scores[groups_bact == group]
    score_matrix[i, :] = scores_this_group
results = pd.DataFrame(score_matrix, index=bacteria, columns=phages)
results.to_csv(results_path+'/prediction_results'+test_suffix+'.csv', index=False)

In [21]:
# rank the phages per bacterium
ranked = {}
for group in list(set(groups_bact)):
    scores_this_group = scores[groups_bact == group]
    ranked_phages = [(x, y) for y, x in sorted(zip(scores_this_group, phages), reverse=True)]
    ranked[bacteria[group]] = ranked_phages

# save results
with open(results_path+'/ranked_results'+test_suffix+'.pickle', 'wb') as f:
    pickle.dump(ranked, f)

In [22]:
# read results
with open(results_path+'/ranked_results'+test_suffix+'.pickle', 'rb') as f:
    ranked_results = pickle.load(f)

In [52]:
ranked_results['K12100']

[('S8c', 0.9979714494230796),
 ('K2064PH2', 0.9967457766707655),
 ('K30lambda2', 0.9964650061988013),
 ('K7PH164C4', 0.9544444538462791),
 ('K2069PH1', 0.9176106571196438),
 ('K34PH164', 0.11846747294794044),
 ('K65PH164', 0.03696121099958089),
 ('S8b', 0.03228285483663257),
 ('K14PH164C1', 0.003790375488850914),
 ('S13d', 0.002407232654668489),
 ('K29PH164C1', 0.0011377869062445913),
 ('S8a', 0.0006171603493324597),
 ('K12P1_1', 0.0005919065449237807),
 ('K62PH164C2', 0.0005570856803069899),
 ('D7b', 0.0005187688568607069),
 ('K54lambda1', 0.00045461982394830907),
 ('K52PH129C1', 0.0004146484415177026),
 ('K49PH164C2', 0.0004131264852125165),
 ('K33PH14C2', 0.000387483758193262),
 ('K32PH164C1', 0.00024252166996545694),
 ('K69PH164C2', 0.00022923682479192002),
 ('K17alfa61', 0.00020281286129450763),
 ('K6PH25C3', 0.00018187610389542986),
 ('K35PH164C3', 0.00013496991451511495),
 ('S9a', 0.00010871191480544478),
 ('K74PH129C2', 0.00010187250982677063),
 ('K54lambda2', 9.032312109001998