# PhageHostLearn.*klebsiella* - inference

This notebook offers complete functionality to make predictions for new bacteria, phages or both, using a trained PhageHostLearn prediction model for Klebsiella phage-host interactions.

**Overview of this notebook**
1. Initial set-up
2. Processing phage genomes and bacterial genomes into RBPs and K-locus proteins, respectively
3. Computing feature representations based on ESM-2.
4. Predicting new interactions and ranking

**Architecture of the PhageHostLearn framework**: 
- Multi-RBP setting: phages consisting of one or more RBPs (multi-instance)
- K-loci proteins (multi-instance) 
- Embeddings for both based on the ESM-2 language model.
- An XGBoost model on top of language embeddings to make predictions

## 1. Initial set-up

DISCLAIMER: PhageHostLearn is only evaluated to make predictions for new bacterial strains against known phages in training. While we believe PhageHostLearn can also be used to make predictions for new phages against known bacteria, or even entirely new bacteria-phage combinations, we have so far not evaluated this in our study.

PhageHostLearn takes as inputs phage genomes and bacterial genomes that are later transformed into phage RBPs and bacterial K-locus proteins. To do this data processing, you'll need to do the following:
1. If you haven't already, download and install all of the following software: [HMMER](http://hmmer.org/), [PHANOTATE](https://github.com/deprekate/PHANOTATE), [Kaptive](https://github.com/katholt/Kaptive), [fair-esm](https://github.com/facebookresearch/esm) and optionally [bio_embeddings](https://github.com/sacdallago/bio_embeddings) (if you want to locally compute embeddings for RBP detection, this can also be done in the cloud with the provided notebook `PTBembeddings_cloud.ipynb`).

2. Make a new data folder with two subfolders in: one for phage genomes and one for bacterial genomes. In both these subfolders, you can collect the genomes as individual FASTA files. If you want to make predictions for new bacterial strains against the phages in training (or vice versa), download the training data from [Zenodo](https://zenodo.org/records/8095914) and put the genomes in your corresponding subfolder.

3. Set the paths to the files and folders below, then run each of the code cells (select and press shift+enter).

In [1]:
# data paths
path = './data'
phages_path = path+'/phage_genomes'
bacteria_path = path+'/bacteria_genomes'
pfam_path = 'RBPdetect_phageRBPs.hmm'
xgb_path = 'RBPdetect_xgb_hmm.json'
kaptive_db_path = path+'/Klebsiella_k_locus_primary_reference.gbk'
suffix = 'inference'

# software paths
hmmer_path = '/path/to/hmmer'
phanotate_path = '/path/to/phanotate.py'

## 2. Data processing

The data processing of PhageHostLearn consists of four consecutive steps: (1) phage gene calling with PHANOTATE, (2) phage protein embedding with bio_embeddings, (3) phage RBP detection and (4) bacterial genome processing with Kaptive.

Expected outputs: (1) an RBPbase.csv file with detected RBPs, (2) a Locibase.json file with detected K-loci proteins.

In [2]:
import phagehostlearn_processing as phlp

In [None]:
# run Phanotate
phanotate_path = '/opt/homebrew/Caskroom/miniforge/base/envs/ML1/bin/phanotate.py'
phlp.phanotate_processing(path, phages_path, phanotate_path, data_suffix=suffix)

In [None]:
# run PTB embeddings (can be done faster in the cloud, see PTB_embeddings.ipynb)
phlp.compute_protein_embeddings(path, data_suffix=suffix)

In [None]:
# run PhageRBPdetect
gene_embeddings_file = path+'/phage_protein_embeddings'+suffix+'.csv'
phlp.phageRBPdetect(path, pfam_path, hmmer_path, xgb_path, gene_embeddings_file, data_suffix=suffix)

In [4]:
# run Kaptive
phlp.process_bacterial_genomes(path, bacteria_path, kaptive_db_path, data_suffix=suffix)

  0%|          | 0/31 [00:00<?, ?it/s]

## 3. Feature construction

Starts from the RBPbase.csv and the Locibase.json in the path. If the ESM-2 embeddings take too long, you might opt to do this step in the cloud or on a high-performance computer. Expected outputs: (1) a .csv file with RBP embeddings, (2) a .csv file with loci embeddings. The last function outputs the following Python objects: ESM-2 feature matrix and groups_bact. If the ESM-2 embeddings take too long, you might opt to do this step in the cloud or on a high-performance computer.

In [3]:
import phagehostlearn_features as phlf

In [None]:
# ESM-2 features for RBPs
phlf.compute_esm2_embeddings_rbp(path, data_suffix=suffix)

In [15]:
# ESM-2 features for loci
phlf.compute_esm2_embeddings_loci(path, data_suffix=suffix)

100%|███████████████████████████████████████████| 31/31 [14:29<00:00, 28.03s/it]


In [5]:
# Construct feature matrices
rbp_embeddings_path = path+'/esm2_embeddings_rbp'+suffix+'.csv'
loci_embeddings_path = path+'/esm2_embeddings_loci'+suffix+'.csv'
features_esm2, groups_bact = phlf.construct_feature_matrices(path, suffix, loci_embeddings_path, rbp_embeddings_path, mode='test')

## 4. Predict and rank new interactions

What we want is to make predictions per bacterium for all of the phages, and then use the prediction scores to rank the potential phages per bacterium.

In [7]:
# load the needed libraries
import pickle
import numpy as np
import pandas as pd
from xgboost import XGBClassifier
%matplotlib inline

In [11]:
# Load the XGBoost model and make predictions
xgb = XGBClassifier()
xgb.load_model('phagehostlearn_esm2_xgb.json')
scores_xgb = xgb.predict_proba(features_esm2)[:,1]

In [14]:
# save prediction scores in an interaction matrix
groups_bact = np.asarray(groups_bact)
loci_embeddings = pd.read_csv(loci_embeddings_path)
rbp_embeddings = pd.read_csv(rbp_embeddings_path)
bacteria = list(loci_embeddings['accession'])
phages = list(set(rbp_embeddings['phage_ID']))

score_matrix = np.zeros((len(bacteria), len(phages)))
for i, group in enumerate(list(set(groups_bact))):
    #scores_this_group = scores[groups_bact == group]
    scores_this_group = scores_xgb[groups_bact == group]
    score_matrix[i, :] = scores_this_group
results = pd.DataFrame(score_matrix, index=bacteria, columns=phages)
results.to_csv(path+'/prediction_results'+suffix+'.csv', index=False)

In [16]:
# rank the phages per bacterium
ranked = {}
for group in list(set(groups_bact)):
    scores_this_group = scores_xgb[groups_bact == group]
    ranked_phages = [(x, y) for y, x in sorted(zip(scores_this_group, phages), reverse=True)]
    ranked[bacteria[group]] = ranked_phages

# save results
with open(path+'/ranked_results'+suffix+'.pickle', 'wb') as f:
    pickle.dump(ranked, f)

## 5. Read & interpret results

In [8]:
# read results
with open(path+'/ranked_results'+suffix+'.pickle', 'rb') as f:
    ranked_results = pickle.load(f)

In [62]:
# print top phages per bacterium
top =  5
scores = np.zeros((len(ranked_results.keys()), top))
for i, acc in enumerate(ranked_results.keys()):
    topscores = [round(y, 3) for (x,y) in ranked_results[acc]][:top]
    scores[i,:] = topscores
pd.DataFrame(scores, index=list(ranked_results.keys()))

Unnamed: 0,0,1,2,3,4
A1002KPN,0.056,0.032,0.003,0.0,0.0
E0204,1.0,1.0,1.0,0.985,0.961
k4159,0.895,0.692,0.006,0.0,0.0
9517_7_8,0.849,0.422,0.007,0.004,0.003
A1009KPN,0.999,0.998,0.992,0.992,0.949
K0006KPN,0.056,0.032,0.003,0.0,0.0
D0006,0.056,0.032,0.003,0.0,0.0
K12100,0.999,0.999,0.992,0.99,0.954
H0502KPN,0.056,0.032,0.003,0.0,0.0
K11933,0.999,0.999,0.992,0.99,0.954
