# PhageHostLearn - v3.3.klebsiella - human in the loop

An AI-based Phage-Host interaction predictor framework with K-loci and receptor-binding proteins at its core. This particular PhageHostLearn is for *Klebsiella pneumoniae* related phages.

This notebook offers the functionality to add new data to the PhageHostLearn framework and retrain the PhageHostLearn prediction models, without having to process all data from scratch. Here, it is assumed that you have completed the initial set-up that is carried out in the `phagehostlearn_training.ipynb`.

**Overview of this notebook**
- Setting A: new validated interactions for the same data
- Setting B or C: new interactions for either new phages or bacteria (= new data)
- Setting D: new interactions for new combinations of phages AND bacteria

**Architecture of the PhageHostLearn framework**: 
- Multi-RBP setting: phages consisting of one or more RBPs (multi-instance)
- K-loci proteins (multi-instance)
- Embeddings for both based on ESM-2 language models and HDC
- Combined XGBoost model (for language embeddings) and Random Forest (for HDC embeddings) to make predictions

In [1]:
import numpy as np
import pandas as pd
from joblib import dump, load
from tqdm.notebook import tqdm
from xgboost import XGBClassifier
import phagehostlearn_utils as phlu
import phagehostlearn_features as phlf
import phagehostlearn_processing as phlp
from sklearn.ensemble import RandomForestClassifier

In [2]:
general_path = '/Users/dimi/desktop/simpletest'
results_path = '/Users/dimi/desktop/simpletest/results'
#general_path = '/Users/dimi/GoogleDrive/PhD/4_PHAGEHOST_LEARNING/42_DATA/Valencia_data'
#results_path = '/Users/dimi/GoogleDrive/PhD/4_PHAGEHOST_LEARNING/43_RESULTS/models'
data_suffix = 'Valencia'

## Setting A: adding validated interactions for the same data

In this setting, we're not adding new phages or bacterial hosts, but we have tested new interactions for the phages and bacteria that are already present in the dataset. In this scenario, we only need to add those new interactions to our interaction matrix and retrain from there.

#### A.1 Manually add the validated interactions in the .xlsx file with interactions

#### A.2 Reconstruct interaction matrix and feature matrices

In [None]:
interactions_xlsx_path = general_path+'/klebsiella_phage_host_interactions.xlsx'
phlp.process_interactions(general_path, interactions_xlsx_path, data_suffix=data_suffix)

In [None]:
features_esm2, features_hdc, labels, groups_loci, groups_phage = phlf.construct_feature_matrices(general_path, 
                                                                                            data_suffix=data_suffix)

#### A.3 Retrain & save models

In [None]:
cpus=6
labels = np.asarray(labels)

In [None]:
# ESM-2 FEATURES + XGBoost model
imbalance = sum([1 for i in labels if i==1]) / sum([1 for i in labels if i==0])
xgb = XGBClassifier(scale_pos_weight=1/imbalance, learning_rate=0.2, n_estimators=250, max_depth=7,
                    n_jobs=cpus, eval_metric='logloss', use_label_encoder=False)
xgb.fit(features_esm2, labels)
xgb.save_model('phagehostlearn_esm2_xgb.json')

In [None]:
# HDC FEATURES + RF model
rf = RandomForestClassifier(n_estimators=1000, max_depth=5, class_weight='balanced', n_jobs=cpus)
rf.fit(features_hdc, labels)
dump(rf, 'phagehostlearn_hdc_rf.joblib')

## Setting B/C: adding new phages or bacteria + interactions

In this setting, we're adding either new phages or bacteria against the known bacteria or phages, respectively. This entails adding the new genomes.fasta files in the respective folders (see `phagehostlearn_training.ipynb`) and manually adding the new rows or columns to the interactions.xlsx file. Alternatively, you can make a new_interactions.xlsx file and combine it with the old interaction matrix in Python.

#### BC.1 Manually add the new phage genomes or bacterial genomes to their designated folders

#### BC.2 Rerun the relevant processing steps with the add=True parameter

If you've added new phage genomes, you'll have to rerun PHANOTATE, constructing protein embeddings and PhageRBPdetect. If you've added new bacterial genomes, you'll have to rerun Kaptive. Afterwards, rerun the processing of the interaction matrix.

In [None]:
# PHANOTATE
phage_genomes_path = general_path+'/phages_genomes'
phanotate_path = '/opt/homebrew/Caskroom/miniforge/base/envs/ML1/bin/phanotate.py'
phlp.phanotate_processing(general_path, phage_genomes_path, phanotate_path, data_suffix=data_suffix, add=True)

In [None]:
# Protein embeddings (alternatively run in Google Colab or Kaggle)
phlp.compute_protein_embeddings(general_path, data_suffix=data_suffix, add=True)

In [None]:
# PhageRBPdetect
pfam_path = general_path+'/RBPdetect_phageRBPs.hmm'
hmmer_path = '/Users/Dimi/hmmer-3.3.1'
xgb_path = general_path+'/RBPdetect_xgb_hmm.json'
gene_embeddings_path = general_path+'/phage_protein_embeddings'+data_suffix+'.csv'
phlp.phageRBPdetect(general_path, pfam_path, hmmer_path, xgb_path, gene_embeddings_path, data_suffix=data_suffix)

In [None]:
# Kaptive
#bact_genomes_path = general_path+'/klebsiella_genomes/fasta_files'
bact_genomes_path = general_path+'/klebsiella_genomes'


kaptive_database_path = general_path+'/Klebsiella_k_locus_primary_reference.gbk'
phlp.process_bacterial_genomes(general_path, bact_genomes_path, kaptive_database_path, 
                               data_suffix=data_suffix, add=True)

#### BC.3 Manually add the new bacteria (as rows) or phages (as columns) and interactions in the interactions.xlsx Excel sheet

#### BC.4 Reconstruct interaction matrix and feature matrices

In [7]:
interactions_xlsx_path = general_path+'/klebsiella_phage_host_interactions.xlsx'
phlp.process_interactions(general_path, interactions_xlsx_path, data_suffix=data_suffix)

In [None]:
# compute ESM-2 RBP embeddings
phlf.compute_esm2_embeddings_rbp(general_path, data_suffix=data_suffix, add=True)

In [None]:
# compute ESM-2 loci embeddings
phlf.compute_esm2_embeddings_loci(general_path, data_suffix=data_suffix, add=True)

In [None]:
# compute HDC embeddings
locibase_path = general_path+'/Locibase'+data_suffix+'.json'
rbpbase_path = general_path+'/RBPbase'+data_suffix+'.csv'
phlf.compute_hdc_embedding(general_path, data_suffix, locibase_path, rbpbase_path, mode='train')

In [None]:
features_esm2, features_hdc, labels, groups_loci, groups_phage = phlf.construct_feature_matrices(general_path, 
                                                                                            data_suffix=data_suffix)

#### BC.5 Retrain and save the models

In [None]:
cpus=6
labels = np.asarray(labels)

In [None]:
# ESM-2 FEATURES + XGBoost model
imbalance = sum([1 for i in labels if i==1]) / sum([1 for i in labels if i==0])
xgb = XGBClassifier(scale_pos_weight=1/imbalance, learning_rate=0.2, n_estimators=250, max_depth=7,
                    n_jobs=cpus, eval_metric='logloss', use_label_encoder=False)
xgb.fit(features_esm2, labels)
xgb.save_model('phagehostlearn_esm2_xgb.json')

In [None]:
# HDC FEATURES + RF model
rf = RandomForestClassifier(n_estimators=1000, max_depth=5, class_weight='balanced', n_jobs=cpus)
rf.fit(features_hdc, labels)
dump(rf, 'phagehostlearn_hdc_rf.joblib')

## Setting D: adding new phages AND bacteria + interactions

## 4. Training and evaluating models

#### 4.1 Training both models and saving them for later use

In [11]:
cpus=6
labels = np.asarray(labels)

In [12]:
# ESM-2 FEATURES + XGBoost model
imbalance = sum([1 for i in labels if i==1]) / sum([1 for i in labels if i==0])
xgb = XGBClassifier(scale_pos_weight=1/imbalance, learning_rate=0.2, n_estimators=250, max_depth=7,
                    n_jobs=cpus, eval_metric='logloss', use_label_encoder=False)
xgb.fit(features_esm2, labels)
xgb.save_model('phagehostlearn_esm2_xgb.json')

In [13]:
# HDC FEATURES + RF model
rf = RandomForestClassifier(n_estimators=1000, max_depth=5, class_weight='balanced', n_jobs=cpus)
rf.fit(features_hdc, labels)
dump(rf, 'phagehostlearn_hdc_rf.joblib') 

['phagehostlearn_hdc_rf.joblib']