# Dual identification of novel phage receptor-binding proteins based on protein domains and machine learning
# *Predicting new sequences*


This notebook, together with the 'RBPdetect_protein_embeddings' notebook can be used to make predictions for protein sequences based on our domain-based and machine-learning-based approach. The following steps should be followed:
- Prepare a FASTA file with the **protein** sequences you want to make predictions for.
- Download HMMER (http://hmmer.org) and locate its contents (you need the path to it).
- Make sure all the necessary software packages (*Libraries* below) are installed.
- Compute the necessary protein language embeddings using the 'RBPdetect_protein_embeddings' notebook. This is best done on Kaggle for better performance utilizing GPU computing power. Download the computed embeddings from Kaggle.
- Copy the FASTA file and computed embeddings to the data folder of this GitHub repository. In the data folder, the RBPdetect_XGBmodel.json and RBPdetect_phageRBPs.hmm should also be located.
- Fill in the necessary file names in the second code block.
- Run the code blocks below to make predictions based on the domain-based approach and machine-learning-based approach.
- The resulting dataframe contains a row for each of the protein sequences that was submitted. A binary prediction (0/1) is made for each of the methods. A '0' indicates that the sequence is predicted not to be an RBP, while a '1' indicates that the sequence is predicted as an RBP.

#### Libraries and files

In [1]:
from xgboost import XGBClassifier
from Bio import SeqIO
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import RBPdetect_utils as rbpu

In [2]:
pfam_file = 'data/RBPdetect_phageRBPs.hmm'
xgb_file = 'data/RBPdetect_xgb_model.json'
fasta_file = 'data/examples.fasta'
hmmer_path = ... # e.g. '/Users/Sally/hmmer-3.3.1'
embeddings_file = ... # e.g. 'data/embeddings.csv'

#### Domain-based approach

In [None]:
# press the .hmm file for further use
output, err = rbpu.hmmpress_python(hmmer_path, pfam_file)

In [5]:
# define HMMs to be detected as RBP-related
N_blocks = ['Phage_T7_tail', 'Tail_spike_N', 'Prophage_tail', 'BppU_N', 'Mtd_N', 
           'Head_binding', 'DUF3751', 'End_N_terminal', 'phage_tail_N', 'Prophage_tailD1', 
           'DUF2163', 'Phage_fiber_2', 'unknown_N0', 'unknown_N1', 'unknown_N2', 'unknown_N3', 'unknown_N4', 
            'unknown_N6', 'unknown_N10', 'unknown_N11', 'unknown_N12', 'unknown_N13', 'unknown_N17', 'unknown_N19', 
            'unknown_N23', 'unknown_N24', 'unknown_N26','unknown_N29', 'unknown_N36', 'unknown_N45', 'unknown_N48', 
            'unknown_N49', 'unknown_N53', 'unknown_N57', 'unknown_N60', 'unknown_N61', 'unknown_N65', 'unknown_N73', 
            'unknown_N82', 'unknown_N83', 'unknown_N101', 'unknown_N114', 'unknown_N119', 'unknown_N122', 
            'unknown_N163', 'unknown_N174', 'unknown_N192', 'unknown_N200', 'unknown_N206', 'unknown_N208']
C_blocks = ['Lipase_GDSL_2', 'Pectate_lyase_3', 'gp37_C', 'Beta_helix', 'Gp58', 'End_beta_propel', 
            'End_tail_spike', 'End_beta_barrel', 'PhageP22-tail', 'Phage_spike_2', 
            'gp12-short_mid', 'Collar', 
            'unknown_C2', 'unknown_C3', 'unknown_C8', 'unknown_C15', 'unknown_C35', 'unknown_C54', 'unknown_C76', 
            'unknown_C100', 'unknown_C105', 'unknown_C112', 'unknown_C123', 'unknown_C179', 'unknown_C201', 
            'unknown_C203', 'unknown_C228', 'unknown_C234', 'unknown_C242', 'unknown_C258', 'unknown_C262', 
            'unknown_C267', 'unknown_C268', 'unknown_C274', 'unknown_C286', 'unknown_C292', 'unknown_C294', 
            'Peptidase_S74', 'Phage_fiber_C', 'S_tail_recep_bd', 'CBM_4_9', 'DUF1983', 'DUF3672']

In [None]:
# do domain-based detections
domain_based_detections = rbpu.RBPdetect_domains_protein(hmmer_path, pfam_file, fasta_file, N_blocks=N_blocks, 
                                                         C_blocks=C_blocks, detect_others=False)

names = [record.id for record in SeqIO.parse(fasta_file, 'fasta')]
domain_preds = []
for pid in names:
    if pid in list(domain_based_detections['identifier']):
        domain_preds.append(1)
    else:
        domain_preds.append(0)

#### Machine-learning-based approach

In [None]:
# load protein embeddings to make predictions for
embeddings_df = pd.read_csv(embeddings_file)
embeddings = np.asarray(embeddings_df.iloc[:, 1:])

In [None]:
# load trained model
xgb_saved = XGBClassifier()
xgb_saved.load_model(xgb_file)

# make predictions with the XGBoost model
score_xgb = xgb_saved.predict_proba(embeddings)[:,1]
preds_xgb = (score_xgb > 0.5)*1

#### Save predictions of both methods together

In [None]:
results = pd.DataFrame({'domain_based_predictions':domain_preds, 'machine_learning_predictions':preds_xgb})
results.to_csv('results_predictions.csv', index=False)