# PHIStruct: Improving phage-host interaction prediction at low sequence similarity settings using structure-aware protein embeddings

<b>Mark Edward M. Gonzales<sup>1, 2</sup>, Jennifer C. Ureta<sup>1, 2, 3</sup> & Anish M.S. Shrestha<sup>1, 2</sup></b>

<sup>1</sup> Bioinformatics Lab, Advanced Research Institute for Informatics, Computing and Networking, De La Salle University, Manila 1004, Philippines <br>
<sup>2</sup> Department of Software Technology, College of Computer Studies, De La Salle University, Manila 1004, Philippines <br>
<sup>3</sup> Walter and Eliza Hall Institute of Medical Research, Melbourne, Victoria, 3052, Australia

✉️ gonzales.markedward@gmail.com, jennifer.ureta@gmail.com, anish.shrestha@dlsu.edu.ph

<hr>

# 💡 Prerequisites

### Option 1: Download the prerequisite files
1. Download `consolidated.tar.gz` from this [link](https://drive.google.com/file/d/1yQSXwlb37dm2ZLXGJHdIM5vmrzwPAwvI/view?usp=sharing), and unzip it. This should result in a folder named `consolidated`. <br> Technically, this notebook only needs `consolidated/rbp_embeddings_seqvec.csv` and `consolidated/rbp_embeddings_saprot_relaxed_r3.csv`.
1. Create a folder named `inphared` inside `data`, and save the extracted `consolidated` folder inside `data/inphared`. 
1. Download `fasta.tar.gz` from this [link](https://drive.google.com/file/d/1NMFR3JrrrCHLoCMQp2nia4dgtcXs5x05/view?usp=sharing), and unzip it. This should result in a folder named `fasta`. <br> Technically, this notebook only needs the `.clstr` files inside `fasta`.
1. Save the extracted `fasta` folder inside `data/inphared`.

### Option 2: Generate the prerequisite files yourself
1. Download `consolidated.tar.gz` from this [link](https://drive.google.com/file/d/1yQSXwlb37dm2ZLXGJHdIM5vmrzwPAwvI/view?usp=sharing), and unzip it. This should result in a folder named `consolidated`. <br> Technically, this notebook only needs `consolidated/rbp_embeddings_seqvec.csv` and `consolidated/rbp_embeddings_saprot_relaxed_r3.csv`.
1. Consolidate the sequences of the proteins with predicted structures into a single FASTA file. <br>
   For reproducibility, we provide our consolidated FASTA file [here](https://drive.google.com/file/d/1LTZte1f4lreQ5MXWeM-y2Mtp9z96pXS7/view?usp=sharing).
1. Generate the protein clusters by running CD-HIT on this FASTA file at a sequence similarity threshold of 100%, following the instructions [here](https://github.com/weizhongli/cdhit). 
1. Rename the resulting `.clstr` file to `complete-struct-100.fasta.clstr` and the resulting FASTA file (containing only the representative sequences) to `complete-struct-100.fasta`. 
1. Generate `complete-struct-80.fasta.clstr`, `complete-struct-60.fasta.clstr`, and `complete-struct-40.fasta.clstr` by running CD-HIT on `complete-struct-100.fasta` at sequence similarity thresholds of 80%, 60%, and 40%, respectively.
1. Create a folder named `fasta` inside `data/inphared`, and save the four `.clstr` files inside `data/inphared/fasta`.

### Resulting folder structure

`experiments` (parent folder of this notebook) <br> 
↳ `data` <br>
&nbsp; &nbsp;↳ `inphared` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `consolidated` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp_embeddings_seqvec.csv` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp_embeddings_saprot_relaxed_r3.csv` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `fasta` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `complete-struct-100.fasta.clstr` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `complete-struct-80.fasta.clstr` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `complete-struct-40.fasta.clstr` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `complete-struct-60.fasta.clstr` <br>
↳ `5.10. Benchmarking - Classifier Building & Evaluation (SeqVec).ipynb` (this notebook) <br>

<hr>

# 📁 Output files

The output files (i.e., the results of evaluating the model's performance) &mdash; which are saved in `temp/results` &mdash; are already included when the repository was cloned. <br>

<hr>

# Part I: Preliminaries

Import the necessary libraries and modules.

In [9]:
import warnings

import pandas as pd
import sklearn

import ConstantsUtil
import ClassificationUtil

%load_ext autoreload
%autoreload 2

In [10]:
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", 50)

pd.options.mode.chained_assignment = None

with warnings.catch_warnings():
    warnings.filterwarnings(
        "ignore", category=sklearn.exceptions.UndefinedMetricWarning
    )

In [11]:
constants = ConstantsUtil.ConstantsUtil()
util = ClassificationUtil.ClassificationUtil()

<hr>

# Part II: Classifier Building and Evaluation

Train a multilayer perceptron, and evaluate its performance at different train-versus-test similarity and confidence thresholds.

In [15]:
models = ["SEQVEC"]

for similarity in range(100, 39, -20):
    for model in models:
        model = model.lower()
        df, df_all, protein_clusters = util.filter_proteins_based_on_struct_and_seq_sim(
            f"{constants.INPHARED}/{constants.CONSOLIDATED}/rbp_embeddings_{model}.csv",
            f"{constants.INPHARED}/{constants.CONSOLIDATED}/rbp_embeddings_saprot_relaxed_r3.csv",
            f"{constants.INPHARED}/{constants.FASTA}/complete-struct-{similarity}.fasta.clstr",
        )

        include_proteins_in_cluster = True
        if similarity == 100:
            include_proteins_in_cluster = False

        print(f"*** {model}, similarity = {similarity}% ***")
        util.classify(
            df,
            model + "-mlp-eskapee-smotetomek",
            similarity,
            genus=[
                "enterococcus",
                "staphylococcus",
                "klebsiella",
                "acinetobacter",
                "pseudomonas",
                "enterobacter",
                "escherichia",
            ],
            include_proteins_in_cluster=include_proteins_in_cluster,
            rbp_embeddings_all=df_all,
            protein_clusters=protein_clusters,
            undersample_others=True,
            oversample_technique="SMOTETomek",
            model="MLP-prot",
            batch_size=256,
            learning_rate=1e-3,
            dropout=0.3,
        )

*** seqvec, similarity = 100% ***
Constructing training and test sets...
Training set shape: (16934, 1024)
Test set shape: (2340, 1024)
Training the model...
Saving evaluation results...
Confidence threshold k: 0.0%
                precision    recall  f1-score   support

 acinetobacter     0.9459    0.6306    0.7568       111
  enterobacter     0.2981    0.4138    0.3466       116
  enterococcus     0.7377    0.8824    0.8036        51
   escherichia     0.8643    0.8269    0.8452      1040
    klebsiella     0.8458    0.8568    0.8513       461
        others     0.0000    0.0000    0.0000        51
   pseudomonas     0.7947    0.9407    0.8616       354
staphylococcus     0.9080    0.9487    0.9279       156

      accuracy                         0.8115      2340
     macro avg     0.6743    0.6875    0.6741      2340
  weighted avg     0.8073    0.8115    0.8062      2340

Confidence threshold k: 10.0%
                precision    recall  f1-score   support

 acinetobacter     0.9