# PHIStruct: Improving phage-host interaction prediction at low sequence similarity settings using structure-aware protein embeddings

<b>Mark Edward M. Gonzales<sup>1, 2</sup>, Jennifer C. Ureta<sup>1, 2, 3</sup> & Anish M.S. Shrestha<sup>1, 2</sup></b>

<sup>1</sup> Bioinformatics Lab, Advanced Research Institute for Informatics, Computing and Networking, De La Salle University, Manila 1004, Philippines <br>
<sup>2</sup> Department of Software Technology, College of Computer Studies, De La Salle University, Manila 1004, Philippines <br>
<sup>3</sup> Walter and Eliza Hall Institute of Medical Research, Melbourne, Victoria, 3052, Australia

✉️ gonzales.markedward@gmail.com, jennifer.ureta@gmail.com, anish.shrestha@dlsu.edu.ph

<hr>

# 💡 Prerequisites

### Option 1: Download the prerequisite files
1. Download `rbp_saprot_mask_embeddings.tar.gz` from this [link](https://drive.google.com/file/d/1N6mWO0gG82oP99NqA_pSiAcXFZW6Xk9o/view?usp=sharing), and unzip it. This should result in a folder named `rbp_saprot_mask_embeddings`.
1. Create a folder named `inphared` inside `data`. <br>
   Create a folder named `structure`, this time inside `data/inphared`, and save the extracted `rbp_saprot_mask_embeddings` folder inside `data/inphared/structure`.
1. Download `consolidated.tar.gz` from this [link](https://drive.google.com/file/d/1yQSXwlb37dm2ZLXGJHdIM5vmrzwPAwvI/view?usp=sharing), and unzip it. This should result in a folder named `consolidated`. <br> Technically, this notebook only needs `consolidated/rbp.csv`.
1. Save the extracted `consolidated` folder inside `data/inphared`. 

### Option 2: Generate the prerequisite files yourself (may take a couple of weeks!)
1. Use ColabFold to predict the protein structures of the sequences in `data/inphared/fasta/complete`, following the instructions [here](https://github.com/YoshitakaMo/localcolabfold). <br>Refer to our paper for the parameters at which we ran ColabFold. <br>For reproducibility, we provide the results of running ColabFold [here](https://drive.google.com/file/d/1ZPRdaHwsFOPksLbOyQerREG0gY0p4-AT/view?usp=sharing).
1. Encode the predicted structures using SaProt's structure-aware alphabet, following the instructions [here](https://github.com/westlake-repl/SaProt?tab=readme-ov-file#convert-protein-structure-into-structure-aware-sequence). <br>**Make sure to set the `plddt_mask` parameter of `get_struc_seq()` to `True`.** <br>For reproducibility, we provide the results of this encoding step [here](https://drive.google.com/file/d/10vflEnYUJOVoTWXbYaWg5DGiFobGeT6s/view?usp=sharing).
1. Feed the results of the encoding step to SaProt in order to generate the structure-aware embeddings, following the instructions [here](https://github.com/westlake-repl/SaProt/issues/14). 
1. Save each embedding following this naming convention: `<protein_id>_relaxed.r3.pdb.pt`, and consolidate all the embeddings inside a folder named `rbp_saprot_mask_embeddings`.
1. Create a folder named `inphared` inside `data`. <br>
   Create a folder named `structure`, this time inside `data/inphared`, and save `rbp_saprot_mask_embeddings` inside `data/inphared/structure`.
1. Download `consolidated.tar.gz` from this [link](https://drive.google.com/file/d/1yQSXwlb37dm2ZLXGJHdIM5vmrzwPAwvI/view?usp=sharing), and unzip it. This should result in a folder named `consolidated`. <br> Technically, this notebook only needs `consolidated/rbp.csv`.
1. Save the extracted `consolidated` folder inside `data/inphared`. 

### Resulting folder structure

`experiments` (parent folder of this notebook) <br> 
↳ `data` <br>
&nbsp; &nbsp;↳ `inphared` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `consolidated` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp.csv` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `structure` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp_saprot_mask_embeddings` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `AAA74324.1_relaxed.r3.pdb.pt` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ ... <br>
↳ `3.3. Data Consolidation (SaProt with Low-Confidence Masking).ipynb` (this notebook) <br>

<hr>

# 📁 Output files

1. If you would like to skip running this notebook, download `rbp_saprot_mask_relaxed_r3.csv` from this [link](https://drive.google.com/file/d/15M25MbPMmfpk9rAy2I5Y3SlqC4Gi-EId/view?usp=sharing). This CSV file consolidates the embeddings.

1. Create a folder named `inphared` inside `data`. <br>
   Create a folder named `structure`, this time inside `data/inphared`, and save `rbp_saprot_mask_relaxed_r3.csv` inside `data/inphared/structure`.

1. Download `consolidated.tar.gz` from this [link](https://drive.google.com/file/d/1yQSXwlb37dm2ZLXGJHdIM5vmrzwPAwvI/view?usp=sharing), and unzip it. This should result in a folder named `consolidated`. <br> Technically, this notebook generates only `consolidated/rbp_embeddings_saprot_mask_relaxed_r3.csv`, which consolidates the phage-host information and the embeddings.

1. Save the extracted `consolidated` folder inside `data/inphared`. 


### Resulting folder structure

`experiments` (parent folder of this notebook) <br> 
↳ `data` <br>
&nbsp; &nbsp;↳ `inphared` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `consolidated` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp.csv` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp_embeddings_saprot_mask_relaxed_r3.csv` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `structure` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp_saprot_mask_embeddings` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp_saprot_mask_relaxed_r3.csv` <br>
↳ `3.3. Data Consolidation (SaProt with Low-Confidence Masking).ipynb` (this notebook) <br>

<hr>

# Part I: Preliminaries

Import the necessary libraries and modules.

In [1]:
import pandas as pd

import ConstantsUtil
import StructureUtil

%load_ext autoreload
%autoreload 2

In [2]:
constants = ConstantsUtil.ConstantsUtil()
util = StructureUtil.StructureUtil()

<hr>

# Part II: Consolidation of the SaProt embeddings

Consolidate the embeddings into a single data frame.

In [3]:
rbp_saprot_mask_relaxed_r3 = util.convert_saprot_pt_to_df(
    f"{constants.INPHARED}/{constants.STRUCTURE_SAPROT_MASK}", "_relaxed.r3"
)
rbp_saprot_mask_relaxed_r3.head()

100%|████████████████████████████████████████████████████████████████████████████| 28977/28977 [47:34<00:00, 10.15it/s]


Unnamed: 0,Protein ID,s1,s2,s3,s4,s5,s6,s7,s8,s9,...,s1271,s1272,s1273,s1274,s1275,s1276,s1277,s1278,s1279,s1280
0,AAA74324.1,-0.008504,-0.014574,0.011509,0.036482,-0.038966,-0.103524,-0.081406,0.018075,0.011727,...,-0.041147,-0.006205,-0.005391,0.068507,-0.28156,0.037484,-0.038498,-0.030083,-0.045225,-0.049191
1,AAA74331.1,0.004274,0.010737,-0.068663,8.5e-05,-0.0679,-0.072488,-0.060677,0.061941,0.018569,...,-0.050606,-0.047386,-0.005598,0.068039,-0.306017,0.034562,-0.011731,-0.030635,-0.016533,0.026753
2,AAA98578.2,0.016729,0.020582,0.003032,0.04445,-0.052762,-0.076996,-0.07439,0.022359,0.009708,...,-0.035153,-0.012395,-0.03051,0.097494,-0.281786,0.014246,-0.024108,0.011808,-0.012506,0.015401
3,AAB09218.1,-0.022866,0.045344,-0.009843,0.048203,-0.067898,-0.028404,-0.060779,0.013387,0.009138,...,-0.016834,0.027542,-0.016799,0.068709,-0.352338,0.05111,-0.048934,-0.033355,-0.033687,-0.054914
4,AAB70057.1,-0.026706,0.033634,0.003055,0.019606,-0.051299,-0.065443,-0.080492,0.039519,0.044541,...,-0.002792,0.013223,-0.023988,0.077445,-0.304381,0.026808,-0.020154,-0.018398,-0.035166,-0.060861


In [4]:
rbp_saprot_mask_relaxed_r3.to_csv(
    f"{constants.INPHARED}/{constants.CSV_SAPROT_MASK}_relaxed_r3.csv", index=False
)

Combine the embeddings data frame with the data frame containing the phage-host information.

In [5]:
rbp_nucleotide_seq = pd.read_csv(
    f"{constants.INPHARED}/{constants.CONSOLIDATED}/{constants.INPHARED_RBP_DATA}"
)
rbp_nucleotide_seq.head()

Unnamed: 0,Protein ID,Accession,Description,Classification,Genome Length (bp),Jumbophage,molGC (%),Molecule,Modification Date,Number CDS,...,Genbank Division,Isolation Host (beware inconsistent and nonsense values),Host Superkingdom,Host Phylum,Host Class,Host Order,Host Family,Year-Month,Protein Sequence,Nucleotide Sequence
0,BAD16801.1,AB161975,Wolbachia phage WOcauB1,Wolbachia phage WOcauB1 Viruses,20484,False,36.878,DNA,2016-07-26,27,...,PHG,Wolbachia sp. wCauB,bacteria,pseudomonadota,alphaproteobacteria,rickettsiales,anaplasmataceae,2016-07,MKEAIYQRIKDLAANSTPDQLAYLAKSLELIADKKAISNVVQMTEV...,ATGAAAGAAGCAATATACCAAAGGATAAAGGATTTAGCAGCAAATA...
1,BAF36105.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,PHG,Unspecified,bacteria,cyanobacteriota,cyanophyceae,chroococcales,microcystaceae,2023-08,MHQNISKENRGNYNNGIRPRIFMITTIDFRDIQAACIKQLDDMSKD...,GTGCATCAAAATATTTCAAAGGAGAATCGTGGAAACTATAACAACG...
2,BAF36110.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,PHG,Unspecified,bacteria,cyanobacteriota,cyanophyceae,chroococcales,microcystaceae,2023-08,MRIFYIHHPFLATHRYLLSNAYSTPYTDSITKLTTSYSSMPIILSV...,GTGAGGATTTTTTATATCCACCATCCATTCCTCGCTACTCACCGAT...
3,BAF36131.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,PHG,Unspecified,bacteria,cyanobacteriota,cyanophyceae,chroococcales,microcystaceae,2023-08,MLTDVDIQALIDASISGLSGEMPIVANIAARNALSLTKNTQVLVLD...,TTGCTGACAGATGTCGATATTCAGGCATTAATTGATGCCTCAATTT...
4,BAF36132.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,PHG,Unspecified,bacteria,cyanobacteriota,cyanophyceae,chroococcales,microcystaceae,2023-08,MFGVFIVRREGGYIGTQPNWDAANRPGNWDILDVYNRQRRNLWIQS...,TTGTTCGGAGTTTTTATCGTGAGGCGTGAAGGCGGCTATATCGGAA...


In [6]:
rbp_structure_embeddings_relaxed_r3 = pd.merge(
    rbp_nucleotide_seq,
    rbp_saprot_mask_relaxed_r3,
    how="inner",
    validate="one_to_one",
    on="Protein ID",
)
rbp_structure_embeddings_relaxed_r3.head()

Unnamed: 0,Protein ID,Accession,Description,Classification,Genome Length (bp),Jumbophage,molGC (%),Molecule,Modification Date,Number CDS,...,s1271,s1272,s1273,s1274,s1275,s1276,s1277,s1278,s1279,s1280
0,BAD16801.1,AB161975,Wolbachia phage WOcauB1,Wolbachia phage WOcauB1 Viruses,20484,False,36.878,DNA,2016-07-26,27,...,-0.035965,0.008695,-0.034298,0.051132,-0.309808,0.038125,-0.01948,-0.035563,-0.020148,-0.059373
1,BAF36105.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,0.047789,0.008117,0.018838,0.031085,-0.289683,0.044725,-0.068815,-0.005499,-0.015393,-0.028916
2,BAF36110.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,-0.065459,0.043109,-0.006144,0.043708,-0.312704,0.063939,-0.111784,-0.087121,-0.071293,-0.056054
3,BAF36131.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,-0.027433,-0.000972,-0.065013,0.052423,-0.329274,-0.010864,-0.04541,-0.050286,-0.056554,-0.049343
4,BAF36132.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,-0.03701,0.006602,-0.035198,0.124595,-0.300518,-0.031494,-0.003482,-0.013482,0.018368,-0.066325


In [7]:
rbp_structure_embeddings_relaxed_r3.to_csv(
    f"{constants.INPHARED}/{constants.CONSOLIDATED}/rbp_embeddings_saprot_mask_relaxed_r3.csv",
    index=False,
)