# PHIStruct: Improving phage-host interaction prediction at low sequence similarity settings using structure-aware protein embeddings

<b>Mark Edward M. Gonzales<sup>1, 2</sup>, Jennifer C. Ureta<sup>1, 2, 3</sup> & Anish M.S. Shrestha<sup>1, 2</sup></b>

<sup>1</sup> Bioinformatics Lab, Advanced Research Institute for Informatics, Computing and Networking, De La Salle University, Manila 1004, Philippines <br>
<sup>2</sup> Department of Software Technology, College of Computer Studies, De La Salle University, Manila 1004, Philippines <br>
<sup>3</sup> Walter and Eliza Hall Institute of Medical Research, Melbourne, Victoria, 3052, Australia

✉️ gonzales.markedward@gmail.com, jennifer.ureta@gmail.com, anish.shrestha@dlsu.edu.ph

<hr>

# 💡 Prerequisites

### Option 1: Download the prerequisite files
1. Download `rbp_saprot_struct_mask_embeddings.tar.gz` from this [link](https://drive.google.com/file/d/1GAUsVFQSvKJ2COU1-jUQ5lC3yBUDx9ut/view?usp=sharing), and unzip it. This should result in a folder named `rbp_saprot_struct_mask_embeddings`.
1. Create a folder named `inphared` inside `data`. <br>
   Create a folder named `structure`, this time inside `data/inphared`, and save the extracted `rbp_saprot_struct_mask_embeddings` folder inside `data/inphared/structure`.
1. Download `consolidated.tar.gz` from this [link](https://drive.google.com/file/d/1yQSXwlb37dm2ZLXGJHdIM5vmrzwPAwvI/view?usp=sharing), and unzip it. This should result in a folder named `consolidated`. <br> Technically, this notebook only needs `consolidated/rbp.csv`.
1. Save the extracted `consolidated` folder inside `data/inphared`. 

### Option 2: Generate the prerequisite files yourself (may take a couple of weeks!)
1. Use ColabFold to predict the protein structures of the sequences in `data/inphared/fasta/complete`, following the instructions [here](https://github.com/YoshitakaMo/localcolabfold). <br>Refer to our paper for the parameters at which we ran ColabFold. <br>For reproducibility, we provide the results of running ColabFold [here](https://drive.google.com/file/d/1ZPRdaHwsFOPksLbOyQerREG0gY0p4-AT/view?usp=sharing).
1. Encode the predicted structures using SaProt's structure-aware alphabet, following the instructions [here](https://github.com/westlake-repl/SaProt?tab=readme-ov-file#convert-protein-structure-into-structure-aware-sequence). <br>**Make sure to change all the structure (lowercase) tokens to `#`.** <br>For reproducibility, we provide the results of this encoding step [here](https://drive.google.com/file/d/1Cwwp8iyX94LGqx55fRMUVwHVGPFDHehr/view?usp=sharing).
1. Feed the results of the encoding step to SaProt in order to generate the structure-aware embeddings, following the instructions [here](https://github.com/westlake-repl/SaProt/issues/14). 
1. Save each embedding following this naming convention: `<protein_id>_relaxed.r3.pdb.pt`, and consolidate all the embeddings inside a folder named `rbp_saprot_struct_mask_embeddings`.
1. Create a folder named `inphared` inside `data`. <br>
   Create a folder named `structure`, this time inside `data/inphared`, and save `rbp_saprot_struct_mask_embeddings` inside `data/inphared/structure`.
1. Download `consolidated.tar.gz` from this [link](https://drive.google.com/file/d/1yQSXwlb37dm2ZLXGJHdIM5vmrzwPAwvI/view?usp=sharing), and unzip it. This should result in a folder named `consolidated`. <br> Technically, this notebook only needs `consolidated/rbp.csv`.
1. Save the extracted `consolidated` folder inside `data/inphared`. 

### Resulting folder structure

`experiments` (parent folder of this notebook) <br> 
↳ `data` <br>
&nbsp; &nbsp;↳ `inphared` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `consolidated` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp.csv` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `structure` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp_saprot_struct_mask_embeddings` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `AAA74324.1_relaxed.r3.pdb.pt` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ ... <br>
↳ `3.4. Data Consolidation (SaProt with Structure Masking).ipynb` (this notebook) <br>

<hr>

# 📁 Output files

1. If you would like to skip running this notebook, download `rbp_saprot_struct_mask_relaxed_r3.csv` from this [link](https://drive.google.com/file/d/1eeQphah4GVjxms8vutlt43HuEFmTUTug/view?usp=sharing). This CSV file consolidates the embeddings.

1. Create a folder named `inphared` inside `data`. <br>
   Create a folder named `structure`, this time inside `data/inphared`, and save `rbp_saprot_struct_mask_relaxed_r3.csv` inside `data/inphared/structure`.

1. Download `consolidated.tar.gz` from this [link](https://drive.google.com/file/d/1yQSXwlb37dm2ZLXGJHdIM5vmrzwPAwvI/view?usp=sharing), and unzip it. This should result in a folder named `consolidated`. <br> Technically, this notebook generates only `consolidated/rbp_embeddings_saprot_struct_mask_relaxed_r3.csv`, which consolidates the phage-host information and the embeddings.

1. Save the extracted `consolidated` folder inside `data/inphared`. 


### Resulting folder structure

`experiments` (parent folder of this notebook) <br> 
↳ `data` <br>
&nbsp; &nbsp;↳ `inphared` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `consolidated` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp.csv` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp_embeddings_saprot_struct_mask_relaxed_r3.csv` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `structure` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp_saprot_struct_mask_embeddings` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp_saprot_struct_mask_relaxed_r3.csv` <br>
↳ `3.4. Data Consolidation (SaProt with Structure Masking).ipynb` (this notebook) <br>

<hr>

# Part I: Preliminaries

Import the necessary libraries and modules.

In [1]:
import pandas as pd

import ConstantsUtil
import StructureUtil

%load_ext autoreload
%autoreload 2

In [2]:
constants = ConstantsUtil.ConstantsUtil()
util = StructureUtil.StructureUtil()

<hr>

# Part II: Consolidation of the SaProt embeddings

Consolidate the embeddings into a single data frame.

In [3]:
rbp_saprot_struct_mask_relaxed_r3 = util.convert_saprot_pt_to_df(
    f"{constants.INPHARED}/{constants.STRUCTURE_SAPROT_STRUCT_MASK}", "_relaxed.r3"
)
rbp_saprot_struct_mask_relaxed_r3.head()

100%|████████████████████████████████████████████████████████████████████████████| 28977/28977 [29:36<00:00, 16.32it/s]


Unnamed: 0,Protein ID,s1,s2,s3,s4,s5,s6,s7,s8,s9,...,s1271,s1272,s1273,s1274,s1275,s1276,s1277,s1278,s1279,s1280
0,AAA74324.1,0.017188,-0.038714,0.076578,0.050484,-0.026651,-0.126441,-0.086992,0.026037,-0.002465,...,-0.100108,0.000492,-0.029353,0.063228,-0.317542,0.087773,-0.048079,-0.018086,0.007781,-0.035918
1,AAA74331.1,0.030815,-0.010798,0.045598,-0.006667,-0.02816,-0.051516,-0.11303,0.038848,0.025976,...,-0.166662,-0.059483,-0.07161,0.055553,-0.342379,0.071236,-0.027485,-0.021648,0.043277,0.024916
2,AAA98578.2,0.021491,0.00794,0.076282,0.022202,-0.035941,-0.063123,-0.105316,0.006735,0.001159,...,-0.117928,0.013834,-0.044441,0.092925,-0.321899,0.07249,-0.042483,-0.007342,0.031304,-0.006507
3,AAB09218.1,-0.003662,-0.000101,0.073738,0.035639,-0.038402,-0.051332,-0.071285,-0.008385,-0.005631,...,-0.105274,0.000808,-0.036236,0.066126,-0.366389,0.085536,-0.065895,-0.030888,0.011904,-0.053355
4,AAB70057.1,0.013627,-0.01425,0.076644,0.015864,-0.019512,-0.091388,-0.093953,0.003854,0.019544,...,-0.116157,0.029425,-0.067303,0.1086,-0.302618,0.040414,-0.01423,-0.044296,0.011156,-0.051728


In [4]:
rbp_saprot_struct_mask_relaxed_r3.to_csv(
    f"{constants.INPHARED}/{constants.CSV_SAPROT_STRUCT_MASK}_relaxed_r3.csv",
    index=False,
)

Combine the embeddings data frame with the data frame containing the phage-host information.

In [5]:
rbp_nucleotide_seq = pd.read_csv(
    f"{constants.INPHARED}/{constants.CONSOLIDATED}/{constants.INPHARED_RBP_DATA}"
)
rbp_nucleotide_seq.head()

Unnamed: 0,Protein ID,Accession,Description,Classification,Genome Length (bp),Jumbophage,molGC (%),Molecule,Modification Date,Number CDS,...,Genbank Division,Isolation Host (beware inconsistent and nonsense values),Host Superkingdom,Host Phylum,Host Class,Host Order,Host Family,Year-Month,Protein Sequence,Nucleotide Sequence
0,BAD16801.1,AB161975,Wolbachia phage WOcauB1,Wolbachia phage WOcauB1 Viruses,20484,False,36.878,DNA,2016-07-26,27,...,PHG,Wolbachia sp. wCauB,bacteria,pseudomonadota,alphaproteobacteria,rickettsiales,anaplasmataceae,2016-07,MKEAIYQRIKDLAANSTPDQLAYLAKSLELIADKKAISNVVQMTEV...,ATGAAAGAAGCAATATACCAAAGGATAAAGGATTTAGCAGCAAATA...
1,BAF36105.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,PHG,Unspecified,bacteria,cyanobacteriota,cyanophyceae,chroococcales,microcystaceae,2023-08,MHQNISKENRGNYNNGIRPRIFMITTIDFRDIQAACIKQLDDMSKD...,GTGCATCAAAATATTTCAAAGGAGAATCGTGGAAACTATAACAACG...
2,BAF36110.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,PHG,Unspecified,bacteria,cyanobacteriota,cyanophyceae,chroococcales,microcystaceae,2023-08,MRIFYIHHPFLATHRYLLSNAYSTPYTDSITKLTTSYSSMPIILSV...,GTGAGGATTTTTTATATCCACCATCCATTCCTCGCTACTCACCGAT...
3,BAF36131.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,PHG,Unspecified,bacteria,cyanobacteriota,cyanophyceae,chroococcales,microcystaceae,2023-08,MLTDVDIQALIDASISGLSGEMPIVANIAARNALSLTKNTQVLVLD...,TTGCTGACAGATGTCGATATTCAGGCATTAATTGATGCCTCAATTT...
4,BAF36132.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,PHG,Unspecified,bacteria,cyanobacteriota,cyanophyceae,chroococcales,microcystaceae,2023-08,MFGVFIVRREGGYIGTQPNWDAANRPGNWDILDVYNRQRRNLWIQS...,TTGTTCGGAGTTTTTATCGTGAGGCGTGAAGGCGGCTATATCGGAA...


In [6]:
rbp_structure_embeddings_relaxed_r3 = pd.merge(
    rbp_nucleotide_seq,
    rbp_saprot_struct_mask_relaxed_r3,
    how="inner",
    validate="one_to_one",
    on="Protein ID",
)
rbp_structure_embeddings_relaxed_r3.head()

Unnamed: 0,Protein ID,Accession,Description,Classification,Genome Length (bp),Jumbophage,molGC (%),Molecule,Modification Date,Number CDS,...,s1271,s1272,s1273,s1274,s1275,s1276,s1277,s1278,s1279,s1280
0,BAD16801.1,AB161975,Wolbachia phage WOcauB1,Wolbachia phage WOcauB1 Viruses,20484,False,36.878,DNA,2016-07-26,27,...,-0.10508,0.022164,-0.022043,0.047761,-0.343902,0.045124,-0.048649,-0.034725,0.008094,-0.025113
1,BAF36105.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,-0.076139,0.023233,-0.025228,0.042919,-0.306105,0.081539,-0.082745,-0.003921,0.044274,-0.028975
2,BAF36110.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,-0.101477,0.036944,-0.020289,0.026964,-0.322689,0.082593,-0.108646,-0.069036,-0.016981,-0.035543
3,BAF36131.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,-0.111074,-0.013124,-0.079112,0.032979,-0.352964,0.013509,-0.013255,-0.066899,-0.012351,-0.054145
4,BAF36132.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,-0.121753,0.018492,-0.026304,0.124295,-0.323911,0.003967,-0.000575,-0.03904,0.018505,-0.033633


In [7]:
rbp_structure_embeddings_relaxed_r3.to_csv(
    f"{constants.INPHARED}/{constants.CONSOLIDATED}/rbp_embeddings_saprot_struct_mask_relaxed_r3.csv",
    index=False,
)