# PHIStruct: Improving phage-host interaction prediction at low sequence similarity settings using structure-aware protein embeddings

<b>Mark Edward M. Gonzales<sup>1, 2</sup>, Jennifer C. Ureta<sup>1, 2, 3</sup> & Anish M.S. Shrestha<sup>1, 2</sup></b>

<sup>1</sup> Bioinformatics Lab, Advanced Research Institute for Informatics, Computing and Networking, De La Salle University, Manila 1004, Philippines <br>
<sup>2</sup> Department of Software Technology, College of Computer Studies, De La Salle University, Manila 1004, Philippines <br>
<sup>3</sup> Walter and Eliza Hall Institute of Medical Research, Melbourne, Victoria, 3052, Australia

✉️ gonzales.markedward@gmail.com, jennifer.ureta@gmail.com, anish.shrestha@dlsu.edu.ph

<hr>

# 💡 Prerequisites

### Option 1: Download the prerequisite files
1. Download `rbp_saprot_seq_mask_embeddings.tar.gz` from this [link](https://drive.google.com/file/d/1__Rok7MoEbTJ3P8iO3Z-bA7_pFUX4CoO/view?usp=sharing), and unzip it. This should result in a folder named `rbp_saprot_seq_mask_embeddings`.
1. Create a folder named `inphared` inside `data`. <br>
   Create a folder named `structure`, this time inside `data/inphared`, and save the extracted `rbp_saprot_seq_mask_embeddings` folder inside `data/inphared/structure`.
1. Download `consolidated.tar.gz` from this [link](https://drive.google.com/file/d/1yQSXwlb37dm2ZLXGJHdIM5vmrzwPAwvI/view?usp=sharing), and unzip it. This should result in a folder named `consolidated`. <br> Technically, this notebook only needs `consolidated/rbp.csv`.
1. Save the extracted `consolidated` folder inside `data/inphared`. 

### Option 2: Generate the prerequisite files yourself (may take a couple of weeks!)
1. Use ColabFold to predict the protein structures of the sequences in `data/inphared/fasta/complete`, following the instructions [here](https://github.com/YoshitakaMo/localcolabfold). <br>Refer to our paper for the parameters at which we ran ColabFold. <br>For reproducibility, we provide the results of running ColabFold [here](https://drive.google.com/file/d/1ZPRdaHwsFOPksLbOyQerREG0gY0p4-AT/view?usp=sharing).
1. Encode the predicted structures using SaProt's structure-aware alphabet, following the instructions [here](https://github.com/westlake-repl/SaProt?tab=readme-ov-file#convert-protein-structure-into-structure-aware-sequence). <br>**Make sure to change all the residue (uppercase) tokens to `#`.** <br>For reproducibility, we provide the results of this encoding step [here](https://drive.google.com/file/d/1ziE8krlisUQ_M-jVNYaRSVjyax0Eeai2/view?usp=sharing).
1. Feed the results of the encoding step to SaProt in order to generate the structure-aware embeddings, following the instructions [here](https://github.com/westlake-repl/SaProt/issues/14). 
1. Save each embedding following this naming convention: `<protein_id>_relaxed.r3.pdb.pt`, and consolidate all the embeddings inside a folder named `rbp_saprot_seq_mask_embeddings`.
1. Create a folder named `inphared` inside `data`. <br>
   Create a folder named `structure`, this time inside `data/inphared`, and save `rbp_saprot_seq_mask_embeddings` inside `data/inphared/structure`.
1. Download `consolidated.tar.gz` from this [link](https://drive.google.com/file/d/1yQSXwlb37dm2ZLXGJHdIM5vmrzwPAwvI/view?usp=sharing), and unzip it. This should result in a folder named `consolidated`. <br> Technically, this notebook only needs `consolidated/rbp.csv`.
1. Save the extracted `consolidated` folder inside `data/inphared`. 

### Resulting folder structure

`experiments` (parent folder of this notebook) <br> 
↳ `data` <br>
&nbsp; &nbsp;↳ `inphared` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `consolidated` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp.csv` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `structure` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp_saprot_seq_mask_embeddings` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `AAA74324.1_relaxed.r3.pdb.pt` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ ... <br>
↳ `3.5. Data Consolidation (SaProt with Sequence Masking).ipynb` (this notebook) <br>

<hr>

# 📁 Output files

1. If you would like to skip running this notebook, download `rbp_saprot_seq_mask_relaxed_r3.csv` from this [link](https://drive.google.com/file/d/1TTNlUVcaNaWHXMq4n962JTvFEfvGsbVj/view?usp=sharing). This CSV file consolidates the embeddings.

1. Create a folder named `inphared` inside `data`. <br>
   Create a folder named `structure`, this time inside `data/inphared`, and save `rbp_saprot_seq_mask_relaxed_r3.csv` inside `data/inphared/structure`.

1. Download `consolidated.tar.gz` from this [link](https://drive.google.com/file/d/1yQSXwlb37dm2ZLXGJHdIM5vmrzwPAwvI/view?usp=sharing), and unzip it. This should result in a folder named `consolidated`. <br> Technically, this notebook generates only `consolidated/rbp_embeddings_saprot_seq_mask_relaxed_r3.csv`, which consolidates the phage-host information and the embeddings.

1. Save the extracted `consolidated` folder inside `data/inphared`. 


### Resulting folder structure

`experiments` (parent folder of this notebook) <br> 
↳ `data` <br>
&nbsp; &nbsp;↳ `inphared` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `consolidated` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp.csv` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp_embeddings_saprot_seq_mask_relaxed_r3.csv` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `structure` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp_saprot_seq_mask_embeddings` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp_saprot_seq_mask_relaxed_r3.csv` <br>
↳ `3.5. Data Consolidation (SaProt with Sequence Masking).ipynb` (this notebook) <br>

<hr>

# Part I: Preliminaries

Import the necessary libraries and modules.

In [1]:
import pandas as pd

import ConstantsUtil
import StructureUtil

%load_ext autoreload
%autoreload 2

In [2]:
constants = ConstantsUtil.ConstantsUtil()
util = StructureUtil.StructureUtil()

<hr>

# Part II: Consolidation of the SaProt embeddings

Consolidate the embeddings into a single data frame.

In [3]:
rbp_saprot_seq_mask_relaxed_r3 = util.convert_saprot_pt_to_df(
    f"{constants.INPHARED}/{constants.STRUCTURE_SAPROT_SEQ_MASK}", "_relaxed.r3"
)
rbp_saprot_seq_mask_relaxed_r3.head()

100%|████████████████████████████████████████████████████████████████████████████| 28977/28977 [27:28<00:00, 17.57it/s]


Unnamed: 0,Protein ID,s1,s2,s3,s4,s5,s6,s7,s8,s9,...,s1271,s1272,s1273,s1274,s1275,s1276,s1277,s1278,s1279,s1280
0,AAA74324.1,0.005713,-0.002139,0.013508,-0.021627,0.040859,-0.055532,-0.006189,-0.006692,0.004894,...,-0.0466,-0.008128,-0.055793,0.085335,0.048325,-0.007057,0.015243,0.036059,0.01861,-0.05262
1,AAA74331.1,0.018215,-0.00164,0.017582,-0.055588,0.013366,-0.038805,0.002043,0.054023,0.013386,...,-0.022242,-0.040679,-0.024889,0.06915,0.055251,0.043178,0.005399,0.049476,0.030549,0.00381
2,AAA98578.2,0.016032,0.040192,-0.002418,-0.020064,0.035645,-0.026979,-0.003795,-0.004521,0.015021,...,-0.054726,-0.017037,-0.03928,0.072694,0.050315,0.004953,-0.011,0.052622,0.029711,-0.011149
3,AAB09218.1,0.005787,0.029028,0.014474,0.008848,0.0041,-0.016634,-0.019308,0.012828,0.009144,...,-0.029474,0.008304,-0.041756,0.076812,-0.010433,0.021119,-0.004414,0.039085,0.001066,-0.032709
4,AAB70057.1,-0.002437,0.007826,0.029049,-0.011243,-0.0103,-0.054436,-0.011496,0.044409,0.031361,...,-0.036546,-0.002583,-0.039133,0.082181,0.03202,-0.001363,0.025395,0.036071,-0.005219,-0.026583


In [4]:
rbp_saprot_seq_mask_relaxed_r3.to_csv(
    f"{constants.INPHARED}/{constants.CSV_SAPROT_SEQ_MASK}_relaxed_r3.csv",
    index=False,
)

Combine the embeddings data frame with the data frame containing the phage-host information.

In [5]:
rbp_nucleotide_seq = pd.read_csv(
    f"{constants.INPHARED}/{constants.CONSOLIDATED}/{constants.INPHARED_RBP_DATA}"
)
rbp_nucleotide_seq.head()

Unnamed: 0,Protein ID,Accession,Description,Classification,Genome Length (bp),Jumbophage,molGC (%),Molecule,Modification Date,Number CDS,...,Genbank Division,Isolation Host (beware inconsistent and nonsense values),Host Superkingdom,Host Phylum,Host Class,Host Order,Host Family,Year-Month,Protein Sequence,Nucleotide Sequence
0,BAD16801.1,AB161975,Wolbachia phage WOcauB1,Wolbachia phage WOcauB1 Viruses,20484,False,36.878,DNA,2016-07-26,27,...,PHG,Wolbachia sp. wCauB,bacteria,pseudomonadota,alphaproteobacteria,rickettsiales,anaplasmataceae,2016-07,MKEAIYQRIKDLAANSTPDQLAYLAKSLELIADKKAISNVVQMTEV...,ATGAAAGAAGCAATATACCAAAGGATAAAGGATTTAGCAGCAAATA...
1,BAF36105.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,PHG,Unspecified,bacteria,cyanobacteriota,cyanophyceae,chroococcales,microcystaceae,2023-08,MHQNISKENRGNYNNGIRPRIFMITTIDFRDIQAACIKQLDDMSKD...,GTGCATCAAAATATTTCAAAGGAGAATCGTGGAAACTATAACAACG...
2,BAF36110.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,PHG,Unspecified,bacteria,cyanobacteriota,cyanophyceae,chroococcales,microcystaceae,2023-08,MRIFYIHHPFLATHRYLLSNAYSTPYTDSITKLTTSYSSMPIILSV...,GTGAGGATTTTTTATATCCACCATCCATTCCTCGCTACTCACCGAT...
3,BAF36131.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,PHG,Unspecified,bacteria,cyanobacteriota,cyanophyceae,chroococcales,microcystaceae,2023-08,MLTDVDIQALIDASISGLSGEMPIVANIAARNALSLTKNTQVLVLD...,TTGCTGACAGATGTCGATATTCAGGCATTAATTGATGCCTCAATTT...
4,BAF36132.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,PHG,Unspecified,bacteria,cyanobacteriota,cyanophyceae,chroococcales,microcystaceae,2023-08,MFGVFIVRREGGYIGTQPNWDAANRPGNWDILDVYNRQRRNLWIQS...,TTGTTCGGAGTTTTTATCGTGAGGCGTGAAGGCGGCTATATCGGAA...


In [6]:
rbp_structure_embeddings_relaxed_r3 = pd.merge(
    rbp_nucleotide_seq,
    rbp_saprot_seq_mask_relaxed_r3,
    how="inner",
    validate="one_to_one",
    on="Protein ID",
)
rbp_structure_embeddings_relaxed_r3.head()

Unnamed: 0,Protein ID,Accession,Description,Classification,Genome Length (bp),Jumbophage,molGC (%),Molecule,Modification Date,Number CDS,...,s1271,s1272,s1273,s1274,s1275,s1276,s1277,s1278,s1279,s1280
0,BAD16801.1,AB161975,Wolbachia phage WOcauB1,Wolbachia phage WOcauB1 Viruses,20484,False,36.878,DNA,2016-07-26,27,...,-0.051533,-0.022794,-0.064905,0.076057,0.044809,0.028928,-0.003874,0.027649,0.063658,-0.040593
1,BAF36105.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,-0.033253,-0.027492,-0.053185,0.081067,0.03714,0.031542,-0.060938,0.044506,0.06062,0.005437
2,BAF36110.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,-0.044021,-0.02281,-0.114099,0.141799,0.018311,0.044526,-0.018967,0.01437,0.043462,-0.09003
3,BAF36131.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,-0.02268,0.022862,-0.063694,0.089508,0.019966,0.006555,-0.004751,0.026278,-0.00774,-0.044939
4,BAF36132.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,-0.034211,0.007903,-0.047379,0.079803,0.025154,-0.020763,0.035391,0.03905,0.029944,-0.051824


In [7]:
rbp_structure_embeddings_relaxed_r3.to_csv(
    f"{constants.INPHARED}/{constants.CONSOLIDATED}/rbp_embeddings_saprot_seq_mask_relaxed_r3.csv",
    index=False,
)