# PHIStruct: Improving phage-host interaction prediction at low sequence similarity settings using structure-aware protein embeddings

<b>Mark Edward M. Gonzales<sup>1, 2</sup>, Jennifer C. Ureta<sup>1, 2, 3</sup> & Anish M.S. Shrestha<sup>1, 2</sup></b>

<sup>1</sup> Bioinformatics Lab, Advanced Research Institute for Informatics, Computing and Networking, De La Salle University, Manila 1004, Philippines <br>
<sup>2</sup> Department of Software Technology, College of Computer Studies, De La Salle University, Manila 1004, Philippines <br>
<sup>3</sup> Walter and Eliza Hall Institute of Medical Research, Melbourne, Victoria, 3052, Australia

✉️ gonzales.markedward@gmail.com, jennifer.ureta@gmail.com, anish.shrestha@dlsu.edu.ph

<hr>

# 💡 Prerequisites

### Option 1: Download the prerequisite files
1. Download `rbp_saprot_embeddings.tar.gz` from this [link](https://drive.google.com/file/d/1l1r41Ze56tXQv_U_KShjECpdaoHffJ8d/view?usp=sharing), and unzip it. This should result in a folder named `rbp_saprot_embeddings`.
1. Create a folder named `inphared` inside `data`. <br>
   Create a folder named `structure`, this time inside `data/inphared`, and save the extracted `rbp_saprot_embeddings` folder inside `data/inphared/structure`.
1. Download `consolidated.tar.gz` from this [link](https://drive.google.com/file/d/1yQSXwlb37dm2ZLXGJHdIM5vmrzwPAwvI/view?usp=sharing), and unzip it. This should result in a folder named `consolidated`. <br> Technically, this notebook only needs `consolidated/rbp.csv`.
1. Save the extracted `consolidated` folder inside `data/inphared`. 

### Option 2: Generate the prerequisite files yourself (may take a couple of weeks!)
1. Use ColabFold to predict the protein structures of the sequences in `data/inphared/fasta/complete`, following the instructions [here](https://github.com/YoshitakaMo/localcolabfold). <br>Refer to our paper for the parameters at which we ran ColabFold. <br>For reproducibility, we provide the results of running ColabFold [here](https://drive.google.com/file/d/1ZPRdaHwsFOPksLbOyQerREG0gY0p4-AT/view?usp=sharing).
1. Encode the predicted structures using SaProt's structure-aware alphabet, following the instructions [here](https://github.com/westlake-repl/SaProt?tab=readme-ov-file#convert-protein-structure-into-structure-aware-sequence). <br>For reproducibility, we provide the results of this encoding step [here](https://drive.google.com/file/d/1KgtuT7jY8ZsNlcUglaTjY2ZKyQ2Uoid2/view?usp=sharing).
1. Feed the results of the encoding step to SaProt in order to generate the structure-aware embeddings, following the instructions [here](https://github.com/westlake-repl/SaProt/issues/14). 
1. Save each embedding following this naming convention: `<protein_id>_relaxed.r3.pdb.pt`, and consolidate all the embeddings inside a folder named `rbp_saprot_embeddings`.
1. Create a folder named `inphared` inside `data`. <br>
   Create a folder named `structure`, this time inside `data/inphared`, and save `rbp_saprot_embeddings` inside `data/inphared/structure`.
1. Download `consolidated.tar.gz` from this [link](https://drive.google.com/file/d/1yQSXwlb37dm2ZLXGJHdIM5vmrzwPAwvI/view?usp=sharing), and unzip it. This should result in a folder named `consolidated`. <br> Technically, this notebook only needs `consolidated/rbp.csv`.
1. Save the extracted `consolidated` folder inside `data/inphared`. 

### Resulting folder structure

`experiments` (parent folder of this notebook) <br> 
↳ `data` <br>
&nbsp; &nbsp;↳ `inphared` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `consolidated` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp.csv` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `structure` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp_saprot_embeddings` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `AAA74324.1_relaxed.r3.pdb.pt` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ ... <br>
↳ `3.0. Data Consolidation (SaProt).ipynb` (this notebook) <br>

<hr>

# 📁 Output files

1. If you would like to skip running this notebook, download `rbp_saprot_relaxed_r3.csv` from this [link](https://drive.google.com/file/d/1rY65V6wKvfVzC0AENyERMHJIY0b432r6/view?usp=sharing). This CSV file consolidates the embeddings.

1. Create a folder named `inphared` inside `data`. <br>
   Create a folder named `structure`, this time inside `data/inphared`, and save `rbp_saprot_relaxed_r3.csv` inside `data/inphared/structure`.

1. Download `consolidated.tar.gz` from this [link](https://drive.google.com/file/d/1yQSXwlb37dm2ZLXGJHdIM5vmrzwPAwvI/view?usp=sharing), and unzip it. This should result in a folder named `consolidated`. <br> Technically, this notebook generates only `consolidated/rbp_embeddings_saprot_relaxed_r3.csv`, which consolidates the phage-host information and the embeddings. 

1. Save the extracted `consolidated` folder inside `data/inphared`. 


### Resulting folder structure

`experiments` (parent folder of this notebook) <br> 
↳ `data` <br>
&nbsp; &nbsp;↳ `inphared` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `consolidated` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp.csv` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp_embeddings_saprot_relaxed_r3.csv` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `structure` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp_saprot_embeddings` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp_saprot_relaxed_r3.csv` <br>
↳ `3.0. Data Consolidation (SaProt).ipynb` (this notebook) <br>

<hr>

# Part I: Preliminaries

Import the necessary libraries and modules.

In [1]:
import pandas as pd

import ConstantsUtil
import StructureUtil

%load_ext autoreload
%autoreload 2

In [2]:
constants = ConstantsUtil.ConstantsUtil()
util = StructureUtil.StructureUtil()

<hr>

# Part II: Consolidation of the SaProt embeddings

Consolidate the embeddings into a single data frame.

In [3]:
rbp_saprot_relaxed_r3 = util.convert_saprot_pt_to_df(
    f"{constants.INPHARED}/{constants.STRUCTURE_SAPROT}", "_relaxed.r3"
)
rbp_saprot_relaxed_r3.head()

100%|████████████████████████████████████████████████████████████████████████| 144885/144885 [2:39:28<00:00, 15.14it/s]


Unnamed: 0,Protein ID,s1,s2,s3,s4,s5,s6,s7,s8,s9,...,s1271,s1272,s1273,s1274,s1275,s1276,s1277,s1278,s1279,s1280
0,AAA74324.1,0.008288,-0.009606,0.020807,0.021564,-0.009143,-0.087742,-0.060801,0.028307,0.013417,...,-0.012403,-0.019525,-0.032019,0.070852,-0.286942,0.040566,-0.010063,0.000259,-0.038517,-0.037485
1,AAA74331.1,0.003823,0.02592,-0.061967,-0.021629,-0.041467,-0.061801,-0.056501,0.047991,0.017251,...,-0.042297,-0.053365,-0.020807,0.054125,-0.309149,0.036426,-0.012347,-0.004565,-0.005417,0.015338
2,AAA98578.2,0.009858,0.026908,-0.004928,0.037355,-0.044466,-0.078154,-0.075969,0.022719,0.022212,...,-0.024859,-0.01684,-0.026218,0.087479,-0.286686,0.006222,-0.022076,0.015804,-0.014977,0.012638
3,AAB09218.1,-0.022079,0.056763,-0.013127,0.037152,-0.051844,-0.022957,-0.056361,0.016737,0.010021,...,-0.00144,0.021459,-0.018839,0.055356,-0.355242,0.056944,-0.037552,-0.022838,-0.026101,-0.055208
4,AAB70057.1,-0.026343,0.032874,0.002995,0.01929,-0.050711,-0.064622,-0.079275,0.040141,0.042575,...,-0.001203,0.012924,-0.025169,0.077736,-0.305134,0.026284,-0.020112,-0.018148,-0.035134,-0.061033


In [4]:
rbp_saprot_relaxed_r3.to_csv(
    f"{constants.INPHARED}/{constants.CSV_SAPROT}_relaxed_r3.csv", index=False
)

Combine the embeddings data frame with the data frame containing the phage-host information.

In [5]:
rbp_nucleotide_seq = pd.read_csv(
    f"{constants.INPHARED}/{constants.CONSOLIDATED}/{constants.INPHARED_RBP_DATA}"
)
rbp_nucleotide_seq.head()

Unnamed: 0,Protein ID,Accession,Description,Classification,Genome Length (bp),Jumbophage,molGC (%),Molecule,Modification Date,Number CDS,...,Genbank Division,Isolation Host (beware inconsistent and nonsense values),Host Superkingdom,Host Phylum,Host Class,Host Order,Host Family,Year-Month,Protein Sequence,Nucleotide Sequence
0,BAD16801.1,AB161975,Wolbachia phage WOcauB1,Wolbachia phage WOcauB1 Viruses,20484,False,36.878,DNA,2016-07-26,27,...,PHG,Wolbachia sp. wCauB,bacteria,pseudomonadota,alphaproteobacteria,rickettsiales,anaplasmataceae,2016-07,MKEAIYQRIKDLAANSTPDQLAYLAKSLELIADKKAISNVVQMTEV...,ATGAAAGAAGCAATATACCAAAGGATAAAGGATTTAGCAGCAAATA...
1,BAF36105.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,PHG,Unspecified,bacteria,cyanobacteriota,cyanophyceae,chroococcales,microcystaceae,2023-08,MHQNISKENRGNYNNGIRPRIFMITTIDFRDIQAACIKQLDDMSKD...,GTGCATCAAAATATTTCAAAGGAGAATCGTGGAAACTATAACAACG...
2,BAF36110.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,PHG,Unspecified,bacteria,cyanobacteriota,cyanophyceae,chroococcales,microcystaceae,2023-08,MRIFYIHHPFLATHRYLLSNAYSTPYTDSITKLTTSYSSMPIILSV...,GTGAGGATTTTTTATATCCACCATCCATTCCTCGCTACTCACCGAT...
3,BAF36131.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,PHG,Unspecified,bacteria,cyanobacteriota,cyanophyceae,chroococcales,microcystaceae,2023-08,MLTDVDIQALIDASISGLSGEMPIVANIAARNALSLTKNTQVLVLD...,TTGCTGACAGATGTCGATATTCAGGCATTAATTGATGCCTCAATTT...
4,BAF36132.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,PHG,Unspecified,bacteria,cyanobacteriota,cyanophyceae,chroococcales,microcystaceae,2023-08,MFGVFIVRREGGYIGTQPNWDAANRPGNWDILDVYNRQRRNLWIQS...,TTGTTCGGAGTTTTTATCGTGAGGCGTGAAGGCGGCTATATCGGAA...


In [6]:
rbp_structure_embeddings_relaxed_r3 = pd.merge(
    rbp_nucleotide_seq,
    rbp_saprot_relaxed_r3,
    how="inner",
    validate="one_to_one",
    on="Protein ID",
)
rbp_structure_embeddings_relaxed_r3.head()

Unnamed: 0,Protein ID,Accession,Description,Classification,Genome Length (bp),Jumbophage,molGC (%),Molecule,Modification Date,Number CDS,...,s1271,s1272,s1273,s1274,s1275,s1276,s1277,s1278,s1279,s1280
0,BAD16801.1,AB161975,Wolbachia phage WOcauB1,Wolbachia phage WOcauB1 Viruses,20484,False,36.878,DNA,2016-07-26,27,...,-0.020959,0.001529,-0.035668,0.039971,-0.319399,0.056207,-0.012979,-0.005412,-0.00658,-0.067688
1,BAF36105.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,0.053417,-0.005265,0.011391,0.027982,-0.298075,0.044058,-0.048746,0.003599,-0.004581,-0.030106
2,BAF36110.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,-0.013609,0.008764,-0.03898,0.047949,-0.333071,0.033905,-0.084859,-0.047825,-0.048753,-0.085169
3,BAF36131.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,-0.020721,0.00364,-0.071769,0.045996,-0.343412,-0.011713,-0.043975,-0.032833,-0.054919,-0.050537
4,BAF36132.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,-0.023929,0.00559,-0.041527,0.118564,-0.318104,-0.038388,0.012538,-0.000927,0.001577,-0.071272


In [7]:
rbp_structure_embeddings_relaxed_r3.to_csv(
    f"{constants.INPHARED}/{constants.CONSOLIDATED}/rbp_embeddings_saprot_relaxed_r3.csv",
    index=False,
)