# PHIStruct: Improving phage-host interaction prediction at low sequence similarity settings using structure-aware protein embeddings

<b>Mark Edward M. Gonzales<sup>1, 2</sup>, Jennifer C. Ureta<sup>1, 2, 3</sup> & Anish M.S. Shrestha<sup>1, 2</sup></b>

<sup>1</sup> Bioinformatics Lab, Advanced Research Institute for Informatics, Computing and Networking, De La Salle University, Manila 1004, Philippines <br>
<sup>2</sup> Department of Software Technology, College of Computer Studies, De La Salle University, Manila 1004, Philippines <br>
<sup>3</sup> Walter and Eliza Hall Institute of Medical Research, Melbourne, Victoria, 3052, Australia

✉️ gonzales.markedward@gmail.com, jennifer.ureta@gmail.com, anish.shrestha@dlsu.edu.ph

<hr>

# 💡 Prerequisites

### Option 1: Download the prerequisite files
1. Download `rbp_prostt5_3di_embeddings.h5` from this [link](https://drive.google.com/file/d/1fz56eDOY3q0Ac585gZerQGFQZLsG2y27/view?usp=sharing), and unzip it.
1. Create a folder named `inphared` inside `data`. <br>
   Create a folder named `structure`, this time inside `data/inphared`, and save `rbp_prostt5_3di_embeddings.h5` inside `data/inphared/structure`.
1. Download `consolidated.tar.gz` from this [link](https://drive.google.com/file/d/1yQSXwlb37dm2ZLXGJHdIM5vmrzwPAwvI/view?usp=sharing), and unzip it. This should result in a folder named `consolidated`. <br> Technically, this notebook only needs `consolidated/rbp.csv`.
1. Save the extracted `consolidated` folder inside `data/inphared`. 


### Option 2: Generate the prerequisite files yourself (may take a couple of days!)
1. Use ColabFold to predict the protein structures of the sequences in `data/inphared/fasta/complete`, following the instructions [here](https://github.com/YoshitakaMo/localcolabfold). <br>Refer to our paper for the parameters at which we ran ColabFold. <br>For reproducibility, we provide the results of running ColabFold [here](https://drive.google.com/file/d/1ZPRdaHwsFOPksLbOyQerREG0gY0p4-AT/view?usp=sharing).
1. Encode the predicted structures into 3Di tokens using Foldseek, following the instructions [here](https://github.com/steineggerlab/foldseek?tab=readme-ov-file#tutorial-video). <br>For reproducibility, we provide the results of this encoding step [here](https://drive.google.com/file/d/1_GkC6NH2FiHh0AwqA7SARGjrkDHxE8dP/view?usp=sharing).
1. Feed the 3Di tokens to ProstT5 in order to generate the embeddings, following the instructions [here](https://github.com/mheinzinger/ProstT5/tree/main/scripts). 
1. Consolidate the embeddings into a single HDF5 file named `rbp_prostt5_3di_embeddings.h5`. 
1. Create a folder named `inphared` inside `data`. <br>
   Create a folder named `structure`, this time inside `data/inphared`, and save `rbp_prostt5_3di_embeddings.h5` inside `data/inphared/structure`.
1. Download `consolidated.tar.gz` from this [link](https://drive.google.com/file/d/1yQSXwlb37dm2ZLXGJHdIM5vmrzwPAwvI/view?usp=sharing), and unzip it. This should result in a folder named `consolidated`. <br> Technically, this notebook only needs `consolidated/rbp.csv`.
1. Create a folder named `inphared` inside `data`, and save the extracted `consolidated` folder inside `data/inphared`. 
   
### Resulting folder structure

`experiments` (parent folder of this notebook) <br> 
↳ `data` <br>
&nbsp; &nbsp;↳ `inphared` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `consolidated` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp.csv` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `structure` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp_prostt5_3di_embeddings.h5` <br>
↳ `3.6. Data Consolidation (ProstT5 - 3Di Tokens).ipynb` (this notebook) <br>

<hr>

# 📁 Output files

1. If you would like to skip running this notebook, download `rbp_prostt5_3di_relaxed_r3.csv` from this [link](https://drive.google.com/file/d/1QfUzxwbfK_Lk42SB7aeP7DbTNgJGJy6p/view?usp=sharing). This CSV file consolidates the embeddings.

1. Create a folder named `inphared` inside `data`. <br>
   Create a folder named `structure`, this time inside `data/inphared`, save `rbp_prostt5_relaxed_r3.csv` inside `data/inphared/structure`.

1. Download `consolidated.tar.gz` from this [link](https://drive.google.com/file/d/1yQSXwlb37dm2ZLXGJHdIM5vmrzwPAwvI/view?usp=sharing), and unzip it. This should result in a folder named `consolidated`. <br> Technically, this notebook generates only `consolidated/rbp_embeddings_prostt5_3di_relaxed_r3.csv`, which consolidates the phage-host information and the embeddings. 

1. Save the extracted `consolidated` folder inside `data/inphared`. 


### Resulting folder structure

`experiments` (parent folder of this notebook) <br> 
↳ `data` <br>
&nbsp; &nbsp;↳ `inphared` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `consolidated` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp.csv` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp_embeddings_prostt5_relaxed_r3.csv` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `structure` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp_prostt5_embeddings.h5` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp_prostt5_relaxed_r3.csv` <br>
↳ `3.6. Data Consolidation (ProstT5 - 3Di Tokens).ipynb` (this notebook) <br>

<hr>

# Part I: Preliminaries

Import the necessary libraries and modules.

In [9]:
import pandas as pd

import ConstantsUtil
import StructureUtil

%load_ext autoreload
%autoreload 2

In [10]:
constants = ConstantsUtil.ConstantsUtil()
util = StructureUtil.StructureUtil()

<hr>

# Part II: Consolidation of the ProstT5 embeddings

Consolidate the embeddings into a single data frame.

In [14]:
rbp_prostt5_relaxed_r3 = util.convert_prostt5_h5_to_df(
    f"{constants.INPHARED}/{constants.STRUCTURE_PROSTT5_3Di}", "_relaxed_r3_pdb"
)
rbp_prostt5_relaxed_r3.head()

100%|████████████████████████████████████████████████████████████████████████████| 28977/28977 [23:12<00:00, 20.81it/s]


Unnamed: 0,Protein ID,s1,s2,s3,s4,s5,s6,s7,s8,s9,...,s1015,s1016,s1017,s1018,s1019,s1020,s1021,s1022,s1023,s1024
0,AAA74324_1_relaxed_r3_pdb,-0.008888,0.010727,0.026886,0.002453,0.006351,-0.001094,0.018707,-0.026306,-0.034698,...,-0.001292,-0.026657,0.008186,0.004055,-0.000457,0.013771,0.004559,0.031921,0.003489,-0.022736
1,AAA74331_1_relaxed_r3_pdb,0.008423,-0.006035,0.02562,0.018661,0.00758,-0.031082,0.020325,-0.010162,-0.002855,...,0.012993,0.058075,-0.03363,-0.013634,0.029434,-0.039337,-0.026611,-0.004993,0.021866,0.000198
2,AAA98578_2_relaxed_r3_pdb,-0.026337,0.014565,0.0336,0.019547,0.055298,0.024292,0.049438,0.012581,0.010193,...,0.04364,0.023865,-0.026382,-0.036011,0.025574,-0.017929,0.064148,0.00613,0.002041,-0.022598
3,AAB09218_1_relaxed_r3_pdb,-0.000686,0.03096,0.063782,-0.007919,0.03598,0.007408,0.08606,-0.041595,0.032379,...,0.02771,0.028091,-0.031738,-0.00737,-0.011612,0.044464,0.035645,0.031067,-0.033722,0.008881
4,AAB70057_1_relaxed_r3_pdb,-0.002926,-0.011444,0.025513,0.030548,0.007385,-0.010201,0.056671,-0.035797,-0.015465,...,-0.002586,0.052551,-0.021255,0.022583,0.012131,-0.007809,0.025818,0.020477,-0.004868,0.003401


In [15]:
rbp_prostt5_relaxed_r3.to_csv(
    f"{constants.INPHARED}/{constants.CSV_PROSTT5_3Di}_relaxed_r3.csv", index=False
)

Combine the embeddings data frame with the data frame containing the phage-host information.

In [17]:
rbp_nucleotide_seq = pd.read_csv(
    f"{constants.INPHARED}/{constants.CONSOLIDATED}/{constants.INPHARED_RBP_DATA}"
)
rbp_nucleotide_seq.head()

Unnamed: 0,Protein ID,Accession,Description,Classification,Genome Length (bp),Jumbophage,molGC (%),Molecule,Modification Date,Number CDS,...,Genbank Division,Isolation Host (beware inconsistent and nonsense values),Host Superkingdom,Host Phylum,Host Class,Host Order,Host Family,Year-Month,Protein Sequence,Nucleotide Sequence
0,BAD16801.1,AB161975,Wolbachia phage WOcauB1,Wolbachia phage WOcauB1 Viruses,20484,False,36.878,DNA,2016-07-26,27,...,PHG,Wolbachia sp. wCauB,bacteria,pseudomonadota,alphaproteobacteria,rickettsiales,anaplasmataceae,2016-07,MKEAIYQRIKDLAANSTPDQLAYLAKSLELIADKKAISNVVQMTEV...,ATGAAAGAAGCAATATACCAAAGGATAAAGGATTTAGCAGCAAATA...
1,BAF36105.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,PHG,Unspecified,bacteria,cyanobacteriota,cyanophyceae,chroococcales,microcystaceae,2023-08,MHQNISKENRGNYNNGIRPRIFMITTIDFRDIQAACIKQLDDMSKD...,GTGCATCAAAATATTTCAAAGGAGAATCGTGGAAACTATAACAACG...
2,BAF36110.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,PHG,Unspecified,bacteria,cyanobacteriota,cyanophyceae,chroococcales,microcystaceae,2023-08,MRIFYIHHPFLATHRYLLSNAYSTPYTDSITKLTTSYSSMPIILSV...,GTGAGGATTTTTTATATCCACCATCCATTCCTCGCTACTCACCGAT...
3,BAF36131.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,PHG,Unspecified,bacteria,cyanobacteriota,cyanophyceae,chroococcales,microcystaceae,2023-08,MLTDVDIQALIDASISGLSGEMPIVANIAARNALSLTKNTQVLVLD...,TTGCTGACAGATGTCGATATTCAGGCATTAATTGATGCCTCAATTT...
4,BAF36132.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,PHG,Unspecified,bacteria,cyanobacteriota,cyanophyceae,chroococcales,microcystaceae,2023-08,MFGVFIVRREGGYIGTQPNWDAANRPGNWDILDVYNRQRRNLWIQS...,TTGTTCGGAGTTTTTATCGTGAGGCGTGAAGGCGGCTATATCGGAA...


In [18]:
rbp_nucleotide_seq_clean = util.add_prostt5_id_df(rbp_nucleotide_seq)
rbp_nucleotide_seq_clean.head()

Unnamed: 0,Protein ID,Accession,Description,Classification,Genome Length (bp),Jumbophage,molGC (%),Molecule,Modification Date,Number CDS,...,Isolation Host (beware inconsistent and nonsense values),Host Superkingdom,Host Phylum,Host Class,Host Order,Host Family,Year-Month,Protein Sequence,Nucleotide Sequence,Protein ID (Clean)
0,BAD16801.1,AB161975,Wolbachia phage WOcauB1,Wolbachia phage WOcauB1 Viruses,20484,False,36.878,DNA,2016-07-26,27,...,Wolbachia sp. wCauB,bacteria,pseudomonadota,alphaproteobacteria,rickettsiales,anaplasmataceae,2016-07,MKEAIYQRIKDLAANSTPDQLAYLAKSLELIADKKAISNVVQMTEV...,ATGAAAGAAGCAATATACCAAAGGATAAAGGATTTAGCAGCAAATA...,BAD16801_1
1,BAF36105.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,Unspecified,bacteria,cyanobacteriota,cyanophyceae,chroococcales,microcystaceae,2023-08,MHQNISKENRGNYNNGIRPRIFMITTIDFRDIQAACIKQLDDMSKD...,GTGCATCAAAATATTTCAAAGGAGAATCGTGGAAACTATAACAACG...,BAF36105_1
2,BAF36110.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,Unspecified,bacteria,cyanobacteriota,cyanophyceae,chroococcales,microcystaceae,2023-08,MRIFYIHHPFLATHRYLLSNAYSTPYTDSITKLTTSYSSMPIILSV...,GTGAGGATTTTTTATATCCACCATCCATTCCTCGCTACTCACCGAT...,BAF36110_1
3,BAF36131.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,Unspecified,bacteria,cyanobacteriota,cyanophyceae,chroococcales,microcystaceae,2023-08,MLTDVDIQALIDASISGLSGEMPIVANIAARNALSLTKNTQVLVLD...,TTGCTGACAGATGTCGATATTCAGGCATTAATTGATGCCTCAATTT...,BAF36131_1
4,BAF36132.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,Unspecified,bacteria,cyanobacteriota,cyanophyceae,chroococcales,microcystaceae,2023-08,MFGVFIVRREGGYIGTQPNWDAANRPGNWDILDVYNRQRRNLWIQS...,TTGTTCGGAGTTTTTATCGTGAGGCGTGAAGGCGGCTATATCGGAA...,BAF36132_1


In [19]:
rbp_prostt5_relaxed_r3_clean = pd.read_csv(
    f"{constants.INPHARED}/{constants.CSV_PROSTT5_3Di}_relaxed_r3.csv"
)
rbp_prostt5_relaxed_r3_clean = util.sanitize_prostt5_df(
    rbp_prostt5_relaxed_r3_clean, "_relaxed_r3_pdb"
)
rbp_prostt5_relaxed_r3_clean.head()

Unnamed: 0,s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,...,s1016,s1017,s1018,s1019,s1020,s1021,s1022,s1023,s1024,Protein ID (Clean)
0,-0.008888,0.010727,0.026886,0.002453,0.006351,-0.001094,0.018707,-0.026306,-0.034698,0.01046,...,-0.026657,0.008186,0.004055,-0.000457,0.013771,0.004559,0.031921,0.003489,-0.022736,AAA74324_1
1,0.008423,-0.006035,0.02562,0.018661,0.00758,-0.031082,0.020325,-0.010162,-0.002855,0.018906,...,0.058075,-0.03363,-0.013634,0.029434,-0.039337,-0.026611,-0.004993,0.021866,0.000198,AAA74331_1
2,-0.026337,0.014565,0.0336,0.019547,0.055298,0.024292,0.049438,0.012581,0.010193,-0.002279,...,0.023865,-0.026382,-0.036011,0.025574,-0.017929,0.064148,0.00613,0.002041,-0.022598,AAA98578_2
3,-0.000686,0.03096,0.063782,-0.007919,0.03598,0.007408,0.08606,-0.041595,0.032379,0.04657,...,0.028091,-0.031738,-0.00737,-0.011612,0.044464,0.035645,0.031067,-0.033722,0.008881,AAB09218_1
4,-0.002926,-0.011444,0.025513,0.030548,0.007385,-0.010201,0.056671,-0.035797,-0.015465,0.026566,...,0.052551,-0.021255,0.022583,0.012131,-0.007809,0.025818,0.020477,-0.004868,0.003401,AAB70057_1


In [20]:
rbp_structure_embeddings_relaxed_r3 = pd.merge(
    rbp_nucleotide_seq_clean,
    rbp_prostt5_relaxed_r3_clean,
    how="inner",
    validate="one_to_one",
    on="Protein ID (Clean)",
)
del rbp_structure_embeddings_relaxed_r3["Protein ID (Clean)"]
rbp_structure_embeddings_relaxed_r3.head()

Unnamed: 0,Protein ID,Accession,Description,Classification,Genome Length (bp),Jumbophage,molGC (%),Molecule,Modification Date,Number CDS,...,s1015,s1016,s1017,s1018,s1019,s1020,s1021,s1022,s1023,s1024
0,BAD16801.1,AB161975,Wolbachia phage WOcauB1,Wolbachia phage WOcauB1 Viruses,20484,False,36.878,DNA,2016-07-26,27,...,0.007641,-0.026047,-0.017426,-0.059143,-0.006275,-0.030853,0.028748,-0.0495,-0.021011,-0.003014
1,BAF36105.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,0.011131,-0.02356,-0.010529,-0.023682,0.02449,0.000429,0.022018,-0.016663,-0.037628,-0.002579
2,BAF36110.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,-0.041626,0.041809,0.035767,-0.003069,0.004318,0.036224,0.028839,0.037933,-0.019272,0.002094
3,BAF36131.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,-0.022156,0.03363,0.011124,0.019257,0.006161,0.042145,0.021805,0.035278,-0.013054,-0.016052
4,BAF36132.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,0.012314,0.002008,0.012672,0.014526,0.004807,0.007988,-0.002621,0.02121,-0.000929,-0.048462


In [21]:
rbp_structure_embeddings_relaxed_r3.to_csv(
    f"{constants.INPHARED}/{constants.CONSOLIDATED}/rbp_embeddings_prostt5_3di_relaxed_r3.csv",
    index=False,
)