# PHIStruct: Improving phage-host interaction prediction at low sequence similarity settings using structure-aware protein embeddings

<b>Mark Edward M. Gonzales<sup>1, 2</sup>, Jennifer C. Ureta<sup>1, 2, 3</sup> & Anish M.S. Shrestha<sup>1, 2</sup></b>

<sup>1</sup> Bioinformatics Lab, Advanced Research Institute for Informatics, Computing and Networking, De La Salle University, Manila 1004, Philippines <br>
<sup>2</sup> Department of Software Technology, College of Computer Studies, De La Salle University, Manila 1004, Philippines <br>
<sup>3</sup> Walter and Eliza Hall Institute of Medical Research, Melbourne, Victoria, 3052, Australia

✉️ gonzales.markedward@gmail.com, jennifer.ureta@gmail.com, anish.shrestha@dlsu.edu.ph

<hr>

# 💡 Prerequisites

### Option 1: Download the prerequisite files
1. Download `rbp_prostt5_embeddings.h5` from this [link](https://drive.google.com/file/d/1oNJkzVwTJmy7D38KGOnzn3PsLfDBavmG/view?usp=sharing), and unzip it.
1. Create a folder named `inphared` inside `data`. <br>
   Create a folder named `structure`, this time inside `data/inphared`, and save `rbp_prostt5_embeddings.h5` inside `data/inphared/structure`.
1. Download `consolidated.tar.gz` from this [link](https://drive.google.com/file/d/1yQSXwlb37dm2ZLXGJHdIM5vmrzwPAwvI/view?usp=sharing), and unzip it. This should result in a folder named `consolidated`. <br> Technically, this notebook only needs `consolidated/rbp.csv`.
1. Save the extracted `consolidated` folder inside `data/inphared`. 


### Option 2: Generate the prerequisite files yourself (may take a couple of hours!)
1. Feed the protein sequences to ProstT5 in order to generate the embeddings, following the instructions [here](https://github.com/mheinzinger/ProstT5/tree/main/scripts). 
1. Consolidate the embeddings into a single HDF5 file named `rbp_prostt5_embeddings.h5`. 
1. Create a folder named `inphared` inside `data`. <br>
   Create a folder named `structure`, this time inside `data/inphared`, and save `rbp_prostt5_embeddings.h5` inside `data/inphared/structure`.
1. Download `consolidated.tar.gz` from this [link](https://drive.google.com/file/d/1yQSXwlb37dm2ZLXGJHdIM5vmrzwPAwvI/view?usp=sharing), and unzip it. This should result in a folder named `consolidated`. <br> Technically, this notebook only needs `consolidated/rbp.csv`.
1. Create a folder named `inphared` inside `data`, and save the extracted `consolidated` folder inside `data/inphared`. 
   
### Resulting folder structure

`experiments` (parent folder of this notebook) <br> 
↳ `data` <br>
&nbsp; &nbsp;↳ `inphared` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `consolidated` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp.csv` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `structure` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp_prostt5_embeddings.h5` <br>
↳ `3.1. Data Consolidation (ProstT5 - AA Tokens).ipynb` (this notebook) <br>

<hr>

# 📁 Output files

1. If you would like to skip running this notebook, download `rbp_prostt5_relaxed_r3.csv` from this [link](https://drive.google.com/file/d/1PLrfpkUd37G8jbYInWFoghlw_SGHogSV/view?usp=sharing). This CSV file consolidates the embeddings.

1. Create a folder named `inphared` inside `data`. <br>
   Create a folder named `structure`, this time inside `data/inphared`, save `rbp_prostt5_relaxed_r3.csv` inside `data/inphared/structure`.

1. Download `consolidated.tar.gz` from this [link](https://drive.google.com/file/d/1yQSXwlb37dm2ZLXGJHdIM5vmrzwPAwvI/view?usp=sharing), and unzip it. This should result in a folder named `consolidated`. <br> Technically, this notebook generates only `consolidated/rbp_embeddings_prostt5_relaxed_r3.csv`, which consolidates the phage-host information and the embeddings. 

1. Save the extracted `consolidated` folder inside `data/inphared`. 


### Resulting folder structure

`experiments` (parent folder of this notebook) <br> 
↳ `data` <br>
&nbsp; &nbsp;↳ `inphared` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `consolidated` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp.csv` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp_embeddings_prostt5_relaxed_r3.csv` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `structure` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp_prostt5_embeddings.h5` <br>
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp_prostt5_relaxed_r3.csv` <br>
↳ `3.1. Data Consolidation (ProstT5 - AA Tokens).ipynb` (this notebook) <br>

<hr>

# Part I: Preliminaries

Import the necessary libraries and modules.

In [1]:
import pandas as pd

import ConstantsUtil
import StructureUtil

%load_ext autoreload
%autoreload 2

In [2]:
constants = ConstantsUtil.ConstantsUtil()
util = StructureUtil.StructureUtil()

<hr>

# Part II: Consolidation of the ProstT5 embeddings

Consolidate the embeddings into a single data frame.

In [3]:
rbp_prostt5_relaxed_r3 = util.convert_prostt5_h5_to_df(
    f"{constants.INPHARED}/{constants.STRUCTURE_PROSTT5}", "_relaxed_r3_pdb"
)
rbp_prostt5_relaxed_r3.head()

100%|█████████████████████████████████████████████████████████████████████████| 144885/144885 [23:53<00:00, 101.07it/s]


Unnamed: 0,Protein ID,s1,s2,s3,s4,s5,s6,s7,s8,s9,...,s1015,s1016,s1017,s1018,s1019,s1020,s1021,s1022,s1023,s1024
0,AAA74324_1_relaxed_r3_pdb,-0.008888,0.010757,0.026901,0.002495,0.006329,-0.001117,0.018707,-0.026306,-0.034668,...,-0.001294,-0.026688,0.008217,0.004044,-0.000456,0.013741,0.00457,0.031921,0.003466,-0.02272
1,AAA74331_1_relaxed_r3_pdb,0.008438,-0.00605,0.025681,0.018616,0.007584,-0.031097,0.02037,-0.010178,-0.00285,...,0.013016,0.058044,-0.033661,-0.013657,0.029449,-0.039368,-0.026581,-0.004993,0.021896,0.000183
2,AAA98578_2_relaxed_r3_pdb,-0.026337,0.014549,0.0336,0.019562,0.055298,0.024307,0.049469,0.012596,0.010185,...,0.04364,0.023865,-0.026382,-0.036011,0.025574,-0.017944,0.064209,0.006153,0.002045,-0.022614
3,AAB09218_1_relaxed_r3_pdb,-0.000671,0.03096,0.063782,-0.00798,0.036011,0.00742,0.08606,-0.041656,0.03241,...,0.027679,0.028122,-0.031738,-0.00737,-0.011604,0.044434,0.035614,0.031067,-0.033691,0.008873
4,AAB70057_1_relaxed_r3_pdb,-0.002905,-0.011444,0.025513,0.030548,0.007416,-0.010193,0.056671,-0.035767,-0.01548,...,-0.002598,0.052521,-0.021255,0.022614,0.012123,-0.007881,0.025803,0.020477,-0.004837,0.003382


In [4]:
rbp_prostt5_relaxed_r3.to_csv(
    f"{constants.INPHARED}/{constants.CSV_PROSTT5}_relaxed_r3.csv", index=False
)

Combine the embeddings data frame with the data frame containing the phage-host information.

In [5]:
rbp_nucleotide_seq = pd.read_csv(
    f"{constants.INPHARED}/{constants.CONSOLIDATED}/{constants.INPHARED_RBP_DATA}"
)
rbp_nucleotide_seq.head()

Unnamed: 0,Protein ID,Accession,Description,Classification,Genome Length (bp),Jumbophage,molGC (%),Molecule,Modification Date,Number CDS,...,Genbank Division,Isolation Host (beware inconsistent and nonsense values),Host Superkingdom,Host Phylum,Host Class,Host Order,Host Family,Year-Month,Protein Sequence,Nucleotide Sequence
0,BAD16801.1,AB161975,Wolbachia phage WOcauB1,Wolbachia phage WOcauB1 Viruses,20484,False,36.878,DNA,2016-07-26,27,...,PHG,Wolbachia sp. wCauB,bacteria,pseudomonadota,alphaproteobacteria,rickettsiales,anaplasmataceae,2016-07,MKEAIYQRIKDLAANSTPDQLAYLAKSLELIADKKAISNVVQMTEV...,ATGAAAGAAGCAATATACCAAAGGATAAAGGATTTAGCAGCAAATA...
1,BAF36105.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,PHG,Unspecified,bacteria,cyanobacteriota,cyanophyceae,chroococcales,microcystaceae,2023-08,MHQNISKENRGNYNNGIRPRIFMITTIDFRDIQAACIKQLDDMSKD...,GTGCATCAAAATATTTCAAAGGAGAATCGTGGAAACTATAACAACG...
2,BAF36110.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,PHG,Unspecified,bacteria,cyanobacteriota,cyanophyceae,chroococcales,microcystaceae,2023-08,MRIFYIHHPFLATHRYLLSNAYSTPYTDSITKLTTSYSSMPIILSV...,GTGAGGATTTTTTATATCCACCATCCATTCCTCGCTACTCACCGAT...
3,BAF36131.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,PHG,Unspecified,bacteria,cyanobacteriota,cyanophyceae,chroococcales,microcystaceae,2023-08,MLTDVDIQALIDASISGLSGEMPIVANIAARNALSLTKNTQVLVLD...,TTGCTGACAGATGTCGATATTCAGGCATTAATTGATGCCTCAATTT...
4,BAF36132.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,PHG,Unspecified,bacteria,cyanobacteriota,cyanophyceae,chroococcales,microcystaceae,2023-08,MFGVFIVRREGGYIGTQPNWDAANRPGNWDILDVYNRQRRNLWIQS...,TTGTTCGGAGTTTTTATCGTGAGGCGTGAAGGCGGCTATATCGGAA...


In [6]:
rbp_nucleotide_seq_clean = util.add_prostt5_id_df(rbp_nucleotide_seq)
rbp_nucleotide_seq_clean.head()

Unnamed: 0,Protein ID,Accession,Description,Classification,Genome Length (bp),Jumbophage,molGC (%),Molecule,Modification Date,Number CDS,...,Isolation Host (beware inconsistent and nonsense values),Host Superkingdom,Host Phylum,Host Class,Host Order,Host Family,Year-Month,Protein Sequence,Nucleotide Sequence,Protein ID (Clean)
0,BAD16801.1,AB161975,Wolbachia phage WOcauB1,Wolbachia phage WOcauB1 Viruses,20484,False,36.878,DNA,2016-07-26,27,...,Wolbachia sp. wCauB,bacteria,pseudomonadota,alphaproteobacteria,rickettsiales,anaplasmataceae,2016-07,MKEAIYQRIKDLAANSTPDQLAYLAKSLELIADKKAISNVVQMTEV...,ATGAAAGAAGCAATATACCAAAGGATAAAGGATTTAGCAGCAAATA...,BAD16801_1
1,BAF36105.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,Unspecified,bacteria,cyanobacteriota,cyanophyceae,chroococcales,microcystaceae,2023-08,MHQNISKENRGNYNNGIRPRIFMITTIDFRDIQAACIKQLDDMSKD...,GTGCATCAAAATATTTCAAAGGAGAATCGTGGAAACTATAACAACG...,BAF36105_1
2,BAF36110.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,Unspecified,bacteria,cyanobacteriota,cyanophyceae,chroococcales,microcystaceae,2023-08,MRIFYIHHPFLATHRYLLSNAYSTPYTDSITKLTTSYSSMPIILSV...,GTGAGGATTTTTTATATCCACCATCCATTCCTCGCTACTCACCGAT...,BAF36110_1
3,BAF36131.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,Unspecified,bacteria,cyanobacteriota,cyanophyceae,chroococcales,microcystaceae,2023-08,MLTDVDIQALIDASISGLSGEMPIVANIAARNALSLTKNTQVLVLD...,TTGCTGACAGATGTCGATATTCAGGCATTAATTGATGCCTCAATTT...,BAF36131_1
4,BAF36132.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,Unspecified,bacteria,cyanobacteriota,cyanophyceae,chroococcales,microcystaceae,2023-08,MFGVFIVRREGGYIGTQPNWDAANRPGNWDILDVYNRQRRNLWIQS...,TTGTTCGGAGTTTTTATCGTGAGGCGTGAAGGCGGCTATATCGGAA...,BAF36132_1


In [7]:
rbp_prostt5_relaxed_r3_clean = pd.read_csv(
    f"{constants.INPHARED}/{constants.CSV_PROSTT5}_relaxed_r3.csv"
)
rbp_prostt5_relaxed_r3_clean = util.sanitize_prostt5_df(
    rbp_prostt5_relaxed_r3_clean, "_relaxed_r3_pdb"
)
rbp_prostt5_relaxed_r3_clean.head()

Unnamed: 0,s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,...,s1016,s1017,s1018,s1019,s1020,s1021,s1022,s1023,s1024,Protein ID (Clean)
0,-0.008888,0.010757,0.026901,0.002495,0.006329,-0.001117,0.018707,-0.026306,-0.034668,0.010445,...,-0.026688,0.008217,0.004044,-0.000456,0.013741,0.00457,0.031921,0.003466,-0.02272,AAA74324_1
1,0.008438,-0.00605,0.025681,0.018616,0.007584,-0.031097,0.02037,-0.010178,-0.00285,0.018906,...,0.058044,-0.033661,-0.013657,0.029449,-0.039368,-0.026581,-0.004993,0.021896,0.000183,AAA74331_1
2,-0.026337,0.014549,0.0336,0.019562,0.055298,0.024307,0.049469,0.012596,0.010185,-0.002304,...,0.023865,-0.026382,-0.036011,0.025574,-0.017944,0.064209,0.006153,0.002045,-0.022614,AAA98578_2
3,-0.000671,0.03096,0.063782,-0.00798,0.036011,0.00742,0.08606,-0.041656,0.03241,0.0466,...,0.028122,-0.031738,-0.00737,-0.011604,0.044434,0.035614,0.031067,-0.033691,0.008873,AAB09218_1
4,-0.002905,-0.011444,0.025513,0.030548,0.007416,-0.010193,0.056671,-0.035767,-0.01548,0.026581,...,0.052521,-0.021255,0.022614,0.012123,-0.007881,0.025803,0.020477,-0.004837,0.003382,AAB70057_1


In [8]:
rbp_structure_embeddings_relaxed_r3 = pd.merge(
    rbp_nucleotide_seq_clean,
    rbp_prostt5_relaxed_r3_clean,
    how="inner",
    validate="one_to_one",
    on="Protein ID (Clean)",
)
del rbp_structure_embeddings_relaxed_r3["Protein ID (Clean)"]
rbp_structure_embeddings_relaxed_r3.head()

Unnamed: 0,Protein ID,Accession,Description,Classification,Genome Length (bp),Jumbophage,molGC (%),Molecule,Modification Date,Number CDS,...,s1015,s1016,s1017,s1018,s1019,s1020,s1021,s1022,s1023,s1024
0,BAD16801.1,AB161975,Wolbachia phage WOcauB1,Wolbachia phage WOcauB1 Viruses,20484,False,36.878,DNA,2016-07-26,27,...,0.007633,-0.026031,-0.017395,-0.059082,-0.006268,-0.030869,0.028748,-0.0495,-0.020996,-0.002974
1,BAF36105.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,0.011124,-0.023529,-0.010574,-0.023682,0.02449,0.00042,0.022034,-0.016678,-0.037628,-0.002535
2,BAF36110.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,-0.041595,0.04187,0.035736,-0.00304,0.004318,0.036255,0.028839,0.037933,-0.019272,0.002094
3,BAF36131.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,-0.022156,0.03363,0.011124,0.019257,0.006161,0.042145,0.021805,0.035278,-0.013054,-0.016052
4,BAF36132.1,AB231700,Microcystis phage LMM01,Microcystis phage LMM01 Fukuivirus LMM01 Fukui...,162109,False,45.953,DNA,2023-08-22,189,...,0.012329,0.001997,0.012733,0.014519,0.004818,0.00798,-0.002657,0.02121,-0.000913,-0.048462


In [9]:
rbp_structure_embeddings_relaxed_r3.to_csv(
    f"{constants.INPHARED}/{constants.CONSOLIDATED}/rbp_embeddings_prostt5_relaxed_r3.csv",
    index=False,
)