# Protein embeddings improve phage-host interaction prediction

**Mark Edward M. Gonzales<sup>1, 2</sup>, Jennifer C. Ureta<sup>1, 2</sup> & Anish M.S. Shrestha<sup>1, 2</sup>**

<sup>1</sup> Bioinformatics Laboratory, Advanced Research Institute for Informatics, Computing and Networking, De La Salle University, Manila, Philippines <br>
<sup>2</sup> Department of Software Technology, College of Computer Studies, De La Salle University, Manila, Philippines 

{mark_gonzales, jennifer.ureta, anish.shrestha}@dlsu.edu.ph

<hr>

## 💡 FASTA & Embeddings

This notebook assumes that you already have the FASTA files (from running [`1. Sequence Processing.ipynb`](https://github.com/bioinfodlsu/phage-host-prediction/blob/main/experiments/1.%20Sequence%20Preprocessing.ipynb)) and RBP embeddings (from running [`4. Protein Embedding Generation.ipynb`](https://github.com/bioinfodlsu/phage-host-prediction/blob/main/experiments/4.%20Protein%20Embedding%20Generation.ipynb)). 

Alternatively, you may:
- Download the FASTA files from [Google Drive](https://drive.google.com/drive/folders/16ZBXZCpC0OmldtPPIy5sEBtS4EVohorT?usp=sharing) and save the downloaded `fasta` folder inside the `inphared` directory located in the same folder as this notebook. 
- Download the protein embeddings from these Google Drive directories: [Part 1](https://drive.google.com/drive/folders/1deenrDQIr3xcl9QCYH-nPhmpY8x2drQw?usp=sharing) and [Part 2](https://drive.google.com/drive/folders/1jnBFNsC6zJISkc6IAz56257MSXKjY0Ez?usp=sharing). Consolidate the downloaded folders inside a single `embeddings` directory and save it inside the `inphared` directory located in the same folder as this notebook. 

The folder structure should look like this:

`experiments` (parent folder of this notebook) <br> 
↳ `inphared` <br>
&nbsp; &nbsp; ↳ `embeddings` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `esm` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `esm1b` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ ... <br>
&nbsp; &nbsp; ↳ `fasta` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `hypothetical` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `nucleotide` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp` <br>
↳ `5. Data Consolidation.ipynb` (this notebook) <br>

<hr>

## 📁 Output Files
If you would like to skip running this notebook, you may download the output phage-host-features CSV files from [Google Drive](https://drive.google.com/drive/folders/1xNoA6dxkN4jzVNCg_7YNjdPZzl51Jo9M?usp=sharing). Save the downloaded `data` folder inside the `inphared` directory located in the same folder as this notebook. The folder structure should look like this:

`experiments` (parent folder of this notebook) <br> 
↳ `inphared` <br>
&nbsp; &nbsp;↳ `data` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp.csv` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp_embeddings_esm.csv` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ ... <br>
&nbsp; &nbsp;↳ `embeddings` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `esm` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `esm1b` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ ... <br>
&nbsp; &nbsp;↳ `fasta` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `hypothetical` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `nucleotide` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp` <br>
↳ `5. Data Consolidation.ipynb` (this notebook) <br>

<hr>

# Part I: Preliminaries

Import the necessary libraries and modules.

In [1]:
import math
import os

from collections import defaultdict

import pandas as pd

from ConstantsUtil import ConstantsUtil
from ClassificationUtil import ClassificationUtil

%load_ext autoreload
%autoreload 2

In [2]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', 50)

pd.options.mode.chained_assignment = None

In [3]:
constants = ConstantsUtil()
util = ClassificationUtil(complete_embeddings_dir = constants.COMPLETE_EMBEDDINGS)

<hr>

## Part II: Data Consolidation

Load the phage-host dataset

In [4]:
inphared = pd.read_csv(f'{constants.TEMP_PREPROCESSING}/{constants.INPHARED_WITH_HOSTS}')
orig_shape = inphared.shape

Load only the entries with identified annotated RBPs.

In [5]:
inphared = inphared[inphared['Accession'].isin(util.get_phages())]
inphared.head()

Unnamed: 0,Accession,Description,Classification,Genome Length (bp),Jumbophage,molGC (%),Molecule,Modification Date,Number CDS,Positive Strand (%),Negative Strand (%),Coding Capacity(%),Low Coding Capacity Warning,tRNAs,Host,Lowest Taxa,Genus,Sub-family,Family,Order,Class,Phylum,Kingdom,Realm,Baltimore Group,Genbank Division,Isolation Host (beware inconsistent and nonsense values)
0,MK250029,Prevotella phage Lak-C1,Prevotella phage Lak-C1 Myoviridae Caudovirice...,540217,True,25.796,DNA,13-JAN-2019,830,47.108434,52.891566,68.324951,,30,prevotella,Myoviridae,Unclassified,Unclassified,Myoviridae,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,ENV,Prevotella sp.
1,MK250028,Prevotella phage Lak-B9,Prevotella phage Lak-B9 Myoviridae Caudovirice...,550053,True,26.012,DNA,13-JAN-2019,859,52.270081,47.729919,69.188424,,29,prevotella,Myoviridae,Unclassified,Unclassified,Myoviridae,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,ENV,Prevotella sp.
2,MK250027,Prevotella phage Lak-B8,Prevotella phage Lak-B8 Myoviridae Caudovirice...,551627,True,26.022,DNA,13-JAN-2019,860,53.023256,46.976744,69.318761,,33,prevotella,Myoviridae,Unclassified,Unclassified,Myoviridae,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,ENV,Prevotella sp.
3,MK250026,Prevotella phage Lak-B7,Prevotella phage Lak-B7 Myoviridae Caudovirice...,550702,True,26.02,DNA,13-JAN-2019,859,53.201397,46.798603,69.363285,,33,prevotella,Myoviridae,Unclassified,Unclassified,Myoviridae,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,ENV,Prevotella sp.
4,MK250025,Prevotella phage Lak-B6,Prevotella phage Lak-B6 Myoviridae Caudovirice...,546689,True,26.029,DNA,13-JAN-2019,847,52.656434,47.343566,69.118274,,30,prevotella,Myoviridae,Unclassified,Unclassified,Myoviridae,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,ENV,Prevotella sp.


### Some Statistics

Number of phages

In [6]:
print('Original:\t', orig_shape[0])
print('With RBPs:\t', inphared.shape[0])

Original:	 15823
With RBPs:	 9583


Get the IDs of the RBPs and the accession IDs of the phages to which they belong.

In [7]:
rbps_with_accession = util.get_rbps()

In [8]:
rbp_df = pd.DataFrame(rbps_with_accession, columns = ['Protein ID', 'Accession'])
rbp_df.head()

Unnamed: 0,Protein ID,Accession
0,BAF36105.1,AB231700
1,BAF36110.1,AB231700
2,BAF36131.1,AB231700
3,BAF36132.1,AB231700
4,BAF36193.1,AB231700


In [9]:
rbp_with_phage = pd.merge(rbp_df, inphared, how = 'inner', validate = 'many_to_one')
rbp_with_phage.head()

Unnamed: 0,Protein ID,Accession,Description,Classification,Genome Length (bp),Jumbophage,molGC (%),Molecule,Modification Date,Number CDS,Positive Strand (%),Negative Strand (%),Coding Capacity(%),Low Coding Capacity Warning,tRNAs,Host,Lowest Taxa,Genus,Sub-family,Family,Order,Class,Phylum,Kingdom,Realm,Baltimore Group,Genbank Division,Isolation Host (beware inconsistent and nonsense values)
0,BAF36105.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,14-JUL-2021,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified
1,BAF36110.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,14-JUL-2021,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified
2,BAF36131.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,14-JUL-2021,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified
3,BAF36132.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,14-JUL-2021,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified
4,BAF36193.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,14-JUL-2021,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified


Cast the `Modification Date` column to a `datetime` data type.

In [10]:
rbp_with_phage['Modification Date'] = pd.to_datetime(rbp_with_phage['Modification Date'])

Add a `Year-Month` column derived from `Modification Date`.

In [11]:
rbp_with_phage['Year-Month'] = rbp_with_phage['Modification Date'].dt.to_period('M')
rbp_with_phage.head()

Unnamed: 0,Protein ID,Accession,Description,Classification,Genome Length (bp),Jumbophage,molGC (%),Molecule,Modification Date,Number CDS,Positive Strand (%),Negative Strand (%),Coding Capacity(%),Low Coding Capacity Warning,tRNAs,Host,Lowest Taxa,Genus,Sub-family,Family,Order,Class,Phylum,Kingdom,Realm,Baltimore Group,Genbank Division,Isolation Host (beware inconsistent and nonsense values),Year-Month
0,BAF36105.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07
1,BAF36110.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07
2,BAF36131.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07
3,BAF36132.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07
4,BAF36193.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07


Add the taxonomical information of the host bacteria.

In [12]:
host_taxonomy = pd.DataFrame(util.get_host_taxonomy(rbp_with_phage), 
                             columns = ['Host Superkingdom', 'Host Phylum', 'Host Class', 'Host Order', 'Host Family', 'Host'])
host_taxonomy.head()

Unnamed: 0,Host Superkingdom,Host Phylum,Host Class,Host Order,Host Family,Host
0,bacteria,cyanobacteria,,chroococcales,microcystaceae,microcystis
1,bacteria,proteobacteria,gammaproteobacteria,enterobacterales,enterobacteriaceae,escherichia
2,bacteria,proteobacteria,gammaproteobacteria,enterobacterales,enterobacteriaceae,enterobacter
3,bacteria,proteobacteria,gammaproteobacteria,enterobacterales,enterobacteriaceae,salmonella
4,bacteria,proteobacteria,betaproteobacteria,burkholderiales,burkholderiaceae,ralstonia


In [13]:
rbp_taxonomy = pd.merge(rbp_with_phage, host_taxonomy, how = 'inner', validate = 'many_to_one')
rbp_taxonomy.head()

Unnamed: 0,Protein ID,Accession,Description,Classification,Genome Length (bp),Jumbophage,molGC (%),Molecule,Modification Date,Number CDS,Positive Strand (%),Negative Strand (%),Coding Capacity(%),Low Coding Capacity Warning,tRNAs,Host,Lowest Taxa,Genus,Sub-family,Family,Order,Class,Phylum,Kingdom,Realm,Baltimore Group,Genbank Division,Isolation Host (beware inconsistent and nonsense values),Year-Month,Host Superkingdom,Host Phylum,Host Class,Host Order,Host Family
0,BAF36105.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07,bacteria,cyanobacteria,,chroococcales,microcystaceae
1,BAF36110.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07,bacteria,cyanobacteria,,chroococcales,microcystaceae
2,BAF36131.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07,bacteria,cyanobacteria,,chroococcales,microcystaceae
3,BAF36132.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07,bacteria,cyanobacteria,,chroococcales,microcystaceae
4,BAF36193.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07,bacteria,cyanobacteria,,chroococcales,microcystaceae


Add the protein sequence.

In [14]:
inphared_fasta_hypothetical = f'{constants.INPHARED}/{constants.FASTA}/{constants.HYPOTHETICAL}'
inphared_fasta_rbp = f'{constants.INPHARED}/{constants.FASTA}/{constants.RBP}'

rbp_sequences = pd.DataFrame(util.get_sequences(rbps_with_accession, True,                                    
                                                f'{inphared_fasta_hypothetical}/{constants.GENBANK}',
                                                f'{inphared_fasta_hypothetical}/{constants.PROKKA}',
                                                f'{inphared_fasta_rbp}/{constants.GENBANK}',
                                                f'{inphared_fasta_rbp}/{constants.PROKKA}'),
                             columns = ['Protein ID', 'Protein Sequence'])
rbp_sequences.head()

Unnamed: 0,Protein ID,Protein Sequence
0,BAF36105.1,MHQNISKENRGNYNNGIRPRIFMITTIDFRDIQAACIKQLDDMSKD...
1,BAF36110.1,MRIFYIHHPFLATHRYLLSNAYSTPYTDSITKLTTSYSSMPIILSV...
2,BAF36131.1,MLTDVDIQALIDASISGLSGEMPIVANIAARNALSLTKNTQVLVLD...
3,BAF36132.1,MFGVFIVRREGGYIGTQPNWDAANRPGNWDILDVYNRQRRNLWIQS...
4,BAF36193.1,MVNYRYRLSRLLIPGGIPDPEIGEVELFLASDRQGYINNIDLPPDP...


In [15]:
rbp_protein_seq = pd.merge(rbp_taxonomy, rbp_sequences, how = 'inner',
                           validate = 'one_to_one')
rbp_protein_seq.head()

Unnamed: 0,Protein ID,Accession,Description,Classification,Genome Length (bp),Jumbophage,molGC (%),Molecule,Modification Date,Number CDS,Positive Strand (%),Negative Strand (%),Coding Capacity(%),Low Coding Capacity Warning,tRNAs,Host,Lowest Taxa,Genus,Sub-family,Family,Order,Class,Phylum,Kingdom,Realm,Baltimore Group,Genbank Division,Isolation Host (beware inconsistent and nonsense values),Year-Month,Host Superkingdom,Host Phylum,Host Class,Host Order,Host Family,Protein Sequence
0,BAF36105.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07,bacteria,cyanobacteria,,chroococcales,microcystaceae,MHQNISKENRGNYNNGIRPRIFMITTIDFRDIQAACIKQLDDMSKD...
1,BAF36110.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07,bacteria,cyanobacteria,,chroococcales,microcystaceae,MRIFYIHHPFLATHRYLLSNAYSTPYTDSITKLTTSYSSMPIILSV...
2,BAF36131.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07,bacteria,cyanobacteria,,chroococcales,microcystaceae,MLTDVDIQALIDASISGLSGEMPIVANIAARNALSLTKNTQVLVLD...
3,BAF36132.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07,bacteria,cyanobacteria,,chroococcales,microcystaceae,MFGVFIVRREGGYIGTQPNWDAANRPGNWDILDVYNRQRRNLWIQS...
4,BAF36193.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07,bacteria,cyanobacteria,,chroococcales,microcystaceae,MVNYRYRLSRLLIPGGIPDPEIGEVELFLASDRQGYINNIDLPPDP...


Add the genomic sequence.

In [16]:
inphared_ffn_hypothetical = f'{constants.INPHARED}/{constants.FASTA}/{constants.NUCLEOTIDE}/{constants.HYPOTHETICAL}'
inphared_ffn_rbp = f'{constants.INPHARED}/{constants.FASTA}/{constants.NUCLEOTIDE}/{constants.RBP}'

rbp_nucleotide_sequences = pd.DataFrame(util.get_sequences(rbps_with_accession, False, 
                                                           f'{inphared_ffn_hypothetical}/{constants.GENBANK}',
                                                           f'{inphared_ffn_hypothetical}/{constants.PROKKA}',
                                                           f'{inphared_ffn_rbp}/{constants.GENBANK}',
                                                           f'{inphared_ffn_rbp}/{constants.PROKKA}'),
                                        columns = ['Protein ID', 'Nucleotide Sequence'])
rbp_nucleotide_sequences.head()

Unnamed: 0,Protein ID,Nucleotide Sequence
0,BAF36105.1,GTGCATCAAAATATTTCAAAGGAGAATCGTGGAAACTATAACAACG...
1,BAF36110.1,GTGAGGATTTTTTATATCCACCATCCATTCCTCGCTACTCACCGAT...
2,BAF36131.1,TTGCTGACAGATGTCGATATTCAGGCATTAATTGATGCCTCAATTT...
3,BAF36132.1,TTGTTCGGAGTTTTTATCGTGAGGCGTGAAGGCGGCTATATCGGAA...
4,BAF36193.1,TTGGTTAATTATCGTTATAGATTATCACGACTACTAATCCCGGGGG...


In [17]:
rbp_nucleotide_seq = pd.merge(rbp_protein_seq, rbp_nucleotide_sequences, how = 'inner',
                              validate = 'one_to_one')
rbp_nucleotide_seq.head()

Unnamed: 0,Protein ID,Accession,Description,Classification,Genome Length (bp),Jumbophage,molGC (%),Molecule,Modification Date,Number CDS,Positive Strand (%),Negative Strand (%),Coding Capacity(%),Low Coding Capacity Warning,tRNAs,Host,Lowest Taxa,Genus,Sub-family,Family,Order,Class,Phylum,Kingdom,Realm,Baltimore Group,Genbank Division,Isolation Host (beware inconsistent and nonsense values),Year-Month,Host Superkingdom,Host Phylum,Host Class,Host Order,Host Family,Protein Sequence,Nucleotide Sequence
0,BAF36105.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07,bacteria,cyanobacteria,,chroococcales,microcystaceae,MHQNISKENRGNYNNGIRPRIFMITTIDFRDIQAACIKQLDDMSKD...,GTGCATCAAAATATTTCAAAGGAGAATCGTGGAAACTATAACAACG...
1,BAF36110.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07,bacteria,cyanobacteria,,chroococcales,microcystaceae,MRIFYIHHPFLATHRYLLSNAYSTPYTDSITKLTTSYSSMPIILSV...,GTGAGGATTTTTTATATCCACCATCCATTCCTCGCTACTCACCGAT...
2,BAF36131.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07,bacteria,cyanobacteria,,chroococcales,microcystaceae,MLTDVDIQALIDASISGLSGEMPIVANIAARNALSLTKNTQVLVLD...,TTGCTGACAGATGTCGATATTCAGGCATTAATTGATGCCTCAATTT...
3,BAF36132.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07,bacteria,cyanobacteria,,chroococcales,microcystaceae,MFGVFIVRREGGYIGTQPNWDAANRPGNWDILDVYNRQRRNLWIQS...,TTGTTCGGAGTTTTTATCGTGAGGCGTGAAGGCGGCTATATCGGAA...
4,BAF36193.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07,bacteria,cyanobacteria,,chroococcales,microcystaceae,MVNYRYRLSRLLIPGGIPDPEIGEVELFLASDRQGYINNIDLPPDP...,TTGGTTAATTATCGTTATAGATTATCACGACTACTAATCCCGGGGG...


Just verifying that all the entries are unique

In [18]:
rbp_nucleotide_seq.shape[0] == rbp_nucleotide_seq['Protein ID'].nunique()

True

In [19]:
rbp_nucleotide_seq.shape

(24752, 36)

Save the resulting table to a CSV file.

In [20]:
if not os.path.exists(f'{constants.INPHARED}/{constants.DATA}'):
    os.makedirs(f'{constants.INPHARED}/{constants.DATA}')

rbp_nucleotide_seq.to_csv(os.path.join(f'{constants.INPHARED}/{constants.DATA}', constants.INPHARED_RBP_DATA), index = False)

In [21]:
rbp_nucleotide_seq = pd.read_csv(f'{constants.INPHARED}/{constants.DATA}/{constants.INPHARED_RBP_DATA}')
rbp_nucleotide_seq.shape

(24752, 36)

### Append the protein embeddings to the table

List the protein language models.

In [22]:
plm_list = list(constants.PLM.keys())
plm_list

['PROTTRANSBERT',
 'PROTXLNET',
 'PROTTRANSALBERT',
 'PROTT5',
 'ESM',
 'ESM1B',
 'SEQVEC']

Append the protein embeddings for each language model.

In [23]:
INDEX = 0

for INDEX in range(0, len(plm_list)):
    plm = plm_list[INDEX]

    rbp_df = util.get_rbp_embeddings_df(plm, f'{constants.INPHARED}/{constants.PLM[plm]}/{constants.COMPLETE}')
    rbp_embeddings = pd.merge(rbp_nucleotide_seq, rbp_df, how = 'inner', validate = 'one_to_one')
    print(plm, ":", rbp_embeddings.shape)
    
    rbp_embeddings.to_csv(os.path.join(f'{constants.INPHARED}/{constants.DATA}', constants.PLM_EMBEDDINGS_CSV[plm]), index = False)

PROTTRANSBERT : (24752, 1060)
PROTXLNET : (24752, 1060)
PROTTRANSALBERT : (24752, 4132)
PROTT5 : (24752, 1060)
ESM : (24752, 1316)
ESM1B : (24752, 1316)
SEQVEC : (24752, 1060)
