# Protein embeddings improve phage-host interaction prediction

**Mark Edward M. Gonzales<sup>1, 2</sup>, Jennifer C. Ureta<sup>1, 2</sup> & Anish M.S. Shrestha<sup>1, 2</sup>**

<sup>1</sup> Bioinformatics Laboratory, Advanced Research Institute for Informatics, Computing and Networking, De La Salle University, Manila, Philippines <br>
<sup>2</sup> Department of Software Technology, College of Computer Studies, De La Salle University, Manila, Philippines 

{mark_gonzales, jennifer.ureta, anish.shrestha}@dlsu.edu.ph

<hr>

## 💡 FASTA & Embeddings

This notebook assumes that you already have the FASTA files (from running [`1. Sequence Processing.ipynb`](https://github.com/bioinfodlsu/phage-host-prediction/blob/main/experiments/1.%20Sequence%20Preprocessing.ipynb)) and RBP embeddings (from running [`4. Protein Embedding Generation.ipynb`](https://github.com/bioinfodlsu/phage-host-prediction/blob/main/experiments/4.%20Protein%20Embedding%20Generation.ipynb)). 

Alternatively, you may:
- Download the FASTA files from [Google Drive](https://drive.google.com/drive/folders/16ZBXZCpC0OmldtPPIy5sEBtS4EVohorT?usp=sharing) and save the downloaded folder inside the `inphared` directory located in the same folder as this notebook. 
- Download the protein embeddings from these Google Drive directories: [Part 1](https://drive.google.com/drive/folders/1deenrDQIr3xcl9QCYH-nPhmpY8x2drQw?usp=sharing). Consolidate the downloaded folders inside a single `embeddings` directory and save it inside the `inphared` directory located in the same folder as this notebook. 

The folder structure should look like this:

`inphared` <br>
↳ `embeddings` <br>
&nbsp; &nbsp; ↳ `esm` <br>
&nbsp; &nbsp; ↳ `esm1b` <br>
&nbsp; &nbsp; ↳ ... <br>
↳ `fasta` <br>
&nbsp; &nbsp; ↳ `hypothetical` <br>
&nbsp; &nbsp; ↳ `nucleotide` <br>
&nbsp; &nbsp; ↳ `rbp`

<hr>

# Part I: Preliminaries

Import the necessary libraries and modules.

In [1]:
import math
import os

from collections import defaultdict

import pandas as pd

from ConstantsUtil import ConstantsUtil
from ClassificationUtil import ClassificationUtil

%load_ext autoreload
%autoreload 2

In [2]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', 50)

pd.options.mode.chained_assignment = None

<hr>

# Load the necessary utility classes

In [3]:
constants = ConstantsUtil()
util = ClassificationUtil(complete_embeddings_dir = constants.COMPLETE_EMBEDDINGS)

<hr>

In [4]:
inphared = pd.read_csv(f'{constants.TEMP_PREPROCESSING}/{constants.INPHARED_WITH_HOSTS}')
orig_shape = inphared.shape

In [5]:
inphared = inphared[inphared['Accession'].isin(util.get_phages())]
inphared.head()

Unnamed: 0,Accession,Description,Classification,Genome Length (bp),Jumbophage,molGC (%),Molecule,Modification Date,Number CDS,Positive Strand (%),Negative Strand (%),Coding Capacity(%),Low Coding Capacity Warning,tRNAs,Host,Lowest Taxa,Genus,Sub-family,Family,Order,Class,Phylum,Kingdom,Realm,Baltimore Group,Genbank Division,Isolation Host (beware inconsistent and nonsense values)
0,MK250029,Prevotella phage Lak-C1,Prevotella phage Lak-C1 Myoviridae Caudovirice...,540217,True,25.796,DNA,13-JAN-2019,830,47.108434,52.891566,68.324951,,30,prevotella,Myoviridae,Unclassified,Unclassified,Myoviridae,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,ENV,Prevotella sp.
1,MK250028,Prevotella phage Lak-B9,Prevotella phage Lak-B9 Myoviridae Caudovirice...,550053,True,26.012,DNA,13-JAN-2019,859,52.270081,47.729919,69.188424,,29,prevotella,Myoviridae,Unclassified,Unclassified,Myoviridae,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,ENV,Prevotella sp.
2,MK250027,Prevotella phage Lak-B8,Prevotella phage Lak-B8 Myoviridae Caudovirice...,551627,True,26.022,DNA,13-JAN-2019,860,53.023256,46.976744,69.318761,,33,prevotella,Myoviridae,Unclassified,Unclassified,Myoviridae,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,ENV,Prevotella sp.
3,MK250026,Prevotella phage Lak-B7,Prevotella phage Lak-B7 Myoviridae Caudovirice...,550702,True,26.02,DNA,13-JAN-2019,859,53.201397,46.798603,69.363285,,33,prevotella,Myoviridae,Unclassified,Unclassified,Myoviridae,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,ENV,Prevotella sp.
4,MK250025,Prevotella phage Lak-B6,Prevotella phage Lak-B6 Myoviridae Caudovirice...,546689,True,26.029,DNA,13-JAN-2019,847,52.656434,47.343566,69.118274,,30,prevotella,Myoviridae,Unclassified,Unclassified,Myoviridae,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,ENV,Prevotella sp.


### Some Statistics

In [6]:
print('Original:\t', orig_shape[0])
print('With RBPs:\t', inphared.shape[0])

Original:	 15823
With RBPs:	 9583


<hr>

In [7]:
rbps_with_accession = util.get_rbps()

In [8]:
rbp_df = pd.DataFrame(rbps_with_accession, columns = ['Protein ID', 'Accession'])
rbp_df.head()

Unnamed: 0,Protein ID,Accession
0,BAF36105.1,AB231700
1,BAF36110.1,AB231700
2,BAF36131.1,AB231700
3,BAF36132.1,AB231700
4,BAF36193.1,AB231700


In [9]:
rbp_with_phage = pd.merge(rbp_df, inphared, how = 'inner', validate = 'many_to_one')
rbp_with_phage.head()

Unnamed: 0,Protein ID,Accession,Description,Classification,Genome Length (bp),Jumbophage,molGC (%),Molecule,Modification Date,Number CDS,Positive Strand (%),Negative Strand (%),Coding Capacity(%),Low Coding Capacity Warning,tRNAs,Host,Lowest Taxa,Genus,Sub-family,Family,Order,Class,Phylum,Kingdom,Realm,Baltimore Group,Genbank Division,Isolation Host (beware inconsistent and nonsense values)
0,BAF36105.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,14-JUL-2021,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified
1,BAF36110.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,14-JUL-2021,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified
2,BAF36131.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,14-JUL-2021,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified
3,BAF36132.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,14-JUL-2021,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified
4,BAF36193.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,14-JUL-2021,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified


### Convert modification date to date

In [10]:
rbp_with_phage['Modification Date'] = pd.to_datetime(rbp_with_phage['Modification Date'])

### Add `Year-Month` column

In [11]:
rbp_with_phage['Year-Month'] = rbp_with_phage['Modification Date'].dt.to_period('M')
rbp_with_phage.head()

Unnamed: 0,Protein ID,Accession,Description,Classification,Genome Length (bp),Jumbophage,molGC (%),Molecule,Modification Date,Number CDS,Positive Strand (%),Negative Strand (%),Coding Capacity(%),Low Coding Capacity Warning,tRNAs,Host,Lowest Taxa,Genus,Sub-family,Family,Order,Class,Phylum,Kingdom,Realm,Baltimore Group,Genbank Division,Isolation Host (beware inconsistent and nonsense values),Year-Month
0,BAF36105.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07
1,BAF36110.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07
2,BAF36131.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07
3,BAF36132.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07
4,BAF36193.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07


### Add taxonomical info of host

In [12]:
host_taxonomy = pd.DataFrame(util.get_host_taxonomy(rbp_with_phage), 
                             columns = ['Host Superkingdom', 'Host Phylum', 'Host Class', 'Host Order', 'Host Family', 'Host'])
host_taxonomy.head()

Unnamed: 0,Host Superkingdom,Host Phylum,Host Class,Host Order,Host Family,Host
0,bacteria,cyanobacteria,,chroococcales,microcystaceae,microcystis
1,bacteria,proteobacteria,gammaproteobacteria,enterobacterales,enterobacteriaceae,escherichia
2,bacteria,proteobacteria,gammaproteobacteria,enterobacterales,enterobacteriaceae,enterobacter
3,bacteria,proteobacteria,gammaproteobacteria,enterobacterales,enterobacteriaceae,salmonella
4,bacteria,proteobacteria,betaproteobacteria,burkholderiales,burkholderiaceae,ralstonia


In [13]:
rbp_taxonomy = pd.merge(rbp_with_phage, host_taxonomy, how = 'inner', validate = 'many_to_one')
rbp_taxonomy.head()

Unnamed: 0,Protein ID,Accession,Description,Classification,Genome Length (bp),Jumbophage,molGC (%),Molecule,Modification Date,Number CDS,Positive Strand (%),Negative Strand (%),Coding Capacity(%),Low Coding Capacity Warning,tRNAs,Host,Lowest Taxa,Genus,Sub-family,Family,Order,Class,Phylum,Kingdom,Realm,Baltimore Group,Genbank Division,Isolation Host (beware inconsistent and nonsense values),Year-Month,Host Superkingdom,Host Phylum,Host Class,Host Order,Host Family
0,BAF36105.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07,bacteria,cyanobacteria,,chroococcales,microcystaceae
1,BAF36110.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07,bacteria,cyanobacteria,,chroococcales,microcystaceae
2,BAF36131.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07,bacteria,cyanobacteria,,chroococcales,microcystaceae
3,BAF36132.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07,bacteria,cyanobacteria,,chroococcales,microcystaceae
4,BAF36193.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07,bacteria,cyanobacteria,,chroococcales,microcystaceae


### Add protein sequence (translation)

In [14]:
inphared_fasta_hypothetical = f'{constants.INPHARED}/{constants.FASTA}/{constants.HYPOTHETICAL}'
inphared_fasta_rbp = f'{constants.INPHARED}/{constants.FASTA}/{constants.RBP}'

rbp_sequences = pd.DataFrame(util.get_sequences(rbps_with_accession, True,                                    
                                                f'{inphared_fasta_hypothetical}/{constants.GENBANK}',
                                                f'{inphared_fasta_hypothetical}/{constants.PROKKA}',
                                                f'{inphared_fasta_rbp}/{constants.GENBANK}',
                                                f'{inphared_fasta_rbp}/{constants.PROKKA}'),
                             columns = ['Protein ID', 'Protein Sequence'])
rbp_sequences.head()

Unnamed: 0,Protein ID,Protein Sequence
0,BAF36105.1,MHQNISKENRGNYNNGIRPRIFMITTIDFRDIQAACIKQLDDMSKD...
1,BAF36110.1,MRIFYIHHPFLATHRYLLSNAYSTPYTDSITKLTTSYSSMPIILSV...
2,BAF36131.1,MLTDVDIQALIDASISGLSGEMPIVANIAARNALSLTKNTQVLVLD...
3,BAF36132.1,MFGVFIVRREGGYIGTQPNWDAANRPGNWDILDVYNRQRRNLWIQS...
4,BAF36193.1,MVNYRYRLSRLLIPGGIPDPEIGEVELFLASDRQGYINNIDLPPDP...


In [15]:
rbp_protein_seq = pd.merge(rbp_taxonomy, rbp_sequences, how = 'inner',
                           validate = 'one_to_one')
rbp_protein_seq.head()

Unnamed: 0,Protein ID,Accession,Description,Classification,Genome Length (bp),Jumbophage,molGC (%),Molecule,Modification Date,Number CDS,Positive Strand (%),Negative Strand (%),Coding Capacity(%),Low Coding Capacity Warning,tRNAs,Host,Lowest Taxa,Genus,Sub-family,Family,Order,Class,Phylum,Kingdom,Realm,Baltimore Group,Genbank Division,Isolation Host (beware inconsistent and nonsense values),Year-Month,Host Superkingdom,Host Phylum,Host Class,Host Order,Host Family,Protein Sequence
0,BAF36105.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07,bacteria,cyanobacteria,,chroococcales,microcystaceae,MHQNISKENRGNYNNGIRPRIFMITTIDFRDIQAACIKQLDDMSKD...
1,BAF36110.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07,bacteria,cyanobacteria,,chroococcales,microcystaceae,MRIFYIHHPFLATHRYLLSNAYSTPYTDSITKLTTSYSSMPIILSV...
2,BAF36131.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07,bacteria,cyanobacteria,,chroococcales,microcystaceae,MLTDVDIQALIDASISGLSGEMPIVANIAARNALSLTKNTQVLVLD...
3,BAF36132.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07,bacteria,cyanobacteria,,chroococcales,microcystaceae,MFGVFIVRREGGYIGTQPNWDAANRPGNWDILDVYNRQRRNLWIQS...
4,BAF36193.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07,bacteria,cyanobacteria,,chroococcales,microcystaceae,MVNYRYRLSRLLIPGGIPDPEIGEVELFLASDRQGYINNIDLPPDP...


### Add nucleotide sequence

In [16]:
inphared_ffn_hypothetical = f'{constants.INPHARED}/{constants.FASTA}/{constants.NUCLEOTIDE}/{constants.HYPOTHETICAL}'
inphared_ffn_rbp = f'{constants.INPHARED}/{constants.FASTA}/{constants.NUCLEOTIDE}/{constants.RBP}'

rbp_nucleotide_sequences = pd.DataFrame(util.get_sequences(rbps_with_accession, False, 
                                                           f'{inphared_ffn_hypothetical}/{constants.GENBANK}',
                                                           f'{inphared_ffn_hypothetical}/{constants.PROKKA}',
                                                           f'{inphared_ffn_rbp}/{constants.GENBANK}',
                                                           f'{inphared_ffn_rbp}/{constants.PROKKA}'),
                                        columns = ['Protein ID', 'Nucleotide Sequence'])
rbp_nucleotide_sequences.head()

Unnamed: 0,Protein ID,Nucleotide Sequence
0,BAF36105.1,GTGCATCAAAATATTTCAAAGGAGAATCGTGGAAACTATAACAACG...
1,BAF36110.1,GTGAGGATTTTTTATATCCACCATCCATTCCTCGCTACTCACCGAT...
2,BAF36131.1,TTGCTGACAGATGTCGATATTCAGGCATTAATTGATGCCTCAATTT...
3,BAF36132.1,TTGTTCGGAGTTTTTATCGTGAGGCGTGAAGGCGGCTATATCGGAA...
4,BAF36193.1,TTGGTTAATTATCGTTATAGATTATCACGACTACTAATCCCGGGGG...


In [17]:
rbp_nucleotide_seq = pd.merge(rbp_protein_seq, rbp_nucleotide_sequences, how = 'inner',
                              validate = 'one_to_one')
rbp_nucleotide_seq.head()

Unnamed: 0,Protein ID,Accession,Description,Classification,Genome Length (bp),Jumbophage,molGC (%),Molecule,Modification Date,Number CDS,Positive Strand (%),Negative Strand (%),Coding Capacity(%),Low Coding Capacity Warning,tRNAs,Host,Lowest Taxa,Genus,Sub-family,Family,Order,Class,Phylum,Kingdom,Realm,Baltimore Group,Genbank Division,Isolation Host (beware inconsistent and nonsense values),Year-Month,Host Superkingdom,Host Phylum,Host Class,Host Order,Host Family,Protein Sequence,Nucleotide Sequence
0,BAF36105.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07,bacteria,cyanobacteria,,chroococcales,microcystaceae,MHQNISKENRGNYNNGIRPRIFMITTIDFRDIQAACIKQLDDMSKD...,GTGCATCAAAATATTTCAAAGGAGAATCGTGGAAACTATAACAACG...
1,BAF36110.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07,bacteria,cyanobacteria,,chroococcales,microcystaceae,MRIFYIHHPFLATHRYLLSNAYSTPYTDSITKLTTSYSSMPIILSV...,GTGAGGATTTTTTATATCCACCATCCATTCCTCGCTACTCACCGAT...
2,BAF36131.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07,bacteria,cyanobacteria,,chroococcales,microcystaceae,MLTDVDIQALIDASISGLSGEMPIVANIAARNALSLTKNTQVLVLD...,TTGCTGACAGATGTCGATATTCAGGCATTAATTGATGCCTCAATTT...
3,BAF36132.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07,bacteria,cyanobacteria,,chroococcales,microcystaceae,MFGVFIVRREGGYIGTQPNWDAANRPGNWDILDVYNRQRRNLWIQS...,TTGTTCGGAGTTTTTATCGTGAGGCGTGAAGGCGGCTATATCGGAA...
4,BAF36193.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,PHG,Unspecified,2021-07,bacteria,cyanobacteria,,chroococcales,microcystaceae,MVNYRYRLSRLLIPGGIPDPEIGEVELFLASDRQGYINNIDLPPDP...,TTGGTTAATTATCGTTATAGATTATCACGACTACTAATCCCGGGGG...


Just verifying that all the entries are unique

In [18]:
rbp_nucleotide_seq.shape[0] == rbp_nucleotide_seq['Protein ID'].nunique()

True

In [19]:
rbp_nucleotide_seq.shape

(24752, 36)

### Save to CSV file

In [21]:
if not os.path.exists(f'{constants.INPHARED}/{constants.DATA}'):
    os.makedirs(f'{constants.INPHARED}/{constants.DATA}')

rbp_nucleotide_seq.to_csv(os.path.join(f'{constants.INPHARED}/{constants.DATA}', constants.INPHARED_RBP_DATA), index = False)

In [22]:
rbp_nucleotide_seq = pd.read_csv(f'{constants.INPHARED}/{constants.DATA}/{constants.INPHARED_RBP_DATA}')
rbp_nucleotide_seq.shape

(24752, 36)

<hr>

# Get PLM embeddings

In [23]:
plm_list = list(constants.PLM.keys())
plm_list

['PROTTRANSBERT',
 'PROTXLNET',
 'PROTTRANSALBERT',
 'PROTT5',
 'ESM',
 'ESM1B',
 'SEQVEC']

Cycle through the different protein languge models by changing the value of `INDEX`.

In [24]:
INDEX = 0
plm = plm_list[INDEX]
plm

'PROTTRANSBERT'

In [27]:
if not os.path.exists(f'{constants.INPHARED}/{constants.EMBEDDINGS}'):
    os.makedirs(f'{constants.INPHARED}/{constants.EMBEDDINGS}')

rbp_df = util.get_rbp_embeddings_df(plm, f'{constants.INPHARED}/{constants.PLM[plm]}/{constants.COMPLETE}')
rbp_df.head()

Unnamed: 0,Protein ID,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,...,1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013,1014,1015,1016,1017,1018,1019,1020,1021,1022,1023,1024
0,BAF36105.1,0.011343,-0.025608,-0.023644,0.011327,0.030272,-0.010992,-0.008051,0.00383,-0.047815,0.046852,-0.013404,-0.046046,0.001363,0.047042,-0.003473,-0.020164,0.022428,0.020981,0.003707,-0.268043,0.012791,-0.008935,-0.008863,0.000673,...,-0.042151,-0.003302,-0.007857,-0.016415,0.006296,0.03353,-0.024245,-0.03138,-0.004161,0.029706,-0.026781,-0.006438,0.003705,8e-05,0.037692,0.015032,-0.030837,-0.026536,-0.012994,0.06541,0.017548,-0.026453,0.008197,-0.016817,0.011521
1,BAF36110.1,0.068575,-0.018504,0.012405,0.03299,0.019897,0.005218,-0.035998,-0.016614,-0.026085,0.022749,-0.006757,-0.022649,0.018968,0.063115,-0.023281,-0.040188,0.02028,-0.006473,0.001024,-0.073634,0.033587,0.020627,-0.012305,-0.035192,...,-0.016343,-0.007508,-0.018304,-0.030201,0.030389,0.013438,0.020754,-0.04893,0.05231,0.012141,-0.045077,-0.001699,0.040026,0.000505,0.082433,-0.00484,-0.03971,-0.039865,0.001029,0.091339,-0.011433,-0.043716,0.017314,1.5e-05,0.05379
2,BAF36131.1,0.014763,-0.026582,0.036431,0.022586,0.061328,0.000776,-0.025405,-0.029366,-0.100102,0.056073,-0.01397,-0.014491,-0.043729,0.035473,0.002547,-0.008221,-0.010369,0.026538,0.010232,0.063491,0.045908,0.004861,0.013222,-0.041022,...,-0.037931,-0.021319,-0.025819,0.023642,0.01262,-0.004665,-0.011568,-0.017803,-0.001289,0.050931,-0.065078,0.044811,-0.011598,-0.026189,0.113922,-0.022101,-0.032947,-0.013377,0.002333,0.118846,-0.008951,0.009829,0.011692,-0.006371,0.034871
3,BAF36132.1,0.031441,-0.020482,-0.01282,0.005248,0.079325,0.004676,-0.018973,-0.033595,-0.047655,0.043803,-0.001279,0.001606,0.002301,0.045565,-0.032164,-0.028279,0.000878,-0.012325,-0.00838,0.098496,0.027793,-0.003735,-0.010564,-0.039872,...,-0.023285,-0.025137,-0.019966,-0.002471,0.043135,0.009154,0.002943,0.005217,8.5e-05,0.037465,-0.011625,-0.01182,-0.004639,0.025781,0.062478,-0.019769,-0.065851,-0.006687,0.014266,0.097479,0.019128,0.001769,0.018548,-0.002265,0.024396
4,BAF36193.1,0.061502,-0.015091,-0.030157,0.025261,0.049047,0.011626,-0.014671,0.011538,-0.0336,0.039354,-0.004341,-0.052534,0.006163,0.05648,-0.02572,-0.021862,-0.004352,0.016171,-0.022362,-0.18654,-0.002528,-0.017337,0.000641,-0.014039,...,-0.040964,0.018762,-0.006258,-0.016687,0.005086,0.048235,-0.023831,-0.008983,-0.020717,-0.008471,-0.024475,-0.014686,0.011615,0.01839,0.035523,0.002646,-0.01373,-0.021774,-5.6e-05,0.074237,-0.003927,-0.041172,0.008673,-0.026502,0.010039


In [28]:
rbp_embeddings = pd.merge(rbp_nucleotide_seq, rbp_df, how = 'inner', validate = 'one_to_one')
rbp_embeddings.head()

Unnamed: 0,Protein ID,Accession,Description,Classification,Genome Length (bp),Jumbophage,molGC (%),Molecule,Modification Date,Number CDS,Positive Strand (%),Negative Strand (%),Coding Capacity(%),Low Coding Capacity Warning,tRNAs,Host,Lowest Taxa,Genus,Sub-family,Family,Order,Class,Phylum,Kingdom,Realm,...,1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013,1014,1015,1016,1017,1018,1019,1020,1021,1022,1023,1024
0,BAF36105.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,...,-0.042151,-0.003302,-0.007857,-0.016415,0.006296,0.03353,-0.024245,-0.03138,-0.004161,0.029706,-0.026781,-0.006438,0.003705,8e-05,0.037692,0.015032,-0.030837,-0.026536,-0.012994,0.06541,0.017548,-0.026453,0.008197,-0.016817,0.011521
1,BAF36110.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,...,-0.016343,-0.007508,-0.018304,-0.030201,0.030389,0.013438,0.020754,-0.04893,0.05231,0.012141,-0.045077,-0.001699,0.040026,0.000505,0.082433,-0.00484,-0.03971,-0.039865,0.001029,0.091339,-0.011433,-0.043716,0.017314,1.5e-05,0.05379
2,BAF36131.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,...,-0.037931,-0.021319,-0.025819,0.023642,0.01262,-0.004665,-0.011568,-0.017803,-0.001289,0.050931,-0.065078,0.044811,-0.011598,-0.026189,0.113922,-0.022101,-0.032947,-0.013377,0.002333,0.118846,-0.008951,0.009829,0.011692,-0.006371,0.034871
3,BAF36132.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,...,-0.023285,-0.025137,-0.019966,-0.002471,0.043135,0.009154,0.002943,0.005217,8.5e-05,0.037465,-0.011625,-0.01182,-0.004639,0.025781,0.062478,-0.019769,-0.065851,-0.006687,0.014266,0.097479,0.019128,0.001769,0.018548,-0.002265,0.024396
4,BAF36193.1,AB231700,Microcystis virus Ma-LMM01,Microcystis virus Ma-LMM01 Fukuivirus Caudovir...,162109,False,45.953,DNA,2021-07-14,189,34.391534,65.608466,93.542616,,2,microcystis,Fukuivirus,Fukuivirus,Unclassified,Unclassified,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,...,-0.040964,0.018762,-0.006258,-0.016687,0.005086,0.048235,-0.023831,-0.008983,-0.020717,-0.008471,-0.024475,-0.014686,0.011615,0.01839,0.035523,0.002646,-0.01373,-0.021774,-5.6e-05,0.074237,-0.003927,-0.041172,0.008673,-0.026502,0.010039


In [29]:
rbp_embeddings.shape

(24752, 1060)

In [30]:
rbp_embeddings.to_csv(os.path.join(f'{constants.INPHARED}/{constants.DATA}', constants.PLM_EMBEDDINGS_CSV[plm]), index = False)