# Protein embeddings improve phage-host interaction prediction

**Mark Edward M. Gonzales<sup>1, 2</sup>, Jennifer C. Ureta<sup>1, 2</sup> & Anish M.S. Shrestha<sup>1, 2</sup>**

<sup>1</sup> Bioinformatics Laboratory, Advanced Research Institute for Informatics, Computing and Networking, De La Salle University, Manila, Philippines <br>
<sup>2</sup> Department of Software Technology, College of Computer Studies, De La Salle University, Manila, Philippines 

{mark_gonzales, jennifer.ureta, anish.shrestha}@dlsu.edu.ph

<hr>

## 💡 ProtBert Embeddings
This notebook assumes that you have already converted the annotated RBP and hypothetical protein sequences (from running [`1. Sequence Processing.ipynb`](https://github.com/bioinfodlsu/phage-host-prediction/blob/main/experiments/1.%20Sequence%20Preprocessing.ipynb)) into ProtBert embeddings. Refer to [`4. Protein Embedding Generation.ipynb`](https://github.com/bioinfodlsu/phage-host-prediction/blob/main/experiments/4.%20Protein%20Embedding%20Generation.ipynb) for the script to generate these embeddings.

Alternatively, you may download the protein embeddings from these Google Drive directories: [Part 1](https://drive.google.com/drive/folders/1deenrDQIr3xcl9QCYH-nPhmpY8x2drQw?usp=sharing) and [Part 2](https://drive.google.com/drive/folders/1jnBFNsC6zJISkc6IAz56257MSXKjY0Ez?usp=sharing). Consolidate the downloaded folders into a single `embeddings` directory and save it inside the `inphared` directory located in the same folder as this notebook. The folder structure should look like this:

`experiments` (parent folder of this notebook) <br> 
↳ `inphared` <br>
&nbsp; &nbsp;↳ `embeddings` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `esm` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `esm1b` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ ... <br>
↳ `3. RBP Computational Prediction.ipynb` (this notebook) <br>

However, technically, you only need the embeddings saved in `inphared/embeddings/prottransbert/rbp` and `inphared/embeddings/prottransbert/hypothetical` to run this notebook.

<hr>

## 📁 Output Files
If you would like to skip running this notebook, you may download the protein embeddings from these Google Drive directories: [Part 1](https://drive.google.com/drive/folders/1deenrDQIr3xcl9QCYH-nPhmpY8x2drQw?usp=sharing) and [Part 2](https://drive.google.com/drive/folders/1jnBFNsC6zJISkc6IAz56257MSXKjY0Ez?usp=sharing). Consolidate the downloaded folders into a single `embeddings` directory and save it inside the `inphared` directory located in the same folder as this notebook. The folder structure should look like this:

`experiments` (parent folder of this notebook) <br> 
↳ `inphared` <br>
&nbsp; &nbsp;↳ `embeddings` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `esm` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `esm1b` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ ... <br>
↳ `3. RBP Computational Prediction.ipynb` (this notebook) <br>

To be specific, this notebook generates the contents of `inphared/embeddings/prottransbert/complete`.

<hr>

# Part I: Preliminaries

Import the necessary libraries and modules.

In [1]:
import os
import shutil

import pandas as pd

from ConstantsUtil import ConstantsUtil
from RBPPredictionUtil import RBPPredictionUtil

%load_ext autoreload
%autoreload 2

In [2]:
constants = ConstantsUtil()
util = RBPPredictionUtil()

Copy all the RBP ProtBert embeddings:
- **FROM `inphared/embeddings/prottransbert/rbp`**: Contains the ProtBert embeddings of the annotated RBPs, i.e., the RBPs selected based on GenBank (for sequences with gene annotations) and Prokka (for sequences without gene annotations) annotations
- **TO `inphared/embeddings/prottransbert/complete`**: Contains the ProtBert embeddings of the annotated RBPs, alongside the ProtBert embeddings of the hypothetical proteins that will be predicted as RBPs after running this notebook

In [3]:
shutil.copytree(f"{constants.INPHARED}/{constants.PLM['PROTTRANSBERT']}/{constants.RBP}", 
                f"{constants.INPHARED}/{constants.PLM['PROTTRANSBERT']}/{constants.COMPLETE}")

'inphared/embeddings/prottransbert/complete'

<hr>

# Part II: Computational Prediction of RBPs

Feed the ProtBert embeddings of the hypothetical proteins to the XGBoost model from this [RBP prediction study](https://www.mdpi.com/1999-4915/14/6/1329) by Boeckaerts <i>et al.</i> (2022) to predict whether they are RBPs.

In [4]:
util.predict_rbps(constants.XGB_RBP_PREDICTION, 
                  f"{constants.INPHARED}/{constants.PLM['PROTTRANSBERT']}/{constants.HYPOTHETICAL}/{constants.GENBANK}", 
                  f"{constants.INPHARED}/{constants.PLM['PROTTRANSBERT']}/{constants.COMPLETE}/{constants.GENBANK}")

AB231700-hypothetical-embeddings.csv
AB255436-hypothetical-embeddings.csv
AB366653-hypothetical-embeddings.csv
AB370205-hypothetical-embeddings.csv
AB370268-hypothetical-embeddings.csv
AB605730-hypothetical-embeddings.csv
AB647160-hypothetical-embeddings.csv
AB711120-hypothetical-embeddings.csv
AB716666-hypothetical-embeddings.csv
AB720063-hypothetical-embeddings.csv
AB720064-hypothetical-embeddings.csv
AB757801-hypothetical-embeddings.csv
AB775548-hypothetical-embeddings.csv
AB853330-hypothetical-embeddings.csv
AB853331-hypothetical-embeddings.csv
AB863625-hypothetical-embeddings.csv
AB897757-hypothetical-embeddings.csv
AB910392-hypothetical-embeddings.csv
AB910393-hypothetical-embeddings.csv
AB916497-hypothetical-embeddings.csv
AB981169-hypothetical-embeddings.csv
AF011378-hypothetical-embeddings.csv
AF020713-hypothetical-embeddings.csv
AF065411-hypothetical-embeddings.csv
AF069308-hypothetical-embeddings.csv
AF125520-hypothetical-embeddings.csv
AF165214-hypothetical-embeddings.csv
A

GU071103-hypothetical-embeddings.csv
GU071105-hypothetical-embeddings.csv
GU071106-hypothetical-embeddings.csv
GU071108-hypothetical-embeddings.csv
GU075905-hypothetical-embeddings.csv
GU169904-hypothetical-embeddings.csv
GU196281-hypothetical-embeddings.csv
GU229986-hypothetical-embeddings.csv
GU296433-hypothetical-embeddings.csv
GU396103-hypothetical-embeddings.csv
GU459069-hypothetical-embeddings.csv
GU477322-hypothetical-embeddings.csv
GU557055-hypothetical-embeddings.csv
GU936714-hypothetical-embeddings.csv
GU949551-hypothetical-embeddings.csv
HE608841-hypothetical-embeddings.csv
HE806280-hypothetical-embeddings.csv
HE815464-hypothetical-embeddings.csv
HE956704-hypothetical-embeddings.csv
HE956707-hypothetical-embeddings.csv
HE956711-hypothetical-embeddings.csv
HE983844-hypothetical-embeddings.csv
HE983845-hypothetical-embeddings.csv
HF543949-hypothetical-embeddings.csv
HF569089-hypothetical-embeddings.csv
HF569091-hypothetical-embeddings.csv
HG428758-hypothetical-embeddings.csv
H

JQ066768-hypothetical-embeddings.csv
JQ086369-hypothetical-embeddings.csv
JQ086370-hypothetical-embeddings.csv
JQ086371-hypothetical-embeddings.csv
JQ086372-hypothetical-embeddings.csv
JQ086373-hypothetical-embeddings.csv
JQ086374-hypothetical-embeddings.csv
JQ086375-hypothetical-embeddings.csv
JQ086377-hypothetical-embeddings.csv
JQ177062-hypothetical-embeddings.csv
JQ177065-hypothetical-embeddings.csv
JQ182726-hypothetical-embeddings.csv
JQ182727-hypothetical-embeddings.csv
JQ182730-hypothetical-embeddings.csv
JQ182731-hypothetical-embeddings.csv
JQ182732-hypothetical-embeddings.csv
JQ182733-hypothetical-embeddings.csv
JQ182734-hypothetical-embeddings.csv
JQ182736-hypothetical-embeddings.csv
JQ245707-hypothetical-embeddings.csv
JQ246028-hypothetical-embeddings.csv
JQ288021-hypothetical-embeddings.csv
JQ340389-hypothetical-embeddings.csv
JQ340774-hypothetical-embeddings.csv
JQ446452-hypothetical-embeddings.csv
JQ513383-hypothetical-embeddings.csv
JQ619704-hypothetical-embeddings.csv
J

KF024731-hypothetical-embeddings.csv
KF024732-hypothetical-embeddings.csv
KF024734-hypothetical-embeddings.csv
KF114876-hypothetical-embeddings.csv
KF114877-hypothetical-embeddings.csv
KF114879-hypothetical-embeddings.csv
KF114880-hypothetical-embeddings.csv
KF147891-hypothetical-embeddings.csv
KF148055-hypothetical-embeddings.csv
KF148616-hypothetical-embeddings.csv
KF156338-hypothetical-embeddings.csv
KF156339-hypothetical-embeddings.csv
KF156340-hypothetical-embeddings.csv
KF188414-hypothetical-embeddings.csv
KF192075-hypothetical-embeddings.csv
KF208639-hypothetical-embeddings.csv
KF279417-hypothetical-embeddings.csv
KF301602-hypothetical-embeddings.csv
KF306380-hypothetical-embeddings.csv
KF322026-hypothetical-embeddings.csv
KF356199-hypothetical-embeddings.csv
KF381361-hypothetical-embeddings.csv
KF416343-hypothetical-embeddings.csv
KF475786-hypothetical-embeddings.csv
KF493883-hypothetical-embeddings.csv
KF534715-hypothetical-embeddings.csv
KF550303-hypothetical-embeddings.csv
K

KJ094031-hypothetical-embeddings.csv
KJ094033-hypothetical-embeddings.csv
KJ101592-hypothetical-embeddings.csv
KJ133688-hypothetical-embeddings.csv
KJ133689-hypothetical-embeddings.csv
KJ133690-hypothetical-embeddings.csv
KJ133691-hypothetical-embeddings.csv
KJ133692-hypothetical-embeddings.csv
KJ133693-hypothetical-embeddings.csv
KJ133694-hypothetical-embeddings.csv
KJ133695-hypothetical-embeddings.csv
KJ133696-hypothetical-embeddings.csv
KJ133697-hypothetical-embeddings.csv
KJ133698-hypothetical-embeddings.csv
KJ133699-hypothetical-embeddings.csv
KJ133700-hypothetical-embeddings.csv
KJ133701-hypothetical-embeddings.csv
KJ133702-hypothetical-embeddings.csv
KJ133703-hypothetical-embeddings.csv
KJ133706-hypothetical-embeddings.csv
KJ133707-hypothetical-embeddings.csv
KJ135004-hypothetical-embeddings.csv
KJ156985-hypothetical-embeddings.csv
KJ159566-hypothetical-embeddings.csv
KJ173786-hypothetical-embeddings.csv
KJ174156-hypothetical-embeddings.csv
KJ174317-hypothetical-embeddings.csv
K

KR080196-hypothetical-embeddings.csv
KR080202-hypothetical-embeddings.csv
KR080204-hypothetical-embeddings.csv
KR093625-hypothetical-embeddings.csv
KR093626-hypothetical-embeddings.csv
KR093628-hypothetical-embeddings.csv
KR131710-hypothetical-embeddings.csv
KR136259-hypothetical-embeddings.csv
KR136260-hypothetical-embeddings.csv
KR233164-hypothetical-embeddings.csv
KR262148-hypothetical-embeddings.csv
KR269718-hypothetical-embeddings.csv
KR269719-hypothetical-embeddings.csv
KR296689-hypothetical-embeddings.csv
KR296692-hypothetical-embeddings.csv
KR296694-hypothetical-embeddings.csv
KR296695-hypothetical-embeddings.csv
KR422353-hypothetical-embeddings.csv
KR534323-hypothetical-embeddings.csv
KR537871-hypothetical-embeddings.csv
KR537872-hypothetical-embeddings.csv
KR560069-hypothetical-embeddings.csv
KR604693-hypothetical-embeddings.csv
KR698074-hypothetical-embeddings.csv
KR824843-hypothetical-embeddings.csv
KR902361-hypothetical-embeddings.csv
KR905066-hypothetical-embeddings.csv
K

KU886222-hypothetical-embeddings.csv
KU886224-hypothetical-embeddings.csv
KU892558-hypothetical-embeddings.csv
KU927500-hypothetical-embeddings.csv
KU935715-hypothetical-embeddings.csv
KU946962-hypothetical-embeddings.csv
KU948710-hypothetical-embeddings.csv
KU963245-hypothetical-embeddings.csv
KU981050-hypothetical-embeddings.csv
KU984979-hypothetical-embeddings.csv
KU984980-hypothetical-embeddings.csv
KU997639-hypothetical-embeddings.csv
KU998240-hypothetical-embeddings.csv
KU998245-hypothetical-embeddings.csv
KX011169-hypothetical-embeddings.csv
KX017521-hypothetical-embeddings.csv
KX066068-hypothetical-embeddings.csv
KX098389-hypothetical-embeddings.csv
KX119174-hypothetical-embeddings.csv
KX119175-hypothetical-embeddings.csv
KX119177-hypothetical-embeddings.csv
KX119182-hypothetical-embeddings.csv
KX119188-hypothetical-embeddings.csv
KX119191-hypothetical-embeddings.csv
KX119192-hypothetical-embeddings.csv
KX119194-hypothetical-embeddings.csv
KX119195-hypothetical-embeddings.csv
K

LC727700-hypothetical-embeddings.csv
LC727701-hypothetical-embeddings.csv
LK985321-hypothetical-embeddings.csv
LN610577-hypothetical-embeddings.csv
LN610578-hypothetical-embeddings.csv
LN610590-hypothetical-embeddings.csv
LN681534-hypothetical-embeddings.csv
LN681537-hypothetical-embeddings.csv
LN681539-hypothetical-embeddings.csv
LN681541-hypothetical-embeddings.csv
LN681542-hypothetical-embeddings.csv
LN828717-hypothetical-embeddings.csv
LN881727-hypothetical-embeddings.csv
LN881729-hypothetical-embeddings.csv
LN881730-hypothetical-embeddings.csv
LN881731-hypothetical-embeddings.csv
LN881732-hypothetical-embeddings.csv
LN881733-hypothetical-embeddings.csv
LN881734-hypothetical-embeddings.csv
LN881735-hypothetical-embeddings.csv
LN881736-hypothetical-embeddings.csv
LN881737-hypothetical-embeddings.csv
LN881738-hypothetical-embeddings.csv
LN887844-hypothetical-embeddings.csv
LN887948-hypothetical-embeddings.csv
LN890663-hypothetical-embeddings.csv
LN898172-hypothetical-embeddings.csv
L

MF788075-hypothetical-embeddings.csv
MF805716-hypothetical-embeddings.csv
MF893271-hypothetical-embeddings.csv
MF893340-hypothetical-embeddings.csv
MF919493-hypothetical-embeddings.csv
MF919494-hypothetical-embeddings.csv
MF919495-hypothetical-embeddings.csv
MF919499-hypothetical-embeddings.csv
MF919504-hypothetical-embeddings.csv
MF919506-hypothetical-embeddings.csv
MF919510-hypothetical-embeddings.csv
MF919512-hypothetical-embeddings.csv
MF919513-hypothetical-embeddings.csv
MF919514-hypothetical-embeddings.csv
MF919524-hypothetical-embeddings.csv
MF919525-hypothetical-embeddings.csv
MF919527-hypothetical-embeddings.csv
MF919529-hypothetical-embeddings.csv
MF919534-hypothetical-embeddings.csv
MF919535-hypothetical-embeddings.csv
MF919540-hypothetical-embeddings.csv
MF919541-hypothetical-embeddings.csv
MF919542-hypothetical-embeddings.csv
MF959998-hypothetical-embeddings.csv
MF959999-hypothetical-embeddings.csv
MF974396-hypothetical-embeddings.csv
MF974397-hypothetical-embeddings.csv
M

MG592569-hypothetical-embeddings.csv
MG592570-hypothetical-embeddings.csv
MG592571-hypothetical-embeddings.csv
MG592572-hypothetical-embeddings.csv
MG592575-hypothetical-embeddings.csv
MG592577-hypothetical-embeddings.csv
MG592578-hypothetical-embeddings.csv
MG592579-hypothetical-embeddings.csv
MG592583-hypothetical-embeddings.csv
MG592585-hypothetical-embeddings.csv
MG592586-hypothetical-embeddings.csv
MG592587-hypothetical-embeddings.csv
MG592588-hypothetical-embeddings.csv
MG592589-hypothetical-embeddings.csv
MG592590-hypothetical-embeddings.csv
MG592591-hypothetical-embeddings.csv
MG592592-hypothetical-embeddings.csv
MG592593-hypothetical-embeddings.csv
MG592594-hypothetical-embeddings.csv
MG592595-hypothetical-embeddings.csv
MG592596-hypothetical-embeddings.csv
MG592600-hypothetical-embeddings.csv
MG592601-hypothetical-embeddings.csv
MG592602-hypothetical-embeddings.csv
MG592603-hypothetical-embeddings.csv
MG592604-hypothetical-embeddings.csv
MG592605-hypothetical-embeddings.csv
M

MH153807-hypothetical-embeddings.csv
MH153809-hypothetical-embeddings.csv
MH153812-hypothetical-embeddings.csv
MH153813-hypothetical-embeddings.csv
MH155868-hypothetical-embeddings.csv
MH155877-hypothetical-embeddings.csv
MH155880-hypothetical-embeddings.csv
MH160767-hypothetical-embeddings.csv
MH171093-hypothetical-embeddings.csv
MH171094-hypothetical-embeddings.csv
MH171095-hypothetical-embeddings.csv
MH178096-hypothetical-embeddings.csv
MH179470-hypothetical-embeddings.csv
MH179471-hypothetical-embeddings.csv
MH179474-hypothetical-embeddings.csv
MH179477-hypothetical-embeddings.csv
MH179480-hypothetical-embeddings.csv
MH181876-hypothetical-embeddings.csv
MH191398-hypothetical-embeddings.csv
MH203051-hypothetical-embeddings.csv
MH221128-hypothetical-embeddings.csv
MH221129-hypothetical-embeddings.csv
MH229862-hypothetical-embeddings.csv
MH229863-hypothetical-embeddings.csv
MH229864-hypothetical-embeddings.csv
MH229865-hypothetical-embeddings.csv
MH230177-hypothetical-embeddings.csv
M

MH807815-hypothetical-embeddings.csv
MH807816-hypothetical-embeddings.csv
MH807817-hypothetical-embeddings.csv
MH807818-hypothetical-embeddings.csv
MH807819-hypothetical-embeddings.csv
MH807820-hypothetical-embeddings.csv
MH809528-hypothetical-embeddings.csv
MH809529-hypothetical-embeddings.csv
MH809530-hypothetical-embeddings.csv
MH809531-hypothetical-embeddings.csv
MH809532-hypothetical-embeddings.csv
MH809533-hypothetical-embeddings.csv
MH816848-hypothetical-embeddings.csv
MH816966-hypothetical-embeddings.csv
MH817999-hypothetical-embeddings.csv
MH825697-hypothetical-embeddings.csv
MH825706-hypothetical-embeddings.csv
MH825707-hypothetical-embeddings.csv
MH825711-hypothetical-embeddings.csv
MH825712-hypothetical-embeddings.csv
MH825713-hypothetical-embeddings.csv
MH837626-hypothetical-embeddings.csv
MH853786-hypothetical-embeddings.csv
MH853787-hypothetical-embeddings.csv
MH853788-hypothetical-embeddings.csv
MH880817-hypothetical-embeddings.csv
MH884508-hypothetical-embeddings.csv
M

MK359361-hypothetical-embeddings.csv
MK359362-hypothetical-embeddings.csv
MK359363-hypothetical-embeddings.csv
MK359364-hypothetical-embeddings.csv
MK359365-hypothetical-embeddings.csv
MK359366-hypothetical-embeddings.csv
MK360025-hypothetical-embeddings.csv
MK368614-hypothetical-embeddings.csv
MK370036-hypothetical-embeddings.csv
MK372342-hypothetical-embeddings.csv
MK373770-hypothetical-embeddings.csv
MK373771-hypothetical-embeddings.csv
MK373772-hypothetical-embeddings.csv
MK373773-hypothetical-embeddings.csv
MK373774-hypothetical-embeddings.csv
MK373775-hypothetical-embeddings.csv
MK373776-hypothetical-embeddings.csv
MK373777-hypothetical-embeddings.csv
MK373778-hypothetical-embeddings.csv
MK373779-hypothetical-embeddings.csv
MK373780-hypothetical-embeddings.csv
MK373781-hypothetical-embeddings.csv
MK373782-hypothetical-embeddings.csv
MK373783-hypothetical-embeddings.csv
MK373784-hypothetical-embeddings.csv
MK373785-hypothetical-embeddings.csv
MK373786-hypothetical-embeddings.csv
M

MK511031-hypothetical-embeddings.csv
MK511032-hypothetical-embeddings.csv
MK511033-hypothetical-embeddings.csv
MK511034-hypothetical-embeddings.csv
MK511035-hypothetical-embeddings.csv
MK511036-hypothetical-embeddings.csv
MK511038-hypothetical-embeddings.csv
MK511039-hypothetical-embeddings.csv
MK511040-hypothetical-embeddings.csv
MK511041-hypothetical-embeddings.csv
MK511042-hypothetical-embeddings.csv
MK511043-hypothetical-embeddings.csv
MK511044-hypothetical-embeddings.csv
MK511045-hypothetical-embeddings.csv
MK511046-hypothetical-embeddings.csv
MK511047-hypothetical-embeddings.csv
MK511048-hypothetical-embeddings.csv
MK511049-hypothetical-embeddings.csv
MK511050-hypothetical-embeddings.csv
MK511051-hypothetical-embeddings.csv
MK511057-hypothetical-embeddings.csv
MK511059-hypothetical-embeddings.csv
MK511060-hypothetical-embeddings.csv
MK511061-hypothetical-embeddings.csv
MK511063-hypothetical-embeddings.csv
MK511065-hypothetical-embeddings.csv
MK521904-hypothetical-embeddings.csv
M

MK907239-hypothetical-embeddings.csv
MK907240-hypothetical-embeddings.csv
MK907241-hypothetical-embeddings.csv
MK907243-hypothetical-embeddings.csv
MK907244-hypothetical-embeddings.csv
MK907245-hypothetical-embeddings.csv
MK907246-hypothetical-embeddings.csv
MK907247-hypothetical-embeddings.csv
MK907248-hypothetical-embeddings.csv
MK907249-hypothetical-embeddings.csv
MK907251-hypothetical-embeddings.csv
MK907252-hypothetical-embeddings.csv
MK907253-hypothetical-embeddings.csv
MK907254-hypothetical-embeddings.csv
MK907255-hypothetical-embeddings.csv
MK907257-hypothetical-embeddings.csv
MK907259-hypothetical-embeddings.csv
MK907260-hypothetical-embeddings.csv
MK907261-hypothetical-embeddings.csv
MK907262-hypothetical-embeddings.csv
MK907263-hypothetical-embeddings.csv
MK907264-hypothetical-embeddings.csv
MK907267-hypothetical-embeddings.csv
MK907268-hypothetical-embeddings.csv
MK907269-hypothetical-embeddings.csv
MK907270-hypothetical-embeddings.csv
MK907271-hypothetical-embeddings.csv
M

MN364664-hypothetical-embeddings.csv
MN369747-hypothetical-embeddings.csv
MN369749-hypothetical-embeddings.csv
MN369752-hypothetical-embeddings.csv
MN369753-hypothetical-embeddings.csv
MN369755-hypothetical-embeddings.csv
MN369760-hypothetical-embeddings.csv
MN369766-hypothetical-embeddings.csv
MN379739-hypothetical-embeddings.csv
MN379740-hypothetical-embeddings.csv
MN384978-hypothetical-embeddings.csv
MN384979-hypothetical-embeddings.csv
MN393079-hypothetical-embeddings.csv
MN393473-hypothetical-embeddings.csv
MN395285-hypothetical-embeddings.csv
MN402506-hypothetical-embeddings.csv
MN414250-hypothetical-embeddings.csv
MN417334-hypothetical-embeddings.csv
MN419153-hypothetical-embeddings.csv
MN428047-hypothetical-embeddings.csv
MN428059-hypothetical-embeddings.csv
MN428063-hypothetical-embeddings.csv
MN428065-hypothetical-embeddings.csv
MN428066-hypothetical-embeddings.csv
MN434093-hypothetical-embeddings.csv
MN434096-hypothetical-embeddings.csv
MN445182-hypothetical-embeddings.csv
M

MN908694-hypothetical-embeddings.csv
MN927226-hypothetical-embeddings.csv
MN929097-hypothetical-embeddings.csv
MN935203-hypothetical-embeddings.csv
MN937349-hypothetical-embeddings.csv
MN939539-hypothetical-embeddings.csv
MN940411-hypothetical-embeddings.csv
MN953776-hypothetical-embeddings.csv
MN954399-hypothetical-embeddings.csv
MN956514-hypothetical-embeddings.csv
MN958086-hypothetical-embeddings.csv
MN966730-hypothetical-embeddings.csv
MN966731-hypothetical-embeddings.csv
MN966732-hypothetical-embeddings.csv
MN988461-hypothetical-embeddings.csv
MN988462-hypothetical-embeddings.csv
MN988463-hypothetical-embeddings.csv
MN988465-hypothetical-embeddings.csv
MN988470-hypothetical-embeddings.csv
MN988472-hypothetical-embeddings.csv
MN988482-hypothetical-embeddings.csv
MN988483-hypothetical-embeddings.csv
MN988484-hypothetical-embeddings.csv
MN988486-hypothetical-embeddings.csv
MN988487-hypothetical-embeddings.csv
MN988490-hypothetical-embeddings.csv
MN988495-hypothetical-embeddings.csv
M

MT270409-hypothetical-embeddings.csv
MT310850-hypothetical-embeddings.csv
MT310863-hypothetical-embeddings.csv
MT310889-hypothetical-embeddings.csv
MT310890-hypothetical-embeddings.csv
MT310891-hypothetical-embeddings.csv
MT310894-hypothetical-embeddings.csv
MT316461-hypothetical-embeddings.csv
MT325768-hypothetical-embeddings.csv
MT331608-hypothetical-embeddings.csv
MT334653-hypothetical-embeddings.csv
MT338525-hypothetical-embeddings.csv
MT341500-hypothetical-embeddings.csv
MT345684-hypothetical-embeddings.csv
MT354569-hypothetical-embeddings.csv
MT354570-hypothetical-embeddings.csv
MT360680-hypothetical-embeddings.csv
MT360681-hypothetical-embeddings.csv
MT360682-hypothetical-embeddings.csv
MT361768-hypothetical-embeddings.csv
MT361972-hypothetical-embeddings.csv
MT366568-hypothetical-embeddings.csv
MT366580-hypothetical-embeddings.csv
MT366760-hypothetical-embeddings.csv
MT366761-hypothetical-embeddings.csv
MT366762-hypothetical-embeddings.csv
MT366945-hypothetical-embeddings.csv
M

MT682716-hypothetical-embeddings.csv
MT684587-hypothetical-embeddings.csv
MT684593-hypothetical-embeddings.csv
MT684595-hypothetical-embeddings.csv
MT684596-hypothetical-embeddings.csv
MT701590-hypothetical-embeddings.csv
MT701592-hypothetical-embeddings.csv
MT701595-hypothetical-embeddings.csv
MT701596-hypothetical-embeddings.csv
MT701597-hypothetical-embeddings.csv
MT701598-hypothetical-embeddings.csv
MT708547-hypothetical-embeddings.csv
MT708548-hypothetical-embeddings.csv
MT711887-hypothetical-embeddings.csv
MT711888-hypothetical-embeddings.csv
MT711977-hypothetical-embeddings.csv
MT713136-hypothetical-embeddings.csv
MT720689-hypothetical-embeddings.csv
MT723933-hypothetical-embeddings.csv
MT723943-hypothetical-embeddings.csv
MT723944-hypothetical-embeddings.csv
MT723945-hypothetical-embeddings.csv
MT732432-hypothetical-embeddings.csv
MT732433-hypothetical-embeddings.csv
MT732434-hypothetical-embeddings.csv
MT732435-hypothetical-embeddings.csv
MT732436-hypothetical-embeddings.csv
M

MW147367-hypothetical-embeddings.csv
MW147599-hypothetical-embeddings.csv
MW149272-hypothetical-embeddings.csv
MW149274-hypothetical-embeddings.csv
MW149275-hypothetical-embeddings.csv
MW151244-hypothetical-embeddings.csv
MW161461-hypothetical-embeddings.csv
MW161462-hypothetical-embeddings.csv
MW161463-hypothetical-embeddings.csv
MW161464-hypothetical-embeddings.csv
MW161465-hypothetical-embeddings.csv
MW161467-hypothetical-embeddings.csv
MW161468-hypothetical-embeddings.csv
MW175414-hypothetical-embeddings.csv
MW175491-hypothetical-embeddings.csv
MW176032-hypothetical-embeddings.csv
MW176033-hypothetical-embeddings.csv
MW176034-hypothetical-embeddings.csv
MW205203-hypothetical-embeddings.csv
MW206381-hypothetical-embeddings.csv
MW218148-hypothetical-embeddings.csv
MW221967-hypothetical-embeddings.csv
MW239124-hypothetical-embeddings.csv
MW239157-hypothetical-embeddings.csv
MW247144-hypothetical-embeddings.csv
MW247145-hypothetical-embeddings.csv
MW247147-hypothetical-embeddings.csv
M

MW822537-hypothetical-embeddings.csv
MW822538-hypothetical-embeddings.csv
MW822539-hypothetical-embeddings.csv
MW822601-hypothetical-embeddings.csv
MW824369-hypothetical-embeddings.csv
MW824370-hypothetical-embeddings.csv
MW824371-hypothetical-embeddings.csv
MW824372-hypothetical-embeddings.csv
MW824373-hypothetical-embeddings.csv
MW824374-hypothetical-embeddings.csv
MW824375-hypothetical-embeddings.csv
MW824376-hypothetical-embeddings.csv
MW824377-hypothetical-embeddings.csv
MW824378-hypothetical-embeddings.csv
MW824379-hypothetical-embeddings.csv
MW824380-hypothetical-embeddings.csv
MW824381-hypothetical-embeddings.csv
MW824382-hypothetical-embeddings.csv
MW824383-hypothetical-embeddings.csv
MW824384-hypothetical-embeddings.csv
MW824385-hypothetical-embeddings.csv
MW824386-hypothetical-embeddings.csv
MW824387-hypothetical-embeddings.csv
MW824388-hypothetical-embeddings.csv
MW824389-hypothetical-embeddings.csv
MW824390-hypothetical-embeddings.csv
MW824391-hypothetical-embeddings.csv
M

MZ357096-hypothetical-embeddings.csv
MZ358387-hypothetical-embeddings.csv
MZ359670-hypothetical-embeddings.csv
MZ374361-hypothetical-embeddings.csv
MZ375357-hypothetical-embeddings.csv
MZ375358-hypothetical-embeddings.csv
MZ384014-hypothetical-embeddings.csv
MZ388551-hypothetical-embeddings.csv
MZ388554-hypothetical-embeddings.csv
MZ388556-hypothetical-embeddings.csv
MZ394712-hypothetical-embeddings.csv
MZ398240-hypothetical-embeddings.csv
MZ398241-hypothetical-embeddings.csv
MZ398242-hypothetical-embeddings.csv
MZ398246-hypothetical-embeddings.csv
MZ398247-hypothetical-embeddings.csv
MZ398248-hypothetical-embeddings.csv
MZ422438-hypothetical-embeddings.csv
MZ424864-hypothetical-embeddings.csv
MZ424865-hypothetical-embeddings.csv
MZ427930-hypothetical-embeddings.csv
MZ428226-hypothetical-embeddings.csv
MZ428228-hypothetical-embeddings.csv
MZ443769-hypothetical-embeddings.csv
MZ443770-hypothetical-embeddings.csv
MZ443771-hypothetical-embeddings.csv
MZ443772-hypothetical-embeddings.csv
M

OK428535-hypothetical-embeddings.csv
OK428602-hypothetical-embeddings.csv
OK483199-hypothetical-embeddings.csv
OK483201-hypothetical-embeddings.csv
OK490494-hypothetical-embeddings.csv
OK499971-hypothetical-embeddings.csv
OK499972-hypothetical-embeddings.csv
OK499973-hypothetical-embeddings.csv
OK499974-hypothetical-embeddings.csv
OK499975-hypothetical-embeddings.csv
OK499976-hypothetical-embeddings.csv
OK499977-hypothetical-embeddings.csv
OK499983-hypothetical-embeddings.csv
OK499986-hypothetical-embeddings.csv
OK499988-hypothetical-embeddings.csv
OK499990-hypothetical-embeddings.csv
OK499993-hypothetical-embeddings.csv
OK499994-hypothetical-embeddings.csv
OK499995-hypothetical-embeddings.csv
OK499996-hypothetical-embeddings.csv
OK499998-hypothetical-embeddings.csv
OK500000-hypothetical-embeddings.csv
OK539824-hypothetical-embeddings.csv
OK539825-hypothetical-embeddings.csv
OK539826-hypothetical-embeddings.csv
OK562429-hypothetical-embeddings.csv
OK562670-hypothetical-embeddings.csv
O

OM735688-hypothetical-embeddings.csv
OM782452-hypothetical-embeddings.csv
OM810291-hypothetical-embeddings.csv
OM837731-hypothetical-embeddings.csv
OM864357-hypothetical-embeddings.csv
OM912978-hypothetical-embeddings.csv
OM913894-hypothetical-embeddings.csv
OM937766-hypothetical-embeddings.csv
OM953433-hypothetical-embeddings.csv
OM953790-hypothetical-embeddings.csv
OM971648-hypothetical-embeddings.csv
OM982619-hypothetical-embeddings.csv
OM982620-hypothetical-embeddings.csv
OM982621-hypothetical-embeddings.csv
OM982646-hypothetical-embeddings.csv
OM982647-hypothetical-embeddings.csv
OM982668-hypothetical-embeddings.csv
OM982669-hypothetical-embeddings.csv
OM982670-hypothetical-embeddings.csv
OM982671-hypothetical-embeddings.csv
OM982672-hypothetical-embeddings.csv
OM982673-hypothetical-embeddings.csv
OM982674-hypothetical-embeddings.csv
ON000910-hypothetical-embeddings.csv
ON042478-hypothetical-embeddings.csv
ON045001-hypothetical-embeddings.csv
ON062054-hypothetical-embeddings.csv
O

ON857932-hypothetical-embeddings.csv
ON857934-hypothetical-embeddings.csv
ON857935-hypothetical-embeddings.csv
ON857937-hypothetical-embeddings.csv
ON857938-hypothetical-embeddings.csv
ON857939-hypothetical-embeddings.csv
ON857940-hypothetical-embeddings.csv
ON857943-hypothetical-embeddings.csv
ON862890-hypothetical-embeddings.csv
ON866946-hypothetical-embeddings.csv
ON881243-hypothetical-embeddings.csv
ON886116-hypothetical-embeddings.csv
ON911714-hypothetical-embeddings.csv
ON911716-hypothetical-embeddings.csv
ON911717-hypothetical-embeddings.csv
ON911718-hypothetical-embeddings.csv
ON922919-hypothetical-embeddings.csv
ON950090-hypothetical-embeddings.csv
ON960072-hypothetical-embeddings.csv
ON970564-hypothetical-embeddings.csv
ON970566-hypothetical-embeddings.csv
ON970569-hypothetical-embeddings.csv
ON970595-hypothetical-embeddings.csv
ON970597-hypothetical-embeddings.csv
ON970600-hypothetical-embeddings.csv
ON970603-hypothetical-embeddings.csv
ON970610-hypothetical-embeddings.csv
O

In [5]:
util.predict_rbps(constants.XGB_RBP_PREDICTION, 
                  f"{constants.INPHARED}/{constants.PLM['PROTTRANSBERT']}/{constants.HYPOTHETICAL}/{constants.PROKKA}", 
                  f"{constants.INPHARED}/{constants.PLM['PROTTRANSBERT']}/{constants.COMPLETE}/{constants.PROKKA}")

CP017838-hypothetical-embeddings.csv
CP027117-hypothetical-embeddings.csv
CP050061-hypothetical-embeddings.csv
CP051279-hypothetical-embeddings.csv
CP051285-hypothetical-embeddings.csv
CP052843-hypothetical-embeddings.csv
CP053388-hypothetical-embeddings.csv
CP054387-hypothetical-embeddings.csv
CP058330-hypothetical-embeddings.csv
CP059058-hypothetical-embeddings.csv
CP062422-hypothetical-embeddings.csv
CP062434-hypothetical-embeddings.csv
CP062442-hypothetical-embeddings.csv
CP062444-hypothetical-embeddings.csv
CP062451-hypothetical-embeddings.csv
CP062454-hypothetical-embeddings.csv
CP062461-hypothetical-embeddings.csv
CP062462-hypothetical-embeddings.csv
CP063417-hypothetical-embeddings.csv
CP067352-hypothetical-embeddings.csv
CP069347-hypothetical-embeddings.csv
CP071049-hypothetical-embeddings.csv
CP091911-hypothetical-embeddings.csv
CP103976-hypothetical-embeddings.csv
DQ163912-hypothetical-embeddings.csv
DQ163915-hypothetical-embeddings.csv
DQ163916-hypothetical-embeddings.csv
D

OK490440-hypothetical-embeddings.csv
OK490441-hypothetical-embeddings.csv
OK490443-hypothetical-embeddings.csv
OK490444-hypothetical-embeddings.csv
OK490446-hypothetical-embeddings.csv
OK490447-hypothetical-embeddings.csv
OK490449-hypothetical-embeddings.csv
OK490450-hypothetical-embeddings.csv
OK490451-hypothetical-embeddings.csv
OK490453-hypothetical-embeddings.csv
OK490454-hypothetical-embeddings.csv
OK490456-hypothetical-embeddings.csv
OK490458-hypothetical-embeddings.csv
OK490459-hypothetical-embeddings.csv
OK638201-hypothetical-embeddings.csv
OK638202-hypothetical-embeddings.csv
OK638203-hypothetical-embeddings.csv
OK905446-hypothetical-embeddings.csv
OM234792-hypothetical-embeddings.csv
OM373550-hypothetical-embeddings.csv
OM373551-hypothetical-embeddings.csv
OM373552-hypothetical-embeddings.csv
OM373553-hypothetical-embeddings.csv
OM373554-hypothetical-embeddings.csv
OM373555-hypothetical-embeddings.csv
OM373556-hypothetical-embeddings.csv
OM373561-hypothetical-embeddings.csv
O

Consolidate the ProtBert embeddings of the RBPs (annotated and computationally predicted) in one directory: `inphared/embeddings/prottransbert/complete/master`.

In [6]:
complete_genbank = f"{constants.INPHARED}/{constants.PLM['PROTTRANSBERT']}/{constants.COMPLETE}/{constants.GENBANK}"
complete_prokka = f"{constants.INPHARED}/{constants.PLM['PROTTRANSBERT']}/{constants.COMPLETE}/{constants.PROKKA}"
complete_master = f"{constants.INPHARED}/{constants.PLM['PROTTRANSBERT']}/{constants.COMPLETE}/{constants.MASTER}"

if not os.path.exists(complete_master):
    os.makedirs(complete_master)

for file in os.listdir(complete_genbank):
    shutil.copy(f'{complete_genbank}/{file}', complete_master)
    
for file in os.listdir(complete_prokka):
    if not os.path.exists(f'{complete_master}/{file}'):
        shutil.copy(f'{complete_prokka}/{file}', complete_master)
    else:
        # Nothing should be printed
        print(file)