# Protein embeddings improve phage-host interaction prediction

**Mark Edward M. Gonzales<sup>1, 2</sup>, Jennifer C. Ureta<sup>1, 2</sup> & Anish M.S. Shrestha<sup>1, 2</sup>**

<sup>1</sup> Bioinformatics Laboratory, Advanced Research Institute for Informatics, Computing and Networking, De La Salle University, Manila, Philippines <br>
<sup>2</sup> Department of Software Technology, College of Computer Studies, De La Salle University, Manila, Philippines 

{mark_gonzales, jennifer.ureta, anish.shrestha}@dlsu.edu.ph

<hr>

## ⚠️ Memory Requirement of Protein Embeddings

The memory requirement of loading pretrained protein embeddings may be heavy for some local machines. We recommend running this notebook on [Google Colab](https://colab.research.google.com/) or any cloud-based service with GPU. In particular, the largest model, ProtT5, consumes 5.9 GB of GPU memory.

<hr>

## 💡 FASTA Files
This notebook assumes that you have generated the FASTA files containing the annotated RBP and hypothetical protein sequences (from running [`1. Sequence Preprocessing.ipynb`](https://github.com/bioinfodlsu/phage-host-prediction/blob/main/experiments/1.%20Sequence%20Preprocessing.ipynb)). 

Alternatively, you may download the FASTA files from [Google Drive](https://drive.google.com/drive/folders/16ZBXZCpC0OmldtPPIy5sEBtS4EVohorT?usp=sharing). Save the downloaded `fasta` folder inside the `inphared` directory located in the same folder as this notebook. The folder structure should look like this:

`experiments` (parent folder of this notebook) <br> 
↳ `inphared` <br>
&nbsp; &nbsp;↳ `fasta` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `hypothetical` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `nucleotide` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp` <br>
↳ `4. Protein Embedding Generation.ipynb` (this notebook) <br>

<hr>

## 📁 Output Files
If you would like to skip running this notebook, you may download the protein embeddings from these Google Drive directories: [Part 1](https://drive.google.com/drive/folders/1deenrDQIr3xcl9QCYH-nPhmpY8x2drQw?usp=sharing) and [Part 2](https://drive.google.com/drive/folders/1jnBFNsC6zJISkc6IAz56257MSXKjY0Ez?usp=sharing). Consolidate the downloaded folders into a single `embeddings` directory and save it inside the `inphared` directory located in the same folder as this notebook. The folder structure should look like this:

`experiments` (parent folder of this notebook) <br> 
↳ `inphared` <br>
&nbsp; &nbsp;↳ `embeddings` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `esm` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `esm1b` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ ... <br>
↳ `4. Protein Embedding Generation.ipynb` (this notebook) <br>

<hr>

# Part I: Preliminaries

Import the necessary libraries and modules.

In [None]:
!pip3 install -U pip > /dev/null
!pip3 install -U bio_embeddings[all] > /dev/null
!pip install scikit_learn==1.0.2
!pip install pyyaml==5.4.1

In [None]:
import numpy as np
import pandas as pd

from tqdm import tqdm
from Bio import SeqIO

Import the protein language model. 

Protein Language Model | Import
-- | --
SeqVec | `SeqVecEmbedder`
ESM | `ESMEmbedder`
ESM-1b | `ESM1bEmbedder`
ProtBert | `ProtTransBertBFDEmbedder`
ProtXLNet | `ProtTransXLNetUniRef100Embedder`
ProtAlbert | `ProtTransAlbertBFDEmbedder`
ProtT5 | `ProtTransT5XLU50Embedder`

In [None]:
from bio_embeddings.embed import ProtTransBertBFDEmbedder

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount = True)

<hr>

# Part II: Generation of Protein Embeddings

The functions below generate the protein embeddings for the proteins in a given FASTA file:
- Use `compute_protein_embeddings_esm` for ESM and ESM-1b. Sequences longer than 1022 amino acids are split into non-overlapping subsequences of length 1022, and the per-residue embeddings are concatenated before averaging (this is the [workaround](https://github.com/brianhie/evolocity/issues/2) suggested by the developers).
- Use `compute_protein_embeddings` for all other language models.

**Parameters**:
- `embedder`: Protein language model
- `fasta_file`: FASTA file containing the proteins
- `results_dir`: File path of the directory to which the resulting embeddings will be saved
- `prefix`: Name of the phage whose selected proteins are to be converted to be embeddings

In [None]:
def compute_protein_embeddings(embedder, fasta_file, results_dir, prefix=''):
    names = [record.id for record in SeqIO.parse(fasta_file, 'fasta')]
    sequences = [str(record.seq) for record in SeqIO.parse(fasta_file, 'fasta')]

    embeddings = [embedder.reduce_per_protein(embedder.embed(sequence)) for sequence in tqdm(sequences)]
    embeddings_df = pd.concat([pd.DataFrame({'ID': names}), pd.DataFrame(embeddings)], axis=1)
    embeddings_df.to_csv(results_dir + prefix + '-embeddings.csv', index=False)


def compute_protein_embeddings_esm(embedder, fasta_file, results_dir, prefix=''):
    names = [record.id for record in SeqIO.parse(fasta_file, 'fasta')]
    
    embeddings = []
    
    for record in SeqIO.parse(fasta_file, 'fasta'):
        sequence = str(record.seq)
        if len(sequence) <= 1022:
            embedding = embedder.reduce_per_protein(embedder.embed(sequence))
        else:
            embedding1 = embedder.embed(sequence[:1022])
            embedding2 = embedder.embed(sequence[1022:])
            embedding = embedder.reduce_per_protein(np.concatenate((embedding1, embedding2)))
        
        embeddings.append(embedding)

    embeddings_df = pd.concat([pd.DataFrame({'ID': names}), pd.DataFrame(embeddings)], axis=1)
    embeddings_df.to_csv(results_dir + prefix + '-embeddings.csv', index=False)

Load the protein language model.

Protein Language Model | Constructor
-- | --
SeqVec | `SeqVecEmbedder`
ESM | `ESMEmbedder`
ESM-1b | `ESM1bEmbedder`
ProtBert | `ProtTransBertBFDEmbedder`
ProtXLNet | `ProtTransXLNetUniRef100Embedder`
ProtAlbert | `ProtTransAlbertBFDEmbedder`
ProtT5 | `ProtTransT5XLU50Embedder`

In [None]:
embedder = ProtTransBertBFDEmbedder()

Supply the directory names:
- `HYPOTHETICAL_FASTA_DIR`: Directory where the FASTA files containing the protein sequences are located
- `HYPOTHETICAL_EMBEDDINGS_DIR`: Directory where the CSV files containing the embeddings are to be saved

In [None]:
HYPOTHETICAL_FASTA_DIR = f''
HYPOTHETICAL_EMBEDDINGS_DIR = f''

Load the FASTA files containing the protein sequences to be embedded.

In [None]:
import os
hypothetical_fasta_files = os.listdir(HYPOTHETICAL_FASTA_DIR)

len(hypothetical_fasta_files)

Generate the protein embeddings.

**⚠️ IMPORTANT**: If the embedder is ESM or ESM-1b, call `compute_protein_embeddings_esm` instead of `compute_protein_embeddings`.

In [None]:
IDX_RESUME = 0    # Adjust as needed (e.g., resuming after Google Colab hangs or times out)

for hypothetical_file in hypothetical_fasta_files[IDX_RESUME:]:
  # -6 because the string ".fasta" has six characters
  compute_protein_embeddings(embedder, f'{HYPOTHETICAL_FASTA_DIR}/{hypothetical_file}', 
                             HYPOTHETICAL_EMBEDDINGS_DIR,
                             f'/{hypothetical_file[:-6]}')
  
  # Display progress
  print(IDX_RESUME, ":", hypothetical_file)
  IDX_RESUME += 1