# Protein embeddings improve phage-host interaction prediction

**Mark Edward M. Gonzales<sup>1, 2</sup> & Anish M.S. Shrestha<sup>1, 2</sup>**

<sup>1</sup> Department of Software Technology, College of Computer Studies, De La Salle University, Manila, Philippines <br>
<sup>2</sup> Bioinformatics Laboratory, Advanced Research Institute for Informatics, Computing and Networking, De La Salle University, Manila, Philippines

{mark_gonzales, anish.shrestha}@dlsu.edu.ph

<hr>

## 📁 Output Files
If you want to skip running this notebook, you may download the protein embeddings from this [Google Drive](https://drive.google.com/drive/folders/1deenrDQIr3xcl9QCYH-nPhmpY8x2drQw?usp=sharing). Save the results inside the `inphared` directory. The folder structure should look like this:

`inphared` <br>
↳ `embeddings` <br>
&nbsp; &nbsp; ↳ `esm` <br>
&nbsp; &nbsp; ↳ `esm1b` <br>
&nbsp; &nbsp; ↳ ...

<hr>

## ⚠️ Memory Requirement of Protein Embeddings

The memory requirement of loading pretrained protein embeddings may be heavy for some local machines. We recommend running this notebook on [Google Colab](https://colab.research.google.com/) or any cloud-based service with GPU.

<hr>

# Part I: Preliminaries

Import the necessary libraries and modules.

In [None]:
!pip3 install -U pip > /dev/null
!pip3 install -U bio_embeddings[all] > /dev/null
!pip install scikit_learn==1.0.2
!pip install pyyaml==5.4.1

In [None]:
import numpy as np
import pandas as pd
import glob

from tqdm import tqdm
from datetime import date

from Bio import SeqIO
from bio_embeddings.embed import ProtTransBertBFDEmbedder

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount = True)

<hr>

# Part II: Generation of Protein Embeddings

In [None]:
def compute_protein_embeddings(embedder, fasta_file, results_dir, prefix=''):
    names = [record.id for record in SeqIO.parse(fasta_file, 'fasta')]
    sequences = [str(record.seq) for record in SeqIO.parse(fasta_file, 'fasta')]

    embeddings = [embedder.reduce_per_protein(embedder.embed(sequence)) for sequence in tqdm(sequences)]
    embeddings_df = pd.concat([pd.DataFrame({'ID': names}), pd.DataFrame(embeddings)], axis=1)
    embeddings_df.to_csv(results_dir + prefix + '-embeddings.csv', index=False)


def compute_protein_embeddings_esm(embedder, fasta_file, results_dir, prefix=''):
    names = [record.id for record in SeqIO.parse(fasta_file, 'fasta')]
    
    embeddings = []
    
    for record in SeqIO.parse(fasta_file, 'fasta'):
        sequence = str(record.seq)
        if len(sequence) <= 1022:
            embedding = embedder.reduce_per_protein(embedder.embed(sequence))
        else:
            embedding1 = embedder.embed(sequence[:1022])
            embedding2 = embedder.embed(sequence[1022:])
            embedding = embedder.reduce_per_protein(np.concatenate((embedding1, embedding2)))
        
        embeddings.append(embedding)

    embeddings_df = pd.concat([pd.DataFrame({'ID': names}), pd.DataFrame(embeddings)], axis=1)
    embeddings_df.to_csv(results_dir + prefix + '-embeddings.csv', index=False)

Load the protein language model. To load other protein language models, refer to the documentation of the [`bio_embeddings`](https://docs.bioembeddings.com/v0.2.3/api/bio_embeddings.embed.html) package.

In [None]:
embedder = ProtTransBertBFDEmbedder()

Supply the directory names:
- `HYPOTHETICAL_FASTA_DIR`: Directory where the FASTA files containing the protein sequences are located
- `HYPOTHETICAL_EMBEDDINGS_DIR`: Directory where the CSV files containing the embeddings are to be saved

In [None]:
HYPOTHETICAL_FASTA_DIR = f''
HYPOTHETICAL_EMBEDDINGS_DIR = f''

Load the FASTA files containing the protein sequences to be embedded.

In [None]:
import os
hypothetical_fasta_files = os.listdir(HYPOTHETICAL_FASTA_DIR)

len(hypothetical_fasta_files)

Generate the protein embeddings.

In [None]:
IDX_RESUME = 0    # Adjust as needed (e.g., resuming after Google Colab hangs or times out)

for hypothetical_file in hypothetical_fasta_files[IDX_RESUME:]:
  # -6 because the string ".fasta" has six characters
  compute_protein_embeddings(embedder, f'{HYPOTHETICAL_FASTA_DIR}/{hypothetical_file}', 
                             HYPOTHETICAL_EMBEDDINGS_DIR,
                             f'/{hypothetical_file[:-6]}')
  
  # Display progress
  print(IDX_RESUME, ":", hypothetical_file)
  IDX_RESUME += 1