# Retrieving PPIs

The package enables to search the Protein Data Bank (PDB) for protein-protein interactions (PPIs) similar to your query PPI. The search can be performed based on the interface structure or protein sequence of interest.

In [2]:
from ppiref.comparison import IDist
from ppiref.retrieval import MMSeqs2PPIRetriever
from ppiref.definitions import PPIREF_DATA_DIR, PPIREF_TEST_DATA_DIR
import pandas as pd

# Suppress Graphein log
from loguru import logger
logger.disable('graphein')

In this example, we will use the near-duplicate homooligomeric PPIs that involve different sequences (taken from Figure 3 in the ["Revealing data leakage in protein interaction benchmarks"](https://arxiv.org/abs/2404.10457) paper). We will try to retrieve PPIs from the PDB that are similar to one of the entries (1k3f) aiming to retrieve another one (1k9s).

<p align="center">
  <img width="350" src="./_static/images/1k3f_1k9s.png"/>
</p>

Fast search requires precomputed data: iDist embeddings for interface search and MMseqs2 database for sequence search. Thereofore, we download the `ppiref_6A_stats.zip` first.

In [3]:
from ppiref.utils.misc import download_from_zenodo
download_from_zenodo('ppi_6A_stats.zip')

Downloading: 100%|██████████| 3.10G/3.10G [05:03<00:00, 10.2MiB/s]
Extracting: 100%|██████████| 15/15 [00:29<00:00,  1.98s/files]


## By similar interface structure

One can find PPI interfaces in the PDB that are structurally similar to the query PPI. This can be done using the precomputed iDist embeddings. Under the hood, iDist will build an `sklearn` index for all the PPI embeddings and use it to find the neighbors of the query embedding, in the near-duplicate radius (0.04 by default, which is validated for 6A interfaces).

In [4]:
# Initialize IDist and read embeddings for all PPI interfaces in PPIRef (i.e., all PPIs in PDB)
idist = IDist()
idist.read_embeddings(PPIREF_DATA_DIR / 'ppiref/ppi_6A_stats/idist_emb.csv', dropna=True)

# Embed your query PPI interface
ppi_dir = PPIREF_TEST_DATA_DIR / 'ppi_dir'
query_ppi_path = ppi_dir / 'k3/1k3f_C_E.pdb'
query_embedding = idist.embed(query_ppi_path, store=False)

# Query for 10 most similar PPIs
dists, ppi_ids = idist.query(query_embedding)
df_idist = pd.DataFrame({'PPI': ppi_ids, 'iDist': dists}).head(10)
df_idist

Unnamed: 0,PPI,iDist
0,1k3f_C_E,0.0
1,1k3f_A_D,0.019316
2,1u1g_C_D,0.029032
3,1sj9_A_F,0.029668
4,8a7d_C_Q,0.029722
5,5efo_A_B,0.029956
6,2hrd_A_F,0.030052
7,1sj9_B_D,0.030148
8,1u1e_C_D,0.030332
9,1u1d_C_D,0.030373


iDist enables to retrieve 1k9s as a near duplicate of 1k3f by the interface structure.

In [5]:
'1k9s' in [x.split('_')[0] for x in ppi_ids]

True

## By similar sequence

One can also find PPIs in PDB that involve sequences similar to the one of interest. This can be done using the prepared [MMseqs2](https://github.com/soedinglab/mmseqs2) database. Install MMseqs2 according to the official documentation and then you can use the wrapper as below. Under the hood, the wrapper will use the `mmseqs2 easy-search` with default parameters.

In [6]:
# Initialize the wrapper for MMseqs2 database to store all sequences from PPIRef
mmseqs2 = MMSeqs2PPIRetriever(PPIREF_DATA_DIR / 'ppiref/ppi_6A_stats/mmseqs_db/db')

# Prepare your fasta file (for example by downloading from Uniprot)
query_path = PPIREF_TEST_DATA_DIR / 'misc/1k3f.fasta'

# Query the MMseqs2 database for 10 PPIs involving sequences most similar to the query sequence
# (returns triples (PPI id, sequence similarity, partner similar to query sequence))
seq_sims, ppi_ids, partners =  mmseqs2.query(query_path)
df_mmseqs2 = pd.DataFrame({'PPI': ppi_ids, 'Sequnce similarity': seq_sims, 'Chain': partners})
df_mmseqs2.head(10)

Unnamed: 0,PPI,Sequnce similarity,Chain
0,1u1c_A_B,1.0,A
1,1u1c_A_C,1.0,A
2,1rxs_M_m,1.0,m
3,1rxs_N_m,1.0,m
4,1rxu_E_F,1.0,F
5,1rxu_A_F,1.0,F
6,1u1e_C_D,1.0,D
7,1u1e_D_E,1.0,D
8,1rxs_M_o,1.0,o
9,1rxs_O_o,1.0,o


Since 1k3f and 1k9s share low sequence identity, the sequence search is not able to retrieve 1k9s as a near duplicate of 1k3f.

In [7]:
'1k9s' in [x.split('_')[0] for x in ppi_ids]

False