# Deduplicating PPIs

iDist provides a fast way to compare large sets of protein-protein interactions (PPIs) pairwise. Therefore, the method may by used to deduplicate PPI datasets. This may be crucial to remove redundancy in the data and to avoid bias in downstream analyses or machine learning.

In [2]:
from ppiref.comparison import IDist
from ppiref.definitions import PPIREF_TEST_DATA_DIR

# Suppress Graphein log
from loguru import logger
logger.disable('graphein')

In this example, we will reuse the near-duplicate PPIs from the previous tutorial "Comparing PPIs" (taken from Figure 1 in the ["Learning to design protein-protein interactions with enhanced generalization"](https://arxiv.org/pdf/2310.18515.pdf) paper).

<p align="center">
  <img width="350" src="./_static/images/1p7z_3p9r.png"/>
</p>

In [3]:
ppis = [
    PPIREF_TEST_DATA_DIR / 'ppi/1p7z_A_C.pdb',
    PPIREF_TEST_DATA_DIR / 'ppi/3p9r_B_D.pdb',
]

Since iDist is based on a simple vectorization of PPIs, it can be used to deduplicate PPIs based on the distance thresholding in the embedding space. The validated threshold used by default is 0.04, suitable for 6A heavy-atom interfaces. Therefore, in the first step, we embed PPIs using iDist.

In [4]:
idist = IDist(max_workers=2)

idist.embed(ppis[0])
idist.embed(ppis[1])
# Or alternatively for large sets:
# idist.embed_parallel(ppis)

idist.embeddings

{'1p7z_A_C': array([0.01731814, 0.00347079, 0.03816472, 0.04164083, 0.02777858,
        0.03990269, 0.02775753, 0.03471655, 0.01386247, 0.03121286,
        0.00694047, 0.03820259, 0.04857357, 0.0173353 , 0.03992103,
        0.02428673, 0.02429712, 0.01386962, 0.00347335, 0.00866032]),
 '3p9r_B_D': array([0.01904621, 0.00347081, 0.03816642, 0.04163909, 0.02777779,
        0.04163979, 0.02775733, 0.03471782, 0.01386243, 0.03121038,
        0.00694084, 0.03820353, 0.04857332, 0.01733598, 0.03818163,
        0.02428163, 0.02430092, 0.01386946, 0.00347302, 0.00693259])}

Deduplicate PPIs. The method removes PPIs that have another PPI embedding closer than the threshold. Only one-side comparison (i.e., `a<->b` but not `b<->a`) is performed, such that representative PPIs are kept.

In [5]:
idist.deduplicate_embeddings()
idist.embeddings

Processing adjacency chunks: 100%|██████████| 1/1 [00:00<00:00, 487.09it/s]


{'3p9r_B_D': array([0.01904621, 0.00347081, 0.03816642, 0.04163909, 0.02777779,
        0.04163979, 0.02775733, 0.03471782, 0.01386243, 0.03121038,
        0.00694084, 0.03820353, 0.04857332, 0.01733598, 0.03818163,
        0.02428163, 0.02430092, 0.01386946, 0.00347302, 0.00693259])}