# Extracting PPIs

The ``ppiref.extraction.PPIExtractor`` class enables extracting protein-protein interactions (PPIs) from PDB files based on inter-atomic distances.

In [2]:
from ppiref.extraction import PPIExtractor
from ppiref.definitions import PPIREF_TEST_DATA_DIR

Prepare a .pdb file. In this example, we will use the [1bui.pdb](https://www.rcsb.org/structure/1bui) file from the Protein Data Bank which contains three interacting proteins: staphylokinase (chain C, pink), microplasmin (blue, chain A), and microplasmin (green, chain B). Further we will extract different types of protein-protein interfaces from the file.

<p align="center">
  <img align="center" width="500" src="./_static/images/1bui.png"/>
</p>

In [3]:
pdb_file = PPIREF_TEST_DATA_DIR / 'pdb/1bui.pdb'

Initialize PPI extractor based on 10A contacts between heavy atoms. Additionally, calculate buried surface area (BSA) of PPIs (slow).

In [4]:
ppi_dir = PPIREF_TEST_DATA_DIR / 'ppi_dir'
extractor = PPIExtractor(
    out_dir=ppi_dir,
    kind='heavy',
    radius=10.,
    bsa=True  # buried surface area calculation is slow
)

**Example 1.** Extract all contact-based dimeric PPIs from a PDB file. This will extract three interfaces: A-C, A-B, and B-C.

In [5]:
extractor.extract(pdb_file)

**Example 2.** Extract all contact-based dimeric PPIs between a subset of chains from a PDB file. In this example, this will lead to the same result as in Example 1 but may be useful for complexes containing more chains.

In [6]:
extractor.extract(pdb_file, partners=['A', 'B', 'C'])

**Example 3.** Extract a contact-based PPI between two specified chains (dimer).

In [7]:
extractor.extract(pdb_file, partners=['A', 'C'])

**Example 4.** Extract a contact-based PPI between three specified chains (trimer).

In [8]:
ppi_dir = PPIREF_TEST_DATA_DIR / 'ppi_dir'
extractor = PPIExtractor(
    out_dir=ppi_dir,
    join=True  # enables joining all pairwise dimeric interfaces into a single oligomeric interface
)
extractor.extract(pdb_file, partners=['A', 'B', 'C'])

**Example 5.** Extract a complete dimer complex by setting high expansion radius around interface (for example purposes).

In [9]:
ppi_complexes_dir = PPIREF_TEST_DATA_DIR / 'ppi_dir_complexes'
extractor_complexes = PPIExtractor(
    out_dir=ppi_complexes_dir,
    kind='heavy',
    radius=6.,
    expansion_radius=1_000_000.
)
extractor_complexes.extract(pdb_file, partners=['A', 'C'])

**Example 6.** Extract all PPIs from all .pdb files in a directory in parallel.

In [10]:
extractor = PPIExtractor(out_dir=ppi_dir, max_workers=2)
pdb_dir = PPIREF_TEST_DATA_DIR / 'pdb'
extractor.extract_parallel(pdb_dir)

Collecting input files: 100%|██████████| 8/8 [00:00<00:00, 3921.28it/s]
Filtering input files with pattern '.*\.pdb: 100%|██████████| 8/8 [00:00<00:00, 11052.18it/s]
Filtering processed files: 100%|██████████| 8/8 [00:00<00:00, 85163.53it/s]
  0%|          | 0/1 [00:00<?, ?it/s]

                             graphein.protein.features.sequence                 
                             .embeddings, you need to install:                  
                             torch                                              
                             To do so, use the following                        
                             command: conda install -c pytorch                  
                             torch                                              
                             graphein.protein.features.sequence                 
                             .embeddings, you need to install:                  
                             biovec                                             
                             biovec cannot be installed via                     
                             conda                                              
                             Alternatively, you can install                     
                            

100%|██████████| 1/1 [00:06<00:00,  6.26s/it]


Print all the extracted files.

In [11]:
for path in ppi_dir.rglob('*.pdb'):
    print(path.relative_to(ppi_dir.parent))

ppi_dir/k3/1k3f_B_D.pdb
ppi_dir/k3/1k3f_B_F.pdb
ppi_dir/k3/1k3f_D_E.pdb
ppi_dir/k3/1k3f_B_C.pdb
ppi_dir/k3/1k3f_D_F.pdb
ppi_dir/k3/1k3f_C_F.pdb
ppi_dir/k3/1k3f_A_D.pdb
ppi_dir/k3/1k3f_A_E.pdb
ppi_dir/k3/1k3f_C_E.pdb
ppi_dir/k3/1k3f_A_B.pdb
ppi_dir/k3/1k3f_E_F.pdb
ppi_dir/k3/1k3f_A_C.pdb
ppi_dir/p7/1p7z_B_D.pdb
ppi_dir/p7/1p7z_B_C.pdb
ppi_dir/p7/1p7z_C_D.pdb
ppi_dir/p7/1p7z_A_D.pdb
ppi_dir/p7/1p7z_A_B.pdb
ppi_dir/p9/3p9r_B_C.pdb
ppi_dir/p9/3p9r_A_C.pdb
ppi_dir/p9/3p9r_A_B.pdb
ppi_dir/p9/3p9r_C_D.pdb
ppi_dir/p9/3p9r_A_D.pdb
ppi_dir/0g/10gs_A_B.pdb
ppi_dir/a0/1a0n_A_B.pdb
ppi_dir/a0/1a02_F_J.pdb
ppi_dir/a0/1a02_F_N.pdb
ppi_dir/a0/1a02_J_N.pdb
ppi_dir/ah/1ahw_A_C.pdb
ppi_dir/ah/1ahw_E_F.pdb
ppi_dir/ah/1ahw_A_B.pdb
ppi_dir/ah/1ahw_A_F.pdb
ppi_dir/ah/1ahw_D_F.pdb
ppi_dir/ah/1ahw_B_C.pdb
ppi_dir/ah/1ahw_D_E.pdb
ppi_dir/bu/1bui_A_B_C.pdb
ppi_dir/bu/1bui_A_C.pdb
ppi_dir/bu/1bui_A_B.pdb
ppi_dir/bu/1bui_B_C.pdb
