# GSD Make dataframes sorted on interactions with RNA for PDB codes 7c79, 7c7a, and 6w6v

List by most interaction with the RNA for the cryo-EM structures of RNase MRP.

This effort mainly builds on the examples of bringing in data and filtering in [Making dataframes dervied from PDBePISA interface lists/reports clearer by adding protein names and filtering to nucleic acid chains](../working_with_dataframes_and_making_clearer.ipynb) and previous notebook in that series.

-----

### Preparation

#### Preparation: initial items

In [1]:
# Get script file if not yet retrieved / check if file exists
import os
file_needed = "pisa_interface_list_to_df.py"
if not os.path.isfile(file_needed):
    !curl -OL https://raw.githubusercontent.com/fomightez/structurework/master/pdbepisa-utilities/{file_needed}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 43262  100 43262    0     0   177k      0 --:--:-- --:--:-- --:--:--  176k


In [2]:
import pandas as pd

In [3]:
# import the main function of the script into the notebooks's namespace 
# so that it can be used here
from pisa_interface_list_to_df import pisa_interface_list_to_df

#### Preparation: make initial dataframes

In [4]:
pdbs_to_use = ['7c79','7c7a','6w6v']
dfs=[pisa_interface_list_to_df(x) for x in pdbs_to_use]

Output()

Output()

Output()

### Making the data easier to read by replacing chain designation characters with molecule names

Protein Data Bank structure files have characters that corrrespond to each chain. The issue is that they are largely arbitrary. Often the sample complex in different structures uses a different character as the designation.  
Thanks to Panda's `df.replace()` method, we can swap out the characters for names of the molecules.  
Let's do that for a few of them while leaving some others unchanged for contrast.

In [5]:
chain2name_pairs = {
    "A":"Nme 1 RNA",
    "B":"Pop1p",
    "C":"Pop3p",
    "D":"Pop4p",
    "E":"Pop5p",
    "F":"Pop6p",
    "G":"Pop7p",
    "H":"Pop8p",
    "I":"Rpp1p$_a$",
    "J":"Rpp1p$_b$",
    "K":"Snm1p",
    "L":"Rmp1p",
    "R":"pre-rRNA",
}
dfs_tmp = dfs.copy()
#iterate on dfs_tmp so can modify dfs while iterating
for indx, df in enumerate(dfs_tmp):
    for k,v in chain2name_pairs.items():
        df = df.replace(k,v)
    dfs[indx] = df

In [6]:
num_to_display = 0
from IPython.display import display, HTML
display(HTML(f'<b>Interface interactions for {pdbs_to_use[num_to_display]}:</b>'))
display(dfs[num_to_display])

Unnamed: 0_level_0,Unnamed: 1_level_0,Chain 1,Chain 1,Chain 1,Chain 1,x,Chain 2,Chain 2,Chain 2,Chain 2,Interface,Interface,Interface,Interface,Interface,Interface,Interface
Unnamed: 0_level_1,row #,Chain label,Number_InterfacingAtoms,Number_InterfacingResidues,Surface (Å$^2$),Unnamed: 6_level_1,Chain label,Number_InterfacingAtoms,Number_InterfacingResidues,Surface (Å$^2$),Area (Å$^2$),Solvation free energy gain,Solvation gain P-value,Hydrogen bonds,Salt Bridges,Disuflides,CSS
0,1,Pop1p,799,205,46836,◊,Nme 1 RNA,932,124,61000,8014.9,-107.9,0.984,111,0,0,0.0
1,2,Pop4p,254,59,19198,◊,Rpp1p$_b$,231,54,16519,2274.8,-27.5,0.1,19,1,0,0.0
2,3,Pop4p,212,56,19198,◊,Nme 1 RNA,238,34,61000,1883.3,-26.4,0.923,26,0,0,0.0
3,4,Pop7p,192,56,8815,◊,Nme 1 RNA,215,25,61000,1814.8,-12.2,0.683,21,0,0,0.0
4,5,Pop1p,173,47,46836,◊,Rmp1p,178,51,8356,1746.3,-20.4,0.023,11,0,0,0.0
5,6,Pop4p,164,40,19198,◊,Snm1p,162,41,8050,1600.0,-19.2,0.212,13,7,0,0.0
6,7,Pop3p,137,35,9801,◊,Snm1p,150,40,8050,1413.6,-22.7,0.093,6,0,0,0.0
7,8,Pop5p,134,39,9572,◊,Rpp1p$_a$,128,33,12564,1307.6,-21.3,0.043,9,3,0,0.0
8,9,Pop1p,114,32,46836,◊,Pop7p,111,28,8815,1081.1,-7.9,0.299,8,8,0,0.0
9,10,Rmp1p,108,28,8356,◊,Nme 1 RNA,148,22,61000,1059.5,-7.5,0.941,19,0,0,0.0


In [7]:
num_to_display = 1
display(HTML(f'<b>Interface interactions for {pdbs_to_use[num_to_display]}:</b>'))
display(dfs[num_to_display])

Unnamed: 0_level_0,Unnamed: 1_level_0,Chain 1,Chain 1,Chain 1,Chain 1,x,Chain 2,Chain 2,Chain 2,Chain 2,Interface,Interface,Interface,Interface,Interface,Interface,Interface
Unnamed: 0_level_1,row #,Chain label,Number_InterfacingAtoms,Number_InterfacingResidues,Surface (Å$^2$),Unnamed: 6_level_1,Chain label,Number_InterfacingAtoms,Number_InterfacingResidues,Surface (Å$^2$),Area (Å$^2$),Solvation free energy gain,Solvation gain P-value,Hydrogen bonds,Salt Bridges,Disuflides,CSS
0,1,Pop1p,785,202,46857,◊,Nme 1 RNA,925,122,61103,7955.8,-103.8,0.984,105,0,0,0.0
1,2,Pop4p,257,61,19422,◊,Rpp1p$_b$,226,54,16616,2326.1,-26.4,0.109,15,1,0,0.0
2,3,Pop4p,208,58,19422,◊,Nme 1 RNA,243,34,61103,1947.0,-28.8,0.885,26,0,0,0.0
3,4,Pop7p,191,56,8780,◊,Nme 1 RNA,214,25,61103,1816.9,-13.0,0.704,20,0,0,0.0
4,5,Pop1p,176,46,46857,◊,Rmp1p,175,48,8598,1798.6,-18.1,0.037,14,2,0,0.0
5,6,Pop4p,172,40,19422,◊,Snm1p,167,42,7998,1627.3,-17.7,0.287,15,7,0,0.0
6,7,Pop5p,135,39,9607,◊,Rpp1p$_a$,130,33,12486,1300.1,-18.3,0.089,15,3,0,0.0
7,8,Pop3p,124,35,9634,◊,Snm1p,136,37,7998,1261.7,-20.0,0.108,6,0,0,0.0
8,9,Pop1p,115,33,46857,◊,Pop7p,113,29,8780,1133.3,-9.1,0.238,9,6,0,0.0
9,10,Rmp1p,95,24,8598,◊,Nme 1 RNA,133,20,61103,991.6,-10.8,0.885,11,0,0,0.0


In [8]:
num_to_display = 2
display(HTML(f'<b>Interface interactions for {pdbs_to_use[num_to_display]}:</b>'))
display(dfs[num_to_display])

Unnamed: 0_level_0,Unnamed: 1_level_0,Chain 1,Chain 1,Chain 1,Chain 1,x,Chain 2,Chain 2,Chain 2,Chain 2,Interface,Interface,Interface,Interface,Interface,Interface,Interface
Unnamed: 0_level_1,row #,Chain label,Number_InterfacingAtoms,Number_InterfacingResidues,Surface (Å$^2$),Unnamed: 6_level_1,Chain label,Number_InterfacingAtoms,Number_InterfacingResidues,Surface (Å$^2$),Area (Å$^2$),Solvation free energy gain,Solvation gain P-value,Hydrogen bonds,Salt Bridges,Disuflides,CSS
0,1,Pop1p,695,184,46970,◊,Nme 1 RNA,830,113,55355,7090.3,-105.9,0.962,105,0,0,0.0
1,2,Pop4p,242,59,15665,◊,Rpp1p$_b$,236,54,16318,2262.6,-24.3,0.149,19,2,0,0.0
2,3,Pop7p,199,59,9439,◊,Nme 1 RNA,225,27,55355,1856.5,-16.6,0.754,23,0,0,0.0
3,4,Pop4p,182,42,15665,◊,Snm1p,164,40,6927,1673.4,-15.2,0.342,10,1,0,0.0
4,5,Pop4p,152,43,15665,◊,Nme 1 RNA,175,25,55355,1444.6,-27.3,0.574,15,0,0,0.0
5,6,Pop1p,128,36,46970,◊,Rmp1p,137,39,9046,1314.6,-13.6,0.267,4,0,0,0.0
6,7,Pop5p,124,39,10680,◊,Rpp1p$_a$,119,32,12899,1267.0,-20.4,0.016,11,1,0,0.0
7,8,Pop6p,104,27,10434,◊,Nme 1 RNA,122,22,55355,1020.1,-6.8,0.89,13,0,0,0.0
8,9,Pop1p,115,32,46970,◊,Pop7p,110,28,9439,1017.8,-10.2,0.287,6,2,0,0.0
9,10,Rmp1p,92,26,9046,◊,Nme 1 RNA,119,18,55355,932.9,-6.7,0.962,18,0,0,0.0


## Filter to just those involving the RNA.

One of the great features of the PDBePISA interfaces data as compared to PDBsum, is that it accounts for interactions with nucleic acid chains.

**What if you just wanted to narrow down to the interactions that involved contact to the nucleic acid chains?**  


This will make obvious which molecules contribute most in interactions with the RNA.

In [9]:
# Example for first one, so iterative approach to collect filtered version for each dataframe in next cell clearer.
num_to_display = 0
display(HTML(f'<b>RNA contact interactions for {pdbs_to_use[num_to_display]}:</b>'))
df = dfs[num_to_display]
#filter to just those involving the RNA
display(df[df[df.columns[1]].str.contains('RNA')|df[df.columns[6]].str.contains('RNA')])

Unnamed: 0_level_0,Unnamed: 1_level_0,Chain 1,Chain 1,Chain 1,Chain 1,x,Chain 2,Chain 2,Chain 2,Chain 2,Interface,Interface,Interface,Interface,Interface,Interface,Interface
Unnamed: 0_level_1,row #,Chain label,Number_InterfacingAtoms,Number_InterfacingResidues,Surface (Å$^2$),Unnamed: 6_level_1,Chain label,Number_InterfacingAtoms,Number_InterfacingResidues,Surface (Å$^2$),Area (Å$^2$),Solvation free energy gain,Solvation gain P-value,Hydrogen bonds,Salt Bridges,Disuflides,CSS
0,1,Pop1p,799,205,46836,◊,Nme 1 RNA,932,124,61000,8014.9,-107.9,0.984,111,0,0,0.0
2,3,Pop4p,212,56,19198,◊,Nme 1 RNA,238,34,61000,1883.3,-26.4,0.923,26,0,0,0.0
3,4,Pop7p,192,56,8815,◊,Nme 1 RNA,215,25,61000,1814.8,-12.2,0.683,21,0,0,0.0
9,10,Rmp1p,108,28,8356,◊,Nme 1 RNA,148,22,61000,1059.5,-7.5,0.941,19,0,0,0.0
10,11,Pop6p,104,29,9941,◊,Nme 1 RNA,122,22,61000,981.6,-11.5,0.707,10,0,0,0.0
12,13,Pop5p,62,20,9572,◊,Nme 1 RNA,97,18,61000,821.1,-5.5,0.861,12,0,0,0.0
18,19,Rpp1p$_b$,36,11,16519,◊,Nme 1 RNA,50,10,61000,365.4,-1.3,0.866,6,0,0,0.0
19,20,Rpp1p$_a$,37,14,12564,◊,Nme 1 RNA,40,7,61000,335.7,-3.1,0.597,2,0,0,0.0
21,22,Snm1p,20,7,8050,◊,Nme 1 RNA,35,6,61000,237.2,-0.7,0.867,3,0,0,0.0
25,26,Pop8p,10,4,8780,◊,Nme 1 RNA,14,6,61000,102.4,-3.4,0.467,0,0,0,0.0


In [10]:
# Make dataframes with just contacts to RNA
df_rs = dfs.copy()
for indx,x in enumerate(dfs):
    num_to_display = indx
    df_r = x.copy()
    df_r = df_r[df_r[df_r.columns[1]].str.contains('RNA')|df_r[df_r.columns[6]].str.contains('RNA')]
    df_rs[indx] = df_r

In [11]:
# Show the dataframes limit just to contacts the RNA
from IPython.display import display, HTML
for indx,d in enumerate(df_rs):
    num_to_display = indx
    display(HTML(f'<b>RNA contact interactions for {pdbs_to_use[num_to_display]}:</b>'))
    display(d)
    display(HTML('<p><br></br>'))

Unnamed: 0_level_0,Unnamed: 1_level_0,Chain 1,Chain 1,Chain 1,Chain 1,x,Chain 2,Chain 2,Chain 2,Chain 2,Interface,Interface,Interface,Interface,Interface,Interface,Interface
Unnamed: 0_level_1,row #,Chain label,Number_InterfacingAtoms,Number_InterfacingResidues,Surface (Å$^2$),Unnamed: 6_level_1,Chain label,Number_InterfacingAtoms,Number_InterfacingResidues,Surface (Å$^2$),Area (Å$^2$),Solvation free energy gain,Solvation gain P-value,Hydrogen bonds,Salt Bridges,Disuflides,CSS
0,1,Pop1p,799,205,46836,◊,Nme 1 RNA,932,124,61000,8014.9,-107.9,0.984,111,0,0,0.0
2,3,Pop4p,212,56,19198,◊,Nme 1 RNA,238,34,61000,1883.3,-26.4,0.923,26,0,0,0.0
3,4,Pop7p,192,56,8815,◊,Nme 1 RNA,215,25,61000,1814.8,-12.2,0.683,21,0,0,0.0
9,10,Rmp1p,108,28,8356,◊,Nme 1 RNA,148,22,61000,1059.5,-7.5,0.941,19,0,0,0.0
10,11,Pop6p,104,29,9941,◊,Nme 1 RNA,122,22,61000,981.6,-11.5,0.707,10,0,0,0.0
12,13,Pop5p,62,20,9572,◊,Nme 1 RNA,97,18,61000,821.1,-5.5,0.861,12,0,0,0.0
18,19,Rpp1p$_b$,36,11,16519,◊,Nme 1 RNA,50,10,61000,365.4,-1.3,0.866,6,0,0,0.0
19,20,Rpp1p$_a$,37,14,12564,◊,Nme 1 RNA,40,7,61000,335.7,-3.1,0.597,2,0,0,0.0
21,22,Snm1p,20,7,8050,◊,Nme 1 RNA,35,6,61000,237.2,-0.7,0.867,3,0,0,0.0
25,26,Pop8p,10,4,8780,◊,Nme 1 RNA,14,6,61000,102.4,-3.4,0.467,0,0,0,0.0


Unnamed: 0_level_0,Unnamed: 1_level_0,Chain 1,Chain 1,Chain 1,Chain 1,x,Chain 2,Chain 2,Chain 2,Chain 2,Interface,Interface,Interface,Interface,Interface,Interface,Interface
Unnamed: 0_level_1,row #,Chain label,Number_InterfacingAtoms,Number_InterfacingResidues,Surface (Å$^2$),Unnamed: 6_level_1,Chain label,Number_InterfacingAtoms,Number_InterfacingResidues,Surface (Å$^2$),Area (Å$^2$),Solvation free energy gain,Solvation gain P-value,Hydrogen bonds,Salt Bridges,Disuflides,CSS
0,1,Pop1p,785,202,46857,◊,Nme 1 RNA,925,122,61103,7955.8,-103.8,0.984,105,0,0,0.0
2,3,Pop4p,208,58,19422,◊,Nme 1 RNA,243,34,61103,1947.0,-28.8,0.885,26,0,0,0.0
3,4,Pop7p,191,56,8780,◊,Nme 1 RNA,214,25,61103,1816.9,-13.0,0.704,20,0,0,0.0
9,10,Rmp1p,95,24,8598,◊,Nme 1 RNA,133,20,61103,991.6,-10.8,0.885,11,0,0,0.0
10,11,Pop6p,107,30,9915,◊,Nme 1 RNA,116,22,61103,976.8,-10.9,0.695,10,0,0,0.0
12,13,Pop5p,72,20,9607,◊,Nme 1 RNA,99,19,61103,837.7,-7.8,0.742,14,0,0,0.0
18,19,Rpp1p$_b$,38,11,16616,◊,Nme 1 RNA,54,10,61103,366.6,-1.8,0.88,5,0,0,0.0
19,20,Nme 1 RNA,51,9,61103,◊,pre-rRNA,42,8,2309,365.5,-5.7,0.778,1,0,0,0.0
20,21,Rpp1p$_a$,36,13,12486,◊,Nme 1 RNA,37,7,61103,309.8,-3.8,0.509,1,0,0,0.0
22,23,Snm1p,21,6,7998,◊,Nme 1 RNA,29,6,61103,224.9,1.6,0.912,3,0,0,0.0


Unnamed: 0_level_0,Unnamed: 1_level_0,Chain 1,Chain 1,Chain 1,Chain 1,x,Chain 2,Chain 2,Chain 2,Chain 2,Interface,Interface,Interface,Interface,Interface,Interface,Interface
Unnamed: 0_level_1,row #,Chain label,Number_InterfacingAtoms,Number_InterfacingResidues,Surface (Å$^2$),Unnamed: 6_level_1,Chain label,Number_InterfacingAtoms,Number_InterfacingResidues,Surface (Å$^2$),Area (Å$^2$),Solvation free energy gain,Solvation gain P-value,Hydrogen bonds,Salt Bridges,Disuflides,CSS
0,1,Pop1p,695,184,46970,◊,Nme 1 RNA,830,113,55355,7090.3,-105.9,0.962,105,0,0,0.0
2,3,Pop7p,199,59,9439,◊,Nme 1 RNA,225,27,55355,1856.5,-16.6,0.754,23,0,0,0.0
4,5,Pop4p,152,43,15665,◊,Nme 1 RNA,175,25,55355,1444.6,-27.3,0.574,15,0,0,0.0
7,8,Pop6p,104,27,10434,◊,Nme 1 RNA,122,22,55355,1020.1,-6.8,0.89,13,0,0,0.0
9,10,Rmp1p,92,26,9046,◊,Nme 1 RNA,119,18,55355,932.9,-6.7,0.962,18,0,0,0.0
10,11,Pop5p,71,20,10680,◊,Nme 1 RNA,103,20,55355,831.6,-8.3,0.87,24,0,0,0.0
15,16,Rpp1p$_a$,42,14,12899,◊,Nme 1 RNA,51,8,55355,394.2,-4.8,0.655,4,0,0,0.0
18,19,Rpp1p$_b$,29,9,16318,◊,Nme 1 RNA,33,8,55355,280.9,-0.5,0.733,5,0,0,0.0
19,20,Snm1p,20,6,6927,◊,Nme 1 RNA,26,5,55355,215.4,-1.7,0.756,2,0,0,0.0
23,24,Pop8p,8,2,8662,◊,Nme 1 RNA,11,4,55355,88.2,-2.3,0.56,0,0,0,0.0


These filtered dataframes will make obvious which molecules contribute the greatest in interactions with the RNA for each of the three structures.

--------

Enjoy!

------

-----