## Making the multiple interface-reporting dataframes generated via snakemake clearer by adding molecule names

This notebook builds on the previous notebook, [Making multiple interface-reporting dataframes for several structures using snakemake](notebooks/making_multiple_interface_dataframes_using_snakemake.ipynb); however, you don't need to have just run that Jupyter notebook as the option to get a previous set of resulting dataframes and use them. 

The approach to add the names of the molecules to the dataframes, in place of the chain designations, used in the second notebook, [Making dataframes dervied from PDBePISA interface lists/reports clearer by adding protein names and filtering to nucleic acid chains](working_with_dataframes_and_making_clearer.ipynb), is scaled-up here.


----

If you just ran the previous notebook in this session, you don't need to run this next cell; however, it has been set up to cause no issues if you did and run it again.  
If you don't yet have the dataframes resulting from the previous notebook, it will get a set of demonstration results previously generated so that the rest of the cells in this notebook will work.

In [1]:
# Check if there seems to already be result dataframes. If there
# doesn't seem to be get one and unpack it.
suffix_for_int_dfs = "_PISAinterface_summary_pickled_df.pkl"
import os
import sys
import fnmatch
dfs_file_pattern = f"*{suffix_for_int_dfs}"
interface_dfs_files = []
for file in os.listdir('.'):
    if fnmatch.fnmatch(file, dfs_file_pattern):
        interface_dfs_files.append(file)
if not interface_dfs_files:
    !curl -OL https://gist.github.com/fomightez/28636f6cc6561e4892384bddc8900eb3/raw/e82fcb71bf3afafcea5764b4b2e794d2dd8f679a/collection_of_interface_dfsDec0720211917.tar.gz
    !tar xzf collection_of_interface_dfsDec0720211917.tar.gz
    # verify it worked and previde feedback
    for file in os.listdir('.'):
        if fnmatch.fnmatch(file, dfs_file_pattern):
            interface_dfs_files.append(file)
    if interface_dfs_files:
        sys.stderr.write("A set of interface dataframes like those the "
            "previous notebook would make have been retrieved.\nYou should be "
            "able to now run this notebok.")
    else:
        sys.stderr.write("No interface dataframes are present. THIS ISN'T GOING TO WORK!")
else:
    sys.stderr.write("A set of interface dataframes are present and\n"
        "executing the cells in this notebook should work now.")

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  8866  100  8866    0     0  41429      0 --:--:-- --:--:-- --:--:-- 41429


A set of interface dataframes like those the previous notebook would make have been retrieved.
You should be able to now run this notebok.

To define the replacements to make to make the notebooks more readable, the following cell will relate the text to change to new values in a set of key-value pairings. If you ran the demonstration and are continuing to use that, you can just run the cell. If you trying to make this notebook edit your own reports, you'll need to change the text on the left of the colon to match the chain designations and the text on the right to match the protein names. (The case of the letters in the text on the left side will be ignored.)

In [2]:
chain2name_pairs = {
    "R":"WDR5",
    "N":"RBBP5",
    "K":"MLL1",
    "T":"SET1",
    "A":"Histone H3$_a$",
    "E":"Histone H3$_b$",
    "B":"Histone H4$_a$",
    "F":"Histone H4$_b$",
    "C":"Histone H2A$_a$",
    "G":"Histone H2A$_b$",
    "D":"Histone H2B$_1.1a$",
    "H":"Histone H2B$_1.1b$",
    "I":"DNA$_a$",
    "J":"DNA$_b$",
}

Then run the next cell to go through every notebook and swap the text on the left side of the colon to be the text on the right side.

In [3]:
# Make a list of the report-containing notebooks and then make the text swaps
suffix_for_int_dfs = "_PISAinterface_summary_pickled_df.pkl"
import os
import sys
import pandas as pd
import glob
import re
dfs_file_pattern = f"*{suffix_for_int_dfs}"
interface_dfs_files = glob.glob(dfs_file_pattern)
def make_swaps(file_name,key_value_pairs):
    '''
    Takes a file name for a pickled dataframe and loads it into memory edits 
    every occurence of each key in each dataframe to replace it with the 
    corresponding value and re-saves the dataframe to a pickled dataframe file.
    Resulting in a dataframe that can be read later that has text with the 
    corresponding value.
    Saves the edited, pickled file. Nothing is returned from this function.
    '''
    output_file_name = "temp.pkl"
    df = pd.read_pickle(file_name)
    for k,v in key_value_pairs.items():
        df = df.replace(k, v)
    #deal with special case of 6kiw having MLL3 and not MLL1
    if "6kiv" in file_name:
        df = df.replace("MLL1","MLL3")
    df.to_pickle(output_file_name)
    # replace the original file with edited
    !mv {output_file_name} {file_name}
    # Feedback
    sys.stderr.write("Chain designations swapped for molecule names in {}.\n".format(file_name))

for dff in interface_dfs_files:
    make_swaps(dff,chain2name_pairs)

Chain designations swapped for molecule names in 6kix_PISAinterface_summary_pickled_df.pkl.
Chain designations swapped for molecule names in 6kiv_PISAinterface_summary_pickled_df.pkl.
Chain designations swapped for molecule names in 6kiu_PISAinterface_summary_pickled_df.pkl.
Chain designations swapped for molecule names in 6kiw_PISAinterface_summary_pickled_df.pkl.
Chain designations swapped for molecule names in 6kiz_PISAinterface_summary_pickled_df.pkl.


Check the dataframe files by examining them. Here's examining one:

In [4]:
dfm = pd.read_pickle("6kix_PISAinterface_summary_pickled_df.pkl")
dfm

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Chain 1,Chain 1,Chain 1,Chain 1,x,Chain 2,Chain 2,Chain 2,Chain 2,Interface,Interface,Interface,Interface,Interface,Interface,Interface
Unnamed: 0_level_1,Id,row #,Chain label,Number_InterfacingAtoms,Number_InterfacingResidues,Surface (Å$^2$),Unnamed: 7_level_1,Chain label,Number_InterfacingAtoms,Number_InterfacingResidues,Surface (Å$^2$),Area (Å$^2$),Solvation free energy gain,Solvation gain P-value,Hydrogen bonds,Salt Bridges,Disuflides,CSS
0,1.0,1.0,DNA$_a$,701.0,145.0,32782.0,◊,DNA$_b$,713.0,145.0,33005,6712.6,-85.7,0.983,361,0,0,0.0
1,2.0,2.0,Histone H2A$_a$,243.0,58.0,9526.0,◊,Histone H2B$_1.1a$,240.0,57.0,8616,2485.5,-47.3,0.016,14,5,0,0.0
2,,3.0,Histone H2A$_b$,235.0,55.0,9202.0,◊,Histone H2B$_1.1b$,254.0,61.0,8361,2481.2,-43.7,0.031,14,6,0,0.0
3,,,,,,,,,,,**_Average:_**,2483.3,-45.5,0.023,14,6,0,0.0
4,3.0,4.0,Histone H3$_a$,243.0,54.0,9187.0,◊,Histone H4$_a$,222.0,51.0,7092,2425.3,-40.5,0.038,12,4,0,0.0
5,,5.0,Histone H3$_b$,232.0,54.0,9190.0,◊,Histone H4$_b$,223.0,49.0,6757,2402.9,-41.7,0.029,16,4,0,0.0
6,,,,,,,,,,,**_Average:_**,2414.1,-41.1,0.033,14,4,0,0.0
7,4.0,6.0,RBBP5,85.0,23.0,20484.0,◊,WDR5,101.0,37.0,13249,851.4,-9.4,0.156,10,0,0,0.0
8,5.0,7.0,RBBP5,93.0,17.0,20484.0,◊,MLL1,94.0,31.0,12534,818.8,-3.4,0.375,10,3,0,0.0
9,6.0,8.0,MLL1,71.0,17.0,12534.0,◊,WDR5,90.0,34.0,13249,817.9,-4.7,0.4,8,2,0,0.0


The chain designations should have been swapped for the names of the molecules.

Notice this just uses a short, easily-edited script to achieve this with the idea that this step wouldn't be commonly done as most structures don't share the same designations, and thus adding in molecule names would be limited by having to make the key pair dictionaries and dealing with special cases. For example, even among this group, MLL3 is found in one structure whereas MLL1 is in the others; contrast row #9 of the dataframe below with row #9 in the top of that 6kiv dataframe displayed after next code cell below. (Note the `special case of 6kiw having MLL3 and not MLL1` lines dealing with this.) 

In [5]:
dfv = pd.read_pickle("6kiv_PISAinterface_summary_pickled_df.pkl")
dfv.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Chain 1,Chain 1,Chain 1,Chain 1,x,Chain 2,Chain 2,Chain 2,Chain 2,Interface,Interface,Interface,Interface,Interface,Interface,Interface
Unnamed: 0_level_1,Id,row #,Chain label,Number_InterfacingAtoms,Number_InterfacingResidues,Surface (Å$^2$),Unnamed: 7_level_1,Chain label,Number_InterfacingAtoms,Number_InterfacingResidues,Surface (Å$^2$),Area (Å$^2$),Solvation free energy gain,Solvation gain P-value,Hydrogen bonds,Salt Bridges,Disuflides,CSS
0,1.0,1.0,DNA$_a$,693.0,145.0,32328.0,◊,DNA$_b$,688.0,145.0,32489,6598.2,-91.1,0.971,365,0,0,0.0
1,2.0,2.0,Histone H3$_a$,257.0,58.0,9108.0,◊,Histone H4$_a$,241.0,55.0,8034,2566.8,-39.8,0.047,17,5,0,0.0
2,,3.0,Histone H3$_b$,252.0,59.0,9111.0,◊,Histone H4$_b$,240.0,53.0,7067,2524.2,-42.2,0.05,14,3,0,0.0
3,,,,,,,,,,,**_Average:_**,2545.5,-41.0,0.049,16,4,0,0.0
4,3.0,4.0,Histone H2A$_b$,242.0,57.0,9089.0,◊,Histone H2B$_1.1b$,251.0,63.0,8250,2474.1,-43.9,0.029,15,4,0,0.0
5,,5.0,Histone H2A$_a$,246.0,57.0,9931.0,◊,Histone H2B$_1.1a$,235.0,58.0,8462,2461.4,-46.7,0.012,14,1,0,0.0
6,,,,,,,,,,,**_Average:_**,2467.7,-45.3,0.021,15,3,0,0.0
7,4.0,6.0,RBBP5,127.0,33.0,21887.0,◊,WDR5,136.0,44.0,12950,1230.2,-13.0,0.146,9,1,0,0.0
8,5.0,7.0,MLL3,88.0,20.0,12795.0,◊,WDR5,119.0,41.0,12950,978.9,-9.6,0.225,7,1,0,0.0
9,6.0,8.0,MLL3,84.0,29.0,12795.0,◊,RBBP5,78.0,18.0,21887,774.2,-3.5,0.47,9,4,0,0.0


However, this approach could be combined with Snakemake to make a more automated workflow as well. I'm imagining using a file lines with the designation and the molecule separated by a tab as input to supply the names to the Snakemake workflow. (With possible further enhancement by mining the names of the molecules and corresponding chains from RCSB in an automated manner, instead of requiring making a dictionary. Although even this may be of limited use as the preferred names aren't always featured clearly in the RCSB data. For example, extracting the the names used in this demo would have been difficult. Plus there's the `M [auth R]` complication in the 'Chains' listing at RCSB, see [here](https://www.rcsb.org/structure/6kix) for example.)

-----

Now, if you haven't yet, you can go back to the previous notebook, [Making multiple interface-reporting dataframes for several structures using snakemake](notebooks/making_multiple_interface_dataframes_using_snakemake.ipynb), and run through the structures that interest you in some pairs of structures, and then come and edit this notebook to make your reports have the name of the molecules.

Or [click here to open the Index](../index.ipynb) listing avilable notebooks and select other notebooks in the series.

-----