## Making the multiple equivalents reports generated via snakemake clearer by substituting protein names

This notebook builds on the previous notebook, [Using snakemake with multiple chains or structures to report if residues interacting with a specific chain have equivalent residues in hhsuite-generated alignments](Using%20snakemake%20with%20multiple%20chains%20or%20structures%20to%20report%20if%20residues%20interacting%20with%20a%20specific%20chain%20have%20equivalent%20residues%20in%20hhsuite-generated%20alignments.ipynb); however, you don't need to have just run that Jupyter notebook as the option to get a previous set of resulting notebooks and use them. 

This is based on a related notebook for reports made from PDBsum information, [Making the multiple reports generated via snakemake clearer by adding protein names](https://nbviewer.jupyter.org/github/fomightez/pdbsum-binder/blob/main/notebooks/Making%20the%20multiple%20reports%20generated%20via%20snakemake%20clearer%20by%20adding%20protein%20names.ipynb), that also bascially does find/replace to substitute in the protein names. (That notebook can be run from sessions launch at my [pdbsum-binder repo](https://github.com/fomightez/pdbsum-binder).)  
Here, the situation is a little more difficult because two sets of chains in the output generated by [Using snakemake with multiple chains or structures to report if residues interacting with a specific chain have equivalent residues in hhsuite-generated alignments](Using%20snakemake%20with%20multiple%20chains%20or%20structures%20to%20report%20if%20residues%20interacting%20with%20a%20specific%20chain%20have%20equivalent%20residues%20in%20hhsuite-generated%20alignments.ipynb), share chain designations (A and B) yet are actually different in different generated reports. Specifically, the report generated from the line `6zdt A B results_S288C_NOP1.hhr 1` in the `equiv_check_matrix.txt` table is not the same proteins as generated from the other lines in the `equiv_check_matrix.txt` table where A and B appear.


----

If you just ran the previous notebook in this session, you don't need to run this next cell; however, it has been set up to cause no issues if you did and run it again.  
If you don't yet have notebooks resulting from the previous notebook, it will get a set of demonstration results previously generated so that the rest of the cells in this notebook will work.

In [1]:
# Check if there seems to already be result notebooks. If there
# doesn't seem to be get one and unpack it.
prefix_for_report_nbs = "equivalents_report_for_"
import os
import sys
import fnmatch
equivalents_report_nb_pattern = f"{prefix_for_report_nbs}*.ipynb"
equivalents_report_nbs = []
for file in os.listdir('.'):
    if fnmatch.fnmatch(file, equivalents_report_nb_pattern):
        equivalents_report_nbs.append(file)
if not equivalents_report_nbs:
    !curl -OL https://gist.github.com/fomightez/31035455a59f211654ad7667fe23a6a5/raw/255c089a3da8da91efb831beb8d94cd5dba7cba4/equivalents_report_nbsJul1520212039.tar.gz
    !tar xzf equivalents_report_nbsJul1520212039.tar.gz
    # verify it worked and previde feedback
    for file in os.listdir('.'):
        if fnmatch.fnmatch(file, equivalents_report_nb_pattern):
            equivalents_report_nbs.append(file)
    if equivalents_report_nbs:
        sys.stderr.write("A set of notebooks with reports like those the "
            "previous notebook would make have been retrieved.\nYou should be "
            "able to now run this notebok.")
    else:
        sys.stderr.write("No notebooks are present. THIS ISN'T GOING TO WORK!")
else:
    sys.stderr.write("A set of notebooks with reports are present and\n"
        "executing the cells in this notebook should work now.")

A set of notebooks with reports are present and
executing the cells in this notebook should work now.

To define the replacements to make to make the notebooks more readable, the following cell will relate the text to change to new values in a set of key-value pairings. If you ran the demonstration and are continuing to use that, you can just run the cell. If you trying to make this notebook edit your own reports, you'll need to change the text on the left of the colon to match the chain designations and the text on the right to match the protein names.  
Note that a separate list exists for the files corresponding to PDB entry `6zdt`.  
(The case of the letters in the text on the left side will be ignored.)

In [2]:
most_chain2name_pairs = {
    "Chain A":"Rpo21p",
    "Chain B":"Rpb2p",
    "Chain C":"Rprb3p",
    "Chain D":"Rpb4p",
    "Chain E":"Rpb5p",
    "Chain F":"Rpb6p",
    "Chain G":"Rpb7p",
    "Chain H":"Rpb8p",
    "Chain I":"Rpb9p",
    "Chain K":"Rpb11p",
}
sixzdt_chain2name_pairs = {
    "Chain A":"fibrillarin (Nop1p)",
    "Chain B":"Nop56p",
}

Then run the next cell to go through every notebook and swap the text on the left side of the colon to be the text on the right side for the appropriate chain designations and structures.

In [3]:
# Make a list of most of the report-containing notebooks.
# For those with `6zdt` in the name, make another list of the notebook,
# Use each list to make the appropriate text swaps
prefix_for_report_nbs = "equivalents_report_for_"
import os
import sys
import fnmatch
import re
equivalents_report_nb_pattern = f"{prefix_for_report_nbs}*.ipynb"
equivalents_report_nbs = []
sixzdt_report_nbs = []
for file in os.listdir('.'):
    if fnmatch.fnmatch(file, equivalents_report_nb_pattern):
        if '6zdt' in file:
            sixzdt_report_nbs.append(file)
        else:
            equivalents_report_nbs.append(file)
def make_swaps(file_name,key_value_pairs):
    '''
    Takes a file name and edits every occurence of each key in all of them,
    replacing that text with the corresponding value.
    Saves the fixed file. Nothing is returned from this function.
    '''
    output_file_name = "temp.txt"
    with open(file_name, 'r') as thefile:
        nb_text=thefile.read()
    for k,v in key_value_pairs.items():
        #nb_text=nb_text.replace(k.lower(),v) # if wasn't case insensitive for key
        # case-insensitive string replacement from https://stackoverflow.com/a/919067/8508004
        insensitive = re.compile(re.escape(k), re.IGNORECASE)
        nb_text = insensitive.sub(v, nb_text)
    with open(output_file_name, 'w') as output_file:
        output_file.write(nb_text)
    # replace the original file with edited
    !mv {output_file_name} {file_name}
    # Feedback
    sys.stderr.write("Chain designations swapped for names in {}.\n".format(file_name))

for nbn in equivalents_report_nbs:
    make_swaps(nbn,most_chain2name_pairs)
for nbn in sixzdt_report_nbs:
    make_swaps(nbn,sixzdt_chain2name_pairs)

Chain designations swapped for names in equivalents_report_for_3rzo_A_K_alignment_example.hhr.ipynb.
Chain designations swapped for names in equivalents_report_for_5vvs_A_D_alignment_example.hhr.ipynb.
Chain designations swapped for names in equivalents_report_for_5vvs_A_G_alignment_example.hhr.ipynb.
Chain designations swapped for names in equivalents_report_for_5vvs_A_E_alignment_example.hhr.ipynb.
Chain designations swapped for names in equivalents_report_for_5vvs_A_I_alignment_example.hhr.ipynb.
Chain designations swapped for names in equivalents_report_for_5vvs_A_B_alignment_example.hhr.ipynb.
Chain designations swapped for names in equivalents_report_for_5vvs_A_H_alignment_example.hhr.ipynb.
Chain designations swapped for names in equivalents_report_for_3rzo_A_B_alignment_example.hhr.ipynb.
Chain designations swapped for names in equivalents_report_for_5vvs_A_C_alignment_example.hhr.ipynb.
Chain designations swapped for names in equivalents_report_for_5vvs_A_F_alignment_example.h

Check the notebook files by examining them. The occurences of 'chain' followed by the chain designations should have been swapped for the proteins names.

-----

Now you can go back to the previous notebook and run through the sets of interactions that interest you that you have alignments showing the homology for from HH-suite3 programs and where the query proteins occurs as part of structures interacting with other chains and then come and edit this notebook to make your reports have the name of the protein.


-----