# Using snakemake with multiple chains or structures to report if residues interacting with a specific chain have equivalent residues in hhsuite-generated alignments

This notebook builds on some of the basics covered in [Report if residues interacting with a specific chain have equivalent residues in an hhsuite-generated alignment](Report%20if%20residues%20interacting%20with%20a%20specific%20chain%20have%20equivalent%20residues%20in%20an%20hhsuite-generated%20alignment.ipynb) in order to examine residues from many different chains or structures.

----

The previous notebook, [Report if residues interacting with a specific chain have equivalent residues in an hhsuite-generated alignment](Report%20if%20residues%20interacting%20with%20a%20specific%20chain%20have%20equivalent%20residues%20in%20an%20hhsuite-generated%20alignment.ipynb), stepped through making reports whether residues in a chain that interact with a specific other chain are conserved in a related alignment generated by HH-suite3 software. 
Is there a way to scale this up to make reports for many combinations of chains in several pairs of related structures or even unrelated structures? This may be especially helpful for the cases where the structures involved are large complexes and several versions of the structure were solved.

This notebook spells out a way to do this with minimal effort. In fact, you only really need knowledge of the PDB code identifiers of the structures you are interested in, the chain designations in each structure, and the name of alignment file. You'll fill out a table to define the structures, chains, & alignments, and then kick off the process and make Jupyter notebooks containing the reports on conservation within the provided alignments for the residues from the specified structures and chains. Snakemake is used to run the process seen in the previous notebooks as a pipeling to do the analysis for multiple combinations specified.

Snakemake is used here as a workflow management software; however, the wonderful features it provides aren't fully covered or demonstrated. I did use a distinct and simpler workflow, to provide more background to using Snakemake in the notebook 'Making multiple interface-reporting dataframes for several structures using snakemake' available when you go [here](https://github.com/fomightez/pdbepisa-binder) and click `launch binder`. That Jupyter notebook also suggests further resources for learning Snakemake.

-----

**Step #1:** Make a table with columns separated by spaces and each line as a row. Each row will specify a structure,two chains from the structure, and an alignment from HH-suite3-related software where the first chain was the query, and the hit number in that alignment to check for the equivalent residues in the hit sequence, for a total of five required items per line. If your query and homolog are the best hit, they'll be the first one and the number to use in the last required enry will be `1`. This is expexted to be the tpyical case. Optionally, in the lines of the matrix you can also specify whether the details of the residues with equivalents should be included in the report, and  whether the details of the residues without equivalents should be included in the report. Those two settings default to `True` and `False` if left off and just five items are given per line in the 'equiv_check_matrix.txt' table. Seven items will occur on lines where all variables are provided.    
The following illustrates the content of such a table.

```text
5vvs A B alignment_example.hhr 1 True False
5vvs A C alignment_example.hhr 1
5vvs A D alignment_example.hhr 1 True False
5vvs A E alignment_example.hhr 1
5vvs A F alignment_example.hhr 1 True True
5vvs A G alignment_example.hhr 1 False
5vvs A H alignment_example.hhr 1
5vvs A I alignment_example.hhr 1
3rzo A B alignment_example.hhr 1
3rzo A K alignment_example.hhr 1
6zdt A B results_S288C_NOP1.hhr 1
```

You can open a text file in Jupyter and directly edit the file to make your table. For the sake of the demonstration, this will be done using code within this notebook found in the cell below.

If it helps you can think about the columns here for each line as the following, using the nomeclature from the first few code cells of previous notebook, [Report if residues interacting with a specific chain have equivalent residues in an hhsuite-generated alignment](Report%20if%20residues%20interacting%20with%20a%20specific%20chain%20have%20equivalent%20residues%20in%20an%20hhsuite-generated%20alignment.ipynb).

```text
structure structure_chain1 structure_chain2 hhr_file details_for_those_with_equivalents details_for_those_without_equivalents
```

Where `hhr_file` specifies the HH-suite3 result file, which was `results_S288C_NOP1.hhr` in the original example notebook.

The individual lines make independent reports and so the entries on different lines don't need to be related in any way. And in fact, the last one in the example is not related to the ones above that involve hain A (Rpb1p) in PDB entries 5vvs & 3rzo. The one on the last row actually corresponds to the same combination as in [Report if residues interacting with a specific chain have equivalent residues in an hhsuite-generated alignment](Report%20if%20residues%20interacting%20with%20a%20specific%20chain%20have%20equivalent%20residues%20in%20an%20hhsuite-generated%20alignment.ipynb); that line thus shows how a single line could be used to repeat the contents of the bulk of that notebook using snakemake. 

**Step #2:** Save the table with the following name, `equiv_check_matrix.txt`. It has to have that name for the table to be recognized and processed to make the Jupyter notbeook files with the reports.

Running following will generate an `equiv_check_matrix.txt` file here with the indicated content; however, you can, and will want to, skip running this if already made your own table. If you run it, it will replace your file though. Alternatively, you can edit the code below to make a table with the contents that interest you.

In [1]:
s='''5vvs A B alignment_example.hhr 1 True False
5vvs A C alignment_example.hhr 1
5vvs A D alignment_example.hhr 1 True False
5vvs A E alignment_example.hhr 1
5vvs A F alignment_example.hhr 1 True True
5vvs A G alignment_example.hhr 1 False
5vvs A H alignment_example.hhr 1
5vvs A I alignment_example.hhr 1
3rzo A B alignment_example.hhr 1
3rzo A K alignment_example.hhr 1
6zdt A B results_S288C_NOP1.hhr 1
'''
%store s >equiv_check_matrix.txt

Writing 's' (str) to file 'equiv_check_matrix.txt'.


**Step #3:** Get the HH-suite3-generated results files (`*.hhr` files).  
The example `equiv_check_matrix.txt` lists that it needs `alignment_example.hhr` and `results_S288C_NOP1.hhr`. The latter being the same example input data used in [Report if residues interacting with a specific chain have equivalent residues in an hhsuite-generated alignment](Report%20if%20residues%20interacting%20with%20a%20specific%20chain%20have%20equivalent%20residues%20in%20an%20hhsuite-generated%20alignment.ipynb).  
Running the next cell will get those, and you can skip running this if you are providing your own HH-suite3-generated results files (`*.hhr` files)

In [2]:
import os
file_needed = "results_S288C_NOP1.hhr"
if not os.path.isfile(file_needed):
    !curl -OL https://gist.githubusercontent.com/fomightez/cbdca72f5990146170f6d15789234dfb/raw/92f6ad30056b4c930d480b6da69265b0101e0cc6/results_S288C_NOP1.hhr
file_needed = "alignment_example.hhr"
if not os.path.isfile(file_needed):
    !curl -o alignment_example.hhr -L https://gist.githubusercontent.com/fomightez/cbdca72f5990146170f6d15789234dfb/raw/41969f2923da9b9c1ca370d5c41c9cde5e8ce9a4/xalignment_example.hhr

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 26350  100 26350    0     0   143k      0 --:--:-- --:--:-- --:--:--  144k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 44855  100 44855    0     0   169k      0 --:--:-- --:--:-- --:--:--  170k


**Step #4:** Run snakemake and point it at the corresponding snake file `equiv_snakefile` and it will process the `equiv_check_matrix.txt` file to extract the information and make individual notebooks corresponding to analysis of the interactions for each line. This will be very similar to running the previous notebooks in this series with the items spelled out on each line.  
The file snakemake uses in this pipeline, named `equiv_snakefile`, is already here. It is related to Python scripts and you can examine the text if you wish.  
It will take about a minute or less to complete if you are running the demonstration.

In [3]:
!snakemake -s equiv_snakefile --cores 1

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 23817  100 23817    0     0   163k      0 --:--:-- --:--:-- --:--:--  163k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 29380  100 29380    0     0   162k      0 --:--:-- --:--:-- --:--:--  163k
[33mBuilding DAG of jobs...[0m
[33mUsing shell: /bin/bash[0m
[33mProvided cores: 1 (use --cores to define parallelism)[0m
[33mRules claiming more threads will be scaled down.[0m
[33mJob stats:
job                                             count    min threads    max threads
--------------------------------------------  -------  -------------  -------------
all                                                 1              1              1
convert_scripts_to_nb_and_run_using_jupytext       11              1    

(For those knowledgeable with snakemake, I will note that I set the number of cores as one because I was finding with eight that occasionally a race condition would ensue where some of the auxillary scripts fetched in the course of running the report-generating notebooks would overwrite each other as they was being accessed by another notebook causing failures. Using one core avoids that hazard. I will add though that in most cases if you use multiple cores, you can easily get the additional files and a new archive made by running snakemake with your chosen number of cores again.  I never saw a race hazard with my clean rule, and so if you want to quickly start over you can run `!snakemake -s equiv_snakefile --cores 8 clean`.)

**Step #4:** Verify the Jupyter notebooks with the reports were generated.  
If you ran the demo ones, you can click [here](equivalents_report_for_5vvs_A_D_alignment_example.hhr.ipynb) to open one of them. For the others...  
You can go to the dashboard and see the ouput of running snakemake. To do that click on the Jupyter logo in the upper left top of this notebook and on that page you'll look in  the notebooks directory and you should see files that begin with `equivalents_report_` and end with `.ipynb`. You can examine some of them to insure all is as expected.

If things seem to be working and you haven't run your data yet, run `!snakemake -s equiv_snakefile --cores 8 clean` in a cell to reset things, and then edit & save `equiv_check_matrix.txt` to have your information, and then run the `!snakemake -s equiv_snakefile --cores 1` step above, again.

**Step #5:** If this was anything other than the demonstration run, download the archive containing all the Jupyter notebooks bundled together.  
For ease in downloading, all the created notebooks have been saved as a compressed archive so that you only need to retieve and keep track of one file. The file you are looking for begins with `equivalents_report_nbs` in front of a date/time stamp and ends with `.tar.gz`. The snakemake run will actually highlight this archive towards the very bottom of the run, following the words 'Be sure to download'.  
**Download that file from this remote, temporary session to your local computer.** You should see this archive file ending in `.tar.gz` on the dashboard. Toggle next to it to select it and then select `Download` to bring it from the remote Jupyterhub session to your computer. If you don't retrieve that file and the session ends, you'll need to re-run to get the results again.

You should be able to unpack that archive using your favorite software to extract compressed files. If that is proving difficult, you can always reopen a session like you did to run this series of notebooks and upload the archive and then run the following command in a Jupyter notebook cellk to unpack it:

```bash
!tar xzf equivalents_report_nbs*
```

(If you are running that command on the command line, leave off the exclamation mark.)
You can then examine the files in the session or download the individual Jupyter notebooks similar to the advice on how to download the archive given above.

In the next notebook in this series, [Making the multiple equivalents reports generated via snakemake clearer by substituting protein names](Making%20the%20multiple%20reports%20generated%20via%20snakemake%20clearer%20by%20adding%20protein%20names.ipynb), I work through how to make the reports more human readable by swapping the chain designations with the actual names of the proteins. This is similar to making the report more human readable that was discussed at the bottom of the previous notebook, [Report if residues interacting with a specific chain have equivalent residues in an hhsuite-generated alignment](Report%20if%20residues%20interacting%20with%20a%20specific%20chain%20have%20equivalent%20residues%20in%20an%20hhsuite-generated%20alignment.ipynb); however, the find/replace will be done to all the notebooks at once based on the file name beginning with `equivalents_report_for_` and ending with `.ipynb`.

If this notebook has you interested in learning more about Snakemake as workflow management software, I did use a distinct and simpler workflow, to provide more background to using Snakemake in the notebook 'Making multiple interface-reporting dataframes for several structures using snakemake' available when you go [here](https://github.com/fomightez/pdbepisa-binder) and click `launch binder`. That Jupyter notebook also suggests further resources for learning Snakemake.

-----

Please continue on with the next notebook in this series, [Making the multiple equivalents reports generated via snakemake clearer by substituting protein names](Making%20the%20multiple%20reports%20generated%20via%20snakemake%20clearer%20by%20adding%20protein%20names.ipynb).


-----