# Determine residues that match to a reference from MSA and use to construct fit commands

Often one wants to superimpose (a.k.a., overlay/superpse/fit/dock) similar macromolecular structures with one another.  You always need to know matching 'spine' or 'backbone' atoms to get Pymol or Jmols fit/compare commands to work. With several structures, some of which can be just fragments, it can be tedious, and sometimes error-pronem to do the determination of matching atoms by hand. This notebook shows how to determine what is there and builds up in steps to generating the fit commands for the same chains shared by different structures.

Because often times you be superimposing a theoretical model on a known structure, or want to use a specific structure as the example structure, you'll often have a sequence in your multiple sequence alignment that corresponds to a 'reference structure'. The 'reference id' used here refers to the identifier of that sequence.

The fit commands are intended to be used in Pymol and Jmol, with separate sections for each one.

The process relies initially on use of a script entitled `MSA_to_corresponding_residue_numbers.py`. Because that lays the groundwork for the process of creating commands from an alignment that can be used on the molecular structure visualization, and I didn't have a demo of that script yet elsewhere, I illustrate use of that script first. 

Technically, that script is a 'sequence analysis' script; however, it, like `categorize_residues_based_on_conservation_relative_consensus_line.py` demonstrated in another notebook here, is very useful for bridging sequence analysis to molecular structure analysis as I hope the later part of this notebook illustrates.


----
 
<div class="alert alert-block alert-warning">
<p>If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.</p>

<p>
    Some tips:
    <ul>
        <li>Code cells have boxes around them. When you hover over them a <i class="fa-step-forward fa"></i> icon appears.</li>
        <li>To run a code cell either click the <i class="fa-step-forward fa"></i> icon, or click on the cell and then hit <b>Shift+Enter</b>. The <b>Shift+Enter</b> combo will also move you to the next cell, so it's a quick way to work through the notebook.</li>
        <li>While a cell is running a <b>*</b> appears in the square brackets next to the cell. Once the cell has finished running the asterisk will be replaced with a number.</li>
        <li>In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.</li>
        <li>To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.</li>
    </ul>
</p>
</div>

----

### Preparation

The next cell will retrieve the necessary scripts.

In [3]:
#get scripts to use here
import os
if not os.path.isfile("MSA_to_corresponding_residue_numbers.py"):
    !curl -OL https://raw.githubusercontent.com/fomightez/sequencework/master/alignment-utilities/MSA_to_corresponding_residue_numbers.py
    !curl -OL https://raw.githubusercontent.com/fomightez/sequencework/master/alignment-utilities/check_seq_frag_in_MSAclustal_intact_viaFASTA.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 29699  100 29699    0     0   169k      0 --:--:-- --:--:-- --:--:--  170k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12393  100 12393    0     0   104k      0 --:--:-- --:--:-- --:--:--  104k


An alignment file is needed as input. The next cell will handle retrieving one.

Details of the retrieved alignment:

Fasta-formattted sequences were downloaded from under 'Sequence' at [here for Stv1p](https://www.yeastgenome.org/locus/S000004658/protein) and [here for Vph1p](https://www.yeastgenome.org/locus/S000005796/protein). These were then combined into one file, the asterisks at the end of each sequence removed (in order to avoid the error `*** WARNING ***  Invalid character '*' in FASTA sequence data, ignored`), combined with two human sequences, and submitted for alignment by [MUSCLE here](https://www.ebi.ac.uk/Tools/msa/muscle/). Default settings were used. The alignment was produced in Clustal format with consensus symbols line along the bottom.

Furthermore, for purposes of illustration here, **the residues that didn't pair with the start of all where left off. MEANING THAT SEVERAL OF THESE SEQUENCES DON'T REPRESENT THE FULL PROTEIN SEQUENCE.**

In [5]:
# Get an alignment file (actually two because going to show with all residues first)
if not os.path.isfile("alignment.clw"):
    !curl -o alignment.clw https://gist.githubusercontent.com/fomightez/f46b0624f1d8e3abb6ff908fc447e63b/raw/6abce38569475c68fa32182c4e0eaadbb8b0cf3b/uw_yeast_plus_two_human_homologs.clw
    !curl -o alignment_with_all_residues.clw https://gist.githubusercontent.com/fomightez/f46b0624f1d8e3abb6ff908fc447e63b/raw/6abce38569475c68fa32182c4e0eaadbb8b0cf3b/Stv1p_Vph1p_muscle_alignment.clw

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  6139  100  6139    0     0  30695      0 --:--:-- --:--:-- --:--:-- 30542
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3633  100  3633    0     0  20183      0 --:--:-- --:--:-- --:--:-- 20296


In [3]:
#verify have alignment file
!head alignment.clw

CLUSTAL multiple sequence alignment by MUSCLE (3.8)


BAH13127.1      FRSEEMTLA--QLFLQSEAAYCCVSELGELGKVQFRDLNPDVNVFQRKFVNEVRRCEEMD
EAW98433.1      MTATEMRCVGRSFYIHG---------------LSIIKLNQNVSSFQRKFVGEVKRCEELE
VPH1            FRSAEMALV--QFYIPQEISRDSAYTLGQLGLVQFRDLNSKVRAFQRTFVNEIRRLDNVE
STV1            FRSADMTYV--QLYIPLEVIREVTFLLGKMSVFMVMDLNKDLTAFQRGYVNQLRRFDEVE
                : : :*  .  .:::                 . . .** .:  *** :*.::.* ::::

BAH13127.1      RKLRFVEKEIRKANIPIM------------DTGENPEVPFPRDMI---------------


Note that the sequence of all but one don't begin with methionine. for purposes of illustration here, **the residues that didn't pair with the start of all where left off. MEANING THAT SEVERAL OF THESE SEQUENCES DON'T REPRESENT THE FULL PROTEIN SEQUENCE.**

You'll want to upload your own alignments to the active Jupyter session in the typical way; if you can click the Jupyter logo in the upper right you'll be taken to a dashboard with a file handing user interface. (Or once it is retrieved, open and replace the contents of `alignment.clw` with your own alignment.)

Note, because white space is critical for the consensus symbols line, it is  best to save the alignment file directly from EMBL-EBI and use that file as input for this script, rather than doing copy-paste from the site. For example, via copy-paste it may be easy to miss the spaces on the last line of the consensus symbols line in the case of two sequences that mismatch for the span of the entire last row of an alignment.

## Check sequence of edited alignment

In the process of editing a multiple sequence file or collecting portions to use to align, it is easy to erroneously delete sequence. This section will demonstrate using `check_seq_frag_in_MSAclustal_intact_viaFASTA.py` to make sure nothing in the fragment used here as been deleted, and verify the edited sequene in the alignment file is valid. It checks against a user-provided FASTA. It is suggested this comes directly from an 'official' source. It is related to **a similar, yet different script** `check_seq_in_MSAclustal_consistent_with_FASTA.py` **that makes sure nothing AT ALL is different/deleted relative the FASTA sequence**. `check_seq_frag_in_MSAclustal_intact_viaFASTA.py` to be used here is not concerned with either the start or end of the sequence. 

The reason that `check_seq_frag_in_MSAclustal_intact_viaFASTA.py` is used here is because part of the sequences in the MSA to be used here have been deleted and in order for the `MSA_to_corresponding_residue_numbers.py` script to yield useable data it is important that the fragment it is aligned to is intact or the numbers won't match the reference structure.

For the FASTA file, VPH1 and STV1 came from the [Saccharomyces Genome Database (SGD)](https://www.yeastgenome.org/) page for each respective encoding gene. From the pages for a particular gene, you can surf to the 'Protein' tab (example: [STV1 protein tab](https://www.yeastgenome.org/locus/S000004658/protein)) and then under 'Sequence' about half way down the page , click on the button 'Download Sequence (.fsa)' to get a FASTA file for the protein corresponding to that gene.

Now to use that FASTA check if the fragment present in MSA is in intact?

In [4]:
!python check_seq_frag_in_MSAclustal_intact_viaFASTA.py alignment.clw VPH1

  File "check_seq_frag_in_MSAclustal_intact_viaFASTA.py", line 256
    if seq infasta_seq:
                     ^
SyntaxError: invalid syntax


Yay! Sequence of one of the sequences in the manually-edited multiple sequence alignment has been confirmed against the official recordv verifying no deletions were introduced into the fragment during editing. It is best the process is repeated with any others as well.

Now that we know that the fragment is intact (and could determine for others) we can continue on to next section where the multiple sequence alignment is used to collect residues that correspond to the reference structure.

## Use residue script via command line

The script takes...

#### Display `USAGE`

In [5]:
!python MSA_to_corresponding_residue_numbers.py -h

usage: MSA_to_corresponding_residue_numbers.py [-h]
                                               ALIGNMENT_FILE REF_ID
                                               [N [N ...]]

MSA_to_corresponding_residue_numbers.py takes akes a multiple sequence
alignment in Clustal format and determines the corresponding residue numbers
for aligned residues for a specified reference sequence in the alignment and
the pairing of it with every other aligned sequence in the multiple sequence
alignment. The reference sequence is specified by the identifier at the start
of each line in the sequence blocks. **** Script by Wayne Decatur (fomightez @
github) ***

positional arguments:
  ALIGNMENT_FILE  Name of file of alignment text file (CLUSTAL format).
  REF_ID          Identifier that matches the sequence in the MSA to use as a
                  reference to get matches to other sequences. Presumably,
                  this sequence has a corresponding structural model
                  available.
  

#### Command line use example #1: basic command (tabular text files output)

The minimum the script needs to analyze an alignment is to specify the alignment file name followed by the designation of the sequence to use as the reference.

In [6]:
%run MSA_to_corresponding_residue_numbers.py alignment_with_all_residues.clw VPH1 


**NOTE: gap indicator in this script is currently set to '-'. If
that does not match what provided alignment uses to indicate gaps,
please change the setting within the script code under
'USER ADJUSTABLE VALUES' around line 130 (give or take a few).**
Alignment file read...VPH1 reference sequence collected from alignment...query compared...
made match pairings into intervals of start and end where possible...

made dataframe of matched start and ends...DONE.

File 'VPH1_residues_matched_to_STV1.tsv' saved.


See the result:

In [7]:
!head VPH1_residues_matched_to_STV1.tsv

VPH1_start	VPH1_end	STV1_start	STV1_end
2	84	1	83
90	92	84	86
93	107	91	105
108	152	117	161
153	171	167	185
172	186	189	203
187	222	240	275
225	267	276	318
271	325	319	373


#### Command line use example #2: designate the starting residue shown in the alignment

However, it many cases one or some of the sequences won't have the initial or perhaps the end sequences. If the missing residues are at the beginning, an offset value needs to be provided **If at least one sequence in the multiple sequence alignment doesn't begin with residue number 1 and you want to provide an offset value, a value for each sequence must be provided. Simply put `1` if it startes with the first residue.**

In [8]:
%run MSA_to_corresponding_residue_numbers.py alignment.clw VPH1 5 1 5 5


**NOTE: gap indicator in this script is currently set to '-'. If
that does not match what provided alignment uses to indicate gaps,
please change the setting within the script code under
'USER ADJUSTABLE VALUES' around line 130 (give or take a few).**
Alignment file read...VPH1 reference sequence collected from alignment...query compared...query compared...query compared...
made match pairings into intervals of start and end where possible...

made dataframes of matched start and ends...DONE.

File 'VPH1_residues_matched_to_BAH13127.1.tsv' saved.
File 'VPH1_residues_matched_to_EAW98433.1.tsv' saved.
File 'VPH1_residues_matched_to_STV1.tsv' saved.


Note if you are working in Python subsequently, you should check out the next section where importing the main function of the script into a Jupyter notebook or IPython session and passing dataframes back into a Jupyter notebook directly are illustrated. This option offers the most easy with using the data in downstream steps.


## Use residue matching function via import

In addition to being able to be run from the command line, the main function imported into a Jupyter notebook (or IPython session) and it can pass back dataframe(s) with the results. This section illustrates that.

First you import the function into the notebook or IPython environment.

In [9]:
from MSA_to_corresponding_residue_numbers import MSA_to_corresponding_residue_numbers

That command looks a bit redundant because the first 'from' is addressing the name of the script. The convention / syntax is not to include the extension though. The second part is specifying to import the function `MSA_to_corresponding_residue_numbers()`.

Now that `MSA_to_corresponding_residue_numbers()` is imported, it can be used. As with using the script fom the command line, the function has a number of options, including a way to provide a list of residues in the case one or more sequences in the multiple sequence alignment don't begin with residue number one.

#### function use example #1: basic command



In [11]:
MSA_to_corresponding_residue_numbers("alignment_with_all_residues.clw","VPH1")


**NOTE: gap indicator in this script is currently set to '-'. If
that does not match what provided alignment uses to indicate gaps,
please change the setting within the script code under
'USER ADJUSTABLE VALUES' around line 130 (give or take a few).**
Alignment file read...VPH1 reference sequence collected from alignment...query compared...
made match pairings into intervals of start and end where possible...

made dataframe of matched start and ends...DONE.

File 'VPH1_residues_matched_to_STV1.tsv' saved.


In [12]:
!head VPH1_residues_matched_to_STV1.tsv

VPH1_start	VPH1_end	STV1_start	STV1_end
2	84	1	83
90	92	84	86
93	107	91	105
108	152	117	161
153	171	167	185
172	186	189	203
187	222	240	275
225	267	276	318
271	325	319	373


### function use example #2: supply start residues

As with the command line version, if any sequence doesn't begin with residue number one a list of the first residue positions for all sequences must be supplied. When calling with the function use `supplied_start_pos` argument to supply a list.

In [13]:
MSA_to_corresponding_residue_numbers("alignment.clw","VPH1", supplied_start_pos = [5,1,5,5])


**NOTE: gap indicator in this script is currently set to '-'. If
that does not match what provided alignment uses to indicate gaps,
please change the setting within the script code under
'USER ADJUSTABLE VALUES' around line 130 (give or take a few).**
Alignment file read...VPH1 reference sequence collected from alignment...query compared...query compared...query compared...
made match pairings into intervals of start and end where possible...

made dataframes of matched start and ends...DONE.

File 'VPH1_residues_matched_to_BAH13127.1.tsv' saved.
File 'VPH1_residues_matched_to_EAW98433.1.tsv' saved.
File 'VPH1_residues_matched_to_STV1.tsv' saved.


In [14]:
!head VPH1_residues_matched_to_STV1.tsv

VPH1_start	VPH1_end	STV1_start	STV1_end
5	80	5	80
86	88	81	83
89	103	88	102
104	148	114	158
149	167	164	182
168	182	186	200
183	218	237	272
221	263	273	315
267	321	316	370


#### function use example #3: return dataframes

For those wishing to utilize this data in Python, a dictionary of dataframes can be returned by setting `return_dfs` to true. The identifier of the query sequence in the match will by the key. The reference idenitifier is returned as well so that it will be a variable ready for subsequent use.

The next cell demonstrates getting the dataframes in as active objects in the notebook environment.

In [20]:
from IPython.display import display, HTML
ref_id, dfs_by_id = MSA_to_corresponding_residue_numbers("alignment.clw","VPH1", return_dfs = True , supplied_start_pos = [5,1,5,5])
for id_,df in dfs_by_id.items():
    print(id_)
    display(df.head())

BAH13127.1



**NOTE: gap indicator in this script is currently set to '-'. If
that does not match what provided alignment uses to indicate gaps,
please change the setting within the script code under
'USER ADJUSTABLE VALUES' around line 130 (give or take a few).**
Alignment file read...VPH1 reference sequence collected from alignment...query compared...query compared...query compared...
made match pairings into intervals of start and end where possible...

made dataframes of matched start and ends...DONE.


Unnamed: 0,VPH1_start,VPH1_end,BAH13127.1_start,BAH13127.1_end
0,5,80,5,80
1,89,103,81,95
2,126,136,96,106
3,160,166,107,113
4,169,448,114,393


EAW98433.1


Unnamed: 0,VPH1_start,VPH1_end,EAW98433.1_start,EAW98433.1_end
0,5,13,1,9
1,14,19,12,17
2,35,80,18,63
3,89,103,64,78
4,105,148,79,122


STV1


Unnamed: 0,VPH1_start,VPH1_end,STV1_start,STV1_end
0,5,80,5,80
1,86,88,81,83
2,89,103,88,102
3,104,148,114,158
4,149,167,164,182


The final section of the notebook, below, builds on the output dataframes to produce molecular visualization commands for use in fitting the matching residues in structures using Pymol and/or Jmol.

## Using the matched residues to make Pymol fit commands

This section builds on the dataframes output to produce molecular visualization commands for use in Pymol. Information on structure being used:
[6C6L: Yeast Vacuolar ATPase Vo in lipid nanodisc](http://www.rcsb.org/structure/6C6L). VPH1 is Chain A

First we'll run the command to make sure the related dataframe is in the namespace of the notebook.

In [6]:
from MSA_to_corresponding_residue_numbers import MSA_to_corresponding_residue_numbers
ref_id, dfs_by_id = MSA_to_corresponding_residue_numbers("alignment.clw","VPH1", return_dfs = True , supplied_start_pos = [5,1,5,5])


**NOTE: gap indicator in this script is currently set to '-'. If
that does not match what provided alignment uses to indicate gaps,
please change the setting within the script code under
'USER ADJUSTABLE VALUES' around line 130 (give or take a few).**
Alignment file read...VPH1 reference sequence collected from alignment...query compared...query compared...query compared...
made match pairings into intervals of start and end where possible...

made dataframes of matched start and ends...DONE.


Now the dataframes can be used to make commands based on the identifiers and matching residue numbering.

Some additional data needs to be provided because the equivalent chains weren't yet provided. For example, the model made of STV1 has chain designation of 'A'. Even though  the structure [PDB id: 6C6L](http://www.rcsb.org/structure/6C6L) is quite complex with fifteen chains (eight unique), VPH1 is designated chain 'A' for reasons related to the subunit nomenclature. 

It won't always be this simple. One model might have a protein if interest chain 'I', while a homolog is chain 'K' in another.

Additionally, the PDB id code and/or identifiers used in corresponding Pymol session for those not in PDB must be brought incorporated prior to constructing the commands as well.

In [26]:
# further prep for generating commands
ref_id_chain = 'A' # for VPH1, from http://www.rcsb.org/structure/6C6L
query_chains_by_id = {'BAH13127.1':'G',
                      'EAW98433.1':'H',
                      'STV1':'A'}
ref_chain_structure = "6C6L"
structures_by_id = {'BAH13127.1':'BAH1',
                      'EAW98433.1':'EAW9',
                      'STV1':'STV1_model'}

# Convert each dataframe to something similar to `ranges_o_residues_resolved_per_chain_pairs` from the notebook 
# `Using Biopython PDB module to list resolved residues and construct fit commands`
# Example of `ranges_o_residues_resolved_per_chain_pairs` that notebook to base it on:
# `{('A', 'F'): ['4:121', '129:158'], ('B', 'G'): ['14:104', '125:140']}`
# However, those were in same chain and so more complex here in that regard!
# Since need residue positions for both chains it will be more like below for
# one reference-query chain pair:
# `{('A', 'F'): (['4:121', '129:158'],['x:xxx', 'xxx:xxx'])}`
# And since also will want these to be one for each pairing, I can store each dictionary
# in a dictionary with the key of the query as id for all, like so (only two shown for example:
# `{'BAH13127.1': {('A', 'G'): (['4:121', '129:158'],['x:xxx', 'xxx:xxx'])}, 'STV1':{('A', 'A'): (['14:104', '125:140'],['x:xxx', 'xxx:xxx'])}}`  
matched_ranges_o_residues_per_chain_pairs_by_id = {}
for id_,df in dfs_by_id.items():
    matched_ranges_o_residues_per_chain_pairs = {}
    chain_pair = (ref_id_chain,query_chains_by_id[id_])
    #iterate through the rows of the dataframe making the two lists
    ref_list = []
    query_list = []
    for row in df.itertuples(index=False):
        ref_list.append(f'{row[0]}:{row[1]}')
        query_list.append(f'{row[2]}:{row[3]}')
    matched_ranges_o_residues_per_chain_pairs[chain_pair] = (ref_list,query_list)
    matched_ranges_o_residues_per_chain_pairs_by_id[id_] = matched_ranges_o_residues_per_chain_pairs

# form commands
formatted_commands = ""
for id_,matched_ranges_per_ch_p in matched_ranges_o_residues_per_chain_pairs_by_id.items():
    selection_pairs = [] #to collect two item tuple elements, each item being a name
    for chain_pair in matched_ranges_per_ch_p:
        structures = (ref_chain_structure, structures_by_id[id_])
        selection_name_pair = []
        for indx,chain in enumerate(chain_pair):
            sel_nom = f"{structures[indx]}ch{chain}CA" #`CA` at end stands for `name CA` / alpha-carbon
            formatted_commands += f"select {sel_nom},"
            formatted_commands += "|".join( f" {structures[indx]} and resid {pos_range} and chain {chain} and name CA " 
                                           for pos_range in matched_ranges_per_ch_p[chain_pair][indx] ) 
            #for pos_range in ranges_o_residues_resolved_per_chain_pairs[chain_pair]:
             #   formatted_commands += f"{structures[indx].id} and resid {pos_range} and chain {chain} and name CA"
            formatted_commands += "\n"
            selection_name_pair.append(sel_nom)
        selection_pairs.append(selection_name_pair)
    first_structure_selections = "|".join(f" {x[0]} " for x in selection_pairs)
    second_structure_selections = "|".join(f" {x[1]} " for x in selection_pairs)
    formatted_commands += f"pair_fit ( {first_structure_selections}), ( {second_structure_selections})"
    formatted_commands += "\n\n"
        
# residues_resolved_per_chain_b
#shared_positions_per_chain_pairs
#ranges_o_residues_resolved_per_chain_pairs
print("\n\n")
print ("FORMATTED PYMOL COMMANDS:")
print(" ")
print(formatted_commands)




FORMATTED PYMOL COMMANDS:
 
select 6C6LchACA, 6C6L and resid 5:80 and chain A and name CA | 6C6L and resid 89:103 and chain A and name CA | 6C6L and resid 126:136 and chain A and name CA | 6C6L and resid 160:166 and chain A and name CA | 6C6L and resid 169:448 and chain A and name CA | 6C6L and resid 449:491 and chain A and name CA | 6C6L and resid 492:507 and chain A and name CA | 6C6L and resid 508:614 and chain A and name CA | 6C6L and resid 615:656 and chain A and name CA | 6C6L and resid 657:671 and chain A and name CA | 6C6L and resid 672:693 and chain A and name CA | 6C6L and resid 703:819 and chain A and name CA | 6C6L and resid 826:834 and chain A and name CA 
select BAH1chGCA, BAH1 and resid 5:80 and chain G and name CA | BAH1 and resid 81:95 and chain G and name CA | BAH1 and resid 96:106 and chain G and name CA | BAH1 and resid 107:113 and chain G and name CA | BAH1 and resid 114:393 and chain G and name CA | BAH1 and resid 395:437 and chain G and name CA | BAH1 and resi

In [20]:
import time

def executeSomething():
    #code here
    print ('.')
    time.sleep(480) #60 seconds times 8 minutes

while True:
    executeSomething()

.
.


KeyboardInterrupt: 

## Using the matching residue information to make Jmol commands

A similar process can be done to make Jmol / JSmol command for the `compare` function, Jmol/JSmol's equivalent of Pymol's fit function.

In [None]:
#JMOL
from categorize_residues_based_on_conservation_relative_consensus_line import categorize_residues_based_on_conservation_relative_consensus_line
df = categorize_residues_based_on_conservation_relative_consensus_line("alignment.clw","VPH1", return_panel_data = True)
structure = "6c6l"
chain = "A"  #Vph1p chain in PDB id: 6c6l
identical_color = "[160,31,95]"
strong_siml_color = "[239, 120, 160]"
weak_siml_color = "[242, 200, 224]"
color_dict = {
    "identical":identical_color,
    "strongly_similar":strong_siml_color,
    "weakly_similar":weak_siml_color
            }
df_dict = dict(zip(df.category,df.residue_positions))
assert list(color_dict.keys()) == list(df_dict.keys())[:3] , "keys not identical"
building_output = ""
for category in df_dict:
    if category != "not_conserved":
        for res in df_dict[category]:
            building_output += f"select {res}:{chain};"
            building_output += f"spacefill on;"
            building_output += f"color {color_dict[category]}\n"
        #save and reset output string
        %store building_output > {category}_jmol_commands.txt
        building_output = ""
# Comment out the above line & uncomment the next line if you want to save as single file
#%store building_output > all_jmol_commands.txt

Jmol commands all in a single file:

In [None]:
from categorize_residues_based_on_conservation_relative_consensus_line import categorize_residues_based_on_conservation_relative_consensus_line
df = categorize_residues_based_on_conservation_relative_consensus_line("alignment.clw","VPH1", return_panel_data = True)

In [None]:
#JMOL
structure = "6c6l"
chain = "A"  #Vph1p chain in PDB id: 6c6l
identical_color = "[160,31,95]"
strong_siml_color = "[239, 120, 160]"
weak_siml_color = "[242, 200, 224]"
color_dict = {
    "identical":identical_color,
    "strongly_similar":strong_siml_color,
    "weakly_similar":weak_siml_color
            }
df_dict = dict(zip(df.category,df.residue_positions))
assert list(color_dict.keys()) == list(df_dict.keys())[:3] , "keys not identical"
building_output = ""
for category in df_dict:
    if category != "not_conserved":
        for res in df_dict[category]:
            building_output += f"select {res}:{chain};"
            building_output += f"spacefill on;"
            building_output += f"color {color_dict[category]}\n"
%store building_output > all_jmol_commands.txt

Showing that worked:

In [None]:
!head all_jmol_commands.txt
!echo " "
!tail all_jmol_commands.txt

------