# Determine residues that match to a reference from MSA and use to construct fit commands

Often one wants to superimpose (a.k.a., overlay/superpse/fit/dock) similar macromolecular structures with one another.  You always need to know matching 'spine' or 'backbone' atoms to get Pymol or Jmols fit/compare commands to work. With several structures, some of which can be just fragments, it can be tedious, and sometimes error-pronem to do the determination of matching atoms by hand. This notebook shows how to determine what is there and builds up in steps to generating the fit commands for the same chains shared by different structures.

Because often times you be superimposing a theoretical model on a known structure, or want to use a specific structure as the example structure, you'll often have a sequence in your multiple sequence alignment that corresponds to a 'reference structure'. The 'reference id' used here refers to the identifier of that sequence.

The fit commands are intended to be used in Pymol and Jmol, with separate sections for each one.

The process relies initially on use of a script entitled `MSA_to_corresponding_residue_numbers.py`. Because that lays the groundwork for the process of creating commands from an alignment that can be used on the molecular structure visualization, and I didn't have a demo of that script yet elsewhere, I illustrate use of that script first. 

Technically, that script is a 'sequence analysis' script; however, it, like `categorize_residues_based_on_conservation_relative_consensus_line.py` demonstrated in another notebook here, is very useful for bridging sequence analysis to molecular structure analysis as I hope the later part of this notebook illustrates.


----
 
<div class="alert alert-block alert-warning">
<p>If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.</p>

<p>
    Some tips:
    <ul>
        <li>Code cells have boxes around them. When you hover over them a <i class="fa-step-forward fa"></i> icon appears.</li>
        <li>To run a code cell either click the <i class="fa-step-forward fa"></i> icon, or click on the cell and then hit <b>Shift+Enter</b>. The <b>Shift+Enter</b> combo will also move you to the next cell, so it's a quick way to work through the notebook.</li>
        <li>While a cell is running a <b>*</b> appears in the square brackets next to the cell. Once the cell has finished running the asterisk will be replaced with a number.</li>
        <li>In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.</li>
        <li>To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.</li>
    </ul>
</p>
</div>

----

### Preparation

The next cell will retrieve the necessary scripts.

In [1]:
#get scripts to use here
!curl -OL https://raw.githubusercontent.com/fomightez/sequencework/master/alignment-utilities/MSA_to_corresponding_residue_numbers.py
!curl -OL https://raw.githubusercontent.com/fomightez/sequencework/master/alignment-utilities/check_seq_frag_in_MSAclustal_intact_viaFASTA.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 26493  100 26493    0     0   256k      0 --:--:-- --:--:-- --:--:--  256k


An alignment file is needed as input. The next cell will handle retrieving one.

Details of the retrieved alignment:

Fasta-formattted sequences were downloaded from under 'Sequence' at [here for Stv1p](https://www.yeastgenome.org/locus/S000004658/protein) and [here for Vph1p](https://www.yeastgenome.org/locus/S000005796/protein). These were then combined into one file, the asterisks at the end of each sequence removed (in order to avoid the error `*** WARNING ***  Invalid character '*' in FASTA sequence data, ignored`), combined with two human sequences, and submitted for alignment by [MUSCLE here](https://www.ebi.ac.uk/Tools/msa/muscle/). Default settings were used. The alignment was produced in Clustal format with consensus symbols line along the bottom.

Furthermore, for purposes of illustration here, **the residues that didn't pair with the start of all where left off. MEANING THAT SEVERAL OF THESE SEQUENCES DON'T REPRESENT THE FULL PROTEIN SEQUENCE.**

In [2]:
# Get an alignment file (actually two because going to show with all residues first)
!curl -o alignment.clw https://gist.githubusercontent.com/fomightez/f46b0624f1d8e3abb6ff908fc447e63b/raw/6abce38569475c68fa32182c4e0eaadbb8b0cf3b/uw_yeast_plus_two_human_homologs.clw
!curl -o alignment_with_all_residues.clw https://gist.githubusercontent.com/fomightez/f46b0624f1d8e3abb6ff908fc447e63b/raw/6abce38569475c68fa32182c4e0eaadbb8b0cf3b/Stv1p_Vph1p_muscle_alignment.clw

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3633  100  3633    0     0  30275      0 --:--:-- --:--:-- --:--:-- 30275


In [3]:
#verify have alignment file
!head alignment.clw

CLUSTAL multiple sequence alignment by MUSCLE (3.8)


STV1            -MNQEEAIFRSADMTYVQLYIPLEVIREVTFLLGKMSVFMVMDLNKDLTAFQRGYVNQLR
VPH1            MAEKEEAIFRSAEMALVQFYIPQEISRDSAYTLGQLGLVQFRDLNSKVRAFQRTFVNEIR
                  ::********:*: **:*** *: *: :: **::.:. . ***..: **** :**::*

STV1            RFDEVERMVGFLNEVVEKHAAETW-----KYILHIDDEGNDIAQPDMADLINTMEPLSLE
VPH1            RLDNVERQYRYFYSLLKKHDIKLYEGDTDKYL----DGSGELYVPPSGSVI---------
                *:*:***   :: .:::**  : :     **:    * ..::  *  ..:*         


Note that the sequence of all but one don't begin with methionine. for purposes of illustration here, **the residues that didn't pair with the start of all where left off. MEANING THAT SEVERAL OF THESE SEQUENCES DON'T REPRESENT THE FULL PROTEIN SEQUENCE.**

You'll want to upload your own alignments to the active Jupyter session in the typical way; if you can click the Jupyter logo in the upper right you'll be taken to a dashboard with a file handing user interface. (Or once it is retrieved, open and replace the contents of `alignment.clw` with your own alignment.)

Note, because white space is critical for the consensus symbols line, it is  best to save the alignment file directly from EMBL-EBI and use that file as input for this script, rather than doing copy-paste from the site. For example, via copy-paste it may be easy to miss the spaces on the last line of the consensus symbols line in the case of two sequences that mismatch for the span of the entire last row of an alignment.

## Check sequence of edited alignment

In the process of editing a multiple sequence file or collecting portions to use to align, it is easy to erroneously delete sequence. This section will demonstrate using `check_seq_frag_in_MSAclustal_intact_viaFASTA.py` to make sure nothing in the fragment used here as been deleted, and verify the edited sequene in the alignment file is valid. It checks against a user-provided FASTA. It is suggested this comes directly from an 'official' source. It is related to **a similar, yet different script** `check_seq_in_MSAclustal_consistent_with_FASTA.py` **that makes sure nothing AT ALL is different/deleted relative the FASTA sequence**. `check_seq_frag_in_MSAclustal_intact_viaFASTA.py` to be used here is not concerned with either the start or end of the sequence. 

The reason that `check_seq_frag_in_MSAclustal_intact_viaFASTA.py` is used here is because part of the sequences in the MSA to be used here have been deleted and in order for the `MSA_to_corresponding_residue_numbers.py` script to yield useable data it is important that the fragment it is aligned to is intact or the numbers won't match the reference structure.

For the FASTA file, VPH1 and STV1 came from the [Saccharomyces Genome Database (SGD)](https://www.yeastgenome.org/) page for each respective encoding gene. From the pages for a particular gene, you can surf to the 'Protein' tab (example: [STV1 protein tab](https://www.yeastgenome.org/locus/S000004658/protein)) and then under 'Sequence' about half way down the page , click on the button 'Download Sequence (.fsa)' to get a FASTA file for the protein corresponding to that gene.

Now to use that FASTA check if the fragment present in MSA is in intact?

In [None]:
!python check_seq_frag_in_MSAclustal_intact_viaFASTA.py alignment.clw VPH1

Yay! Sequence of one of the sequences in the manually-edited multiple sequence alignment has been confirmed against the official recordv verifying no deletions were introduced into the fragment during editing. It is best the process is repeated with any others as well.

Now that we know that the fragment is intact (and could determine for others) we can continue on to next section where the multiple sequence alignment is used to collect residues that correspond to the reference structure.

## Use residue script via command line

The script takes...

#### Display `USAGE`

In [4]:
!python MSA_to_corresponding_residue_numbers.py -h

usage: categorize_residues_based_on_conservation_relative_consensus_line.py
       [-h] [-og {single,separate}] [-ot {tabular_text,panel_data}]
       ALIGNMENT_FILE ID

categorize_residues_based_on_conservation_relative_consensus_line.py takes an
multiple sequence alignment (in CLUSTAL format) that has a consensus line, say
from MUSCLE at https://www.ebi.ac.uk/Tools/msa/muscle/, and for a specific
sequence in the alignment categorizes the residues that are identical,
strongly , similar, weakly similar, or unconserved in the alignment. ****
Script by Wayne Decatur (fomightez @ github) ***

positional arguments:
  ALIGNMENT_FILE        Name of file of alignmnet text file to use to
                        categorize residues.
  ID                    Identifier that has the residues to categorize
                        relative consensus symbols.

optional arguments:
  -h, --help            show this help message and exit
  -og {single,separate}, --output_grouping {single,separate}
     

#### Command line use example #1: basic command (tabular text files output)

The minimum the script needs to analyze an alignment is to specify the alignment file name followed by the designation of the sequence to use as the reference.

In [5]:
%run MSA_to_corresponding_residue_numbers.py alignment_with_all_residues.clw VPH1 


**NOTE: gap indicator in this script is currently set to '-'. If
that does not match what provided alignment uses to indicate gaps,
please change the setting within the script code under
'USER ADJUSTABLE VALUES' around line 110 (give or take a few).**
Alignment read...top line identifier determined as 'STV1'...
...Processing information about sequence with id of 'VPH1'.

Data a saved in tabular text form (tab-separated form) as 'categorized_consv_VPH1_residues.tsv'.


See the result:

In [6]:
!head VPH1_residues_matched_to_STV1.tsv

category	residue_positions
identical	5,6,7,8,9,10,11,12,14,17,18,20,21,22,24,27,33,34,43,44,45,50,51,52,53,56,57,60,61,63,65,66,67,78,79,90,91,93,101,107,109,111,118,120,123,129,136,137,138,142,151,155,158,166,170,176,188,189,191,193,195,196,199,202,203,204,205,207,208,209,210,211,213,218,219,221,229,233,236,237,239,241,242,247,255,256,271,278,283,284,288,290,291,294,295,298,299,302,306,309,315,316,317,323,324,325,335,336,337,338,339,340,342,345,346,349,350,353,357,360,361,370,373,374,377,378,379,381,382,383,384,385,386,387,390,391,392,393,395,397,398,399,400,402,404,405,406,407,408,409,411,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,434,436,437,439,440,441,442,443,445,449,451,453,454,455,456,457,458,459,460,461,462,463,466,467,468,469,471,472,474,475,476,478,479,480,481,482,483,484,485,487,488,489,490,491,492,493,494,496,497,502,503,504,505,506,508,512,514,515,517,518,519,521,522,523,524,525,527,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,54

#### Command line use example #2: designate the starting residue shown in the alignment

However, it many cases one or some of the sequences won't have the initial or perhaps the end sequences. If the missing residues are at the beginning, an offset value needs to be provided **If at least one sequence in the multiple sequence alignment doesn't begin with residue number 1 and you want to provide an offset value, a value for each sequence must be provided. Simply put `1` if it startes with the first residue.**

In [7]:
%run MSA_to_corresponding_residue_numbers.py alignment.clw VPH1 5 1 5 5


**NOTE: gap indicator in this script is currently set to '-'. If
that does not match what provided alignment uses to indicate gaps,
please change the setting within the script code under
'USER ADJUSTABLE VALUES' around line 110 (give or take a few).**
Alignment read...top line identifier determined as 'STV1'...
...Processing information about sequence with id of 'VPH1'.

Data saved in tabular text form (tab-separated form) as:
       'not_conserved_VPH1_residues.tsv'.

       'strongly_similar_VPH1_residues.tsv'.

       'identical_VPH1_residues.tsv'.

       'weakly_similar_VPH1_residues.tsv'.


Note if you are working in Python subsequently, you should check out the next section where importing the main function of the script into a Jupyter notebook or IPython session and passing dataframes back into a Jupyter notebook directly are illustrated. This option offers the most easy with using the data in downstream steps.


## Use residue matching function via import

In addition to being able to be run from the command line, the main function imported into a Jupyter notebook (or IPython session) and it can pass back dataframe(s) with the results. This section illustrates that.

First you import the function into the notebook or IPython environment.

In [13]:
from MSA_to_corresponding_residue_numbers import MSA_to_corresponding_residue_numbers

That command looks a bit redundant because the first 'from' is addressing the name of the script. The convention / syntax is not to include the extension though. The second part is specifying to import the function `MSA_to_corresponding_residue_numbers()`.

Now that `MSA_to_corresponding_residue_numbers()` is imported, it can be used. As with using the script fom the command line, the function has a number of options, including a way to provide a list of residues in the case one or more sequences in the multiple sequence alignment don't begin with residue number one.

#### function use example #1: basic command



In [14]:
MSA_to_corresponding_residue_numberse("alignment_with_all_residues.clw","VPH1")


**NOTE: gap indicator in this script is currently set to '-'. If
that does not match what provided alignment uses to indicate gaps,
please change the setting within the script code under
'USER ADJUSTABLE VALUES' around line 110 (give or take a few).**
Alignment read...top line identifier determined as 'STV1'...
...Processing information about sequence with id of 'VPH1'.

Data a saved in tabular text form (tab-separated form) as 'categorized_consv_VPH1_residues.tsv'.


In [2]:
!head VPH1_residues_matched_to_STV1.tsv

VPH1_start	VPH1_end	STV1_start	STV1_end
5	80	5	80
86	88	81	83
89	103	88	102
104	148	114	158
149	167	164	182
168	182	186	200
183	218	237	272
221	263	273	315
267	321	316	370


### function use example #2: supply start residues

As with the command line version, if any sequence doesn't begin with residue number one a list of the first residue positions for all sequences must be supplied. When calling with the function use `supplied_start_pos` argument to supply a list.

In [None]:
MSA_to_corresponding_residue_numbers("alignment.clw","VPH1", supplied_start_pos = [5,1,5,5])

In [1]:
!head VPH1_residues_matched_to_STV1.tsv

VPH1_start	VPH1_end	STV1_start	STV1_end
5	80	5	80
86	88	81	83
89	103	88	102
104	148	114	158
149	167	164	182
168	182	186	200
183	218	237	272
221	263	273	315
267	321	316	370


#### function use example #3: return dataframes

For those wishing to utilize this data in Python, a dictionary of dataframes can be returned by setting `return_dfs` to true. The identifier of the query sequence in the match will by the key. The reference idenitifier is returned as well so that it will be a variable ready for subsequent use.

The next cell demonstrates getting the dataframes in as active objects in the notebook environment.

In [16]:
ref_id, dfs_by_id = MSA_to_corresponding_residue_numbers("alignment.clw","VPH1", return_dfs = True , supplied_start_pos = [5,1,5,5])
for id_ in dfs_by_id:
    print(id_,df.head())


**NOTE: gap indicator in this script is currently set to '-'. If
that does not match what provided alignment uses to indicate gaps,
please change the setting within the script code under
'USER ADJUSTABLE VALUES' around line 110 (give or take a few).**
Alignment read...top line identifier determined as 'STV1'...
...Processing information about sequence with id of 'VPH1'.

Data a saved in tabular text form (tab-separated form) as 'categorized_consv_VPH1_residues.tsv'.

Returning a single dataframe of residue positions, one list per category.

Unnamed: 0,category,residue_positions
0,identical,"[5, 6, 7, 8, 9, 10, 11, 12, 14, 17, 18, 20, 21..."
1,strongly_similar,"[3, 4, 13, 15, 19, 25, 28, 30, 31, 35, 36, 38,..."
2,weakly_similar,"[37, 39, 41, 46, 47, 74, 95, 96, 104, 105, 112..."
3,not_conserved,"[1, 2, 16, 23, 26, 29, 32, 40, 42, 49, 54, 68,..."


The final section of the notebook, below, builds on the outout of dataframes to produce molecular visualization commands for use in fitting the matching residues in structures using Pymol and/or Jmol.

## Using the matched resies to make Pymol fit commands

This section builds on the dataframes output to produce molecular visualization commands for use in Pymol. Information on structure being used:
[6C6L: Yeast Vacuolar ATPase Vo in lipid nanodisc](http://www.rcsb.org/structure/6C6L). VPH1 is Chain A

First we'll run the command to make sure the related dataframe is in the namespace of the notebook.

In [20]:
from MSA_to_corresponding_residue_numbers import MSA_to_corresponding_residue_numbers
ref_id, dfs_by_id = MSA_to_corresponding_residue_numbers("alignment.clw","VPH1", return_dfs = True , supplied_start_pos = [5,1,5,5])


**NOTE: gap indicator in this script is currently set to '-'. If
that does not match what provided alignment uses to indicate gaps,
please change the setting within the script code under
'USER ADJUSTABLE VALUES' around line 110 (give or take a few).**
Alignment read...top line identifier determined as 'STV1'...
...Processing information about sequence with id of 'VPH1'.

Data a saved in tabular text form (tab-separated form) as 'categorized_consv_VPH1_residues.tsv'.

Returning a single dataframe of residue positions, one list per category.

Now the dataframes can be used to make commands based on the identifiers and matching residue numbering.

Some additional data needs to be provided because the equivalent chains weren't yet provided. For example, the model made of STV1 has chain designation of 'A'. Even though  the structure [PDB id: 6C6L](http://www.rcsb.org/structure/6C6L) is quite complex with fifteen chains (eight unique), VPH1 is designated chain 'A' for reasons related to the subunit nomenclature. 

It won't always be this simple. One model might have a protein if interest chain 'I', while a homolog is chain 'K' in another.

In [None]:
# further prep for generating commands
ref_id_chain = 'A' # for VPH1, from http://www.rcsb.org/structure/6C6L
query_chains_by_id = {'BAH13127.1':'G',
                      'EAW98433.1 ':'H',
                      'STV1':'A'}

# generate commands
formatted_commands = ""
for chain_pair in ranges_o_residues_resolved_per_chain_pairs:
    for indx,chain in enumerate(chain_pair):
        formatted_commands += f"select {structures[indx].id}ch{chain}CA," #`CA` at end stands for `name CA` / alpha-carbon
        formatted_commands += "|".join( f" {structures[indx].id} and resid {pos_range} and chain {chain} and name CA " 
                                       for pos_range in ranges_o_residues_resolved_per_chain_pairs[chain_pair] ) 
        #for pos_range in ranges_o_residues_resolved_per_chain_pairs[chain_pair]:
         #   formatted_commands += f"{structures[indx].id} and resid {pos_range} and chain {chain} and name CA"
        formatted_commands += "\n"

# residues_resolved_per_chain_b
shared_positions_per_chain_pairs
ranges_o_residues_resolved_per_chain_pairs
print("\n\n")
print ("FORMATTED PYMOL COMMANDS:")
print(" ")
print(formatted_commands)

## Writing commands for fitting matching residues of chains shared between two structures

The output can be used in Pymol to structurally align the chains.

In [6]:
# positions of resolved residues shared between two structures formatted into pair_fit commands
allow_MSE = True

from Bio.PDB import *
from collections import defaultdict

def range_extract(lst):
    # from https://www.rosettacode.org/wiki/Range_extraction#Python
    'Yield 2-tuple ranges or 1-tuple single elements from list of increasing ints'
    lenlst = len(lst)
    i = 0
    while i< lenlst:
        low = lst[i]
        while i <lenlst-1 and lst[i]+1 == lst[i+1]: i +=1
        hi = lst[i]
        if   hi - low >= 2:
            yield (low, hi)
        elif hi - low == 1:
            yield (low,)
            yield (hi,)
        else:
            yield (low,)
        i += 1
def printr(ranges):
    # from https://www.rosettacode.org/wiki/Range_extraction#Python
    print( ','.join( (('%i-%i' % r) if len(r) == 2 else '%i' % r)
                     for r in ranges ) )

    
def get_nice_ranges(ranges):
    return [ (('%i-%i' % r) if len(r) == 2 else '%i' % r) for r in ranges ]

def get_nice_ranges_wcolon(ranges):
    return [ (('%i:%i' % r) if len(r) == 2 else '%i' % r) for r in ranges ]

# Read in structure a ('first' structure)
structure = PDBParser().get_structure('3IAB', '3IAB.pdb')
residues_resolved_per_chain = defaultdict(list)
for model in structure:
    for chain in model:
        for residue in chain:
            #print (str(chain.id),residue.id[1])
            #print (str(chain.id),residue.id)
            '''
            if str(chain.id) == 'K':
                print (residue.id[0])
                '''
            #print [residue.id[0]]
            if allow_MSE:
                if residue.id[0] == 'H_MSE':
                    residues_resolved_per_chain[str(chain.id)].append(residue.id[1])
                else:
                    if not residue.id[0].startswith('H_'):
                        residues_resolved_per_chain[str(chain.id)].append(residue.id[1])
            else:
                if not residue.id[0].startswith('H_'):
                    residues_resolved_per_chain[str(chain.id)].append(residue.id[1])
# Read in structure a                
structure_b = PDBParser().get_structure('6AGB', '6AGB.pdb')
residues_resolved_per_chain_b = defaultdict(list)
for model_b in structure_b:
    for chain_b in model_b:
        for residue_b in chain_b:
            #print (str(chain_b.id),residue_b.id[1])
            #print (str(chain_b.id),residue_b.id)
            '''
            if str(chain_b.id) == 'K':
                print (residue_b.id[0])
                '''
            #print [residue_b.id[0]]
            if allow_MSE:
                if residue_b.id[0] == 'H_MSE':
                    residues_resolved_per_chain_b[str(chain_b.id)].append(residue_b.id[1])
                else:
                    if not residue_b.id[0].startswith('H_'):
                        residues_resolved_per_chain_b[str(chain_b.id)].append(residue_b.id[1])
            else:
                if not residue_b.id[0].startswith('H_'):
                    residues_resolved_per_chain_b[str(chain_b.id)].append(residue_b.id[1])

# further prep for last section
structures = (structure,structure_b) #will need for calling each in the order used later, i.e. structure a and b

chain_pairs = [('A','F'),('B','G')] #putting chain identiifier from 'first' structure first in each pair.
                
#now that have position info for residues of both structures, get the shared positions
shared_positions_per_chain_pairs = {}
for chain_pair in chain_pairs:
    shared = list(set(residues_resolved_per_chain[chain_pair[0]]) & set(residues_resolved_per_chain_b[chain_pair[1]]) )
    shared_positions_per_chain_pairs[chain_pair] = shared

#Now format more concisely
print ("Shared positions, concisely stated:")
ranges_o_residues_resolved_per_chain_pairs = {}
for k,lst in shared_positions_per_chain_pairs.items():
    print(k)
    #print (list(range_extract(lst))) # to see the list of tuples unformatted
    printr(range_extract(lst)) # tuples formatted as ranges
    ranges_o_residues_resolved_per_chain_pairs[k] = get_nice_ranges_wcolon(range_extract(lst))
# form commands
formatted_commands = ""
for chain_pair in ranges_o_residues_resolved_per_chain_pairs:
    for indx,chain in enumerate(chain_pair):
        formatted_commands += f"select {structures[indx].id}ch{chain}CA," #`CA` at end stands for `name CA` / alpha-carbon
        formatted_commands += "|".join( f" {structures[indx].id} and resid {pos_range} and chain {chain} and name CA " 
                                       for pos_range in ranges_o_residues_resolved_per_chain_pairs[chain_pair] ) 
        #for pos_range in ranges_o_residues_resolved_per_chain_pairs[chain_pair]:
         #   formatted_commands += f"{structures[indx].id} and resid {pos_range} and chain {chain} and name CA"
        formatted_commands += "\n"

# residues_resolved_per_chain_b
shared_positions_per_chain_pairs
ranges_o_residues_resolved_per_chain_pairs
print("\n\n")
print ("FORMATTED PYMOL COMMANDS:")
print(" ")
print(formatted_commands)



Shared positions, concisely stated:
('A', 'F')
4-121,129-158
('B', 'G')
14-104,125-140



FORMATTED PYMOL COMMANDS:
 
select 3IABchACA, 3IAB and resid 4:121 and chain A and name CA | 3IAB and resid 129:158 and chain A and name CA 
select 6AGBchFCA, 6AGB and resid 4:121 and chain F and name CA | 6AGB and resid 129:158 and chain F and name CA 
select 3IABchBCA, 3IAB and resid 14:104 and chain B and name CA | 3IAB and resid 125:140 and chain B and name CA 
select 6AGBchGCA, 6AGB and resid 14:104 and chain G and name CA | 6AGB and resid 125:140 and chain G and name CA 



In [21]:
structure = "6c6l"
chain = "A"  #Vph1p chain in PDB id: 6c6l
identical_color = "[0.627,0.121,0.372]"
strong_siml_color = "[0.937, 0.470, 0.627]"
weak_siml_color = "[0.949, 0.784, 0.878]"
color_dict = {
    "identical":identical_color,
    "strongly_similar":strong_siml_color,
    "weakly_similar":weak_siml_color
            }
df_dict = dict(zip(df.category,df.residue_positions))
assert list(color_dict.keys()) == list(df_dict.keys())[:3] , "keys not identical"
building_output = ""
for category in df_dict:
    if category != "not_conserved":
        for res in df_dict[category]:
            building_output += f"select resi{res}, (chain {chain} and resi {res})\n"
            building_output += f"show spheres, resi{res}\nset sphere_scale, 1, resi{res}\n"
            building_output += f"color {color_dict[category]}, resi{res}\n"
        #save and reset output string
        %store building_output > {category}_commands.txt
        building_output = ""
# Comment out the above line & uncomment the next line if you want to save as single file
#%store building_output > all_commands.txt


Writing 'building_output' (str) to file 'identical_commands.txt'.
Writing 'building_output' (str) to file 'strongly_similar_commands.txt'.
Writing 'building_output' (str) to file 'weakly_similar_commands.txt'.


To show that worked:

In [22]:
!head identical_commands.txt
!echo " "
!tail identical_commands.txt

select resi5, (chain A and resi 5)
show spheres, resi5
set sphere_scale, 1, resi5
color [0.627,0.121,0.372], resi5
select resi6, (chain A and resi 6)
show spheres, resi6
set sphere_scale, 1, resi6
color [0.627,0.121,0.372], resi6
select resi7, (chain A and resi 7)
show spheres, resi7
 
set sphere_scale, 1, resi820
color [0.627,0.121,0.372], resi820
select resi822, (chain A and resi 822)
show spheres, resi822
set sphere_scale, 1, resi822
color [0.627,0.121,0.372], resi822
select resi830, (chain A and resi 830)
show spheres, resi830
set sphere_scale, 1, resi830
color [0.627,0.121,0.372], resi830


Demonstrating outputting to a single file:

In [23]:
# ALL to a single file
structure = "6c6l"
chain = "A"  #Vph1p chain in PDB id: 6c6l
identical_color = "[0.627,0.121,0.372]"
strong_siml_color = "[0.937, 0.470, 0.627]"
weak_siml_color = "[0.949, 0.784, 0.878]"
color_dict = {
    "identical":identical_color,
    "strongly_similar":strong_siml_color,
    "weakly_similar":weak_siml_color
            }
df_dict = dict(zip(df.category,df.residue_positions))
assert list(color_dict.keys()) == list(df_dict.keys())[:3] , "keys not identical"
building_output = ""
for category in df_dict:
    if category != "not_conserved":
        for res in df_dict[category]:
            building_output += f"select resi{res}, (chain {chain} and resi {res})\n"
            building_output += f"show spheres, resi{res}\nset sphere_scale, 1, resi{res}\n"
            building_output += f"color {color_dict[category]}, resi{res}\n"
%store building_output > all_commands.txt

Writing 'building_output' (str) to file 'all_commands.txt'.


Showing that worked:

In [24]:
!head all_commands.txt
!echo " "
!tail all_commands.txt

select resi5, (chain A and resi 5)
show spheres, resi5
set sphere_scale, 1, resi5
color [0.627,0.121,0.372], resi5
select resi6, (chain A and resi 6)
show spheres, resi6
set sphere_scale, 1, resi6
color [0.627,0.121,0.372], resi6
select resi7, (chain A and resi 7)
show spheres, resi7
 
set sphere_scale, 1, resi783
color [0.949, 0.784, 0.878], resi783
select resi816, (chain A and resi 816)
show spheres, resi816
set sphere_scale, 1, resi816
color [0.949, 0.784, 0.878], resi816
select resi833, (chain A and resi 833)
show spheres, resi833
set sphere_scale, 1, resi833
color [0.949, 0.784, 0.878], resi833


## Using the matching residue information to make Jmol commands

A similar process can be done to make Jmol / JSmol command for the `compare` function, Jmol/JSmol's equivalent of Pymol's fit function.

In [25]:
#JMOL
from categorize_residues_based_on_conservation_relative_consensus_line import categorize_residues_based_on_conservation_relative_consensus_line
df = categorize_residues_based_on_conservation_relative_consensus_line("alignment.clw","VPH1", return_panel_data = True)
structure = "6c6l"
chain = "A"  #Vph1p chain in PDB id: 6c6l
identical_color = "[160,31,95]"
strong_siml_color = "[239, 120, 160]"
weak_siml_color = "[242, 200, 224]"
color_dict = {
    "identical":identical_color,
    "strongly_similar":strong_siml_color,
    "weakly_similar":weak_siml_color
            }
df_dict = dict(zip(df.category,df.residue_positions))
assert list(color_dict.keys()) == list(df_dict.keys())[:3] , "keys not identical"
building_output = ""
for category in df_dict:
    if category != "not_conserved":
        for res in df_dict[category]:
            building_output += f"select {res}:{chain};"
            building_output += f"spacefill on;"
            building_output += f"color {color_dict[category]}\n"
        #save and reset output string
        %store building_output > {category}_jmol_commands.txt
        building_output = ""
# Comment out the above line & uncomment the next line if you want to save as single file
#%store building_output > all_jmol_commands.txt

Writing 'building_output' (str) to file 'identical_jmol_commands.txt'.
Writing 'building_output' (str) to file 'strongly_similar_jmol_commands.txt'.
Writing 'building_output' (str) to file 'weakly_similar_jmol_commands.txt'.



**NOTE: gap indicator in this script is currently set to '-'. If
that does not match what provided alignment uses to indicate gaps,
please change the setting within the script code under
'USER ADJUSTABLE VALUES' around line 110 (give or take a few).**
Alignment read...top line identifier determined as 'STV1'...
...Processing information about sequence with id of 'VPH1'.

Data a saved in tabular text form (tab-separated form) as 'categorized_consv_VPH1_residues.tsv'.

Returning a single dataframe of residue positions, one list per category.

Jmol commands all in a single file:

In [26]:
from categorize_residues_based_on_conservation_relative_consensus_line import categorize_residues_based_on_conservation_relative_consensus_line
df = categorize_residues_based_on_conservation_relative_consensus_line("alignment.clw","VPH1", return_panel_data = True)


**NOTE: gap indicator in this script is currently set to '-'. If
that does not match what provided alignment uses to indicate gaps,
please change the setting within the script code under
'USER ADJUSTABLE VALUES' around line 110 (give or take a few).**
Alignment read...top line identifier determined as 'STV1'...
...Processing information about sequence with id of 'VPH1'.

Data a saved in tabular text form (tab-separated form) as 'categorized_consv_VPH1_residues.tsv'.

Returning a single dataframe of residue positions, one list per category.

In [27]:
#JMOL
structure = "6c6l"
chain = "A"  #Vph1p chain in PDB id: 6c6l
identical_color = "[160,31,95]"
strong_siml_color = "[239, 120, 160]"
weak_siml_color = "[242, 200, 224]"
color_dict = {
    "identical":identical_color,
    "strongly_similar":strong_siml_color,
    "weakly_similar":weak_siml_color
            }
df_dict = dict(zip(df.category,df.residue_positions))
assert list(color_dict.keys()) == list(df_dict.keys())[:3] , "keys not identical"
building_output = ""
for category in df_dict:
    if category != "not_conserved":
        for res in df_dict[category]:
            building_output += f"select {res}:{chain};"
            building_output += f"spacefill on;"
            building_output += f"color {color_dict[category]}\n"
%store building_output > all_jmol_commands.txt

Writing 'building_output' (str) to file 'all_jmol_commands.txt'.


Showing that worked:

In [28]:
!head all_jmol_commands.txt
!echo " "
!tail all_jmol_commands.txt

select 5:A;spacefill on;color [160,31,95]
select 6:A;spacefill on;color [160,31,95]
select 7:A;spacefill on;color [160,31,95]
select 8:A;spacefill on;color [160,31,95]
select 9:A;spacefill on;color [160,31,95]
select 10:A;spacefill on;color [160,31,95]
select 11:A;spacefill on;color [160,31,95]
select 12:A;spacefill on;color [160,31,95]
select 14:A;spacefill on;color [160,31,95]
select 17:A;spacefill on;color [160,31,95]
 
select 756:A;spacefill on;color [242, 200, 224]
select 760:A;spacefill on;color [242, 200, 224]
select 762:A;spacefill on;color [242, 200, 224]
select 763:A;spacefill on;color [242, 200, 224]
select 766:A;spacefill on;color [242, 200, 224]
select 770:A;spacefill on;color [242, 200, 224]
select 779:A;spacefill on;color [242, 200, 224]
select 783:A;spacefill on;color [242, 200, 224]
select 816:A;spacefill on;color [242, 200, 224]
select 833:A;spacefill on;color [242, 200, 224]


------