# Categorize conservation in a MSA and use that to generate molvis commands

This will work through categorizing conservation from a multiple sequence alignment (MSA) and use that to generate commands for molecular visualization.

The process relies initially on use of a script entitled `categorize_residues_based_on_conservation_relative_consensus_line.py`. Because that lays the groundwork for the process of creating commands from an alignment that can be used on the molecular structure visualization, and I didn't have a demo of that script yet elsewhere, I illustrate use of that script first. 

Technically, that script is a 'sequence analysis' script; however, it is very useful for bridging sequence analysis to molecular structure analysis as I hope the later part of this notebook illustrates.


----
 
<div class="alert alert-block alert-warning">
<p>If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.</p>

<p>
    Some tips:
    <ul>
        <li>Code cells have boxes around them. When you hover over them a <i class="fa-step-forward fa"></i> icon appears.</li>
        <li>To run a code cell either click the <i class="fa-step-forward fa"></i> icon, or click on the cell and then hit <b>Shift+Enter</b>. The <b>Shift+Enter</b> combo will also move you to the next cell, so it's a quick way to work through the notebook.</li>
        <li>While a cell is running a <b>*</b> appears in the square brackets next to the cell. Once the cell has finished running the asterisk will be replaced with a number.</li>
        <li>In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.</li>
        <li>To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.</li>
    </ul>
</p>
</div>

----

### Preparation

The next cell will retrieve the necessary script.

In [1]:
#get script to use to categorize
!curl -OL https://raw.githubusercontent.com/fomightez/sequencework/master/alignment-utilities/categorize_residues_based_on_conservation_relative_consensus_line.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 26493  100 26493    0     0   256k      0 --:--:-- --:--:-- --:--:--  256k


An alignment file is needed as input. The next cell will handle retrieving one.

Details of the retrieved alignment:

Fasta-formattted sequences were downloaded from under 'Sequence' at [here for Stv1p](https://www.yeastgenome.org/locus/S000004658/protein) and [here for Vph1p](https://www.yeastgenome.org/locus/S000005796/protein). These were then combined into one file, the asterisks at the end of each sequence removed (in order to avoid the error `*** WARNING ***  Invalid character '*' in FASTA sequence data, ignored`) and submitted for alignment by [MUSCLE here](https://www.ebi.ac.uk/Tools/msa/muscle/). Default settings were used. The alignment was produced in Clustal format with consensus symbols line along the bottom.

In [2]:
# Get an alignment file
!curl -o alignment.clw https://gist.githubusercontent.com/fomightez/f46b0624f1d8e3abb6ff908fc447e63b/raw/e4cfaae56c69d9c21ad5357b820e6397747c3a88/Stv1p_Vph1p_muscle_alignment.clw

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3633  100  3633    0     0  30275      0 --:--:-- --:--:-- --:--:-- 30275


In [3]:
#verify have alignment file
!head alignment.clw

CLUSTAL multiple sequence alignment by MUSCLE (3.8)


STV1            -MNQEEAIFRSADMTYVQLYIPLEVIREVTFLLGKMSVFMVMDLNKDLTAFQRGYVNQLR
VPH1            MAEKEEAIFRSAEMALVQFYIPQEISRDSAYTLGQLGLVQFRDLNSKVRAFQRTFVNEIR
                  ::********:*: **:*** *: *: :: **::.:. . ***..: **** :**::*

STV1            RFDEVERMVGFLNEVVEKHAAETW-----KYILHIDDEGNDIAQPDMADLINTMEPLSLE
VPH1            RLDNVERQYRYFYSLLKKHDIKLYEGDTDKYL----DGSGELYVPPSGSVI---------
                *:*:***   :: .:::**  : :     **:    * ..::  *  ..:*         


You'll want to upload your own alignments to the active Jupyter session in the typical way; if you can click the Jupyter logo in the upper right you'll be taken to a dashboard with a file handing user interface. (Or once it is retrieved, open and replace the contents of `alignment.clw` with your own alignment.)

Note, because white space is critical for the consensus symbols line, it is  best to save the alignment file directly from EMBL-EBI and use that file as input for this script, rather than doing copy-paste from the site. For example, via copy-paste it may be easy to miss the spaces on the last line of the consensus symbols line in the case of two sequences that mismatch for the span of the entire last row of an alignment.

## Use categorizing script via command line

The script akes an multiple sequence alignment (in CLUSTAL format) that has a consensus line, say from MUSCLE at https://www.ebi.ac.uk/Tools/msa/muscle/, and for a specific sequence in thealignment categorizes the residues that are identical, strongly, similar, or weakly similar in the alignment. It also categorizes those not conserved in the process.

See 'What do the consensus symbols mean in the alignment?' [here](https://www.ebi.ac.uk/Tools/msa/clustalw2/help/faq.html#23) for explanation of the indicators. 

Importantly, residue positions in the results are in common terms where the first residue is number one.

Assumes the CLUSTAL alignment is provided with a header line as it would 
come from EMBl-EBI's Muscle.( Doesn't check it matches anything. Just assumes nothing in the first line is of interest.)

Note, because white space is critical for the consensus symbols line, it is  best to save the alignment file directly from EMBL-EBI and use that file as input for this script, rather than doing copy-paste from the site. For example, via copy-paste it may be easy to miss the spaces on the last line of the consensus symbols line in the case of two sequences that mismatch for the span of the entire last row of an alignment.

#### Display `USAGE`

In [4]:
!python categorize_residues_based_on_conservation_relative_consensus_line.py -h

usage: categorize_residues_based_on_conservation_relative_consensus_line.py
       [-h] [-og {single,separate}] [-ot {tabular_text,panel_data}]
       ALIGNMENT_FILE ID

categorize_residues_based_on_conservation_relative_consensus_line.py takes an
multiple sequence alignment (in CLUSTAL format) that has a consensus line, say
from MUSCLE at https://www.ebi.ac.uk/Tools/msa/muscle/, and for a specific
sequence in the alignment categorizes the residues that are identical,
strongly , similar, weakly similar, or unconserved in the alignment. ****
Script by Wayne Decatur (fomightez @ github) ***

positional arguments:
  ALIGNMENT_FILE        Name of file of alignmnet text file to use to
                        categorize residues.
  ID                    Identifier that has the residues to categorize
                        relative consensus symbols.

optional arguments:
  -h, --help            show this help message and exit
  -og {single,separate}, --output_grouping {single,separate}
     

#### Command line use example #1: basic command (tabular text file)

The minimum the script needs to analyze an alignment is to specify the alignment file name followed by the designation of the sequence for which to detail conserved positions.

In [5]:
%run categorize_residues_based_on_conservation_relative_consensus_line.py alignment.clw VPH1 


**NOTE: gap indicator in this script is currently set to '-'. If
that does not match what provided alignment uses to indicate gaps,
please change the setting within the script code under
'USER ADJUSTABLE VALUES' around line 110 (give or take a few).**
Alignment read...top line identifier determined as 'STV1'...
...Processing information about sequence with id of 'VPH1'.

Data a saved in tabular text form (tab-separated form) as 'categorized_consv_VPH1_residues.tsv'.


See the result:

In [6]:
!head categorized_consv_VPH1_residues.tsv

category	residue_positions
identical	5,6,7,8,9,10,11,12,14,17,18,20,21,22,24,27,33,34,43,44,45,50,51,52,53,56,57,60,61,63,65,66,67,78,79,90,91,93,101,107,109,111,118,120,123,129,136,137,138,142,151,155,158,166,170,176,188,189,191,193,195,196,199,202,203,204,205,207,208,209,210,211,213,218,219,221,229,233,236,237,239,241,242,247,255,256,271,278,283,284,288,290,291,294,295,298,299,302,306,309,315,316,317,323,324,325,335,336,337,338,339,340,342,345,346,349,350,353,357,360,361,370,373,374,377,378,379,381,382,383,384,385,386,387,390,391,392,393,395,397,398,399,400,402,404,405,406,407,408,409,411,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,434,436,437,439,440,441,442,443,445,449,451,453,454,455,456,457,458,459,460,461,462,463,466,467,468,469,471,472,474,475,476,478,479,480,481,482,483,484,485,487,488,489,490,491,492,493,494,496,497,502,503,504,505,506,508,512,514,515,517,518,519,521,522,523,524,525,527,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,54

That produces output where each line is a category, a tab separator, and then the listing of the residue positions that match that category separated by a comma.

There are other options for the output though from the command line. These involve how the details are grouped, i.e., in on file or a separate file for each catergory. Or whether the output is tabular text or pickled dataframes. These are demonstrated in the next examples.

#### Command line use example #2: separate tabular text files for each category

Separate files for the output can be specified using the `--output_grouping` option, abbreviated `-og`.

The next cell demonstrates setting it to result in tab-sparated tabular text files for each category.

In [7]:
%run categorize_residues_based_on_conservation_relative_consensus_line.py alignment.clw VPH1 --output_grouping separate 


**NOTE: gap indicator in this script is currently set to '-'. If
that does not match what provided alignment uses to indicate gaps,
please change the setting within the script code under
'USER ADJUSTABLE VALUES' around line 110 (give or take a few).**
Alignment read...top line identifier determined as 'STV1'...
...Processing information about sequence with id of 'VPH1'.

Data saved in tabular text form (tab-separated form) as:
       'not_conserved_VPH1_residues.tsv'.

       'strongly_similar_VPH1_residues.tsv'.

       'identical_VPH1_residues.tsv'.

       'weakly_similar_VPH1_residues.tsv'.


Let's look at one of those files to compare with the default of 'single' grouping.

In [8]:
!head weakly_similar_VPH1_residues.tsv

weakly_similar_residue_pos
37
39
41
46
47
74
95
96
104


This produces a listing of the residues positions with one on each line.

#### Command line use example #3: a dataframe stored in file form as output

For those wishing to utilize this data in Python subsequently, Pandas dataframes in pickled form can be specified using the `--output_type` option, abbreviated `-ot` .

The next cell demonstrates setting it to result in a pickled dataframe.

In [9]:
%run categorize_residues_based_on_conservation_relative_consensus_line.py alignment.clw VPH1 --output_type panel_data 


**NOTE: gap indicator in this script is currently set to '-'. If
that does not match what provided alignment uses to indicate gaps,
please change the setting within the script code under
'USER ADJUSTABLE VALUES' around line 110 (give or take a few).**
Alignment read...top line identifier determined as 'STV1'...
...Processing information about sequence with id of 'VPH1'.

Data saved in pickled dataframe form as 'categorized_consv_VPH1_residues.pkl'.


Showing that worked by reading the pickled dataframe back into this notebook. 

The first column of this dataframe is the categories. The residue positions column is a list of the positions for each category.

In [10]:
import pandas as pd
cat_df = pd.read_pickle("categorized_consv_VPH1_residues.pkl")
cat_df

Unnamed: 0,category,residue_positions
0,identical,"[5, 6, 7, 8, 9, 10, 11, 12, 14, 17, 18, 20, 21..."
1,strongly_similar,"[3, 4, 13, 15, 19, 25, 28, 30, 31, 35, 36, 38,..."
2,weakly_similar,"[37, 39, 41, 46, 47, 74, 95, 96, 104, 105, 112..."
3,not_conserved,"[1, 2, 16, 23, 26, 29, 32, 40, 42, 49, 54, 68,..."


Note if you are working in Python subsequently, you should check out the next section where importing the main function of the script into a Jupyter notebook or IPython session and passing an active dataframe (or dataframes if specifying not wanting the categories all in a single file) back into a Jupyter notebook directly are illustrated.

### Command line use example #4: separate panel data (dataframes)

The two optional settings can also be combined to output a pickled dataframe for each category of conservation.

In [11]:
%run categorize_residues_based_on_conservation_relative_consensus_line.py alignment.clw VPH1 -og separate -ot panel_data 


**NOTE: gap indicator in this script is currently set to '-'. If
that does not match what provided alignment uses to indicate gaps,
please change the setting within the script code under
'USER ADJUSTABLE VALUES' around line 110 (give or take a few).**
Alignment read...top line identifier determined as 'STV1'...
...Processing information about sequence with id of 'VPH1'.

Data saved in pickled dataframe form as:
       'not_conserved_VPH1_residues.pkl'.

       'strongly_similar_VPH1_residues.pkl'.

       'identical_VPH1_residues.pkl'.

       'weakly_similar_VPH1_residues.pkl'.


Each dataframe lists the pertinent residue positions when this setting is used.

The next call will show that by reading one of the produced pickled dataframe files back into this notebook sesion. The first few lines of the data are shown.

In [12]:
import pandas as pd
cat_df = pd.read_pickle("weakly_similar_VPH1_residues.pkl")
cat_df.head()

Unnamed: 0,weakly_similar_residue_pos
0,37
1,39
2,41
3,46
4,47


Note if you are working in Python subsequently, you should check out the next section where importing the main function of the script into a Jupyter notebook or IPython session and passing an active dataframe (or dataframes if specifying not wanting the categories all in a single file) back into a Jupyter notebook directly are illustrated.

## Use categorizing function via import

In addition to being able to be run from the command line, the main function imported into a Jupyter notebook (or IPython session) and it can pass back dataframe(s) with the results. This section illustrates that.

First you import the function into the notebook or IPython environment.

In [13]:
from categorize_residues_based_on_conservation_relative_consensus_line import categorize_residues_based_on_conservation_relative_consensus_line

That command looks a bit redundant because the first 'from' is addressing the name of the script. The convention / syntax is not to include the extension though. The second part is specifying to import the function `categorize_residues_based_on_conservation_relative_consensus_line()`.

Now that `categorize_residues_based_on_conservation_relative_consensus_line()` is imported, it can be used. As with using the script fom the command line, the function has a number of options that can be used.

#### function use example #1: basic command (tabular text file)

The minimum the function needs is to specify the alignment file name followed by the designation of the sequence for which to detail conserved positions.

In [14]:
categorize_residues_based_on_conservation_relative_consensus_line("alignment.clw","VPH1")


**NOTE: gap indicator in this script is currently set to '-'. If
that does not match what provided alignment uses to indicate gaps,
please change the setting within the script code under
'USER ADJUSTABLE VALUES' around line 110 (give or take a few).**
Alignment read...top line identifier determined as 'STV1'...
...Processing information about sequence with id of 'VPH1'.

Data a saved in tabular text form (tab-separated form) as 'categorized_consv_VPH1_residues.tsv'.


That produces the same output as the basic command line form, i.e., a single tabular text tile with each category followed by a tab separot and then and a list of residues corresponding to that. The positions are separated by a comma.

#### function use example #2: separate tabular text files for each category

Separate files for the output can be specified when calling the function by setting `output_multiple` to true. The default is for the outout to be tab-separated text and so no other settings are needed when calling the function.

The next cell demonstrates setting it to result in tab-sparated tabular text files for each category.

In [15]:
categorize_residues_based_on_conservation_relative_consensus_line("alignment.clw","VPH1", output_separate = True)


**NOTE: gap indicator in this script is currently set to '-'. If
that does not match what provided alignment uses to indicate gaps,
please change the setting within the script code under
'USER ADJUSTABLE VALUES' around line 110 (give or take a few).**
Alignment read...top line identifier determined as 'STV1'...
...Processing information about sequence with id of 'VPH1'.

Data a saved in tabular text form (tab-separated form) as 'categorized_consv_VPH1_residues.tsv'.

Data saved in tabular text form (tab-separated form) as:
       'not_conserved_VPH1_residues.tsv'.

       'strongly_similar_VPH1_residues.tsv'.

       'identical_VPH1_residues.tsv'.

       'weakly_similar_VPH1_residues.tsv'.


#### function use example #3: return a dataframe

For those wishing to utilize this data in Python, a dataframe can be returned by setting `return_panel_data` to true.

The next cell demonstrates setting in the dataframe being an active object in the notebook environment.

In [16]:
df = categorize_residues_based_on_conservation_relative_consensus_line("alignment.clw","VPH1", return_panel_data = True)
df


**NOTE: gap indicator in this script is currently set to '-'. If
that does not match what provided alignment uses to indicate gaps,
please change the setting within the script code under
'USER ADJUSTABLE VALUES' around line 110 (give or take a few).**
Alignment read...top line identifier determined as 'STV1'...
...Processing information about sequence with id of 'VPH1'.

Data a saved in tabular text form (tab-separated form) as 'categorized_consv_VPH1_residues.tsv'.

Returning a single dataframe of residue positions, one list per category.

Unnamed: 0,category,residue_positions
0,identical,"[5, 6, 7, 8, 9, 10, 11, 12, 14, 17, 18, 20, 21..."
1,strongly_similar,"[3, 4, 13, 15, 19, 25, 28, 30, 31, 35, 36, 38,..."
2,weakly_similar,"[37, 39, 41, 46, 47, 74, 95, 96, 104, 105, 112..."
3,not_conserved,"[1, 2, 16, 23, 26, 29, 32, 40, 42, 49, 54, 68,..."


For each category, the residue positions are a python list of the residues.

Use of such a dataframe to continue efforts with Python produce something useful will be demonstrated in the final section of the notebook after this last example.

### function use example #4: return a separate dataframe for each category

Alternatively, a separate dataframe can be produced for each category by setting `output_separate` to true when calling the script with `return_panel_data = True`, as well. 

In [17]:
df_dict = categorize_residues_based_on_conservation_relative_consensus_line("alignment.clw","VPH1", return_panel_data = True, output_separate = True)


**NOTE: gap indicator in this script is currently set to '-'. If
that does not match what provided alignment uses to indicate gaps,
please change the setting within the script code under
'USER ADJUSTABLE VALUES' around line 110 (give or take a few).**
Alignment read...top line identifier determined as 'STV1'...
...Processing information about sequence with id of 'VPH1'.

Data a saved in tabular text form (tab-separated form) as 'categorized_consv_VPH1_residues.tsv'.

Data saved in tabular text form (tab-separated form) as:
       'not_conserved_VPH1_residues.tsv'.

       'strongly_similar_VPH1_residues.tsv'.

       'identical_VPH1_residues.tsv'.

       'weakly_similar_VPH1_residues.tsv'.

Returning a a dictionary of 4 dataframes , one dataframe for each category. The dictionary keys
are the categories:'not_conserved','strongly_similar','identical','weakly_similar'.

The individual dataframes are returned as a dicitonary of dataframes with the categories as keys. In the dataframes, each row is a residue under that category. 

In [18]:
from IPython.display import display, HTML
# trick from https://stackoverflow.com/a/29665452/8508004
for k in df_dict:
    print ("This is the dataframe for '{}':".format(k))
    display(df_dict[k].head())

This is the dataframe for 'not_conserved':


Unnamed: 0,not_conserved_residue_pos
0,1
1,2
2,16
3,23
4,26


This is the dataframe for 'strongly_similar':


Unnamed: 0,strongly_similar_residue_pos
0,3
1,4
2,13
3,15
4,19


This is the dataframe for 'identical':


Unnamed: 0,identical_residue_pos
0,5
1,6
2,7
3,8
4,9


This is the dataframe for 'weakly_similar':


Unnamed: 0,weakly_similar_residue_pos
0,37
1,39
2,41
3,46
4,47


Or if you needed to display them in most conserved to least for some reason.

In [19]:
#ordered most conserved to least for display
sep_dfs_ordered = ([df_dict['identical'],
            df_dict['strongly_similar'],df_dict['weakly_similar'],
            df_dict['not_conserved']])
from IPython.display import display, HTML
# trick from https://stackoverflow.com/a/29665452/8508004
for e in sep_dfs_ordered:
    display(e.head())

Unnamed: 0,identical_residue_pos
0,5
1,6
2,7
3,8
4,9


Unnamed: 0,strongly_similar_residue_pos
0,3
1,4
2,13
3,15
4,19


Unnamed: 0,weakly_similar_residue_pos
0,37
1,39
2,41
3,46
4,47


Unnamed: 0,not_conserved_residue_pos
0,1
1,2
2,16
3,23
4,26


The final section of the notebook, below, builds on the single dataframe output to produce molecular visualization commands for use in Pymol and/or Jmol.

## Using the categorization to make Pymol commands

This section builds on the single dataframe output to produce molecular visualization commands for use in Pymol. Information on structure being used:
[6C6L: Yeast Vacuolar ATPase Vo in lipid nanodisc](http://www.rcsb.org/structure/6C6L). VPH1 is Chain A

First we'll run the command to make sure the related dataframe is in the namespace of the notebook.

In [20]:
from categorize_residues_based_on_conservation_relative_consensus_line import categorize_residues_based_on_conservation_relative_consensus_line
df = categorize_residues_based_on_conservation_relative_consensus_line("alignment.clw","VPH1", return_panel_data = True)


**NOTE: gap indicator in this script is currently set to '-'. If
that does not match what provided alignment uses to indicate gaps,
please change the setting within the script code under
'USER ADJUSTABLE VALUES' around line 110 (give or take a few).**
Alignment read...top line identifier determined as 'STV1'...
...Processing information about sequence with id of 'VPH1'.

Data a saved in tabular text form (tab-separated form) as 'categorized_consv_VPH1_residues.tsv'.

Returning a single dataframe of residue positions, one list per category.

Now that dataframe can be used to make commands based on the caetegories and residue numbering.

Because this example has a lot of atoms for each category, I am outputting a separate file for each so that the first parts of each can be examined. If your protein has less residues, you may prefer a single file.

In [21]:
structure = "6c6l"
chain = "A"  #Vph1p chain in PDB id: 6c6l
identical_color = "[0.627,0.121,0.372]"
strong_siml_color = "[0.937, 0.470, 0.627]"
weak_siml_color = "[0.949, 0.784, 0.878]"
color_dict = {
    "identical":identical_color,
    "strongly_similar":strong_siml_color,
    "weakly_similar":weak_siml_color
            }
df_dict = dict(zip(df.category,df.residue_positions))
assert list(color_dict.keys()) == list(df_dict.keys())[:3] , "keys not identical"
building_output = ""
for category in df_dict:
    if category != "not_conserved":
        for res in df_dict[category]:
            building_output += f"select resi{res}, (chain {chain} and resi {res})\n"
            building_output += f"show spheres, resi{res}\nset sphere_scale, 1, resi{res}\n"
            building_output += f"color {color_dict[category]}, resi{res}\n"
        #save and reset output string
        %store building_output > {category}_commands.txt
        building_output = ""
# Comment out the above line & uncomment the next line if you want to save as single file
#%store building_output > all_commands.txt


Writing 'building_output' (str) to file 'identical_commands.txt'.
Writing 'building_output' (str) to file 'strongly_similar_commands.txt'.
Writing 'building_output' (str) to file 'weakly_similar_commands.txt'.


To show that worked:

In [22]:
!head identical_commands.txt
!echo " "
!tail identical_commands.txt

select resi5, (chain A and resi 5)
show spheres, resi5
set sphere_scale, 1, resi5
color [0.627,0.121,0.372], resi5
select resi6, (chain A and resi 6)
show spheres, resi6
set sphere_scale, 1, resi6
color [0.627,0.121,0.372], resi6
select resi7, (chain A and resi 7)
show spheres, resi7
 
set sphere_scale, 1, resi820
color [0.627,0.121,0.372], resi820
select resi822, (chain A and resi 822)
show spheres, resi822
set sphere_scale, 1, resi822
color [0.627,0.121,0.372], resi822
select resi830, (chain A and resi 830)
show spheres, resi830
set sphere_scale, 1, resi830
color [0.627,0.121,0.372], resi830


Demonstrating outputting to a single file:

In [23]:
# ALL to a single file
structure = "6c6l"
chain = "A"  #Vph1p chain in PDB id: 6c6l
identical_color = "[0.627,0.121,0.372]"
strong_siml_color = "[0.937, 0.470, 0.627]"
weak_siml_color = "[0.949, 0.784, 0.878]"
color_dict = {
    "identical":identical_color,
    "strongly_similar":strong_siml_color,
    "weakly_similar":weak_siml_color
            }
df_dict = dict(zip(df.category,df.residue_positions))
assert list(color_dict.keys()) == list(df_dict.keys())[:3] , "keys not identical"
building_output = ""
for category in df_dict:
    if category != "not_conserved":
        for res in df_dict[category]:
            building_output += f"select resi{res}, (chain {chain} and resi {res})\n"
            building_output += f"show spheres, resi{res}\nset sphere_scale, 1, resi{res}\n"
            building_output += f"color {color_dict[category]}, resi{res}\n"
%store building_output > all_commands.txt

Writing 'building_output' (str) to file 'all_commands.txt'.


Showing that worked:

In [24]:
!head all_commands.txt
!echo " "
!tail all_commands.txt

select resi5, (chain A and resi 5)
show spheres, resi5
set sphere_scale, 1, resi5
color [0.627,0.121,0.372], resi5
select resi6, (chain A and resi 6)
show spheres, resi6
set sphere_scale, 1, resi6
color [0.627,0.121,0.372], resi6
select resi7, (chain A and resi 7)
show spheres, resi7
 
set sphere_scale, 1, resi783
color [0.949, 0.784, 0.878], resi783
select resi816, (chain A and resi 816)
show spheres, resi816
set sphere_scale, 1, resi816
color [0.949, 0.784, 0.878], resi816
select resi833, (chain A and resi 833)
show spheres, resi833
set sphere_scale, 1, resi833
color [0.949, 0.784, 0.878], resi833


## Using the categorization to make Jmol commands

A similar process can be done to make Jmol / JSmol command.

In [25]:
#JMOL
from categorize_residues_based_on_conservation_relative_consensus_line import categorize_residues_based_on_conservation_relative_consensus_line
df = categorize_residues_based_on_conservation_relative_consensus_line("alignment.clw","VPH1", return_panel_data = True)
structure = "6c6l"
chain = "A"  #Vph1p chain in PDB id: 6c6l
identical_color = "[160,31,95]"
strong_siml_color = "[239, 120, 160]"
weak_siml_color = "[242, 200, 224]"
color_dict = {
    "identical":identical_color,
    "strongly_similar":strong_siml_color,
    "weakly_similar":weak_siml_color
            }
df_dict = dict(zip(df.category,df.residue_positions))
assert list(color_dict.keys()) == list(df_dict.keys())[:3] , "keys not identical"
building_output = ""
for category in df_dict:
    if category != "not_conserved":
        for res in df_dict[category]:
            building_output += f"select {res}:{chain};"
            building_output += f"spacefill on;"
            building_output += f"color {color_dict[category]}\n"
        #save and reset output string
        %store building_output > {category}_jmol_commands.txt
        building_output = ""
# Comment out the above line & uncomment the next line if you want to save as single file
#%store building_output > all_jmol_commands.txt

Writing 'building_output' (str) to file 'identical_jmol_commands.txt'.
Writing 'building_output' (str) to file 'strongly_similar_jmol_commands.txt'.
Writing 'building_output' (str) to file 'weakly_similar_jmol_commands.txt'.



**NOTE: gap indicator in this script is currently set to '-'. If
that does not match what provided alignment uses to indicate gaps,
please change the setting within the script code under
'USER ADJUSTABLE VALUES' around line 110 (give or take a few).**
Alignment read...top line identifier determined as 'STV1'...
...Processing information about sequence with id of 'VPH1'.

Data a saved in tabular text form (tab-separated form) as 'categorized_consv_VPH1_residues.tsv'.

Returning a single dataframe of residue positions, one list per category.

Jmol commands all in a single file:

In [26]:
from categorize_residues_based_on_conservation_relative_consensus_line import categorize_residues_based_on_conservation_relative_consensus_line
df = categorize_residues_based_on_conservation_relative_consensus_line("alignment.clw","VPH1", return_panel_data = True)


**NOTE: gap indicator in this script is currently set to '-'. If
that does not match what provided alignment uses to indicate gaps,
please change the setting within the script code under
'USER ADJUSTABLE VALUES' around line 110 (give or take a few).**
Alignment read...top line identifier determined as 'STV1'...
...Processing information about sequence with id of 'VPH1'.

Data a saved in tabular text form (tab-separated form) as 'categorized_consv_VPH1_residues.tsv'.

Returning a single dataframe of residue positions, one list per category.

In [27]:
#JMOL
structure = "6c6l"
chain = "A"  #Vph1p chain in PDB id: 6c6l
identical_color = "[160,31,95]"
strong_siml_color = "[239, 120, 160]"
weak_siml_color = "[242, 200, 224]"
color_dict = {
    "identical":identical_color,
    "strongly_similar":strong_siml_color,
    "weakly_similar":weak_siml_color
            }
df_dict = dict(zip(df.category,df.residue_positions))
assert list(color_dict.keys()) == list(df_dict.keys())[:3] , "keys not identical"
building_output = ""
for category in df_dict:
    if category != "not_conserved":
        for res in df_dict[category]:
            building_output += f"select {res}:{chain};"
            building_output += f"spacefill on;"
            building_output += f"color {color_dict[category]}\n"
%store building_output > all_jmol_commands.txt

Writing 'building_output' (str) to file 'all_jmol_commands.txt'.


Showing that worked:

In [28]:
!head all_jmol_commands.txt
!echo " "
!tail all_jmol_commands.txt

select 5:A;spacefill on;color [160,31,95]
select 6:A;spacefill on;color [160,31,95]
select 7:A;spacefill on;color [160,31,95]
select 8:A;spacefill on;color [160,31,95]
select 9:A;spacefill on;color [160,31,95]
select 10:A;spacefill on;color [160,31,95]
select 11:A;spacefill on;color [160,31,95]
select 12:A;spacefill on;color [160,31,95]
select 14:A;spacefill on;color [160,31,95]
select 17:A;spacefill on;color [160,31,95]
 
select 756:A;spacefill on;color [242, 200, 224]
select 760:A;spacefill on;color [242, 200, 224]
select 762:A;spacefill on;color [242, 200, 224]
select 763:A;spacefill on;color [242, 200, 224]
select 766:A;spacefill on;color [242, 200, 224]
select 770:A;spacefill on;color [242, 200, 224]
select 779:A;spacefill on;color [242, 200, 224]
select 783:A;spacefill on;color [242, 200, 224]
select 816:A;spacefill on;color [242, 200, 224]
select 833:A;spacefill on;color [242, 200, 224]


------