# GSD: 1_vs_21_vs_2_vs_12

I tried to annotate what individual sections were based on from my general illustrations of various scripts or command line processes.

-----

## Prepare for using the residue matching script on MSA by obtaining FASTAs and validating

This section is largely based on early parts of a Jupyter notebook [here](http://nbviewer.jupyter.org/github/fomightez/cl_demo-binder/blob/master/notebooks/Determine%20residues%20that%20match%20to%20a%20reference%20from%20MSA%20and%20use%20to%20construct%20fit%20commands.ipynb). It can be launched in active form from [here](https://github.com/fomightez/cl_demo-binder/tree/master/).

In [None]:
# Retrieve scripts needed for this section
# Get the retrieving script to use for yeast protein sequences
!curl -O https://raw.githubusercontent.com/fomightez/yeastmine/4eb96c1c4714df336076b53ea26b6139ef6cddbb/get_protein_seq_as_FASTA.py
# Get the validating script
!curl -O https://raw.githubusercontent.com/fomightez/sequencework/88d2076fcb6014bdad7f07b293cead16ae2e9fe0/alignment-utilities/check_seq_frag_in_MSAclustal_intact_viaFASTA.py

In [None]:
# Need a dependency for next cell
!pip install BS4

In [None]:
# Get FASTA file for the non yeast one
!curl -o 1x0t_A.fa http://oca.weizmann.ac.il/oca-bin/send-seq?1x0t_A
# remove HTML to leave actual FASTA
# based on https://stackoverflow.com/a/21577649/8508004 and https://unix.stackexchange.com/a/64747
import sys
from bs4 import BeautifulSoup
oca_file_to_fix = "1x0t_A.fa"
soup = BeautifulSoup(open(oca_file_to_fix), "html.parser")
for pre in soup.findAll("pre"):
    fasta =  pre.contents
%store fasta[0] > {oca_file_to_fix}
# Get FASTA file for yeast ones
genes = ["rpr2", "snm1", "rpa12"]
for gene in genes:
    %run get_protein_seq_as_FASTA.py {gene}

Note that `get_protein_seq_as_FASTA.py` now (May 2019) causes `RuntimeError: generator raised StopIteration` error in Python 3.7. (I suspect related to the second item under 'Changes in Python Behavior' [here](https://docs.python.org/3.7/whatsnew/3.7.html#changes-in-python-behavior). So I ran the ` get_protein_seq_as_FASTA.py ` part in Python 2.7 launched from my mcscan-blast-binder, donwloaded to local system, and uploaded the resulting files into here to bypass for now.

In [None]:
#associate produced files with the MSA ids
fasta_files_byMSAid = {}
fasta_files_byMSAid["Snm1"] = "S288C_YDR478W_SNM1_protein.fsa"
fasta_files_byMSAid["Rpr2"] = "S288C_YIR015W_RPR2_protein.fsa"
fasta_files_byMSAid["Rpa12"] = "S288C_YJR063W_RPA12_protein.fsa"
fasta_files_byMSAid["1x0t.1.A"] = "1x0t_A.fa"

In [None]:
# Get the multiple sequence alignment
!curl -o alignment.clw https://gist.githubusercontent.com/fomightez/f46b0624f1d8e3abb6ff908fc447e63b/raw/7ef7cfdaa2c9f9974f22fd60be3cfe7d1935cd86/uy_alignmt.clw

Now we have everything to validate the sequences as suitable for use by the script that uses residue matching to generate equivalent positions from the multiple sequence alignment.
While validating, we will also collect the position of first residue in each multiple sequence alignment because **this information will be needed later** for the reside matching script.

In [None]:
#Validate for each sequence in the alignment if the fragment is contiguous once it starts
# Plus collect what is the position of first residue in each multiple sequence alignment
from check_seq_frag_in_MSAclustal_intact_viaFASTA import check_seq_frag_in_MSAclustal_intact_viaFASTA
import time
import sys
start_positions_by_id = {}
MSAids = fasta_files_byMSAid.keys()
for MSAid in fasta_files_byMSAid:
    print ( MSAid, fasta_files_byMSAid[MSAid])
    consistent, start_pos = check_seq_frag_in_MSAclustal_intact_viaFASTA("alignment.clw", MSAid, fasta_files_byMSAid[MSAid], return_TF = True, report_start_residue_pos = True)
    if consistent:
        time.sleep(0.301) # pause before printing stdour so it will show up after stderr
        print ("The sequence for {} is contiguous and valid.".format(MSAid))
        print ("The sequence for {} starts at position {} in the MSA.".format(MSAid,start_pos))
    else:
        print ("The sequence for {} is NOT valid. This will be A PROBLEM.")
        sys.exit(1)
    start_positions_by_id[MSAid] = start_pos
#start_positions_by_id

----

## Merging multiple structures in PDB format into single file with multiple models

This section is based largely on the section entitled 'Python script method to merge' [here](http://nbviewer.jupyter.org/github/fomightez/cl_demo-binder/blob/master/notebooks/Split%20and%20combine%20multimodel%20PDB%20files.ipynb#Python-script-method-to-merge). It can be launched in active form from [here](https://github.com/fomightez/cl_demo-binder/tree/master/).


In [None]:
# Get the merging script
!curl -O https://raw.githubusercontent.com/fomightez/structurework/82b3200c0a0b432ffc23eb2105d0e8c82b42aa81/python_scripts/merge_multi_PDBs_into_single_file.py

I combined the placed files I wanted to merge into one directory, named `models`, within a session running this notebook from [here](https://github.com/fomightez/cl_demo-binder/tree/master/). 

These are the files:

- 6AGB_K.pdb
- Snm1p_I-TASSER_model1.pdb
- 1x0t.pdb
- 4c3h_chainI.pdb


6AGB_K.pdb/6AH3_K.pdb were generated from [here](https://github.com/fomightez/bio3d-binder). 4c3h_chainI.pdb was generated from [here](http://nbviewer.jupyter.org/github/fomightez/cl_demo-binder/blob/master/notebooks/Split%20pdb%20files%20into%20chains.ipynb#Bash/sed-method); and renamed. (I just happened to try a different approach betweeen the two because I had a different session launched when making 4c3h chains files.)

I will need the order to match the order I'll use when generating commands. To control the order, I'll take advantage of the ability of `merge_multi_PDBs_into_single_file.py` to recognize numbers at the end of the main part of the file names of the PDB files to use those as the order.
So the files were renamed to:

- 6AGB_K_1.pdb
- Snm1p_I-TASSER_model1_2.pdb
- 1x0t_3.pdb
- 4c3h_chainI_4.pdb

The following cell has the the command I ran after that. (Note I commented out these two commands at the end of this section  because I don't want to do the steps in this section each time I re-run the notebook. I had tried using `%%script false` at top of cell because so they don't run, but that doesn't seem to work anymore [if it ever did?].) 

In [None]:
#%run merge_multi_PDBs_into_single_file.py models

In [None]:
#rename the produced file to something more specific
#!mv models.pdb rpr2_snm1_1x0t_rpa12.pdb

I collected the resulting file 'rpr2_snm1_1x0t_rpa12.pdb' and uploaded to Proteopedia, find it [here](http://proteopedia.org/wiki/index.php/Image:Rpr2_snm1_1x0t_rpa12.pdb).


Opened the structure file in Proteopedia using the scene scripting tool and then saved a scene. Placed the text specifying that scene in the wikitext of the code defining a structure window on a page. 

----

## Use residue matching script to collect information about residue positions in the alignment

This section is largely based on last section of a Jupyter notebook [here](http://nbviewer.jupyter.org/github/fomightez/cl_demo-binder/blob/master/notebooks/Determine%20residues%20that%20match%20to%20a%20reference%20from%20MSA%20and%20use%20to%20construct%20fit%20commands.ipynb). It can be launched in active form from [here](https://github.com/fomightez/cl_demo-binder/tree/master/).

In [22]:
# Get the residue position matching script
!curl -O https://raw.githubusercontent.com/fomightez/sequencework/88d2076fcb6014bdad7f07b293cead16ae2e9fe0/alignment-utilities/MSA_to_corresponding_residue_numbers.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 30126  100 30126    0     0  91847      0 --:--:-- --:--:-- --:--:-- 91847


In [23]:
from MSA_to_corresponding_residue_numbers import MSA_to_corresponding_residue_numbers
start_posns = [start_positions_by_id["Snm1"],
               start_positions_by_id["1x0t.1.A"],
               start_positions_by_id["Rpr2"],
               start_positions_by_id["Rpa12"],
              ]
ref_id, dfs_by_id = MSA_to_corresponding_residue_numbers("alignment.clw","Rpr2", return_dfs = True , supplied_start_pos = start_posns)


**NOTE: gap indicator in this script is currently set to '-'. If
that does not match what provided alignment uses to indicate gaps,
please change the setting within the script code under
'USER ADJUSTABLE VALUES' around line 130 (give or take a few).**
Alignment file read...Rpr2 reference sequence collected from alignment...query compared...query compared...query compared...
made match pairings into intervals of start and end where possible...

made dataframes of matched start and ends...DONE.


In [24]:
from IPython.display import display, HTML
for id_,df in dfs_by_id.items():
    print(id_)
    display(df)

Snm1


Unnamed: 0,Rpr2_start,Rpr2_end,Snm1_start,Snm1_end
0,37,48,14,25
1,56,63,26,33
2,64,72,35,43
3,74,86,44,56
4,87,108,58,79
5,109,115,100,106
6,116,144,108,136


1x0t.1.A


Unnamed: 0,Rpr2_start,Rpr2_end,1x0t.1.A_start,1x0t.1.A_end
0,37,47,23,33
1,56,63,34,41
2,64,72,43,51
3,74,108,52,86
4,109,115,91,97
5,116,130,99,113


Rpa12


Unnamed: 0,Rpr2_start,Rpr2_end,Rpa12_start,Rpa12_end
0,87,98,83,94
1,105,108,95,98
2,109,115,108,114
3,116,125,116,125


----

## Using the matching residue information to make Jmol commands

(This section is largely based on last section of a Jupyter notebook [here](http://nbviewer.jupyter.org/github/fomightez/cl_demo-binder/blob/master/notebooks/Determine%20residues%20that%20match%20to%20a%20reference%20from%20MSA%20and%20use%20to%20construct%20fit%20commands.ipynb). It can be launched in active form from [here](https://github.com/fomightez/cl_demo-binder/tree/master/).)

Information from earlier needs to be incorporated to do this. That is done in the first few lines in the next cell.

The [Interactive Scripting Documentation states by default without specifying `SUBSET` that `COMPARE` uses 'spine' atoms](https://chemapps.stolaf.edu/jmol/docs/#compare) and you might imagine that is a reasonable option since it is set as the default. However, I got much better results specifying using alpha-carbon backbone directly. 

This moves each query structure to dock onto the reference structure.

In [25]:
# further prep for Jmol generating commands
ref_id_chain = 'K' # for Rpr2, from http://www.rcsb.org/structure/6AGB or http://www.rcsb.org/structure/6AH3
# for `query_chains_by_id`, use ids from the MSA file as keys and the
# chain designations from the PDB files as the values
query_chains_by_id = {'Snm1':'A',
                      '1x0t.1.A':'A',
                      'Rpa12':'I'}
ref_chain_structure = "6AH3_K"

# for `structures_by_id`, use ids from the MSA file as keys and 
# the ids in the PDB files (or any text without dashes that makes sense) as values
structures_by_id = {'Snm1':'Snm1p',
                      '1x0t.1.A':'1x0t',
                      'Rpa12':'Rpa12p'}

# for `models_by_id`, use ids from the MSA file as keys and the
# model number it corresponds to in the merged multi-model as the value
ref_model_num = 1
models_by_id = {'Snm1':2,
                '1x0t.1.A':3,
                'Rpa12':4}


from itertools import product
from string import ascii_lowercase
arbitrary_ids = (''.join(i) for i in product(ascii_lowercase, repeat = 2)) # see comment at https://stackoverflow.com/a/7074066/8508004


# Convert each dataframe to something similar to `ranges_o_residues_resolved_per_chain_pairs` from the notebook 
# `Using Biopython PDB module to list resolved residues and construct fit commands`
# Example of `ranges_o_residues_resolved_per_chain_pairs` that notebook to base it on:
# `{('A', 'F'): ['4:121', '129:158'], ('B', 'G'): ['14:104', '125:140']}`
# However, those were in same chain and so more complex here in that regard!
# Since need residue positions for both chains it will be more like below for
# one reference-query chain pair and for ranges that jmol uses:
# `{('A', 'F'): (('4-121', 'x-xxx' ),('129-158', 'xxx-xxx'))}`
# And since also will want these to be one for each pairing, I can store each dictionary
# in a dictionary with the key of the query as id for all, like so (only two shown for example:
# `{'BAH13127.1': {('A', 'G'): [('4-121', 'x-xxx' ),('129-158', 'xxx-xxx')]}, 'STV1':{('A', 'A'): [('14-104', 'x-xxx'),('125-140', 'xxx-xxx')]}`  
paired_residue_sets_by_id = {}
for id_,df in dfs_by_id.items():
    paired_residue_sets = {}
    chain_pair = (ref_id_chain,query_chains_by_id[id_])
    #iterate through the rows of the dataframe making list of the pairings
    pairings = []
    for row in df.itertuples(index=False):
        pairings.append((f'{row[0]}-{row[1]}',f'{row[2]}-{row[3]}'))
    paired_residue_sets[chain_pair] = pairings
    paired_residue_sets_by_id[id_] = paired_residue_sets


# form commands
formatted_commands = ""
for id_,paired_residue_sets in paired_residue_sets_by_id.items():
    # Step through the paired ranges defining selections based on them.
    # Collect the names of those selections in `selection_pair_names` because once defined they
    # need to called later in the compare command in 2 item sets, one atom set for each chain.
    selection_pair_names = [] #to collect two item tuple elements, each item being arbitrarily name of an atom pairing
    for chain_pair, paired_list in paired_residue_sets.items():
        ref_chain,query_chain = chain_pair
        for pos_pair in paired_list:
            selection_name_pair = []
            sel_id = next(arbitrary_ids)
            query_sel_nom = f"sel_{sel_id}_model{structures_by_id[id_]}"
            formatted_commands += f"define {query_sel_nom} {pos_pair[1]}:{query_chain} and model=1.{models_by_id[id_]};\n"
            ref_sel_nom = f"sel_{sel_id}_model{ref_chain_structure}"
            formatted_commands += f"define {ref_sel_nom} {pos_pair[0]}:{ref_chain} and model=1.{ref_model_num};\n"
            selection_pair_names.append((query_sel_nom,ref_sel_nom))
    #print(selection_pair_names)
    pairs_of_atom_sets = ", ".join(f"{{{names[0]}}}{{{names[1]}}}" for names in selection_pair_names)
    #print(pairs_of_atom_sets)
    formatted_commands += f"origcoord = {{1.{models_by_id[id_]}}}.xyz.all;\n"
    formatted_commands += f"compare {{1.{models_by_id[id_]}}} {{1.{ref_model_num}}} SUBSET {{*.CA}} ATOMS {pairs_of_atom_sets} ROTATE TRANSLATE -2.5;\n"
    formatted_commands += "\n\n"
        
# residues_resolved_per_chain_b
#shared_positions_per_chain_pairs
#ranges_o_residues_resolved_per_chain_pairs
print("\n\n")
print ("FORMATTED JMOL/JSMOL COMMANDS:")
print(" ")
print(formatted_commands)




FORMATTED JMOL/JSMOL COMMANDS:
 
define sel_aa_modelSnm1p 14-25:A and model=1.2;
define sel_aa_model6AH3_K 37-48:K and model=1.1;
define sel_ab_modelSnm1p 26-33:A and model=1.2;
define sel_ab_model6AH3_K 56-63:K and model=1.1;
define sel_ac_modelSnm1p 35-43:A and model=1.2;
define sel_ac_model6AH3_K 64-72:K and model=1.1;
define sel_ad_modelSnm1p 44-56:A and model=1.2;
define sel_ad_model6AH3_K 74-86:K and model=1.1;
define sel_ae_modelSnm1p 58-79:A and model=1.2;
define sel_ae_model6AH3_K 87-108:K and model=1.1;
define sel_af_modelSnm1p 100-106:A and model=1.2;
define sel_af_model6AH3_K 109-115:K and model=1.1;
define sel_ag_modelSnm1p 108-136:A and model=1.2;
define sel_ag_model6AH3_K 116-144:K and model=1.1;
origcoord = {1.2}.xyz.all;
compare {1.2} {1.1} SUBSET {*.CA} ATOMS {sel_aa_modelSnm1p}{sel_aa_model6AH3_K}, {sel_ab_modelSnm1p}{sel_ab_model6AH3_K}, {sel_ac_modelSnm1p}{sel_ac_model6AH3_K}, {sel_ad_modelSnm1p}{sel_ad_model6AH3_K}, {sel_ae_modelSnm1p}{sel_ae_model6AH3_K}, {sel

In [26]:
%store formatted_commands > all_jmol_commands.txt

Writing 'formatted_commands' (str) to file 'all_jmol_commands.txt'.


Downloaded `all_jmol_commands.txt` to my local drive and changed the file name to something more specific there, `jmol_fit_commands_1_vs_21_vs_2_vs_12.txt`.

Loaded the Proteopedia page I made earlier and then brought up the `Console` for the structure window containing the unsuperimposed structures.

Ran `model 0` to show all and then pasted in the Jmol commands to do the fits. When the superposing was complete, I used the following to trigger save structures docked upon one another.

    write COORDS PDB "rpr2_snm1_1x0t_rpa12FIT.pdb"

Saved that to my local drive. It keeps them as multi-model PDB files.

----



## Using the matching residue information to make Pymol commands

Note that I didn't end up needing to do this when originally just aligning the rpr2/snm1 and related homologs, because I already created `rpr2_snm1_1x0t_rpa12FIT.pdb` in Jmol that I could just open in Pymol.

(This section is largely based on the 'Pymol' section of a Jupyter notebook [here](http://nbviewer.jupyter.org/github/fomightez/cl_demo-binder/blob/master/notebooks/Determine%20residues%20that%20match%20to%20a%20reference%20from%20MSA%20and%20use%20to%20construct%20fit%20commands.ipynb). It can be launched in active form from [here](https://github.com/fomightez/cl_demo-binder/tree/master/).)

Information from earlier needs to be incorporated to do this. That is done in the first few lines in the next cell.

This moves each query structure to dock onto the reference structure.

In [27]:
# further prep for generating commands
ref_id_chain = 'K' # for Rpr2, from http://www.rcsb.org/structure/6AGB or http://www.rcsb.org/structure/6AH3
# for `query_chains_by_id`, use ids from the MSA file as keys and the
# chain designations from the PDB files as the values
query_chains_by_id = {'Snm1':'A',
                      '1x0t.1.A':'A',
                      'Rpa12':'I'}
ref_chain_structure = "6AH3_K"
# for `structures_by_id`, use ids from the MSA file as keys and 
# the ids in the PDB files as values
structures_by_id = {'Snm1':'Snm1p',
                      '1x0t.1.A':'1x0t',
                      'Rpa12':'Rpa12p'}
# Convert each dataframe to something similar to `ranges_o_residues_resolved_per_chain_pairs` from the notebook 
# `Using Biopython PDB module to list resolved residues and construct fit commands`
# Example of `ranges_o_residues_resolved_per_chain_pairs` that notebook to base it on:
# `{('A', 'F'): ['4:121', '129:158'], ('B', 'G'): ['14:104', '125:140']}`
# However, those were in same chain and so more complex here in that regard!
# Since need residue positions for both chains it will be more like below for
# one reference-query chain pair:
# `{('A', 'F'): (['4:121', '129:158'],['x:xxx', 'xxx:xxx'])}`
# And since also will want these to be one for each pairing, I can store each dictionary
# in a dictionary with the key of the query as id for all, like so (only two shown for example:
# `{'BAH13127.1': {('A', 'G'): (['4:121', '129:158'],['x:xxx', 'xxx:xxx'])}, 'STV1':{('A', 'A'): (['14:104', '125:140'],['x:xxx', 'xxx:xxx'])}}`  
matched_ranges_o_residues_per_chain_pairs_by_id = {}
for id_,df in dfs_by_id.items():
    matched_ranges_o_residues_per_chain_pairs = {}
    chain_pair = (ref_id_chain,query_chains_by_id[id_])
    #iterate through the rows of the dataframe making the two lists
    ref_list = []
    query_list = []
    for row in df.itertuples(index=False):
        ref_list.append(f'{row[0]}:{row[1]}')
        query_list.append(f'{row[2]}:{row[3]}')
    matched_ranges_o_residues_per_chain_pairs[chain_pair] = (ref_list,query_list)
    matched_ranges_o_residues_per_chain_pairs_by_id[id_] = matched_ranges_o_residues_per_chain_pairs

# form commands
formatted_commands = ""
for id_,matched_ranges_per_ch_p in matched_ranges_o_residues_per_chain_pairs_by_id.items():
    selection_pairs = [] #to collect two item tuple elements, each item being a name
    for chain_pair in matched_ranges_per_ch_p:
        structures = (ref_chain_structure, structures_by_id[id_])
        selection_name_pair = []
        for indx,chain in enumerate(chain_pair):
            sel_nom = f"{structures[indx]}ch{chain}CA" #`CA` at end stands for `name CA` / alpha-carbon
            formatted_commands += f"select {sel_nom},"
            formatted_commands += "|".join( f" {structures[indx]} and resid {pos_range} and chain {chain} and name CA " 
                                           for pos_range in matched_ranges_per_ch_p[chain_pair][indx] ) 
            #for pos_range in ranges_o_residues_resolved_per_chain_pairs[chain_pair]:
             #   formatted_commands += f"{structures[indx].id} and resid {pos_range} and chain {chain} and name CA"
            formatted_commands += "\n"
            selection_name_pair.append(sel_nom)
        selection_pairs.append(selection_name_pair)
    first_structure_selections = "|".join(f" {x[0]} " for x in selection_pairs)
    second_structure_selections = "|".join(f" {x[1]} " for x in selection_pairs)
    formatted_commands += f"pair_fit ( {first_structure_selections}), ( {second_structure_selections})"
    formatted_commands += "\n\n"
        
# residues_resolved_per_chain_b
#shared_positions_per_chain_pairs
#ranges_o_residues_resolved_per_chain_pairs
print("\n\n")
print ("FORMATTED PYMOL COMMANDS:")
print(" ")
print(formatted_commands)




FORMATTED PYMOL COMMANDS:
 
select 6AH3_KchKCA, 6AH3_K and resid 37:48 and chain K and name CA | 6AH3_K and resid 56:63 and chain K and name CA | 6AH3_K and resid 64:72 and chain K and name CA | 6AH3_K and resid 74:86 and chain K and name CA | 6AH3_K and resid 87:108 and chain K and name CA | 6AH3_K and resid 109:115 and chain K and name CA | 6AH3_K and resid 116:144 and chain K and name CA 
select Snm1pchACA, Snm1p and resid 14:25 and chain A and name CA | Snm1p and resid 26:33 and chain A and name CA | Snm1p and resid 35:43 and chain A and name CA | Snm1p and resid 44:56 and chain A and name CA | Snm1p and resid 58:79 and chain A and name CA | Snm1p and resid 100:106 and chain A and name CA | Snm1p and resid 108:136 and chain A and name CA 
pair_fit (  6AH3_KchKCA ), (  Snm1pchACA )

select 6AH3_KchKCA, 6AH3_K and resid 37:47 and chain K and name CA | 6AH3_K and resid 56:63 and chain K and name CA | 6AH3_K and resid 64:72 and chain K and name CA | 6AH3_K and resid 74:108 and chain

In [28]:
%store formatted_commands > all_pymol_commands.txt

Writing 'formatted_commands' (str) to file 'all_pymol_commands.txt'.


Download `all_pymol_commands.txt` to my local drive and changed the file name to something more specific there, `pymol_fit_commands_1_vs_21_vs_2_vs_12.txt`.

----

----

## Add consensus to the alignment

This section is largely based on the 'Add a consensus symbol line to an MSA' section of a Jupyter notebook [here](https://nbviewer.jupyter.org/github/fomightez/cl_sq_demo-binder/blob/master/notebooks/Use%20biopython%20to%20make%20valid%20CLUSTAL%20formatted%20MSAs%2C%20check%20sequence%20of%20manually%20edited%20alignment%2C%20and%20add%20consensus%20line.ipynb#Add-a-consensus-symbol-line-to-an-MSA). It can be launched in active form from [here](https://github.com/fomightez/cl_sq_demo-binder).


In [29]:
#Get the symbol adding script
!curl -O https://raw.githubusercontent.com/fomightez/sequencework/88d2076fcb6014bdad7f07b293cead16ae2e9fe0/alignment-utilities/calculate_cons_for_clustal_protein.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 24121  100 24121    0     0   120k      0 --:--:-- --:--:-- --:--:--  120k


In [30]:
%run calculate_cons_for_clustal_protein.py alignment.clw

Alignment file read...top line identifier determined as 'Snm1'...bottom line identifier determined as 'Rpa12'...
individual lines for each sequence identifier parsed...determining conservation of aligned sequences...

Alignment with conservation indication symbols added saved as 'alignment_plusCONS.clw'.
Finished.


In [31]:
!cat alignment_plusCONS.clw

CLUSTAL plus hand-editing

Snm1      MNKDQAEKYQERSLRQKYNLLHVLP-------TLNSRALSGLYYKNFHNS-VKRYQIMLP
1x0t.1.A  -----------ERIDTLFTLAERV--------ARYSPDLAKRYVELALEI-QKKAKVKIP
Rpr2      -------------LNYLYQISAYQTRARQKARTDAHTPLA-RNYIKSMDLISKKTKTSLL
Rpa12     ------------------------------------------------------------
                                                                      

Snm1      EQLKSGKFCSHCGCVYVPNFNASLQLTTNTEQGDSDELGGESMEGPKKCIQVNCLNCEKS
1x0t.1.A  RKWK-RRYCKRCHTFLIPGVNARVRLRTKR----------------MPHVVITCLECGYI
Rpr2      PTIK-RTICKKCHRLLWTPKKLEITSD--------------------GALSVMC-GCGTV
Rpa12     -----KEKCPQCGNEEM------NYHTLQ--LRSADE---------GATVFYTCTSCGYK
                  * :*                                     :   *  *   

Snm1      KLFEWKSEFVVPTFGQDVSPMINSTSSGKVSYAVKKPQKSKTSTGKERSKKRKLNSLTNL
1x0t.1.A  MRYPYLREVK--------------------------------------------------
Rpr2      KRFNIGADPNYRTYSEREGNLLNS------------------------------------
Rpa12     FRTNN----------------