# GSD: Conversion of FASTA alignment format to clustal and making of molvis commands

The impetus for this notebook is that I used [I-TASSER](https://zhanglab.ccmb.med.umich.edu/I-TASSER/) to model a protein using the 'Specify template with alignment' option, which requires an multiple sequence alignment (MSA) in FASTA format for pasrt of the submission. Now that the model is back I want to use that MSA to produce the fit commands for PyMOL.  

This basis for this process can be found split among the following notebooks that should be available in this session, with the order of this listing roughly corresponding to how the steps fit together: 

- [Convert MSA in FASTA format to clustal format using Bioython](../Conversion%20of%20FASTA%20alignment%20format%20to%20clustal%20using%20Biopython.ipynb)
- [Determine residues that match to a reference from multiple sequence alignment and use to construct fit commands](../Determine%20residues%20that%20match%20to%20a%20reference%20from%20MSA%20and%20use%20to%20construct%20fit%20commands.ipynb)



-----

## Preparation: 

Preparation steps:
- get the necesary scripts
- making input for the biopython conversion step

In [None]:
#get scripts to use here
import os
if not os.path.isfile("MSA_to_corresponding_residue_numbers.py"):
    !curl -OL https://raw.githubusercontent.com/fomightez/sequencework/master/alignment-utilities/MSA_to_corresponding_residue_numbers.py

In [None]:
s='''FOR MY EYES ONLY
'''

In [None]:
%store s >aligned_seqs.fa

### Conversion step

The convsersion step will take the input represented above and convert it.

In [None]:
# Based on section 'File Format Conversion' at https://biopython.org/wiki/AlignIO
from Bio import AlignIO

input_handle = open("aligned_seqs.fa", "r")
output_handle = open("aligned_seqs.clw", "w")

alignments = AlignIO.parse(input_handle, "fasta")
AlignIO.write(alignments, output_handle, "clustal")

output_handle.close()
input_handle.close()

### Verify conversion result

Verify it worked by examining the resulting file `example.clw`.

In [None]:
!cat aligned_seqs.clw

## Using the aligned sequences to make the Fit commands for PyMOL

First we'll use the main function of `MSA_to_corresponding_residue_numbers.py` process the multiple sequence
alignment in Clustal format and determine the corresponding residue numbers
for aligned residues for a specified reference sequence in the alignment and
the pairing of it with every other aligned sequence in the multiple sequence
alignment. We'll then use that dataframe of corresponding residue numbers to make to produce molecular visualization commands for use in doing the fit in Pymol. 

First we'll run the command to make the needed dataframe and bring it into the namespace of the notebook.

In [None]:
from MSA_to_corresponding_residue_numbers import MSA_to_corresponding_residue_numbers
ref_id, dfs_by_id = MSA_to_corresponding_residue_numbers("aligned_seqs.clw","7c7a:L", return_dfs = True , supplied_start_pos = [1,3])

Because there are only two sequences in the alignment, there really is only one dataframe made; however, the scripts and process is set up to handle multiple sequences in the alignment being compared to a reference sequence and so bear that in mind that there is a list of dataframes that happens to only be one and that the code was worked out to loop on multiples.

In [None]:
print(ref_id)

In [None]:
for ea_id in dfs_by_id:
    display(dfs_by_id[ea_id])

Now the dataframe can be used to make commands based on the identifiers and matching residue numbering.

Some additional data needs to be provided because the equivalent chains of the model file and the structure weren't yet provided. 

Additionally, the PDB id code and/or identifiers used in corresponding Pymol session for those not in PDB must be specified prior to constructing the commands as well.

In [None]:
# further prep for generating commands
ref_id_chain = 'L' # for VPH1, from http://www.rcsb.org/structure/6C6L
# for `query_chains_by_id`, use ids from the MSA file as keys and the
# chain designations from the PDB files as the values
query_chains_by_id = {'alignment_id_of_protein_being_modeled':'A',}
ref_chain_structure = "7c7aBIOL"
# for `structures_by_id`, use ids from the MSA file as keys and 
# the ids in the PDB files as values
structures_by_id = {'alignment_id_of_protein_being_modeled':'model1',}

# Convert each dataframe to something similar to `ranges_o_residues_resolved_per_chain_pairs` from the notebook 
# `Using Biopython PDB module to list resolved residues and construct fit commands`
# Example of `ranges_o_residues_resolved_per_chain_pairs` that notebook to base it on:
# `{('A', 'F'): ['4:121', '129:158'], ('B', 'G'): ['14:104', '125:140']}`
# However, those were in same chain and so more complex here in that regard!
# Since need residue positions for both chains it will be more like below for
# one reference-query chain pair:
# `{('A', 'F'): (['4:121', '129:158'],['x:xxx', 'xxx:xxx'])}`
# And since also will want these to be one for each pairing, I can store each dictionary
# in a dictionary with the key of the query as id for all, like so (only two shown for example:
# `{'BAH13127.1': {('A', 'G'): (['4:121', '129:158'],['x:xxx', 'xxx:xxx'])}, 'STV1':{('A', 'A'): (['14:104', '125:140'],['x:xxx', 'xxx:xxx'])}}`  
matched_ranges_o_residues_per_chain_pairs_by_id = {}
for id_,df in dfs_by_id.items():
    matched_ranges_o_residues_per_chain_pairs = {}
    chain_pair = (ref_id_chain,query_chains_by_id[id_])
    #iterate through the rows of the dataframe making the two lists
    ref_list = []
    query_list = []
    for row in df.itertuples(index=False):
        ref_list.append(f'{row[0]}:{row[1]}')
        query_list.append(f'{row[2]}:{row[3]}')
    matched_ranges_o_residues_per_chain_pairs[chain_pair] = (ref_list,query_list)
    matched_ranges_o_residues_per_chain_pairs_by_id[id_] = matched_ranges_o_residues_per_chain_pairs

# form commands
formatted_commands = ""
for id_,matched_ranges_per_ch_p in matched_ranges_o_residues_per_chain_pairs_by_id.items():
    selection_pairs = [] #to collect two item tuple elements, each item being a name
    for chain_pair in matched_ranges_per_ch_p:
        structures = (ref_chain_structure, structures_by_id[id_])
        selection_name_pair = []
        for indx,chain in enumerate(chain_pair):
            sel_nom = f"{structures[indx]}ch{chain}CA" #`CA` at end stands for `name CA` / alpha-carbon
            formatted_commands += f"select {sel_nom},"
            formatted_commands += "|".join( f" {structures[indx]} and resid {pos_range} and chain {chain} and name CA " 
                                           for pos_range in matched_ranges_per_ch_p[chain_pair][indx] ) 
            #for pos_range in ranges_o_residues_resolved_per_chain_pairs[chain_pair]:
             #   formatted_commands += f"{structures[indx].id} and resid {pos_range} and chain {chain} and name CA"
            formatted_commands += "\n"
            selection_name_pair.append(sel_nom)
        selection_pairs.append(selection_name_pair)
    first_structure_selections = "|".join(f" {x[0]} " for x in selection_pairs)
    second_structure_selections = "|".join(f" {x[1]} " for x in selection_pairs)
    formatted_commands += f"pair_fit ( {first_structure_selections}), ( {second_structure_selections})"
    formatted_commands += "\n\n"
        
# residues_resolved_per_chain_b
#shared_positions_per_chain_pairs
#ranges_o_residues_resolved_per_chain_pairs
print("\n\n")
print ("FORMATTED PYMOL COMMANDS:")
print(" ")
print(formatted_commands)

-------

Enjoy.