# Using Python to examine HH-suite results - example making multi-fasta sequence format from query and hit sequences

Run this in sessions launched from [my HH-suite3-binder repo](https://github.com/fomightez/hhsuite3-binder) because the software is already installed.   
This follows from the notebook in this series entitled [Using Python to examine HH-suite3 results](Using%20Python%20to%20examine%20HH-suite3%20results.ipynb).

Here, will demonstrate using the basics from that previous notebook a specific example of going from the text based HH-suite3 results output files, which end in `.hhr`, to convert the sequence alignment in the query and hit match of be aligned multi-FASTA. This is similar to the format that I-TASSER needs to submit an alignment, and so can be a useful step if you now want to model the structure of homolog using HH-suite3 sequence alignment when the macromolecular structure for the query is already known.


-----

<div class="alert alert-block alert-warning">
<p>If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.</p>

<p>
    Some tips:
    <ul>
        <li>Code cells have boxes around them. When you hover over them an <i class="fa-step-forward fa"></i> icon appears.</li>
        <li>To run a code cell either click the <i class="fa-step-forward fa"></i> icon, or click on the cell and then hit <b>Shift+Enter</b>. The <b>Shift+Enter</b> combo will also move you to the next cell, so it's a quick way to work through the notebook.</li>
        <li>While a cell is running a <b>*</b> appears in the square brackets next to the cell. Once the cell has finished running the asterix will be replaced with a number.</li>
        <li>In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.</li>
        <li>To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.</li>
    </ul>
</p>
</div>

----



## Preparation

This preparation is based on what is already covered in entitled [Using Python to examine HH-suite3 results](Using%20Python%20to%20examine%20HH-suite3%20results.ipynb); however, does not require you to have run that recently. Just run everything under this 'Preparation' section to continue on with the rest of the notebook.

Get the `hhsuite3_results_to_df.py` script that will be used to mine the results files.

In [1]:
import os
file_needed = "hhsuite3_results_to_df.py"
if not os.path.isfile(file_needed):
    !curl -OL https://raw.githubusercontent.com/fomightez/sequencework/master/hhsuite3-utilities/hhsuite3_results_to_df.py

Get some example files with  HH-suite3 query results.  
For ease, I'll just some of the results files that the biopython package uses to test the parsing code is working well. They are publically available [here](https://github.com/biopython/biopython/tree/master/Tests/HHsuite); the `README` file [there](https://github.com/biopython/biopython/tree/master/Tests/HHsuite) summarizes the content of those files.

If you prefer to use your own result files, upload your file to this session and change the appropriate file names in later steps.  
(Note for a couple of the `.hhr` files, I noticed they had `Done!` close to the last line of the file; however, I hadn't seen that in the `.hhr` files from HH-suite3 that I developed with and it caused errors for my `hhsuite3_results_to_df.py`. To make them more consistent, I'm removing that, assuming it is from format version of HH-suite since biopython which is a source of these test files mentions that version in the current documentation.)

In [26]:
import os
files_needed = ["hhsearch_q9bsu1_uniclust_w_ss_pfamA_30.hhr",
                "2uvo_hhblits.hhr",
                "2uvo_hhsearch.hhr",
                "hhpred_9590198.hhr"]
url_prefix = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/HHsuite/"
for file_needed in files_needed:
    if not os.path.isfile(file_needed):
        !curl -OL {url_prefix+file_needed}
!sed -i 's/Done!//g' 2uvo_hhblits.hhr
!sed -i 's/Done!//g' 2uvo_hhsearch.hhr

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 17947  100 17947    0     0  41834      0 --:--:-- --:--:-- --:--:-- 41834
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 38142  100 38142    0     0   164k      0 --:--:-- --:--:-- --:--:--  164k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 39235  100 39235    0     0  49980      0 --:--:-- --:--:-- --:--:-- 49917
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 40978  100 40978    0     0  98268      0 --:--:-- --:--:-- --:--:-- 98268


Define a function to deal with breaking up long sequence into chunks useful for lines of output.

In [11]:
# adapted from https://github.com/fomightez/structurework/blob/0580b7470a866f5dc9f75aaf5200a5a5415b4f5d/pdbsum-utilities/similarities_in_proteinprotein_interactions.py#L161
def generate_seq_chunks(seq_string, chunk_size = 60):
    '''
    This takes a sequence as a string and breaks it up into list of strings of
    set length with graceful handling of the last line that will most likely
    not be full length. 
    The list of sequence strings gets returned

    `chunk_size = 60` sets residues per line to have in FASTA; note this 
    #chunking to multiple lines is the opposite of what PatMatch's 
    # `unjustify_fasta.pl` does. 

    Note `chunk_size` defaults to 60 here but optionally a different one
    can be provided.

    I believe my chunking code is based on 
    https://stackoverflow.com/a/13673133/8508004 or 
    https://stackoverflow.com/a/9475354/8508004 , see my gist 
    https://gist.github.com/fomightez/ef7583919dde51f3569731ca1c5247ba for some 
    notes on it and more related.
    There's a related code in my script 
    `similarities_in_proteinprotein_interactions.py` inside my repo 
    structurework/pdbsum-utilities/  and in 
    `Searching for homologs among deduced proteins from a genome using BLAST and Python.ipynb`
    & `Searching for coding sequences in genomes using BLAST and Python.ipynb`
    & `notebooks/GSD/GSD Rpb1_orthologs_in_1011_genomes.ipynb` inside my repo 
    blast-binder.
    '''
    return [seq_string[i:i+chunk_size] for i in range(
        0, len(seq_string),chunk_size)]

----

## Using hhsuite parsing code from biopython

This section builds on the section 'Using hhsuite parsing code from biopython' from [Using Python to examine HH-suite3 results](Using%20Python%20to%20examine%20HH-suite3%20results.ipynb). You'll want to review that notebook first if you are having any issues following along some of the initial steps to access the `.hhr` file and parse it.

This approach uses biopython to parse HH-suite3 results `.hhr` file to Python objects that play well with biopython / Python. Biopython has already been installed here. If you are trying to run this Jupyter notebook elsewhere, you'd need to do that.

In [3]:
from Bio.SearchIO import parse # import module of Biopython needed into active kernel namespace

The next command will parse an `.hhr` file. Note that `, "hhsuite2-text"` part specifies the format the biopython parser will use. If you've paid attention, you'll know HH-suite3 is the current release. It looks like Biopython hasn't caught up with that yet; however, fortunately, I've verified the `hhsuite2-text` option seems to work with current, HH-suite3-produced `.hhr` files to a great extent.

In [27]:
parsed_file_info = parse("hhsearch_q9bsu1_uniclust_w_ss_pfamA_30.hhr", "hhsuite2-text") # based on test_SearchIO_hhsuite2_text.py and https://biopython.org/docs/1.75/api/Bio.SearchIO.html
results = list(parsed_file_info)

Here, I just want to deal with the best hit, so `results[:1]` and `result.hsps[:1]`.

In [28]:
for result in results[:1]:
    for hsp in result.hsps[:1]:
        '''
        print(hsp.score)
        print("***------HELPFUL DATA SEPARATOR-------***")
        print(hsp.query.id)
        print(hsp.hit.description) # I guessed `.hit.description` this based on `self.assertEqual("T", str(hsp.hit.seq))` in test_SearchIO_hhsuite2_text.py because I didn't see any example for this one)
        print("***------HELPFUL DATA SEPARATOR-------***")
        print(hsp.hit.seq)
        print("***------HELPFUL DATA SEPARATOR-------***")
        print(hsp.query.seq)
        print("***------HELPFUL DATA SEPARATOR-------***")
        print(generate_seq_chunks(hsp.query.seq))
        print(generate_seq_chunks(hsp.hit.seq))
        '''
        seq_fa_query = ">" + hsp.query.id.split()[0] + "\n"+"\n".join(
                    generate_seq_chunks(str(hsp.query.seq)))
        print(seq_fa_query)
        seq_fa_hit = ">" + hsp.hit.description.split()[0] + "\n"+"\n".join(
                    generate_seq_chunks(str(hsp.hit.seq)))
        print(seq_fa_hit)

>sp|Q9BSU1|CP070_HUMAN
SLGNEQWEFTLGMPLAQAVAILQKHCRIIKNVQVLYSEQSPLSHDLILNLTQDGIKLMFD
AFNQRLKVIEVCDLTKVKLKYCGVHFNSQAIAPTIEQIDQSFGATHPGVYNSAEQLFHLN
FRGLSFSFQLDSWTEAPKYEPNFAHGLASLQIPHGATVKRMYIYSGNSLQDTKAPMMPLS
CFLGNVYAESVDVLRDGTGPAGLRLRLLAAGCGPGLLADAKMRVFERSVYFGDSCQDVLS
MLGSPHKVFYKSEDKMKIHSPSPHKQVPSKCNDYFFNYFTLGVDILFDANTHKVKKFVLH
TNYPGHYNFNIYHRCEFKIPLAIKKENADGQTE--TCTTYSKWDNIQELLGHPVEKPVVL
HRSSSPNNTNPFGSTFCFGLQRMIFEVMQNNHIASVTLY
>UPF0183
EQWE----FALGMPLAQAISILQKHCRIIKNVQVLYSEQMPLSHDLILNLTQDGIKLLFD
ACNQRLKVIEVYDLTKVKLKYCGVHFNSQAIAPTIEQIDQSFGATHPGVYNAAEQLFHLN
FRGLSFSFQLDSWSEAPKYEPNFAHGLASLQIPHGATVKRMYIYSGNNLQETKAPAMPLA
CFLGNVYAECVEVLRDGAGPLGLKLRLLTAGCGPGVLADTKVRAVERSIYFGDSCQDVLS
ALGSPHKVFYKSEDKMKIHSPSPHKQVPSKCNDYFFNYYILGVDILFDSTTHLVKKFVLH
TNFPGHYNFNIYHRCDFKIPLIIKKDGADAHSEDCILTTYSKWDQIQELLGHPMEKPVVL
HRSSSANNTNPFGSTFCFGLQRMIFEVMQNNHIASVTLY


So that took the aligned sequences in the HH-suite3 results `.hhr` file, extracted them as sequences, and made multi-sequence aligned FASTA format sequences for the hit and query of the best hit.

**Note the aligned sequences have the same length even though if you discount the gaps, they are not the same length.**  
In other words, the interleaved complex alignment blocks have been converted to aligned FASTA format sequence entries.


You may wish to review the sequences in the alignment of the query to the first hit in `hhsearch_q9bsu1_uniclust_w_ss_pfamA_30.hhr` to understand what is going on.

----

Continue on with the next notebook in the series, [???](???.ipynb). That notebook builds on the ground work here and in previous notebooks in this series to demonstrate ???.

----