# Using Python to examine HH-suite3 results

Run this in sessions launched from [my HH-suite3-binder repo](https://github.com/fomightez/hhsuite3-binder) because the software is already installed.   
This follows from the notebook in this series entitled [Practical use of HH-suite3 on the command line in Jupyter via MyBinder.org: Basics](notebooks/basics.ipynb).

Here, will demonstrate going from the text based HH-suite3 results output files, which end in `.hhr`, to useful derived forms of the information contained in them using Python.

First, I'll demonstrate using the script I developed `hhsuite3_results_to_df.py` that converts a HH-suite3 results file to a Pandas dataframe. The dataframe is then easily further utilized using Python's Pandas module.

Second, I'll demonstrate the use of biopython to parse HH-suite3 results output files.


-----

<div class="alert alert-block alert-warning">
<p>If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.</p>

<p>
    Some tips:
    <ul>
        <li>Code cells have boxes around them. When you hover over them an <i class="fa-step-forward fa"></i> icon appears.</li>
        <li>To run a code cell either click the <i class="fa-step-forward fa"></i> icon, or click on the cell and then hit <b>Shift+Enter</b>. The <b>Shift+Enter</b> combo will also move you to the next cell, so it's a quick way to work through the notebook.</li>
        <li>While a cell is running a <b>*</b> appears in the square brackets next to the cell. Once the cell has finished running the asterix will be replaced with a number.</li>
        <li>In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.</li>
        <li>To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.</li>
    </ul>
</p>
</div>

----



## Preparation

Most of the set-up is handled. The popular Python modules Pandas and Biopython should already be installed if you are running this from Binder sessions launched from my [hhsuite3-binder](https://github.com/fomightez/hhsuite3-binder). 

#### Some specific steps that bring in specific scripts and input files will be done in this section.





Get the `hhsuite3_results_to_df.py` script that will be used to mine the results files.

In [1]:
import os
file_needed = "hhsuite3_results_to_df.py"
if not os.path.isfile(file_needed):
    !curl -OL https://raw.githubusercontent.com/fomightez/sequencework/master/hhsuite3-utilities/hhsuite3_results_to_df.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 28448  100 28448    0     0  65397      0 --:--:-- --:--:-- --:--:-- 65397


Get some example files with  HH-suite3 query results.  
For ease, I'll just some of the results files that the biopython package uses to test the parsing code is working well. They are publically available [here](https://github.com/biopython/biopython/tree/master/Tests/HHsuite); the `README` file [there](https://github.com/biopython/biopython/tree/master/Tests/HHsuite) summarizes the content of those files.

If you prefer to use your own result files, upload your file to this session and change the appropriate file names in later steps.  
(Note for a couple of the `.hhr` files, I noticed they had `Done!` close to the last line of the file; however, I hadn't seen that in the `.hhr` files from HH-suite3 that I developed with and it caused errors for my `hhsuite3_results_to_df.py`. To make them more consistent, I'm removing that, assuming it is from format version of HH-suite since biopython which is a source of these test files mentions that version in the current documentation.)

In [2]:
import os
files_needed = ["hhsearch_q9bsu1_uniclust_w_ss_pfamA_30.hhr",
                "2uvo_hhblits.hhr",
                "2uvo_hhsearch.hhr",
                "hhpred_9590198.hhr"]
url_prefix = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/HHsuite/"
for file_needed in files_needed:
    if not os.path.isfile(file_needed):
        !curl -OL {url_prefix+file_needed}
!sed -i 's/Done!//g' 2uvo_hhblits.hhr
!sed -i 's/Done!//g' 2uvo_hhsearch.hhr

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 17947  100 17947    0     0  81208      0 --:--:-- --:--:-- --:--:-- 80842
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 38142  100 38142    0     0   220k      0 --:--:-- --:--:-- --:--:--  219k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 39235  100 39235    0     0   126k      0 --:--:-- --:--:-- --:--:--  126k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 40978  100 40978    0     0   144k      0 --:--:-- --:--:-- --:--:--  144k


## Use `hhsuite3_results_to_df` to process a results file on the command line

We have the script, hhsuite3_results_to_df.py,  and a text results file for it to process. To process a example file, run the next command where we use Python to run the script and direct it at the results file, `hhsearch_q9bsu1_uniclust_w_ss_pfamA_30.hhr`,  we made just retrieved a few cells ago.

In [3]:
%run hhsuite3_results_to_df.py 2uvo_hhsearch.hhr

Provided results read and converted to a dataframe...

A dataframe of the data has been saved as a file
in a manner where other Python programs can access it (pickled form).
RESULTING DATAFRAME is stored as ==> 'hhs3_results_pickled_df.pkl'

As of writing this, the script we are using outputs a file that is a binary, compact form of the dataframe. (That means it is tiny and not human readable. It is called 'pickled'. Saving in that form may seem odd, but as illustrated [here](#Output-to-more-universal,-table-like-formats) below this is is a very malleable form. And even more pertinent for dealing with data in Jupyter notebooks, there is actually an easier way to interact with this script when in Jupyter notebook that skips saving this intermediate file. So hang on through the long, more trandtional way of doing this before the easier way is introduced. And I saved it in the compact form and not the mroe typical tab-delimited form because we mostly won't go this route and might as well make tiny files while working along to a better route. It is easy to convert back and forth using the pickled form assuming you can match the Pandas/Python versions.)

We can take that file where the dataframe is pickled, and bring it into active memory in this notebook with another command form the Pandas library. First, we have to import the Pandas library.
Run the next command to bring the dataframe into active memory. Note the name comes from the name noted when we ran the script in the cell above.

In [4]:
import pandas as pd
df = pd.read_pickle("hhs3_results_pickled_df.pkl")

When that last cell ran, you won't notice any output, but something happened. We can look at that dataframe by calling it in a cell.

In [5]:
df

Unnamed: 0,hit_num,qid,qtitle,query_length,hid,htitle,hit_length,Probab,E-value,Score,...,Sum_probs,size_diff,aln_confidence,qconsensus,hseq,hconsensus,hss_pred,hss_dssp,qseq,pw_conservation
0,1,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,2uvo_A,2uvo_A Agglutinin isolectin 1; carbohydrate-bi...,171,100.0,4.6e-42,249.39,...,166.9,0,7999999999999999999999999999999999999999999999...,ercgeqgsnmecpnnlccsqygycgmggdycgkgcqngacwtskrc...,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRC...,~~cg~~~~~~~c~~~~CCS~~g~Cg~~~~~Cg~gC~~~~c~~~~~c...,CCCCCCCCCcCCCCCCeeCCCCeECCCcccccCCcccccccccccc...,CBCBGGGTTBBCGGGCEECTTSBEEBSHHHHSTTCCBSSCSSCCBC...,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRC...,||||++.++..||++.|||+|+|||.+.+||+++||.+.|++..+|...
1,2,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,2wga,2wga ; lectin (agglutinin); NMR {},164,99.95,2.8000000000000002e-33,204.56,...,153.2,7,5888999999999999999999999999999999999999999999...,rcgeqgsnmecpnnlccsqygycgmggdycgkgcqngacwtskrcg...,GXGCXGXXMYCSTNNCCXXWESCGSGGYXCGEGCNLGACQXGXPCX...,~~g~~~~~~~c~~~~CCs~~g~Cg~~~~~c~~~C~~~~c~~~~~cg...,ccCCCcccccCCCCceECCcceECCCCccccCccccCcccccceec...,CSGGGSSCCCCSTTCEECTTSCEECSTTTTSTTCCSSSCSSCCCSS...,RCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRCG...,.||.+..+..||+++|||.|+|||.+.+||+++||.++|++..+|+...
2,3,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,2uvo_A,2uvo_A Agglutinin isolectin 1; carbohydrate-bi...,171,99.84,1.1e-25,163.39,...,121.1,0,4899999999999999999999999999999999999999999999...,krcgsqaggatctnnqccsqygycgfgaeycgagcqggpcradikc...,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRC...,~~cg~~~~~~~c~~~~CCS~~g~Cg~~~~~Cg~gC~~~~c~~~~~c...,CCCCCCCCCcCCCCCCeeCCCCeECCCcccccCCcccccccccccc...,CBCBGGGTTBBCGGGCEECTTSBEEBSHHHHSTTCCBSSCSSCCBC...,KRCGSQAGGATCTNNQCCSQYGYCGFGAEYCGAGCQGGPCRADIKC...,.|||.++.+.+|.++.|||+|+|||.+++||+++||.++|..|..|...
3,4,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,1ulk_A,"1ulk_A Lectin-C; chitin-binding protein, hevei...",126,99.84,1.5000000000000001e-25,157.96,...,110.7,45,699999999999999999999999999999999999766,krcgsqaggatctnnqccsqygycgfgaeycgagcqggpcradikc...,PVCGVRASGRVCPDGYCCSQWGYCGTTEEYCGKGCQSQ-CD-YNRC...,~~cg~~~~~~~c~~g~CCs~~g~CG~~~~~Cg~gCq~~-c~-~~~C...,CcCcCCCCCCCCCCCCeECCCCeeCCCccccCCCcccc-ce-eeec...,CCSBGGGTTBCCGGGCEECTTSCEESSHHHHSTTCCBC-TT-TTBC...,KRCGSQAGGATCTNNQCCSQYGYCGFGAEYCGAGCQGGPCRADIKC...,+|||.|+++.+|.++.|||+++|||.+.+||+++||...
4,5,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,1ulk_A,"1ulk_A Lectin-C; chitin-binding protein, hevei...",126,99.82,6.5e-25,154.69,...,109.2,45,489999999999999999999999999999999999765,ercgeqgsnmecpnnlccsqygycgmggdycgkgcqngacwtskrc...,PVCGVRASGRVCPDGYCCSQWGYCGTTEEYCGKGCQSQ-CD-YNRC...,~~cg~~~~~~~c~~g~CCs~~g~CG~~~~~Cg~gCq~~-c~-~~~C...,CcCcCCCCCCCCCCCCeECCCCeeCCCccccCCCcccc-ce-eeec...,CCSBGGGTTBCCGGGCEECTTSCEESSHHHHSTTCCBC-TT-TTBC...,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRC...,.|||.|.++..||+++|||+++|||.+.+||+++||...
5,6,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,2wga,2wga ; lectin (agglutinin); NMR {},164,99.78,1.7e-23,152.97,...,114.3,7,4889999999999999999999999999999999999999999999...,rcgsqaggatctnnqccsqygycgfgaeycgagcqggpcradikcg...,GXGCXGXXMYCSTNNCCXXWESCGSGGYXCGEGCNLGACQXGXPCX...,~~g~~~~~~~c~~~~CCs~~g~Cg~~~~~c~~~C~~~~c~~~~~cg...,ccCCCcccccCCCCceECCcceECCCCccccCccccCcccccceec...,CSGGGSSCCCCSTTCEECTTSCEECSTTTTSTTCCSSSCSSCCCSS...,RCGSQAGGATCTNNQCCSQYGYCGFGAEYCGAGCQGGPCRADIKCG...,.||.++.+.+|.++.|||.|+|||.+++||+.+||.++|+.+.+|+...
6,7,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,1uha_A,"1uha_A Lectin-D2; chitin-binding domain, sugar...",82,99.54,1.1e-18,113.58,...,73.1,89,589999999999999999999999999999999999654,krcgsqaggatctnnqccsqygycgfgaeycgagcqggpcradikc...,PECGERASGKRCPNGKCCSQWGYCGTTDNYCGQGCQSQ-C-DYWRC...,~~cg~~~~~~~C~~g~CCs~~G~Cg~~~~~c~~~c~~~-~-~~g~C...,CCCCCCCCCCcCCCCCccCCCccccCccccccCCcccc-c-ccccc...,CCSBGGGTTBCCGGGCEECTTSCEESSHHHHSTTCCBC-T-TTTBC...,KRCGSQAGGATCTNNQCCSQYGYCGFGAEYCGAGCQGGPCRADIKC...,.|||+++++.+|.+++|||+|+||+.+.+||+++||..+
7,8,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,1en2_A,"1en2_A UDA, agglutinin isolectin I/agglutinin ...",89,99.54,1.2e-18,115.94,...,74.1,82,79999999999999999999999999999999999769985,rcgsqaggatctnnqccsqygycgfgaeycgagcqggpcra----d...,RCGSQGGGSTCPGLRCCSIWGWCGDSEPYCGRTCENK-CWSGERSD...,~cg~~~~~~~C~~~~CCS~~G~CG~~~~~C~~~Cq~~-c~~~~~~~...,CCCCCCCCccCCCCcccCCCceecccccccCCCCcCC-CcccccCC...,BCTTTTTSCCCGGGCEEETTSBEESSHHHHSTTEEES-CGGGCCTT...,RCGSQAGGATCTNNQCCSQYGYCGFGAEYCGAGCQGGPCRA----D...,|||++.++.+|.++.|||+++|||.+.+||+++||..||++
8,9,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,1en2_A,"1en2_A UDA, agglutinin isolectin I/agglutinin ...",89,99.41,5.1e-17,108.07,...,74.0,82,689999999999999999999999999999999999868,ercgeqgsnmecpnnlccsqygycgmggdycgkgcqngacwt----...,ERCGSQGGGSTCPGLRCCSIWGWCGDSEPYCGRTCENK-CWSGERS...,~~cg~~~~~~~C~~~~CCS~~G~CG~~~~~C~~~Cq~~-c~~~~~~...,CCCCCCCCCccCCCCcccCCCceecccccccCCCCcCC-CcccccC...,CBCTTTTTSCCCGGGCEEETTSBEESSHHHHSTTEEES-CGGGCCT...,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWT----...,+|||.+..+..||+++|||+++|||.+-+||+++||..+
9,10,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,1uha_A,"1uha_A Lectin-D2; chitin-binding domain, sugar...",82,99.38,1e-16,104.25,...,72.3,89,5889999999999999999999999999999999986,kcgsqaggklcpnnlccsqwgfcglgsefcgggcqsgacstdkpcg...,ECGERASGKRCPNGKCCSQWGYCGTTDNYCGQGCQSQ--CDYWRCG...,~cg~~~~~~~C~~g~CCs~~G~Cg~~~~~c~~~c~~~--~~~g~Cg...,CCCCCCCCCcCCCCCccCCCccccCccccccCCcccc--ccccccc...,CSBGGGTTBCCGGGCEECTTSCEESSHHHHSTTCCBC--TTTTBCB...,KCGSQAGGKLCPNNLCCSQWGFCGLGSEFCGGGCQSGACSTDKPCG...,.||++++++.||+++|||+|||||.+.+||+.+||+.


You'll notice when the list of data is large, that the Jupyter environment represents just the head and tail to make it more reasonable. There are ways you can have Jupyter display it all which we won't go into here. 

Instead we'll start to show some methods of dataframes that make them convenient. For example, you can use the `head` method to see the top portion of the file.

In [6]:
df.head()

Unnamed: 0,hit_num,qid,qtitle,query_length,hid,htitle,hit_length,Probab,E-value,Score,...,Sum_probs,size_diff,aln_confidence,qconsensus,hseq,hconsensus,hss_pred,hss_dssp,qseq,pw_conservation
0,1,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,2uvo_A,2uvo_A Agglutinin isolectin 1; carbohydrate-bi...,171,100.0,4.6e-42,249.39,...,166.9,0,7999999999999999999999999999999999999999999999...,ercgeqgsnmecpnnlccsqygycgmggdycgkgcqngacwtskrc...,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRC...,~~cg~~~~~~~c~~~~CCS~~g~Cg~~~~~Cg~gC~~~~c~~~~~c...,CCCCCCCCCcCCCCCCeeCCCCeECCCcccccCCcccccccccccc...,CBCBGGGTTBBCGGGCEECTTSBEEBSHHHHSTTCCBSSCSSCCBC...,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRC...,||||++.++..||++.|||+|+|||.+.+||+++||.+.|++..+|...
1,2,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,2wga,2wga ; lectin (agglutinin); NMR {},164,99.95,2.8000000000000002e-33,204.56,...,153.2,7,5888999999999999999999999999999999999999999999...,rcgeqgsnmecpnnlccsqygycgmggdycgkgcqngacwtskrcg...,GXGCXGXXMYCSTNNCCXXWESCGSGGYXCGEGCNLGACQXGXPCX...,~~g~~~~~~~c~~~~CCs~~g~Cg~~~~~c~~~C~~~~c~~~~~cg...,ccCCCcccccCCCCceECCcceECCCCccccCccccCcccccceec...,CSGGGSSCCCCSTTCEECTTSCEECSTTTTSTTCCSSSCSSCCCSS...,RCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRCG...,.||.+..+..||+++|||.|+|||.+.+||+++||.++|++..+|+...
2,3,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,2uvo_A,2uvo_A Agglutinin isolectin 1; carbohydrate-bi...,171,99.84,1.1e-25,163.39,...,121.1,0,4899999999999999999999999999999999999999999999...,krcgsqaggatctnnqccsqygycgfgaeycgagcqggpcradikc...,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRC...,~~cg~~~~~~~c~~~~CCS~~g~Cg~~~~~Cg~gC~~~~c~~~~~c...,CCCCCCCCCcCCCCCCeeCCCCeECCCcccccCCcccccccccccc...,CBCBGGGTTBBCGGGCEECTTSBEEBSHHHHSTTCCBSSCSSCCBC...,KRCGSQAGGATCTNNQCCSQYGYCGFGAEYCGAGCQGGPCRADIKC...,.|||.++.+.+|.++.|||+|+|||.+++||+++||.++|..|..|...
3,4,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,1ulk_A,"1ulk_A Lectin-C; chitin-binding protein, hevei...",126,99.84,1.5000000000000001e-25,157.96,...,110.7,45,699999999999999999999999999999999999766,krcgsqaggatctnnqccsqygycgfgaeycgagcqggpcradikc...,PVCGVRASGRVCPDGYCCSQWGYCGTTEEYCGKGCQSQ-CD-YNRC...,~~cg~~~~~~~c~~g~CCs~~g~CG~~~~~Cg~gCq~~-c~-~~~C...,CcCcCCCCCCCCCCCCeECCCCeeCCCccccCCCcccc-ce-eeec...,CCSBGGGTTBCCGGGCEECTTSCEESSHHHHSTTCCBC-TT-TTBC...,KRCGSQAGGATCTNNQCCSQYGYCGFGAEYCGAGCQGGPCRADIKC...,+|||.|+++.+|.++.|||+++|||.+.+||+++||...
4,5,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,1ulk_A,"1ulk_A Lectin-C; chitin-binding protein, hevei...",126,99.82,6.5e-25,154.69,...,109.2,45,489999999999999999999999999999999999765,ercgeqgsnmecpnnlccsqygycgmggdycgkgcqngacwtskrc...,PVCGVRASGRVCPDGYCCSQWGYCGTTEEYCGKGCQSQ-CD-YNRC...,~~cg~~~~~~~c~~g~CCs~~g~CG~~~~~Cg~gCq~~-c~-~~~C...,CcCcCCCCCCCCCCCCeECCCCeeCCCccccCCCcccc-ce-eeec...,CCSBGGGTTBCCGGGCEECTTSCEESSHHHHSTTCCBC-TT-TTBC...,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRC...,.|||.|.++..||+++|||+++|||.+.+||+++||...


Note that this script keeps each hit as a separate entry in the resulting dataframe. The biopython hhsuite parser also demonstrated in this Jupyter notebook below makes a text table that combines 'high-scoring segment pair' or 'high-scoring pair' (hsp) that correspond to the same accession identifier. You could always use Panda's `groupby` in a downstream step to achieve a similar accounting with `hhsuite3_results_to_df.py`. Here is an example where the 32 rows collapse to sixteen sets of two if you group by the hit indentifier. For example, if you scroll up you'll notice hit_nums #8 & #9 are associated with the same identifier, which is 1en2_A. Those become a group at the top of this way of grouping:

In [7]:
grouped = df.groupby('hid')
for h_id, grouped_df in grouped:
    print(h_id)
    display(grouped_df)

1en2_A


Unnamed: 0,hit_num,qid,qtitle,query_length,hid,htitle,hit_length,Probab,E-value,Score,...,Sum_probs,size_diff,aln_confidence,qconsensus,hseq,hconsensus,hss_pred,hss_dssp,qseq,pw_conservation
7,8,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,1en2_A,"1en2_A UDA, agglutinin isolectin I/agglutinin ...",89,99.54,1.2e-18,115.94,...,74.1,82,79999999999999999999999999999999999769985,rcgsqaggatctnnqccsqygycgfgaeycgagcqggpcra----d...,RCGSQGGGSTCPGLRCCSIWGWCGDSEPYCGRTCENK-CWSGERSD...,~cg~~~~~~~C~~~~CCS~~G~CG~~~~~C~~~Cq~~-c~~~~~~~...,CCCCCCCCccCCCCcccCCCceecccccccCCCCcCC-CcccccCC...,BCTTTTTSCCCGGGCEEETTSBEESSHHHHSTTEEES-CGGGCCTT...,RCGSQAGGATCTNNQCCSQYGYCGFGAEYCGAGCQGGPCRA----D...,|||++.++.+|.++.|||+++|||.+.+||+++||..||++
8,9,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,1en2_A,"1en2_A UDA, agglutinin isolectin I/agglutinin ...",89,99.41,5.1e-17,108.07,...,74.0,82,689999999999999999999999999999999999868,ercgeqgsnmecpnnlccsqygycgmggdycgkgcqngacwt----...,ERCGSQGGGSTCPGLRCCSIWGWCGDSEPYCGRTCENK-CWSGERS...,~~cg~~~~~~~C~~~~CCS~~G~CG~~~~~C~~~Cq~~-c~~~~~~...,CCCCCCCCCccCCCCcccCCCceecccccccCCCCcCC-CcccccC...,CBCTTTTTSCCCGGGCEEETTSBEESSHHHHSTTEEES-CGGGCCT...,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWT----...,+|||.+..+..||+++|||+++|||.+-+||+++||..+


1mmc_A


Unnamed: 0,hit_num,qid,qtitle,query_length,hid,htitle,hit_length,Probab,E-value,Score,...,Sum_probs,size_diff,aln_confidence,qconsensus,hseq,hconsensus,hss_pred,hss_dssp,qseq,pw_conservation
15,16,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,1mmc_A,"1mmc_A AC-AMP2, antimicrobial peptide 2; antif...",30,97.83,2.8e-08,58.62,...,25.7,141,788889999999999999999999996,qaggklcpnnlccsqwgfcglgsefcg,ECVRGRCPSGMCCSQFGYCGKGPKYCG,~~~~~~C~~~~CCS~~G~CG~t~~~C~,cCccCCCCCCCcccccceeCCchHhhC,CCSSSCCSTTCEECTTSCEESSHHHHC,QAGGKLCPNNLCCSQWGFCGLGSEFCG,||+...||+++|||+|||||.+.+||+
22,23,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,1mmc_A,"1mmc_A AC-AMP2, antimicrobial peptide 2; antif...",30,97.14,1.9e-06,50.76,...,24.5,141,4456689999999999999999999974,qgsnmecpnnlccsqygycgmggdycgk,ECVRGRCPSGMCCSQFGYCGKGPKYCGR,~~~~~~C~~~~CCS~~G~CG~t~~~C~~,cCccCCCCCCCcccccceeCCchHhhCc,CCSSSCCSTTCEECTTSCEESSHHHHCC,QGSNMECPNNLCCSQYGYCGMGGDYCGK,|..+..||+++|||+|+|||.+.+||++


1p9g_A


Unnamed: 0,hit_num,qid,qtitle,query_length,hid,htitle,hit_length,Probab,E-value,Score,...,Sum_probs,size_diff,aln_confidence,qconsensus,hseq,hconsensus,hss_pred,hss_dssp,qseq,pw_conservation
14,15,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,1p9g_A,"1p9g_A EAFP 2; antifungal peptide, atomic reso...",41,97.88,2e-08,62.0,...,33.7,130,6899,ercgeqgsnmecpnnlccsqygycgmggdycgkg-cqng,ETCA-SRCPRPCNAGLCCSIYGYCGSGAAYCGAGNCRCQ,~~CG-~~~~~~C~~~~CCS~~G~CG~t~~~C~~~~Cq~~,CCcC-CcCCcccCCCCeECccceeCCCccccCCCccccC,CCGG-GGTTCCSCTTCEEETTSCEECSHHHHSTTTEEEC,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKG-CQNG,+|||
16,17,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,1p9g_A,"1p9g_A EAFP 2; antifungal peptide, atomic reso...",41,97.83,2.9e-08,61.32,...,32.1,130,477,kcgsqaggklcpnnlccsqwgfcglgsefcggg-cqsg,TCA-SRCPRPCNAGLCCSIYGYCGSGAAYCGAGNCRCQ,~CG-~~~~~~C~~~~CCS~~G~CG~t~~~C~~~~Cq~~,CcC-CcCCcccCCCCeECccceeCCCccccCCCccccC,CGG-GGTTCCSCTTCEEETTSCEECSHHHHSTTTEEEC,KCGSQAGGKLCPNNLCCSQWGFCGLGSEFCGGG-CQSG,.||


1uha_A


Unnamed: 0,hit_num,qid,qtitle,query_length,hid,htitle,hit_length,Probab,E-value,Score,...,Sum_probs,size_diff,aln_confidence,qconsensus,hseq,hconsensus,hss_pred,hss_dssp,qseq,pw_conservation
6,7,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,1uha_A,"1uha_A Lectin-D2; chitin-binding domain, sugar...",82,99.54,1.1e-18,113.58,...,73.1,89,589999999999999999999999999999999999654,krcgsqaggatctnnqccsqygycgfgaeycgagcqggpcradikc...,PECGERASGKRCPNGKCCSQWGYCGTTDNYCGQGCQSQ-C-DYWRC...,~~cg~~~~~~~C~~g~CCs~~G~Cg~~~~~c~~~c~~~-~-~~g~C...,CCCCCCCCCCcCCCCCccCCCccccCccccccCCcccc-c-ccccc...,CCSBGGGTTBCCGGGCEECTTSCEESSHHHHSTTCCBC-T-TTTBC...,KRCGSQAGGATCTNNQCCSQYGYCGFGAEYCGAGCQGGPCRADIKC...,.|||+++++.+|.+++|||+|+||+.+.+||+++||..+
9,10,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,1uha_A,"1uha_A Lectin-D2; chitin-binding domain, sugar...",82,99.38,1e-16,104.25,...,72.3,89,5889999999999999999999999999999999986,kcgsqaggklcpnnlccsqwgfcglgsefcgggcqsgacstdkpcg...,ECGERASGKRCPNGKCCSQWGYCGTTDNYCGQGCQSQ--CDYWRCG...,~cg~~~~~~~C~~g~CCs~~G~Cg~~~~~c~~~c~~~--~~~g~Cg...,CCCCCCCCCcCCCCCccCCCccccCccccccCCcccc--ccccccc...,CSBGGGTTBCCGGGCEECTTSCEESSHHHHSTTCCBC--TTTTBCB...,KCGSQAGGKLCPNNLCCSQWGFCGLGSEFCGGGCQSGACSTDKPCG...,.||++++++.||+++|||+|||||.+.+||+.+||+.


1ulk_A


Unnamed: 0,hit_num,qid,qtitle,query_length,hid,htitle,hit_length,Probab,E-value,Score,...,Sum_probs,size_diff,aln_confidence,qconsensus,hseq,hconsensus,hss_pred,hss_dssp,qseq,pw_conservation
3,4,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,1ulk_A,"1ulk_A Lectin-C; chitin-binding protein, hevei...",126,99.84,1.5000000000000001e-25,157.96,...,110.7,45,699999999999999999999999999999999999766,krcgsqaggatctnnqccsqygycgfgaeycgagcqggpcradikc...,PVCGVRASGRVCPDGYCCSQWGYCGTTEEYCGKGCQSQ-CD-YNRC...,~~cg~~~~~~~c~~g~CCs~~g~CG~~~~~Cg~gCq~~-c~-~~~C...,CcCcCCCCCCCCCCCCeECCCCeeCCCccccCCCcccc-ce-eeec...,CCSBGGGTTBCCGGGCEECTTSCEESSHHHHSTTCCBC-TT-TTBC...,KRCGSQAGGATCTNNQCCSQYGYCGFGAEYCGAGCQGGPCRADIKC...,+|||.|+++.+|.++.|||+++|||.+.+||+++||...
4,5,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,1ulk_A,"1ulk_A Lectin-C; chitin-binding protein, hevei...",126,99.82,6.5e-25,154.69,...,109.2,45,489999999999999999999999999999999999765,ercgeqgsnmecpnnlccsqygycgmggdycgkgcqngacwtskrc...,PVCGVRASGRVCPDGYCCSQWGYCGTTEEYCGKGCQSQ-CD-YNRC...,~~cg~~~~~~~c~~g~CCs~~g~CG~~~~~Cg~gCq~~-c~-~~~C...,CcCcCCCCCCCCCCCCeECCCCeeCCCccccCCCcccc-ce-eeec...,CCSBGGGTTBCCGGGCEECTTSCEESSHHHHSTTCCBC-TT-TTBC...,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRC...,.|||.|.++..||+++|||+++|||.+.+||+++||...


1wga


Unnamed: 0,hit_num,qid,qtitle,query_length,hid,htitle,hit_length,Probab,E-value,Score,...,Sum_probs,size_diff,aln_confidence,qconsensus,hseq,hconsensus,hss_pred,hss_dssp,qseq,pw_conservation
26,27,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,1wga,1wga ; lectin (agglutinin); NMR {},164,92.12,0.024,45.43,...,136.2,7,4568888999999999999999999999999999998875487544,nmecpnnlccsqygycgmggdycgkgcqngacwtskrcgsqaggat...,XXXCXXXXCCXXXXXCXXXXXXCXXXCXXXXCXXXXXCXXX--XXX...,~~~c~~~~cc~~~~~c~~~~~~c~~~c~~~~c~~~~~c~~~--~~~...,ccccccccccccccccccccccccccccccccccccccccc--ccc...,,NMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRCGSQAGGAT...,...|....||.....|......|...|....|.....|...|...|
31,32,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,1wga,1wga ; lectin (agglutinin); NMR {},164,72.0,1.3,35.47,...,98.0,7,456788899999999999999999999999999998887548754,nmecpnnlccsqygycgmggdycgkgcqngacwtskrcgsqaggat...,XXXCXXXXCCXXXXXCXXXXXXCXXXCXXXXCXXXXXCXXX--XXX...,~~~c~~~~cc~~~~~c~~~~~~c~~~c~~~~c~~~~~c~~~--~~~...,ccccccccccccccccccccccccccccccccccccccccc--ccc...,,NMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRCGSQAGGAT...,...|....||.....|......|...|....|.....|...|...


1wkx_A


Unnamed: 0,hit_num,qid,qtitle,query_length,hid,htitle,hit_length,Probab,E-value,Score,...,Sum_probs,size_diff,aln_confidence,qconsensus,hseq,hconsensus,hss_pred,hss_dssp,qseq,pw_conservation
11,12,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,1wkx_A,"1wkx_A Hevein isoform 2; allergen, lectin, agg...",43,98.12,3.1e-09,64.73,...,34.0,128,47888888999999999999999999999996,kcgsqaggklcpnnlccsqwgfcglgsefcgg--gcqsg,QCGRQAGGKLCPDNLCCSQWGWCGSTDEYCSPDHNCQSN,~CG~~~~~~~C~~~~CCS~~G~CG~t~~~C~~~~~Cq~~,CCCCcCCCcccCCCCeEeecCcccCCcccccCCCCccCC,CSBGGGTTBCCSTTCEECTTSCEESSHHHHCGGGTCCBS,KCGSQAGGKLCPNNLCCSQWGFCGLGSEFCGG--GCQSG,.||+++++..||+++|||+|||||...+||+.
17,18,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,1wkx_A,"1wkx_A Hevein isoform 2; allergen, lectin, agg...",43,97.76,4.7e-08,59.52,...,34.7,128,589999888999999999999999999999997,ercgeqgsnmecpnnlccsqygycgmggdycgk--gcqng,EQCGRQAGGKLCPDNLCCSQWGWCGSTDEYCSPDHNCQSN,~~CG~~~~~~~C~~~~CCS~~G~CG~t~~~C~~~~~Cq~~,CCCCCcCCCcccCCCCeEeecCcccCCcccccCCCCccCC,CCSBGGGTTBCCSTTCEECTTSCEESSHHHHCGGGTCCBS,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGK--GCQNG,+|||.+..+..||+++|||+|+|||+..|||+.


2dkv_A


Unnamed: 0,hit_num,qid,qtitle,query_length,hid,htitle,hit_length,Probab,E-value,Score,...,Sum_probs,size_diff,aln_confidence,qconsensus,hseq,hconsensus,hss_pred,hss_dssp,qseq,pw_conservation
21,22,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,2dkv_A,"2dkv_A Chitinase; whole structure, oryza sativ...",309,97.24,1e-06,69.81,...,35.1,138,4888889999999999999999999999999999987,kcgsqaggklcpnnlccsqwgfcglgsefcgggcqsg,QCGAQAGGARCPNCLCCSRWGWCGTTSDFCGDGCQSQ,~cG~~~~~~~c~~~~ccs~~g~cg~~~~~C~~~cq~~,CCCCCCCCCcCCCCCeeCcCCcccCCccccCccccCC,BCSTTTTTCCCGGGCEECTTSBEESSHHHHSTTCCBC,KCGSQAGGKLCPNNLCCSQWGFCGLGSEFCGGGCQSG,.||++.++..||.++|||+|||||...+||+.|||+.
24,25,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,2dkv_A,"2dkv_A Chitinase; whole structure, oryza sativ...",309,96.79,1e-05,64.37,...,35.8,138,47999999999999999999999999999999999965,krcgsqaggatctnnqccsqygycgfgaeycgagcqgg,EQCGAQAGGARCPNCLCCSRWGWCGTTSDFCGDGCQSQ,~~cG~~~~~~~c~~~~ccs~~g~cg~~~~~C~~~cq~~,CCCCCCCCCCcCCCCCeeCcCCcccCCccccCccccCC,CBCSTTTTTCCCGGGCEECTTSBEESSHHHHSTTCCBC,KRCGSQAGGATCTNNQCCSQYGYCGFGAEYCGAGCQGG,.+||++.++.+|..+.|||++||||-..+||+.+||..


2kus_A


Unnamed: 0,hit_num,qid,qtitle,query_length,hid,htitle,hit_length,Probab,E-value,Score,...,Sum_probs,size_diff,aln_confidence,qconsensus,hseq,hconsensus,hss_pred,hss_dssp,qseq,pw_conservation
20,21,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,2kus_A,2kus_A SM-AMP-1.1A; plant antimicrobial peptid...,35,97.56,1.7e-07,55.43,...,29.5,136,46788998887,radikcgsqaggklcpnnlccsqwgfcglgsefcg,GPNGQCGPGWG--GCRGGLCCSQYGYCGSGPKYCA,~~~~~CG~~~g--~C~~g~CCS~~G~CG~~~~~C~,CCCcccCCCCC--cCCCCcEECCCceecCChhhhC,CTTCBCBTTTB--CCCTTCEECTTSBEECSHHHHC,RADIKCGSQAGGKLCPNNLCCSQWGFCGLGSEFCG,+.|..||.+++
25,26,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,2kus_A,2kus_A SM-AMP-1.1A; plant antimicrobial peptid...,35,96.68,1.6e-05,46.9,...,28.1,136,457888887,skrcgsqaggatctnnqccsqygycgfgaeycga,NGQCGPGWG--GCRGGLCCSQYGYCGSGPKYCAH,~~~CG~~~g--~C~~g~CCS~~G~CG~~~~~C~~,CcccCCCCC--cCCCCcEECCCceecCChhhhCc,TCBCBTTTB--CCCTTCEECTTSBEECSHHHHCC,SKRCGSQAGGATCTNNQCCSQYGYCGFGAEYCGA,..+||.+++


2lb7_A


Unnamed: 0,hit_num,qid,qtitle,query_length,hid,htitle,hit_length,Probab,E-value,Score,...,Sum_probs,size_diff,aln_confidence,qconsensus,hseq,hconsensus,hss_pred,hss_dssp,qseq,pw_conservation
13,14,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,2lb7_A,"2lb7_A WAMP-1A, antimicrobial peptide 1A; anti...",44,97.97,1e-08,61.74,...,33.9,127,578888889999999999999999999999987,kcgsqaggklcpnnlccsqwgfcglgsefcggg-cqsg,RCGDQARGAKCPNCLCCGKYGFCGSGDAYCGAGSCQSQ,~CG~~~~~~~C~~~~CCS~~G~CG~t~~~C~~~~Cq~~,CCcCCCCCcccCCCCcCCcceeecCCccccCCCCccCC,ECBGGGTTBCCCTTCEEETTTEEECSHHHHSTTSEEEC,KCGSQAGGKLCPNNLCCSQWGFCGLGSEFCGGG-CQSG,.||++++++.||+++|||+|||||...+||+.+
18,19,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,2lb7_A,"2lb7_A WAMP-1A, antimicrobial peptide 1A; anti...",44,97.58,1.5e-07,56.59,...,34.3,127,4788888888999999999999999999999987,ercgeqgsnmecpnnlccsqygycgmggdycgkg-cqng,QRCGDQARGAKCPNCLCCGKYGFCGSGDAYCGAGSCQSQ,~~CG~~~~~~~C~~~~CCS~~G~CG~t~~~C~~~~Cq~~,CCCcCCCCCcccCCCCcCCcceeecCCccccCCCCccCC,EECBGGGTTBCCCTTCEEETTTEEECSHHHHSTTSEEEC,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKG-CQNG,++||.+..+..||+++|||+|+|||+..|||+.+


2n1s_A


Unnamed: 0,hit_num,qid,qtitle,query_length,hid,htitle,hit_length,Probab,E-value,Score,...,Sum_probs,size_diff,aln_confidence,qconsensus,hseq,hconsensus,hss_pred,hss_dssp,qseq,pw_conservation
19,20,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,2n1s_A,"2n1s_A AMP-2; antimicrobial peptide, ICK, cyst...",30,97.58,1.6e-07,55.11,...,23.2,141,34568999999999999999999985,aggklcpnnlccsqwgfcglgsefcg,CYRGRCSGGLCCSKYGYCGSGPAYCG,cg~~~C~~~~CCs~~G~CGtt~~~C~,cCCCCCCCCCccccccccCcchhhcC,CBTTBCSTTCEECTTSBEECSHHHHC,AGGKLCPNNLCCSQWGFCGLGSEFCG,.|...||+++|||+|||||.+.+||+
23,24,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,2n1s_A,"2n1s_A AMP-2; antimicrobial peptide, ICK, cyst...",30,97.02,3.5e-06,49.35,...,23.7,141,566,rcgeqgsnmecpnnlccsqygycgmggdycg,QCY----RGRCSGGLCCSKYGYCGSGPAYCG,~cg----~~~C~~~~CCs~~G~CGtt~~~C~,hcC----CCCCCCCCccccccccCcchhhcC,BCB----TTBCSTTCEECTTSBEECSHHHHC,RCGEQGSNMECPNNLCCSQYGYCGMGGDYCG,+||


2uvo_A


Unnamed: 0,hit_num,qid,qtitle,query_length,hid,htitle,hit_length,Probab,E-value,Score,...,Sum_probs,size_diff,aln_confidence,qconsensus,hseq,hconsensus,hss_pred,hss_dssp,qseq,pw_conservation
0,1,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,2uvo_A,2uvo_A Agglutinin isolectin 1; carbohydrate-bi...,171,100.0,4.6e-42,249.39,...,166.9,0,7999999999999999999999999999999999999999999999...,ercgeqgsnmecpnnlccsqygycgmggdycgkgcqngacwtskrc...,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRC...,~~cg~~~~~~~c~~~~CCS~~g~Cg~~~~~Cg~gC~~~~c~~~~~c...,CCCCCCCCCcCCCCCCeeCCCCeECCCcccccCCcccccccccccc...,CBCBGGGTTBBCGGGCEECTTSBEEBSHHHHSTTCCBSSCSSCCBC...,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRC...,||||++.++..||++.|||+|+|||.+.+||+++||.+.|++..+|...
2,3,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,2uvo_A,2uvo_A Agglutinin isolectin 1; carbohydrate-bi...,171,99.84,1.1e-25,163.39,...,121.1,0,4899999999999999999999999999999999999999999999...,krcgsqaggatctnnqccsqygycgfgaeycgagcqggpcradikc...,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRC...,~~cg~~~~~~~c~~~~CCS~~g~Cg~~~~~Cg~gC~~~~c~~~~~c...,CCCCCCCCCcCCCCCCeeCCCCeECCCcccccCCcccccccccccc...,CBCBGGGTTBBCGGGCEECTTSBEEBSHHHHSTTCCBSSCSSCCBC...,KRCGSQAGGATCTNNQCCSQYGYCGFGAEYCGAGCQGGPCRADIKC...,.|||.++.+.+|.++.|||+|+|||.+++||+++||.++|..|..|...


2wga


Unnamed: 0,hit_num,qid,qtitle,query_length,hid,htitle,hit_length,Probab,E-value,Score,...,Sum_probs,size_diff,aln_confidence,qconsensus,hseq,hconsensus,hss_pred,hss_dssp,qseq,pw_conservation
1,2,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,2wga,2wga ; lectin (agglutinin); NMR {},164,99.95,2.8000000000000002e-33,204.56,...,153.2,7,5888999999999999999999999999999999999999999999...,rcgeqgsnmecpnnlccsqygycgmggdycgkgcqngacwtskrcg...,GXGCXGXXMYCSTNNCCXXWESCGSGGYXCGEGCNLGACQXGXPCX...,~~g~~~~~~~c~~~~CCs~~g~Cg~~~~~c~~~C~~~~c~~~~~cg...,ccCCCcccccCCCCceECCcceECCCCccccCccccCcccccceec...,CSGGGSSCCCCSTTCEECTTSCEECSTTTTSTTCCSSSCSSCCCSS...,RCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRCG...,.||.+..+..||+++|||.|+|||.+.+||+++||.++|++..+|+...
5,6,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,2wga,2wga ; lectin (agglutinin); NMR {},164,99.78,1.7e-23,152.97,...,114.3,7,4889999999999999999999999999999999999999999999...,rcgsqaggatctnnqccsqygycgfgaeycgagcqggpcradikcg...,GXGCXGXXMYCSTNNCCXXWESCGSGGYXCGEGCNLGACQXGXPCX...,~~g~~~~~~~c~~~~CCs~~g~Cg~~~~~c~~~C~~~~c~~~~~cg...,ccCCCcccccCCCCceECCcceECCCCccccCccccCcccccceec...,CSGGGSSCCCCSTTCEECTTSCEECSTTTTSTTCCSSSCSSCCCSS...,RCGSQAGGATCTNNQCCSQYGYCGFGAEYCGAGCQGGPCRADIKCG...,.||.++.+.+|.++.|||.|+|||.+++||+.+||.++|+.+.+|+...


4mpi_A


Unnamed: 0,hit_num,qid,qtitle,query_length,hid,htitle,hit_length,Probab,E-value,Score,...,Sum_probs,size_diff,aln_confidence,qconsensus,hseq,hconsensus,hss_pred,hss_dssp,qseq,pw_conservation
10,11,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,4mpi_A,"4mpi_A Class I chitinase; hevein-like domain, ...",45,98.2,1.6e-09,66.41,...,36.7,126,456888888899999999999999999999999999985,dikcgsqaggklcpnnlccsqwgfcglgsefcgggcqsgacs,MEQCGRQAGGALCPGGLCCSQYGWCANTPEYCGSGCQSQ-CD,~~~CG~~~~~~~C~~~~CCs~~G~CG~t~~~C~~gCq~~-c~,ccccCCcCCCcccCCCCcCcccceecCCccccccccccc-CC,CCBCBGGGTTBCCGGGCEECTTSBEECSHHHHSTTCCBC-TT,DIKCGSQAGGKLCPNNLCCSQWGFCGLGSEFCGGGCQSGACS,...||.+.+...||+++|||+|||||...+||+.+||+.
12,13,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,4mpi_A,"4mpi_A Class I chitinase; hevein-like domain, ...",45,98.0,8e-09,63.24,...,36.5,126,6888888899999999999999999999999999975,rcgeqgsnmecpnnlccsqygycgmggdycgkgcqngacwts,QCGRQAGGALCPGGLCCSQYGWCANTPEYCGSGCQSQ-CDGG,~CG~~~~~~~C~~~~CCs~~G~CG~t~~~C~~gCq~~-c~~~,ccCCcCCCcccCCCCcCcccceecCCccccccccccc-CCCC,BCBGGGTTBCCGGGCEECTTSBEECSHHHHSTTCCBC-TTCC,RCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTS,+||.+..+..||+++|||+|+|||+..|||+++||..


4z8i_A


Unnamed: 0,hit_num,qid,qtitle,query_length,hid,htitle,hit_length,Probab,E-value,Score,...,Sum_probs,size_diff,aln_confidence,qconsensus,hseq,hconsensus,hss_pred,hss_dssp,qseq,pw_conservation
27,28,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,4z8i_A,"4z8i_A BBTPGRP3, peptidoglycan recognition pro...",236,88.54,0.1,39.32,...,27.7,65,5566677665,radikcgsqa-----ggklcpn---nlccsqwgfcglgsefcg,RSDGRCGPNYPAPDANPGECNPHAVDHCCSEWGWCGRETSHCT,r~d~rCg~~~~~~~~~~~~C~~~~~~~CCs~~~~cg~~~~~c~,CCCCCCCCCCCCCCCCCcccCCCCCCCCCCCCCEEeCCccccc,CSSSBCBSSSCBTTBSSBBCCTTSSCCEECTTSBEECSHHHHH,RADIKCGSQA-----GGKLCPN---NLCCSQWGFCGLGSEFCG,|.|-+||...
28,29,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,4z8i_A,"4z8i_A BBTPGRP3, peptidoglycan recognition pro...",236,88.13,0.12,39.05,...,26.8,65,34666665,skrcgsqa-----ggatctn---nqccsqygycgfgaeycg,DGRCGPNYPAPDANPGECNPHAVDHCCSEWGWCGRETSHCT,d~rCg~~~~~~~~~~~~C~~~~~~~CCs~~~~cg~~~~~c~,CCCCCCCCCCCCCCCcccCCCCCCCCCCCCCEEeCCccccc,SSBCBSSSCBTTBSSBBCCTTSSCCEECTTSBEECSHHHHH,SKRCGSQA-----GGATCTN---NQCCSQYGYCGFGAEYCG,..|||...


4zxm_A


Unnamed: 0,hit_num,qid,qtitle,query_length,hid,htitle,hit_length,Probab,E-value,Score,...,Sum_probs,size_diff,aln_confidence,qconsensus,hseq,hconsensus,hss_pred,hss_dssp,qseq,pw_conservation
29,30,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,4zxm_A,4zxm_A PGRP domain of peptidoglycan recognitio...,256,85.66,0.21,38.06,...,0.0,85,3455555554,radikcgsqa-----ggklcpn---nlccsqwgfcglgsefcg,RSDGRCGPNYPAPDANPGECNPHAVDHCCSEWGWCGRETSHCT,~~d~~Cg~~~~~~~~~~~~C~~~~~~~CCs~~~~Cg~~~~~C~,CCCCCCCCCCCCCCCCCcccCCCCCCCCCCCCCeEeCCCCCcC,-------------------------------------------,RADIKCGSQA-----GGKLCPN---NLCCSQWGFCGLGSEFCG,|.|-+||...
30,31,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,4zxm_A,4zxm_A PGRP domain of peptidoglycan recognitio...,256,85.31,0.23,37.89,...,0.0,85,4566655,krcgsqa-----ggatctn---nqccsqygycgfgaeycg-agcq,GRCGPNYPAPDANPGECNPHAVDHCCSEWGWCGRETSHCTCSSCV,~~Cg~~~~~~~~~~~~C~~~~~~~CCs~~~~Cg~~~~~C~~~~c~,CCCCCCCCCCCCCCcccCCCCCCCCCCCCCeEeCCCCCcCCcccc,---------------------------------------------,KRCGSQA-----GGATCTN---NQCCSQYGYCGFGAEYCG-AGCQ,.|||...


Same data as earlier but we can cleary see we have Hydrogen bonds, Non-bonded contacts (a.k.a., van der Waals contacts), and salt bridges, and we immediately get a sense of what types of interactions are more abundant.

You may want to get a sense of what else you can do by examining he first two notebooks that come up with you launch a session from my [blast-binder](https://github.com/fomightez/blast-binder) site. Those first two notebooks cover using the dataframe containing BLAST results some.

Shortly, we'll cover how to bring the dataframe we just made into the notebook without dealing with a file intermediate; however, next I'll demonstrate how to save it as text for use elsewhere, such as in Excel.

## Output to more universal, table-like formats

I've tried to sell you on the power of the Python/Pandas dataframe, but it isn't for all uses or everyone. However, most everyone is accustomed to dealing with text based tables or even Excel. In fact, a text-based based table perhaps tab or comma-delimited would be the better way to archive the data we are generating here. Python/Pandas makes it easy to go from the dataframe form to these tabular forms. You can even go back later from the table to the dataframe, which may be inportant if you are going to different versions of Python/Pandas as I briefly mentioned parenthetically above.

**First, generating a text-based table.**

In [8]:
#Save / write a TSV-formatted (tab-separated values/ tab-delimited) file
df.to_csv('hhsuite_results.tsv', sep='\t',index = False) #add `,header=False` to leave off header, too

Because `df.to_csv()` defaults to dealing with csv, you can simply use `df.to_csv('example.csv',index = False)` for comma-delimited (comma-separated) files.

You can see that worked by looking at the first few lines with the next command. (Feel free to make the number higher or delete the number all together. I restricted it just to first line to make output smaller.)

In [9]:
!head -5 hhsuite_results.tsv

hit_num	qid	qtitle	query_length	hid	htitle	hit_length	Probab	E-value	Score	Aligned_cols	Identities	Similarity	Sum_probs	size_diff	aln_confidence	qconsensus	hseq	hconsensus	hss_pred	hss_dssp	qseq	pw_conservation
1	2UVO:A|PDBID|CHAIN|SEQUENCE	2UVO:A|PDBID|CHAIN|SEQUENCE	171	2uvo_A	2uvo_A Agglutinin isolectin 1; carbohydrate-binding protein, hevein domain, chitin-binding, GERM agglutinin, chitin-binding protein; HET: NDG NAG GOL; 1.40A {Triticum aestivum} PDB: 1wgc_A* 2cwg_A* 2x3t_A* 4aml_A* 7wga_A 9wga_A 2wgc_A 1wgt_A 1k7t_A* 1k7v_A* 1k7u_A 2x52_A* 1t0w_A*	171	100.0	4.6e-42	249.39	171	100%	2.05	166.9	0	799999999999999999999999999999999999999999999999999999999999999999999999999999999999999999998899999999999999999999999999999999999999999999999999999999999999999999999999986	ercgeqgsnmecpnnlccsqygycgmggdycgkgcqngacwtskrcgsqaggatctnnqccsqygycgfgaeycgagcqggpcradikcgsqaggklcpnnlccsqwgfcglgsefcgggcqsgacstdkpcgkdaggrvctnnyccskwgscgigpgycgagcqsggcdg	ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRCG

If you had need to go back from a tab-separated table to a dataframe, you can run something like in the following cell.

In [10]:
reverted_df = pd.read_csv('hhsuite_results.tsv', sep='\t')
reverted_df.to_pickle('reverted_df.pkl') # OPTIONAL: pickle that data too

For a comma-delimited (CSV) file you'd use `df = pd.read_csv('example.csv')` because `pd.read_csv()` method defaults to comma as the separator (`sep` parameter).

You can verify that read from the text-based table by viewing it with the next line.

In [11]:
reverted_df.head()

Unnamed: 0,hit_num,qid,qtitle,query_length,hid,htitle,hit_length,Probab,E-value,Score,...,Sum_probs,size_diff,aln_confidence,qconsensus,hseq,hconsensus,hss_pred,hss_dssp,qseq,pw_conservation
0,1,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,2uvo_A,2uvo_A Agglutinin isolectin 1; carbohydrate-bi...,171,100.0,4.6e-42,249.39,...,166.9,0,7999999999999999999999999999999999999999999999...,ercgeqgsnmecpnnlccsqygycgmggdycgkgcqngacwtskrc...,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRC...,~~cg~~~~~~~c~~~~CCS~~g~Cg~~~~~Cg~gC~~~~c~~~~~c...,CCCCCCCCCcCCCCCCeeCCCCeECCCcccccCCcccccccccccc...,CBCBGGGTTBBCGGGCEECTTSBEEBSHHHHSTTCCBSSCSSCCBC...,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRC...,||||++.++..||++.|||+|+|||.+.+||+++||.+.|++..+|...
1,2,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,2wga,2wga ; lectin (agglutinin); NMR {},164,99.95,2.8000000000000002e-33,204.56,...,153.2,7,5888999999999999999999999999999999999999999999...,rcgeqgsnmecpnnlccsqygycgmggdycgkgcqngacwtskrcg...,GXGCXGXXMYCSTNNCCXXWESCGSGGYXCGEGCNLGACQXGXPCX...,~~g~~~~~~~c~~~~CCs~~g~Cg~~~~~c~~~C~~~~c~~~~~cg...,ccCCCcccccCCCCceECCcceECCCCccccCccccCcccccceec...,CSGGGSSCCCCSTTCEECTTSCEECSTTTTSTTCCSSSCSSCCCSS...,RCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRCG...,.||.+..+..||+++|||.|+|||.+.+||+++||.++|++..+|+...
2,3,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,2uvo_A,2uvo_A Agglutinin isolectin 1; carbohydrate-bi...,171,99.84,1.1e-25,163.39,...,121.1,0,4899999999999999999999999999999999999999999999...,krcgsqaggatctnnqccsqygycgfgaeycgagcqggpcradikc...,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRC...,~~cg~~~~~~~c~~~~CCS~~g~Cg~~~~~Cg~gC~~~~c~~~~~c...,CCCCCCCCCcCCCCCCeeCCCCeECCCcccccCCcccccccccccc...,CBCBGGGTTBBCGGGCEECTTSBEEBSHHHHSTTCCBSSCSSCCBC...,KRCGSQAGGATCTNNQCCSQYGYCGFGAEYCGAGCQGGPCRADIKC...,.|||.++.+.+|.++.|||+|+|||.+++||+++||.++|..|..|...
3,4,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,1ulk_A,"1ulk_A Lectin-C; chitin-binding protein, hevei...",126,99.84,1.5000000000000001e-25,157.96,...,110.7,45,699999999999999999999999999999999999766,krcgsqaggatctnnqccsqygycgfgaeycgagcqggpcradikc...,PVCGVRASGRVCPDGYCCSQWGYCGTTEEYCGKGCQSQ-CD-YNRC...,~~cg~~~~~~~c~~g~CCs~~g~CG~~~~~Cg~gCq~~-c~-~~~C...,CcCcCCCCCCCCCCCCeECCCCeeCCCccccCCCcccc-ce-eeec...,CCSBGGGTTBCCGGGCEECTTSCEESSHHHHSTTCCBC-TT-TTBC...,KRCGSQAGGATCTNNQCCSQYGYCGFGAEYCGAGCQGGPCRADIKC...,+|||.|+++.+|.++.|||+++|||.+.+||+++||...
4,5,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,1ulk_A,"1ulk_A Lectin-C; chitin-binding protein, hevei...",126,99.82,6.5e-25,154.69,...,109.2,45,489999999999999999999999999999999999765,ercgeqgsnmecpnnlccsqygycgmggdycgkgcqngacwtskrc...,PVCGVRASGRVCPDGYCCSQWGYCGTTEEYCGKGCQSQ-CD-YNRC...,~~cg~~~~~~~c~~g~CCs~~g~CG~~~~~Cg~gCq~~-c~-~~~C...,CcCcCCCCCCCCCCCCeECCCCeeCCCccccCCCcccc-ce-eeec...,CCSBGGGTTBCCGGGCEECTTSCEESSHHHHSTTCCBC-TT-TTBC...,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRC...,.|||.|.++..||+++|||+++|||.+.+||+++||...


**Generating an Excel spreadsheet from a dataframe.**

Because this is an specialized need, there is a special module needed that I didn't bother installing by default and so it needs to be installed before generating the Excel file. Running the next cell will do both.

In [12]:
%pip install openpyxl
# save to excel (KEEPS multiINDEX, and makes sparse to look good in Excel straight out of Python)
df.to_excel('hhsuite_results.xlsx') # after openpyxl installed

Note: you may need to restart the kernel to use updated packages.


You'll need to download the file first to your computer and then view it locally as there is no viewer in the Jupyter environment.

Adiitionally, it is possible to add styles to dataframes and the styles such as shading of cells and coloring of text will be translated to the Excel document made as well.

Excel files can be read in to Pandas dataframes directly without needing to go to a text based intermediate first.

In [13]:
# read Excel
df_from_excel = pd.read_excel('hhsuite_results.xlsx',engine='openpyxl') # see https://stackoverflow.com/a/65266270/8508004 where notes xlrd no longer supports xlsx

That can be viewed to convince yourself it worked by running the next command.

In [14]:
df_from_excel.head()

Unnamed: 0.1,Unnamed: 0,hit_num,qid,qtitle,query_length,hid,htitle,hit_length,Probab,E-value,...,Sum_probs,size_diff,aln_confidence,qconsensus,hseq,hconsensus,hss_pred,hss_dssp,qseq,pw_conservation
0,0,1,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,2uvo_A,2uvo_A Agglutinin isolectin 1; carbohydrate-bi...,171,100.0,4.6e-42,...,166.9,0,7999999999999999999999999999999999999999999999...,ercgeqgsnmecpnnlccsqygycgmggdycgkgcqngacwtskrc...,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRC...,~~cg~~~~~~~c~~~~CCS~~g~Cg~~~~~Cg~gC~~~~c~~~~~c...,CCCCCCCCCcCCCCCCeeCCCCeECCCcccccCCcccccccccccc...,CBCBGGGTTBBCGGGCEECTTSBEEBSHHHHSTTCCBSSCSSCCBC...,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRC...,||||++.++..||++.|||+|+|||.+.+||+++||.+.|++..+|...
1,1,2,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,2wga,2wga ; lectin (agglutinin); NMR {},164,99.95,2.8000000000000002e-33,...,153.2,7,5888999999999999999999999999999999999999999999...,rcgeqgsnmecpnnlccsqygycgmggdycgkgcqngacwtskrcg...,GXGCXGXXMYCSTNNCCXXWESCGSGGYXCGEGCNLGACQXGXPCX...,~~g~~~~~~~c~~~~CCs~~g~Cg~~~~~c~~~C~~~~c~~~~~cg...,ccCCCcccccCCCCceECCcceECCCCccccCccccCcccccceec...,CSGGGSSCCCCSTTCEECTTSCEECSTTTTSTTCCSSSCSSCCCSS...,RCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRCG...,.||.+..+..||+++|||.|+|||.+.+||+++||.++|++..+|+...
2,2,3,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,2uvo_A,2uvo_A Agglutinin isolectin 1; carbohydrate-bi...,171,99.84,1.1e-25,...,121.1,0,4899999999999999999999999999999999999999999999...,krcgsqaggatctnnqccsqygycgfgaeycgagcqggpcradikc...,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRC...,~~cg~~~~~~~c~~~~CCS~~g~Cg~~~~~Cg~gC~~~~c~~~~~c...,CCCCCCCCCcCCCCCCeeCCCCeECCCcccccCCcccccccccccc...,CBCBGGGTTBBCGGGCEECTTSBEEBSHHHHSTTCCBSSCSSCCBC...,KRCGSQAGGATCTNNQCCSQYGYCGFGAEYCGAGCQGGPCRADIKC...,.|||.++.+.+|.++.|||+|+|||.+++||+++||.++|..|..|...
3,3,4,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,1ulk_A,"1ulk_A Lectin-C; chitin-binding protein, hevei...",126,99.84,1.5000000000000001e-25,...,110.7,45,699999999999999999999999999999999999766,krcgsqaggatctnnqccsqygycgfgaeycgagcqggpcradikc...,PVCGVRASGRVCPDGYCCSQWGYCGTTEEYCGKGCQSQ-CD-YNRC...,~~cg~~~~~~~c~~g~CCs~~g~CG~~~~~Cg~gCq~~-c~-~~~C...,CcCcCCCCCCCCCCCCeECCCCeeCCCccccCCCcccc-ce-eeec...,CCSBGGGTTBCCGGGCEECTTSCEESSHHHHSTTCCBC-TT-TTBC...,KRCGSQAGGATCTNNQCCSQYGYCGFGAEYCGAGCQGGPCRADIKC...,+|||.|+++.+|.++.|||+++|||.+.+||+++||...
4,4,5,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,1ulk_A,"1ulk_A Lectin-C; chitin-binding protein, hevei...",126,99.82,6.5e-25,...,109.2,45,489999999999999999999999999999999999765,ercgeqgsnmecpnnlccsqygycgmggdycgkgcqngacwtskrc...,PVCGVRASGRVCPDGYCCSQWGYCGTTEEYCGKGCQSQ-CD-YNRC...,~~cg~~~~~~~c~~g~CCs~~g~CG~~~~~Cg~gCq~~-c~-~~~C...,CcCcCCCCCCCCCCCCeECCCCeeCCCccccCCCcccc-ce-eeec...,CCSBGGGTTBCCGGGCEECTTSCEESSHHHHSTTCCBC-TT-TTBC...,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRC...,.|||.|.++..||+++|||+++|||.+.+||+++||...


Next, we'll cover how to bring the dataframe we just made into the notebook without dealing with a file intermediate.


## Use the main function`hhsuite3_results_to_df` to process a results file directly in Python/Jupyter
First we'll check for the script we'll use and get it if we don't already have it. 

(The thinking is once you know what you are doing you may have skipped all the steps above and not have the script you'll need yet. It cannot hurt to check and if it isn't present, bring it here.)

In [15]:
# Get a file if not yet retrieved / check if file exists
import os
file_needed = "hhsuite3_results_to_df.py"
if not os.path.isfile(file_needed):
    !curl -OL https://raw.githubusercontent.com/fomightez/sequencework/master/hhsuite3-utilities/hhsuite3_results_to_df.py

This is going to rely on approaches very similar to those illustrated [here](https://github.com/fomightez/patmatch-binder/blob/6f7630b2ee061079a72cd117127328fd1abfa6c7/notebooks/PatMatch%20with%20more%20Python.ipynb#Passing-results-data-into-active-memory-without-a-file-intermediate) and [here](https://github.com/fomightez/patmatch-binder/blob/6f7630b2ee061079a72cd117127328fd1abfa6c7/notebooks/Sending%20PatMatch%20output%20directly%20to%20Python.ipynb##Running-Patmatch-and-passing-the-results-to-Python-without-creating-an-output-file-intermediate).

We obtained the `hhsuite3_results_to_df.py` script in the preparation steps above. However, instead of using it as an external script as we did earlier in this notebook, we want to use the core function of that script within this notebook for the options that involve no file intermediatess. Similar to the way we imported a lot of other useful modules in the first notebook and a cell above, you can run the next cell to bring in to memory of this notebook's computational environment, the main function associated with the `hhsuite3_results_to_df.py` script, aptly named `hhsuite3_results_to_df`. (As written below the command to do that looks a bit redundant;however, the first from part of the command below actually is referencing the `hhsuite3_results_to_df.py` script, but it doesn't need the `.py` extension because the import only deals with such files.)

In [16]:
from hhsuite3_results_to_df import hhsuite3_results_to_df

We can demonstrate that worked by calling the function.

In [17]:
hhsuite3_results_to_df()

TypeError: hhsuite3_results_to_df() missing 1 required positional argument: 'results'

If the module was not imported, you'd see `ModuleNotFoundError: No module named 'hhsuite3_results_to_df'`, but instead you should see it saying it is missing `results` to act on because you passed it nothing.

After importing the main function of that script into this running notebook, you are ready to demonstrate the approach that doesn't require a file intermediates. The imported `hhsuite3_results_to_df` function is used within the computational environment of the notebook and the dataframe produced assigned to a variable in the running the notebook. In the end, the results are in an active dataframe in the notebook without needing to read the pickled dataframe. **Although bear in mind the pickled dataframe still gets made, and it is good to download and keep that pickled dataframe since you'll find it convenient for reading and getting back into an analysis without need for rerunning earlier steps again.**

In [18]:
direct_df = hhsuite3_results_to_df("2uvo_hhsearch.hhr")
direct_df.head()

Provided results read and converted to a dataframe...

A dataframe of the data has been saved as a file
in a manner where other Python programs can access it (pickled form).
RESULTING DATAFRAME is stored as ==> 'hhs3_results_pickled_df.pkl'

Returning a dataframe with the information as well.

Unnamed: 0,hit_num,qid,qtitle,query_length,hid,htitle,hit_length,Probab,E-value,Score,...,Sum_probs,size_diff,aln_confidence,qconsensus,hseq,hconsensus,hss_pred,hss_dssp,qseq,pw_conservation
0,1,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,2uvo_A,2uvo_A Agglutinin isolectin 1; carbohydrate-bi...,171,100.0,4.6e-42,249.39,...,166.9,0,7999999999999999999999999999999999999999999999...,ercgeqgsnmecpnnlccsqygycgmggdycgkgcqngacwtskrc...,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRC...,~~cg~~~~~~~c~~~~CCS~~g~Cg~~~~~Cg~gC~~~~c~~~~~c...,CCCCCCCCCcCCCCCCeeCCCCeECCCcccccCCcccccccccccc...,CBCBGGGTTBBCGGGCEECTTSBEEBSHHHHSTTCCBSSCSSCCBC...,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRC...,||||++.++..||++.|||+|+|||.+.+||+++||.+.|++..+|...
1,2,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,2wga,2wga ; lectin (agglutinin); NMR {},164,99.95,2.8000000000000002e-33,204.56,...,153.2,7,5888999999999999999999999999999999999999999999...,rcgeqgsnmecpnnlccsqygycgmggdycgkgcqngacwtskrcg...,GXGCXGXXMYCSTNNCCXXWESCGSGGYXCGEGCNLGACQXGXPCX...,~~g~~~~~~~c~~~~CCs~~g~Cg~~~~~c~~~C~~~~c~~~~~cg...,ccCCCcccccCCCCceECCcceECCCCccccCccccCcccccceec...,CSGGGSSCCCCSTTCEECTTSCEECSTTTTSTTCCSSSCSSCCCSS...,RCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRCG...,.||.+..+..||+++|||.|+|||.+.+||+++||.++|++..+|+...
2,3,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,2uvo_A,2uvo_A Agglutinin isolectin 1; carbohydrate-bi...,171,99.84,1.1e-25,163.39,...,121.1,0,4899999999999999999999999999999999999999999999...,krcgsqaggatctnnqccsqygycgfgaeycgagcqggpcradikc...,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRC...,~~cg~~~~~~~c~~~~CCS~~g~Cg~~~~~Cg~gC~~~~c~~~~~c...,CCCCCCCCCcCCCCCCeeCCCCeECCCcccccCCcccccccccccc...,CBCBGGGTTBBCGGGCEECTTSBEEBSHHHHSTTCCBSSCSSCCBC...,KRCGSQAGGATCTNNQCCSQYGYCGFGAEYCGAGCQGGPCRADIKC...,.|||.++.+.+|.++.|||+|+|||.+++||+++||.++|..|..|...
3,4,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,1ulk_A,"1ulk_A Lectin-C; chitin-binding protein, hevei...",126,99.84,1.5000000000000001e-25,157.96,...,110.7,45,699999999999999999999999999999999999766,krcgsqaggatctnnqccsqygycgfgaeycgagcqggpcradikc...,PVCGVRASGRVCPDGYCCSQWGYCGTTEEYCGKGCQSQ-CD-YNRC...,~~cg~~~~~~~c~~g~CCs~~g~CG~~~~~Cg~gCq~~-c~-~~~C...,CcCcCCCCCCCCCCCCeECCCCeeCCCccccCCCcccc-ce-eeec...,CCSBGGGTTBCCGGGCEECTTSCEESSHHHHSTTCCBC-TT-TTBC...,KRCGSQAGGATCTNNQCCSQYGYCGFGAEYCGAGCQGGPCRADIKC...,+|||.|+++.+|.++.|||+++|||.+.+||+++||...
4,5,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,1ulk_A,"1ulk_A Lectin-C; chitin-binding protein, hevei...",126,99.82,6.5e-25,154.69,...,109.2,45,489999999999999999999999999999999999765,ercgeqgsnmecpnnlccsqygycgmggdycgkgcqngacwtskrc...,PVCGVRASGRVCPDGYCCSQWGYCGTTEEYCGKGCQSQ-CD-YNRC...,~~cg~~~~~~~c~~g~CCs~~g~CG~~~~~Cg~gCq~~-c~-~~~C...,CcCcCCCCCCCCCCCCeECCCCeeCCCccccCCCcccc-ce-eeec...,CCSBGGGTTBCCGGGCEECTTSCEESSHHHHSTTCCBC-TT-TTBC...,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRC...,.|||.|.++..||+++|||+++|||.+.+||+++||...


This may be how you prefer to use the script. Either option exists.



### Demonstrate using it to process multiple files

Scan the search results directory for the individual results files that begin with `2uvo_` and process those.  
This is to give a practical example of how you can use the main function of `hhsuite3_results_to_df.py` to loop over multiple files easily.

In [19]:
import os
import sys
import fnmatch
name_part_to_match_prefix = "2uvo_"
extension_to_match = ".hhr"
results_file_names= []
for file in os.listdir('.'):
    if fnmatch.fnmatch(file, f"{name_part_to_match_prefix}*{extension_to_match}"):
        results_file_names.append(file)
print(f"A total of {len(results_file_names)} files identified to process.")

A total of 2 files identified to process.


From each result file, make a dataframe of the results.  

In [20]:
results_fn_prefix = "2uvo_" # the results file name prefix
df_dict = {}
for file in results_file_names:
    file_id = file[:-4] # `2uvo_hhsearch.hhr` gives `2uvo_hhsearch`
    df_dict[file_id] =  hhsuite3_results_to_df(file, return_df = True, pickle_df=False)
print(len(df_dict))

2


Provided results read and converted to a dataframe...

A dataframe of the data was not stored for use
elsewhere because `no_pickling` was specified.

Returning a dataframe with the information as well.Provided results read and converted to a dataframe...

A dataframe of the data was not stored for use
elsewhere because `no_pickling` was specified.

Returning a dataframe with the information as well.

The dictionary keys are the file names withot the `.hhr` part:

In [21]:
list(df_dict.keys())

['2uvo_hhblits', '2uvo_hhsearch']

To see the dataframe corresponding to '2uvo_hhsearch':

In [22]:
df_dict['2uvo_hhsearch'].head()

Unnamed: 0,hit_num,qid,qtitle,query_length,hid,htitle,hit_length,Probab,E-value,Score,...,Sum_probs,size_diff,aln_confidence,qconsensus,hseq,hconsensus,hss_pred,hss_dssp,qseq,pw_conservation
0,1,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,2uvo_A,2uvo_A Agglutinin isolectin 1; carbohydrate-bi...,171,100.0,4.6e-42,249.39,...,166.9,0,7999999999999999999999999999999999999999999999...,ercgeqgsnmecpnnlccsqygycgmggdycgkgcqngacwtskrc...,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRC...,~~cg~~~~~~~c~~~~CCS~~g~Cg~~~~~Cg~gC~~~~c~~~~~c...,CCCCCCCCCcCCCCCCeeCCCCeECCCcccccCCcccccccccccc...,CBCBGGGTTBBCGGGCEECTTSBEEBSHHHHSTTCCBSSCSSCCBC...,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRC...,||||++.++..||++.|||+|+|||.+.+||+++||.+.|++..+|...
1,2,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,2wga,2wga ; lectin (agglutinin); NMR {},164,99.95,2.8000000000000002e-33,204.56,...,153.2,7,5888999999999999999999999999999999999999999999...,rcgeqgsnmecpnnlccsqygycgmggdycgkgcqngacwtskrcg...,GXGCXGXXMYCSTNNCCXXWESCGSGGYXCGEGCNLGACQXGXPCX...,~~g~~~~~~~c~~~~CCs~~g~Cg~~~~~c~~~C~~~~c~~~~~cg...,ccCCCcccccCCCCceECCcceECCCCccccCccccCcccccceec...,CSGGGSSCCCCSTTCEECTTSCEECSTTTTSTTCCSSSCSSCCCSS...,RCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRCG...,.||.+..+..||+++|||.|+|||.+.+||+++||.++|++..+|+...
2,3,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,2uvo_A,2uvo_A Agglutinin isolectin 1; carbohydrate-bi...,171,99.84,1.1e-25,163.39,...,121.1,0,4899999999999999999999999999999999999999999999...,krcgsqaggatctnnqccsqygycgfgaeycgagcqggpcradikc...,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRC...,~~cg~~~~~~~c~~~~CCS~~g~Cg~~~~~Cg~gC~~~~c~~~~~c...,CCCCCCCCCcCCCCCCeeCCCCeECCCcccccCCcccccccccccc...,CBCBGGGTTBBCGGGCEECTTSBEEBSHHHHSTTCCBSSCSSCCBC...,KRCGSQAGGATCTNNQCCSQYGYCGFGAEYCGAGCQGGPCRADIKC...,.|||.++.+.+|.++.|||+|+|||.+++||+++||.++|..|..|...
3,4,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,1ulk_A,"1ulk_A Lectin-C; chitin-binding protein, hevei...",126,99.84,1.5000000000000001e-25,157.96,...,110.7,45,699999999999999999999999999999999999766,krcgsqaggatctnnqccsqygycgfgaeycgagcqggpcradikc...,PVCGVRASGRVCPDGYCCSQWGYCGTTEEYCGKGCQSQ-CD-YNRC...,~~cg~~~~~~~c~~g~CCs~~g~CG~~~~~Cg~gCq~~-c~-~~~C...,CcCcCCCCCCCCCCCCeECCCCeeCCCccccCCCcccc-ce-eeec...,CCSBGGGTTBCCGGGCEECTTSCEESSHHHHSTTCCBC-TT-TTBC...,KRCGSQAGGATCTNNQCCSQYGYCGFGAEYCGAGCQGGPCRADIKC...,+|||.|+++.+|.++.|||+++|||.+.+||+++||...
4,5,2UVO:A|PDBID|CHAIN|SEQUENCE,2UVO:A|PDBID|CHAIN|SEQUENCE,171,1ulk_A,"1ulk_A Lectin-C; chitin-binding protein, hevei...",126,99.82,6.5e-25,154.69,...,109.2,45,489999999999999999999999999999999999765,ercgeqgsnmecpnnlccsqygycgmggdycgkgcqngacwtskrc...,PVCGVRASGRVCPDGYCCSQWGYCGTTEEYCGKGCQSQ-CD-YNRC...,~~cg~~~~~~~c~~g~CCs~~g~CG~~~~~Cg~gCq~~-c~-~~~C...,CcCcCCCCCCCCCCCCeECCCCeeCCCccccCCCcccc-ce-eeec...,CCSBGGGTTBCCGGGCEECTTSCEESSHHHHSTTCCBC-TT-TTBC...,ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRC...,.|||.|.++..||+++|||+++|||.+.+||+++||...


That illustrates how you can easily iterate over multiple `.hhr` result files.

-----

**CAVEAT for `hhsuite3_results_to_df.py`**

Note that **there is caveat** that `hhsuite3_results_to_df.py` presently doesn't work on [hhsearch_q9bsu1_uniclust_w_ss_pfamA_30.hhr](https://github.com/biopython/biopython/blob/master/Tests/HHsuite/hhsearch_q9bsu1_uniclust_w_ss_pfamA_30.hhr). See the note [here](https://github.com/fomightez/sequencework/blob/master/hhsuite3-utilities/README.md). The most direct thing to solve this would be to **demonstrate a pre-processing step here** to solve that which would then allow  `hhsuite3_results_to_df.py` to work on [hhsearch_q9bsu1_uniclust_w_ss_pfamA_30.hhr](https://github.com/biopython/biopython/blob/master/Tests/HHsuite/hhsearch_q9bsu1_uniclust_w_ss_pfamA_30.hhr) at least after such a step.

---


----

## Using hhsuite parsing code from biopython

The notebook's above section showed a script that did all the handling to produce a Pandas dataframe from an HH-suite3 results `.hhr` file. However, there is an option to use biopython to parse HH-suite3 results `.hhr` file to Python objects that play well with biopython / Python. Biopython has already been installed here. If you are trying to run this Jupyter notebook elsewhere, you'd need to do that.

This section will demonstrate some of how this can be done. This is not meant to be all-encompassing. This may not be tractable unless you have some familarity with biopython/Python. Determining what to demonstrate here involved examining [hhsuite2_text.py](https://github.com/biopython/biopython/blob/master/Bio/SearchIO/HHsuiteIO/hhsuite2_text.py) and [test_SearchIO_hhsuite2_text.py](https://github.com/biopython/biopython/blob/master/Tests/test_SearchIO_hhsuite2_text.py), and consulting [the documentation](https://biopython.org/docs/1.75/api/Bio.SearchIO.html). (With some additional resources from [an older tutotial](https://biopython-cn.readthedocs.io/zh_CN/latest/en/chr08.html).) Beyond that I had to work out some of this with trial-and-error because I was not overly familiar with the HSP and HSPFragment modules within Biopython. I've added the use of `dir()` at the end here to explore what attributes are available; however, I liked looking at the resources listed above to get a sense of how thos related to the parsed results file and Biopython.

In [23]:
from Bio.SearchIO import parse # import module of Biopython needed into active kernel namespace

The next command will parse an `.hhr` file. Note that `, "hhsuite2-text"` part specifies the format the biopython parser will use. If you've paid attention, you'll know HH-suite3 is the current release. It looks like Biopython hasn't caught up with that yet; however, fortunately, I've verified the `hhsuite2-text` option seems to work with current, HH-suite3-produced `.hhr` files to a great extent.

In [24]:
parsed_file_info = parse("hhsearch_q9bsu1_uniclust_w_ss_pfamA_30.hhr", "hhsuite2-text") # based on test_SearchIO_hhsuite2_text.py and https://biopython.org/docs/1.75/api/Bio.SearchIO.html

In [25]:
type(parsed_file_info)

generator

The `parsed_file_info` created is a Python generator.

We can begin to get a sense of what it is by converting it to a list object and looking at that. 

In [26]:
results = list(parsed_file_info)
results

[QueryResult(id='sp|Q9BSU1|CP070_HUMAN UPF0183 protein C16orf70 OS=Homo sapiens OX=9606 GN=C16orf70 PE=1 SV=1', 12 hits)]

Looks like a single `QueryResult` object is respresented. It has an identifier and says it has a dozen hits.

We can access the identifier for that first and only `QueryResult` object, like so:

In [27]:
results[0].id

'sp|Q9BSU1|CP070_HUMAN UPF0183 protein C16orf70 OS=Homo sapiens OX=9606 GN=C16orf70 PE=1 SV=1'

Python is zero-indexed so the `[0]` specifies the first in the code above.

Now that we sort of know what we've got, let's start over and access the `QueryResult` object more like it would normally. That way we don't need that cluttered `results[0]` to get at it. The first and only item in the generator got consumed when we accessed it to convert it into a list. So we'll go back to making that `parsed_file_info` again.

In [28]:
parsed_file_info = parse("hhsearch_q9bsu1_uniclust_w_ss_pfamA_30.hhr", "hhsuite2-text") # based on test_SearchIO_hhsuite2_text.py and https://biopython.org/docs/1.75/api/Bio.SearchIO.html

This time we'll access the first object in the generator with the command below. (I'm assuming 'most' `.hhr` files only have this one `QueryResult` object and so this is likely how you'd always also proceed.)

In [29]:
results = next(parsed_file_info)

Let's examine this new version of `results`.

In [30]:
results

QueryResult(id='sp|Q9BSU1|CP070_HUMAN UPF0183 protein C16orf70 OS=Homo sapiens OX=9606 GN=C16orf70 PE=1 SV=1', 12 hits)

(Below, we'll actually `print(results)`. You are welcome to do that here yourself. You'll observe the output is different because Python allows customizing the representation of an object when you output it with `print()` and this can differ from when you just display that item.)

Alright, now we have this one `QueryResult` object and we can access the identifier this way.

In [31]:
results.id

'sp|Q9BSU1|CP070_HUMAN UPF0183 protein C16orf70 OS=Homo sapiens OX=9606 GN=C16orf70 PE=1 SV=1'

I think given `12 hits` in the `QueryResult` object representation, you may have picked up that this object let's us explore those hits. We'll iterate on those, revealing what type of object each is.

In [32]:
for result in results:
    print(type(result))

<class 'Bio.SearchIO._model.hit.Hit'>
<class 'Bio.SearchIO._model.hit.Hit'>
<class 'Bio.SearchIO._model.hit.Hit'>
<class 'Bio.SearchIO._model.hit.Hit'>
<class 'Bio.SearchIO._model.hit.Hit'>
<class 'Bio.SearchIO._model.hit.Hit'>
<class 'Bio.SearchIO._model.hit.Hit'>
<class 'Bio.SearchIO._model.hit.Hit'>
<class 'Bio.SearchIO._model.hit.Hit'>
<class 'Bio.SearchIO._model.hit.Hit'>
<class 'Bio.SearchIO._model.hit.Hit'>
<class 'Bio.SearchIO._model.hit.Hit'>


Each is an object of the 'Bio.SearchIO._model.hit.Hit'-type class. This is a special Biopython class. 

Each `Bio.SearchIO._model.hit.Hit` object has attributes we can access. As a way to get a flavor for how to explore them, we'll first introduce the concept behind one of them. 

Actually, if you happened to examine the file [hhsearch_q9bsu1_uniclust_w_ss_pfamA_30.hhr](https://raw.githubusercontent.com/biopython/biopython/master/Tests/HHsuite/hhsearch_q9bsu1_uniclust_w_ss_pfamA_30.hhr), you may find it interesting that there are only 12 hits. Running the following code will show the first lines showing that.

In [33]:
!head -n25 hhsearch_q9bsu1_uniclust_w_ss_pfamA_30.hhr

Query         sp|Q9BSU1|CP070_HUMAN UPF0183 protein C16orf70 OS=Homo sapiens OX=9606 GN=C16orf70 PE=1 SV=1
Match_columns 422
No_of_seqs    149 out of 573
Neff          6.62119
Searched_HMMs 16712
Date          Wed Feb 13 09:26:07 2019
Command       /home/shah/hh-suite/build/bin/hhsearch -i /home/shah/seq/q9bsu1/hhblits/q9bsu1_uniclust_w_ss.a3m -d /home/shah/db/pfamA_30/pfam -o /home/shah/seq/q9bsu1/hhsearch_q9bsu1_uniclust_w_ss_pfamA_30.hhr -p 20 -Z 250 -loc -z 1 -b 1 -B 250 -ssm 2 -sc 1 -seq 1 -dbstrlen 10000 -norealign -maxres 32000 

 No Hit                             Prob E-value P-value  Score    SS Cols Query HMM  Template HMM
  1 PF03676.13 ; UPF0183 ; Unchara 100.0  2E-106  1E-110  822.8  46.8  393   11-407     1-395 (395)
  2 PF04355.12 ; SmpA_OmlA ; SmpA   71.6    0.97 5.8E-05   31.5   1.2   20  239-258    12-31  (73)
  3 PF11399.7 ; DUF3192 ; Protein   53.5     4.1 0.00025   31.0   1.6   20  239-258    34-53  (104)
  4 PF14172.5 ; DUF4309 ; Domain o  52.4     7.

See how the `No` (number) column goes up to sixteen. Why then only a dozen `Bio.SearchIO._model.hit.Hit` objects in results?

This is because Biopython combines the data for 'high-scoring segment pairs' (or 'high-scoring pairs') , a.k.a., `HSPs` shared by the same hit identifier under the data for that 'Hit'. If you examine the 'Hit' column in the table above, like  PF14504.5  can have more than one hsp;  PF14504.5 has three. We can easily see that by printing the `results` object now. (In an aside above, it was discussed above why this shows something different than just putting `results` in a cell.)

In [34]:
print(results)

Program: HHSUITE (<unknown version>)
  Query: sp|Q9BSU1|CP070_HUMAN UPF0183 protein C16orf70 OS=Homo sapiens OX=9606 GN=C16orf70 PE=1 SV=1 (422)
         <unknown description>
 Target: <unknown target>
   Hits: ----  -----  ----------------------------------------------------------
            #  # HSP  ID + description
         ----  -----  ----------------------------------------------------------
            0      1  PF03676.13  UPF0183 ; Uncharacterised protein family (U...
            1      1  PF04355.12  SmpA_OmlA ; SmpA / OmlA family
            2      1  PF11399.7  DUF3192 ; Protein of unknown function (DUF3192)
            3      2  PF14172.5  DUF4309 ; Domain of unknown function (DUF4309)
            4      1  PF11006.7  DUF2845 ; Protein of unknown function (DUF2845)
            5      3  PF14504.5  CAP_assoc_N ; CAP-associated N-terminal
            6      1  PF14454.5  Prok_Ub ; Prokaryotic Ubiquitin
            7      1  PF06185.11  YecM ; YecM protein
            8    

The sixth one down, PF14504.5, has three under the number of `# HSP` column. 

Let's see how we can access that data for the HSP for each each hit. First let's see the number for each hit by applying Python's length function.

In [35]:
for result in results:
    print(len(result.hsps))

1
1
1
2
1
3
1
1
2
1
1
1


Now how do we get those in individual HSPs from the parsed results? We'll explore that.  
To explore better, yet keep the list of output shorter we'll mainly limit our subsequent exploration to the first two for the next few cells by iterating on `results[:2]`. Keep in mind, there are ten more though.

In [36]:
for result in results[:2]:
    for hsp in result.hsps:
        print(hsp)
        print("\n")

      Query: sp|Q9BSU1|CP070_HUMAN UPF0183 protein C16orf70 OS=Homo sapiens O...
        Hit: PF03676.13 UPF0183 ; Uncharacterised protein family (UPF0183)
Query range: [10:407] (None)
  Hit range: [0:395] (None)
Quick stats: evalue 2e-106; bitscore ?
  Fragments: 1 (399 columns)
     Query - SLGNEQWEFTLGMPLAQAVAILQKHCRIIKNVQVLYSEQSPLSHDLILNLTQDGIKLMF~~~SVTLY
       Hit - EQWE----FALGMPLAQAISILQKHCRIIKNVQVLYSEQMPLSHDLILNLTQDGIKLLF~~~SVTLY


      Query: sp|Q9BSU1|CP070_HUMAN UPF0183 protein C16orf70 OS=Homo sapiens O...
        Hit: PF04355.12 SmpA_OmlA ; SmpA / OmlA family
Query range: [238:258] (None)
  Hit range: [11:31] (None)
Quick stats: evalue 0.97; bitscore ?
  Fragments: 1 (20 columns)
     Query - VYFGDSCQDVLSMLGSPHKV
       Hit - LQIGMSESQVTYLLGNPMLR




Because we already knew there was only one we could have used the code here:
    
```python
for result in results[:2]:
    print(result.hsps[0])
    print("\n")
```

Because we know the sixth one has more, we can look at that one quickly. In otherwords display the three for PF14504.5:

In [37]:
for hsp in results[5].hsps:
    print(hsp)
    print("\n")

      Query: sp|Q9BSU1|CP070_HUMAN UPF0183 protein C16orf70 OS=Homo sapiens O...
        Hit: PF14504.5 CAP_assoc_N ; CAP-associated N-terminal
Query range: [240:259] (None)
  Hit range: [0:19] (None)
Quick stats: evalue 9.7; bitscore ?
  Fragments: 1 (19 columns)
     Query - FGDSCQDVLSMLGSPHKVF
       Hit - IGKNASDLQVLLGDPERKD


      Query: sp|Q9BSU1|CP070_HUMAN UPF0183 protein C16orf70 OS=Homo sapiens O...
        Hit: PF14504.5 CAP_assoc_N ; CAP-associated N-terminal
Query range: [238:309] (None)
  Hit range: [61:128] (None)
Quick stats: evalue 17; bitscore ?
  Fragments: 1 (80 columns)
     Query - VYFGDSCQDVLSMLGSPHKV---------FYKSEDKMKIHSPSPHKQVPSKCNDYFFNY~~~KKFVL
       Hit - FHIGQPVSEIYSSVFIDTNINFQYKGSSYRFELSE-------------DDLNTRPLIKA~~~SSIRY


      Query: sp|Q9BSU1|CP070_HUMAN UPF0183 protein C16orf70 OS=Homo sapiens O...
        Hit: PF14504.5 CAP_assoc_N ; CAP-associated N-terminal
Query range: [12:83] (None)
  Hit range: [60:130] (None)
Quick stats: evalue 83; bitscore ?
 

Note that the sequence represenations above are shortened for ease in viewing in this summarized form. Below will show you a route you could use to get to the actual sequences.

We'll return to looping on the first two to give a sense of what can be examined, and to make it easier to see correspondence each data item is separated by the text `***------HELPFUL DATA SEPARATOR-------***`:

In [38]:
for result in results[:2]:
    for hsp in result.hsps:
        print(hsp.score)
        print("***------HELPFUL DATA SEPARATOR-------***")
        print(hsp.hit.description) # I guessed `.hit.description` this based on `self.assertEqual("T", str(hsp.hit.seq))` in test_SearchIO_hhsuite2_text.py because I didn't see any example for this one)
        print("***------HELPFUL DATA SEPARATOR-------***")
        print(hsp.hit.seq)
        print("***------HELPFUL DATA SEPARATOR-------***")
        print(hsp.query.seq)
        print("***------HELPFUL DATA SEPARATOR-------***")

822.75
***------HELPFUL DATA SEPARATOR-------***
UPF0183 ; Uncharacterised protein family (UPF0183)
***------HELPFUL DATA SEPARATOR-------***
EQWE----FALGMPLAQAISILQKHCRIIKNVQVLYSEQMPLSHDLILNLTQDGIKLLFDACNQRLKVIEVYDLTKVKLKYCGVHFNSQAIAPTIEQIDQSFGATHPGVYNAAEQLFHLNFRGLSFSFQLDSWSEAPKYEPNFAHGLASLQIPHGATVKRMYIYSGNNLQETKAPAMPLACFLGNVYAECVEVLRDGAGPLGLKLRLLTAGCGPGVLADTKVRAVERSIYFGDSCQDVLSALGSPHKVFYKSEDKMKIHSPSPHKQVPSKCNDYFFNYYILGVDILFDSTTHLVKKFVLHTNFPGHYNFNIYHRCDFKIPLIIKKDGADAHSEDCILTTYSKWDQIQELLGHPMEKPVVLHRSSSANNTNPFGSTFCFGLQRMIFEVMQNNHIASVTLY
***------HELPFUL DATA SEPARATOR-------***
SLGNEQWEFTLGMPLAQAVAILQKHCRIIKNVQVLYSEQSPLSHDLILNLTQDGIKLMFDAFNQRLKVIEVCDLTKVKLKYCGVHFNSQAIAPTIEQIDQSFGATHPGVYNSAEQLFHLNFRGLSFSFQLDSWTEAPKYEPNFAHGLASLQIPHGATVKRMYIYSGNSLQDTKAPMMPLSCFLGNVYAESVDVLRDGTGPAGLRLRLLAAGCGPGLLADAKMRVFERSVYFGDSCQDVLSMLGSPHKVFYKSEDKMKIHSPSPHKQVPSKCNDYFFNYFTLGVDILFDANTHKVKKFVLHTNYPGHYNFNIYHRCEFKIPLAIKKENADGQTE--TCTTYSKWDNIQELLGHPVEKPVVLHRSSSPNNTNPFGSTFCFGLQRMIFEVMQNNHIASVTLY
***------HELPFUL

To get a sense of what attributes are available to show you can use Python's `dir()` function. Here is demonstraing how to use that to query those for the equivalent of that I used for `hsp.hit` above:

In [39]:
# print(dir(results[0].hsps[0].hit)) # <-- advanced route to what below reveals
for result in results[:1]:
    for hsp in result.hsps:
        print(dir(hsp.hit))

['__add__', '__bool__', '__class__', '__contains__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__le___', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__radd__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_per_letter_annotations', '_seq', '_set_per_letter_annotations', '_set_seq', 'annotations', 'dbxrefs', 'description', 'features', 'format', 'id', 'letter_annotations', 'lower', 'name', 'reverse_complement', 'seq', 'translate', 'upper']


You ignore those beginning with `_` and so that means:

```python
'annotations', 'dbxrefs', 'description', 'features', 'format', 'id', 'letter_annotations', 'lower', 'name', 'reverse_complement', 'seq', 'translate', 'upper'
```

And you can further drill down. For example, public attributes under `hsp.hit.seq` would be:

```python
'back_transcribe', 'complement', 'complement_rna', 'count', 'count_overlap', 'encode', 'endswith', 'find', 'index', 'join', 'lower', 'lstrip', 'reverse_complement', 'reverse_complement_rna', 'rfind', 'rindex', 'rsplit', 'rstrip', 'split', 'startswith', 'strip', 'tomutable', 'transcribe', 'translate', 'ungap', 'upper'
```

At the top of this section entitled 'Using hhsuite parsing code from biopython', I listed some additional resources that better gave me a sense of what is available from the parsed results file and how those pieces fit into the Biopython operational model.

----

Continue on with the next notebook in the series, [????](?????.ipynb). That notebook builds on the ground work here and in previous notebooks in this series to demonstrate .... .

----