# Demo of script to get sequence from multiFASTA file when description contains matching text

In order to demonstrate the use of my script `get_seq_from_multiFASTA_with_match_in_description.py`, I'll use it to collect protein sequences from a collection of PacBio sequenced yeast genomes from [Yue et al 2017](https://www.ncbi.nlm.nih.gov/pubmed/28416820).

Reference for sequence data:  
[Contrasting evolutionary genome dynamics between domesticated and wild yeasts.
Yue JX, Li J, Aigrain L, Hallin J, Persson K, Oliver K, Bergström A, Coupland P, Warringer J, Lagomarsino MC, Fischer G, Durbin R, Liti G. Nat Genet. 2017 Jun;49(6):913-924. doi: 10.1038/ng.3847. Epub 2017 Apr 17. PMID: 28416820](https://www.ncbi.nlm.nih.gov/pubmed/28416820)

This is meant to represent a typical workflow where a combination of these steps might be used.

------

##  Basic use

This script gets a sequence from a sequence file with multiple sequence entries in FASTA format (a.k.a. multiFASTA file) if there is a match to the provided text in the description line.
It gets the first match it finds only.

Before going on to use in a situation more representativ, this is meant to show the basics of using it on the command line. (On the 'proper' command line you wouldn't need the exclamation points I put in front of these commands for them to work in this notebook.)

In [3]:
# Get the script
!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/Extract_from_FASTA/get_seq_from_multiFASTA_with_match_in_description.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9941  100  9941    0     0  72036      0 --:--:-- --:--:-- --:--:-- 72562


#### Display USAGE block

In [5]:
!python get_seq_from_multiFASTA_with_match_in_description.py -h

usage: get_seq_from_multiFASTA_with_match_in_description.py
       [-h] [-cs] SEQUENCE_FILE TEXT_TO_MATCH

get_seq_from_multiFASTA_with_match_in_description.py takes any sequences in
FASTA format and gets the first sequence with a description line containing a
match to provided text string. For example, if provided a multi-sequence FASTA
file and a gene identifier, such as `YDL140C`, it will pull out the first
sequence matching that anywhere in description line. Defaults to ignoring
case. **** Script by Wayne Decatur (fomightez @ github) ***

positional arguments:
  SEQUENCE_FILE         Name of sequence file to search.
  TEXT_TO_MATCH         Text to match.

optional arguments:
  -h, --help            show this help message and exit
  -cs, --case_sensitive
                        Add this flag if you want to force matching to be
                        case-sensitive.


Next, we'll get example data and extract a sequence.

In [10]:
# Get a file of protein sequences
!curl -OL http://yjx1217.github.io/Yeast_PacBio_2016/data/Mitochondrial_PEP/DBVPG6044.mt.pep.fa.gz
!gunzip DBVPG6044.mt.pep.fa.gz

In [11]:
# Extract
!python get_seq_from_multiFASTA_with_match_in_description.py DBVPG6044.mt.pep.fa cox1



*****************DONE**************************
Extracted sequence saved in FASTA format as 'seq_cox1.fa'.
*****************DONE**************************


The next cell shows it worked.

In [12]:
!head seq_cox1.fa

>COX1|COX1.t01|COX1
MVQRWLYSTNAKDIAVLYFMLAIFSGMAGTAMSLIIRLELAAPGSQYLHGNSQLFNVLVTGHAVLMIFFLVMPALMGGFGNYLLPLMIGATDTAFPRINNIAFWVLPMGLVCLVTSTLVESGAGTGWTVYPPLSSIQAHSGPSVDLAIFALHLTSISSLLGAINFIVTTLNMRTNGMTMHKLPLFVWSIFITAFLLLLSLPVLSAGITMLLLDRNFNTSFFEVAGGGDPILYEHLFWFFGHPEVYILIIPGFGIISHVVSTYSKKPVFGEISMVYAMASIGLLGFLVWSHHMYIVGLDADTRAYFTSATMIIAIPTGIKIFSWLATIYGGSIRLATPMLYAIAFLFLFTMGGLTGVALANASLDVAFHDTYYVVGHFHYVLSMGAIFSLFAGYYYWSPQILGLNYNEKLAQIQFWLIFIGANVIFFPMHFLGINGMPRRIPDYPDAFAGWNYVASIGSFIATLSLFLFIYILYDQLVNGLNNKVNNKSVIYAKAPDFVESNTIFNLNTVKSSSIEFLLTSPPAVHSFNTPAVQS

There is an optional flag that can be added to make the matching case sensitive. (The default is to not be case sensitive.)

In [14]:
!python get_seq_from_multiFASTA_with_match_in_description.py DBVPG6044.mt.pep.fa cox1 -cs

**ERROR:No match to provided text found in description line for ANY sequence record.  ***ERROR*** 
EXITING.


In [15]:
!python get_seq_from_multiFASTA_with_match_in_description.py DBVPG6044.mt.pep.fa COX1 -cs



*****************DONE**************************
Extracted sequence saved in FASTA format as 'seq_COX1.fa'.
*****************DONE**************************


## Use script in a Jupyter notebook to collect sequences from a series of PacBio-sequenced genomes

This will demonstrating importing the script into a notebook and importing the main function in order to collect protein sequences from a collection of PacBio sequenced yeast genomes from [Yue et al 2017](https://www.ncbi.nlm.nih.gov/pubmed/28416820).

Reference for sequence data:  
[Contrasting evolutionary genome dynamics between domesticated and wild yeasts.
Yue JX, Li J, Aigrain L, Hallin J, Persson K, Oliver K, Bergström A, Coupland P, Warringer J, Lagomarsino MC, Fischer G, Durbin R, Liti G. Nat Genet. 2017 Jun;49(6):913-924. doi: 10.1038/ng.3847. Epub 2017 Apr 17. PMID: 28416820](https://www.ncbi.nlm.nih.gov/pubmed/28416820)

This is meant to demonstrate using Python and the script to make accomplish a series of steps.

In [19]:
# Get the script if the above section wasn't run
import os
file_needed = "get_seq_from_multiFASTA_with_match_in_description.py"
if not os.path.isfile(file_needed):
    !curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/Extract_from_FASTA/get_seq_from_multiFASTA_with_match_in_description.py

In [21]:
# Prepare for getting PacBio (Yue et al 2017 sequences)
#make a list of the strain designations
yue_et_al_strains = ["S288C","DBVPG6044","DBVPG6765","SK1","Y12",
                     "YPS128","UWOPS034614","CBS432","N44","YPS138",
                     "UFRJ50816","UWOPS919171"]

In [23]:
# Get & unpack protein sequences from strains 
for s in yue_et_al_strains:
    !curl -LO http://yjx1217.github.io/Yeast_PacBio_2016/data/Nuclear_PEP/{s}.pep.fa.gz
    !gunzip -f {s}.pep.fa.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   178  100   178    0     0   2342      0 --:--:-- --:--:-- --:--:--  2373
100 1653k  100 1653k    0     0  3451k      0 --:--:-- --:--:-- --:--:-- 3451k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   178  100   178    0     0   3708      0 --:--:-- --:--:-- --:--:--  3708
100 1660k  100 1660k    0     0  7125k      0 --:--:-- --:--:-- --:--:-- 7125k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   178  100   178    0     0   1711      0 --:--:-- --:--:-- --:--:--  1711
100 1655k  100 1655k    0     0  3221k      0 --:--:-- --:--:-- --:--:-- 3221k
  % Total    % Received % Xferd  Average Speed   Tim

With the protein sequences available, we are ready to step through each file and collect the protein sequence for a gene which we will designate with the [SGD](https://www.yeastgenome.org/) systematic identifier. Fortunately, Yue et al. had already annotated the description line of the protein encoding sequences with the identifier of the corresponding gene from the SGD reference sequence for S288C.

One more preparation step is to bring the main function from `get_seq_from_multiFASTA_with_match_in_description.py` into the notebook environment. The next cell does that.

In [1]:
# bring the main function from `get_seq_from_multiFASTA_with_match_in_description.py` into the namespace
from get_seq_from_multiFASTA_with_match_in_description import get_seq_from_multiFASTA_with_match_in_description

In [17]:
# Use fnmatch to find match to the protein sequence file names 
# so only check in the peptide fasta files (skip the index files ending 
# in `.fai` and the mito sequence example from above)

gene_to_match = "YDL140C"


fn_to_check = "pep.fa" #part of file name in files to search
sequences = "" #initialize a string to collect sequences in

import os
import fnmatch
for file in os.listdir('.'):
    if fnmatch.fnmatch(file, '*'+fn_to_check):
        if not file.endswith(".fai") and file != "DBVPG6044.mt.pep.fa":
            sequences += get_seq_from_multiFASTA_with_match_in_description(
                file,gene_to_match, return_record_as_string=True)
            sequences += "\n" # so the next entry is on a new line



*****************DONE**************************
Extracted sequence saved in FASTA format as 'seq_YDL140C.fa'.
*****************DONE**************************


*****************DONE**************************
Extracted sequence saved in FASTA format as 'seq_YDL140C.fa'.
*****************DONE**************************


*****************DONE**************************
Extracted sequence saved in FASTA format as 'seq_YDL140C.fa'.
*****************DONE**************************


*****************DONE**************************
Extracted sequence saved in FASTA format as 'seq_YDL140C.fa'.
*****************DONE**************************


*****************DONE**************************
Extracted sequence saved in FASTA format as 'seq_YDL140C.fa'.
*****************DONE**************************


*****************DONE**************************
Extracted sequence saved in FASTA format as 'seq_YDL140C.fa'.
*****************DONE**************************


*****************DONE*****************

Let's save the file of the sequences because we'll probably want them.

In [18]:
%store sequences > "YDL140C_orthologs.fa"

Writing 'sequences' (str) to file 'YDL140C_orthologs.fa'.


Let's check that:

In [20]:
!head -n 2 YDL140C_orthologs.fa

>CBS432_04G00980|CBS432_04T00980.1|YDL140C
MVGQQYSSAPLRTVKEVQFGLFSPEEVRAISVAKIRFPETMDETQTRAKIGGLNDPRLGSIDRNLKCQTCQEGMNECPGHFGHIDLAKPVFHVGFIAKIKKVCECVCMHCGKLLLDEHNELMRQALAIKDSKKRFAAIWTLCKTKMVCETDVPSEDDPTQLVSRGGCGNTQPTVRKDGLKLVGSWKKDRASGDADEPELRVLSTEEILNIFKHISVKDFTSLGFNEVFSRPEWMILTCLPVPPPPVRPSISFNESQRGEDDLTFKLADILKANISLETLEHNGAPHHAIEEAESLLQFHVATYMDNDIAGQPQALQKSGRPVKSIRARLKGKEGRIRGNLMGKRVDFSARTVISGDPNLELDQVGVPKSIAKTLTYPEVVTPYNIDRLTQLVRNGPNEHPGAKYVIRDSGDRIDLRYSKRAGDIQLQYGWKVERHIMDNDPVLFNRQPSLHKMSMMAHRVKVIPYSTFRLNLSVTSPYNADFDGDEMNLHVPQSEETRAELSQLCAVPLQIVSPQSNKPCMGIVQDTLCGIRKLTLRDTFIELDQVLNMLYWVPDWDGVIPTPAIIKPKPLWSGKQILSVAIPNGIHLQRFDEGTTLLSPKDNGMLIIDGQIIFGVVEKKTVGSSNGGLIHVVTREKGPQVCAKLFGNIQKVVNFWLLHNGFSTGIGDTIADGPTMREITETIAEAKKKVLDVTKEAQANLLTAKHGMTLRESFEDNVVRFLNEARDKAGRLAEVNLKDLNNVKQMVMAGSKGSFINIAQMSACVGQQSVEGKRIAFGFVDRTLPHFSKDDYSPESKGFVENSYLRGLTPQEFFFHAMGGREGLIDTAVKTAETGYIQRRLVKALEDIMVHYDNTTRNSLGNVIQFIYGEDGMDAAHIEKQSLDTIGGSDTAFERRYRIDLLNTDHTLDPSLLESGSEILGDLKLQVLLDEEYKQLVKDRKFLREVFVDGEANWPLP

Since the sequence in this case is from yeast, the FASTA can also be obtained using my script `get_protein_seq_as_FASTA.py` which will work in binder session launched from where this active notebook was also lanched. See about `get_protein_seq_as_FASTA.py` [here](https://github.com/fomightez/yeastmine).

Yay! Sequence of one of the sequences in the manually-edited multiple sequence alignment has been confirmed against the official record to verifying no errors were introduced during editing. It is best the process is repeated with any others as well.

## Optional: Adjust width of alignment

Biopython produces alignments with 50 sequence characters per line even if you started with ones like those that come from [EMBL-EBI's MUSCLE](https://www.ebi.ac.uk/Tools/msa/muscle/) or similar tools with 60 sequence characters per line. Clustal-formateted data with fifty sequence characters per line can be adjusted back to 60 (or some options) using [Mview](https://www.ebi.ac.uk/Tools/msa/mview/) and then parsing the output to actual CLUSTAL format. (Admittedly, it is a little kludgy but it works and it is less software to maintain by relying on [MView](https://www.ebi.ac.uk/Tools/msa/mview/) for most of the heavy lifting.)

This illustrates doing that. 

#### Step 1: Use MView to get alignment with width desired.

First, the alignment with fifty sequence characters per line needs to be submitted to [MView](https://www.ebi.ac.uk/Tools/msa/mview/) and the settings adjusted to make sure the right form comes back. The following lines describe doing that. **YOU DO NOT ACTUALLY NEED TO DO THAT FOR THE EXAMPLE; a pre-made version will be retrieved in the next cell to save you from needing to actually do these steps.**
 
In the top box at [MView](https://www.ebi.ac.uk/Tools/msa/mview/), paste in the text of the alignment that written with Biopython's `AlignIO.write()` method above. In that example, the file containing the alignment text is called `standardized_alignment.clw`. 

Set `INPUT FORMAT` to 'CLUSTAL'.

Under `STEP 3 - Set output parameters` at [MView](https://www.ebi.ac.uk/Tools/msa/mview/), click on the 'More options...' button and adjust the settings to match the image below. Adjust the 'ALIGNMENT WIDTH' in those options to what you'd like. 

![options_settings](../imgs/mview_settings_for_parse.png)

Essentially, in addition to choosing 'MVIEW' as `OUTPUT FORMAT`, 'ON' for `ALIGNMENT`, the desired width, most settings are adjusted to 'OFF', in particular `HTML MARKUP`, `RULER`, and `CONSENSUS`.

Submit the job and let it run.   
We need to bring the result into this Jupyter session. Click the `Download Alignment File` button just above the text output. Currently, clicking that in my browser bings up a page with just the ouput text where I need to highlight all the text and copy and then copy the entire block of text and paste it into a file here in the Jupyter session.  Call it `mview_out.txt`. 

You can run the next cell to get the `mview_out.txt` file that would result from those steps.

In [1]:
# Get pre-made mview output
!curl -o mview_out.txt https://gist.githubusercontent.com/fomightez/f46b0624f1d8e3abb6ff908fc447e63b/raw/5860f3b4c6aaf25d348dd9a188670cb89e68792e/uv_mview_output.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2735  100  2735    0     0  11540      0 --:--:-- --:--:-- --:--:-- 11540


To show we have that:

In [2]:
!head mview_out.txt

Reference sequence (1): STV1 Identities normalised by aligned length.
1 STV1 100.0% 100.0%  -MNQEEAIFRSADMTYVQLYIPLEVIREVTFLLGKMSVFMVMDLNKDLTAFQRGYVNQLR 
2 VPH1  91.5%  47.2%  MAEKEEAIFRSAEMALVQFYIPQEISRDSAYTLGQLGLVQFRDLNSKVRAFQRTFVNEIR 

1 STV1 100.0% 100.0%  RFDEVERMVGFLNEVVEKHAAETW-----KYILHIDDEGNDIAQPDMADLINTMEPLSLE 
2 VPH1  91.5%  47.2%  RLDNVERQYRYFYSLLKKHDIKLYEGDTDKYL----DGSGELYVPPSGSVI--------- 

1 STV1 100.0% 100.0%  NVNDMVKEITDCESRARQLDESLDSLRSKLNDLLEQRQVIFECSKFIEVNPGIAGRATNP 
2 VPH1  91.5%  47.2%  --DDYVRNASYLEERLIQMEDATDQIEVQKNDLEQYRFILQSGDEFF-----LKGDNTDS 



#### Step 2: Use my script to convert that to standardized CLUSTAL format.

From the view above, you'll see that MView got things most of the way there.  Specifically, there is the idenitifier and the sequence of the specified width in each sequence block. My script mainly parses those out and adjusts a few things to make it standard CLUSTAL format.  
The main source for the 'standardized' specification seems to be [here](http://meme-suite.org/doc/clustalw-format.html); however, see the top of this notebook for more about the specification.

The next cell will retrieve the script.

In [3]:
# Get the script
!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/alignment-utilities/mview_to_CLUSTAL.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13692  100 13692    0     0   129k      0 --:--:-- --:--:-- --:--:--  129k


Now we will point the script at that `mview_out.txt` file.

In [4]:
%run mview_to_CLUSTAL.py mview_out.txt

MView output read...collected identifiers and sequences...arranging for output...

Alignment converted from MView to CLUSTAL saved as 'mview_out_clustalized.clw'.
Finished.


To see the result:

In [5]:
!head mview_out_clustalized.clw

CLUSTAL multiple sequence alignment by mview_to_CLUSTAL (0.1.0)

STV1            -MNQEEAIFRSADMTYVQLYIPLEVIREVTFLLGKMSVFMVMDLNKDLTAFQRGYVNQLR
VPH1            MAEKEEAIFRSAEMALVQFYIPQEISRDSAYTLGQLGLVQFRDLNSKVRAFQRTFVNEIR

STV1            RFDEVERMVGFLNEVVEKHAAETW-----KYILHIDDEGNDIAQPDMADLINTMEPLSLE
VPH1            RLDNVERQYRYFYSLLKKHDIKLYEGDTDKYL----DGSGELYVPPSGSVI---------

STV1            NVNDMVKEITDCESRARQLDESLDSLRSKLNDLLEQRQVIFECSKFIEVNPGIAGRATNP
VPH1            --DDYVRNASYLEERLIQMEDATDQIEVQKNDLEQYRFILQSGDEFF-----LKGDNTDS


Let's rename that back to something more general for use below.

In [55]:
!mv mview_out_clustalized.clw aligned.clw

## Add a consensus symbol line to an MSA

Multiple sequence alignments from various sources don't come with the consensus symbols line typically provided by [EMBL-EBI's MUSCLE](https://www.ebi.ac.uk/Tools/msa/muscle/). Theese symbols are described [here](https://www.ebi.ac.uk/seqdb/confluence/display/JDSAT/Bioinformatics+Tools+FAQ#BioinformaticsToolsFAQ-WhatdoconsensussymbolsrepresentinaMultipleSequenceAlignment?). Or sometimes they can get lost or need substantial updating following manual editing to the point is easier to remove them and start over to add them correctly. Here `calculate_cons_for_clustal_protein.py` is used to add a consensus line to an multiple sequence alignment.  I have a separate script for nucleic acids, called `calculate_cons_for_clustal_nucleic.py`, see about it [here](https://github.com/fomightez/sequencework/tree/master/alignment-utilities).

(Note the width will be 60 and not 50 if the optional step was included.)

In [None]:
(Note the width will be 60 and not 50 if the optional step was included.)

In [None]:
(Note the width will be 60 and not 50 if the optional step was included.)

Possible subsequent use for the consensus symbols line:  

Beyond visually displaying relatedness in a multiple sequence alignment, these symbols can be used for categorizing residues to make commands for highlighting in molecular visualization. See [here for an example](https://nbviewer.jupyter.org/github/fomightez/cl_demo-binder/blob/master/cl_demo-binder%20Categorize%20conservation%20in%20a%20MSA%20and%20use%20that%20to%20generate%20molvis%20commands.ipynb) that uses `categorize_residues_based_on_conservation_relative_consensus_line.py` script described [here](https://github.com/fomightez/sequencework/tree/master/alignment-utilities). The notebook can be launched in active form from [here](https://github.com/fomightez/cl_demo-binder) and then selecting from the index to go to the 'Categorize conservation in a MSA and use that to generate molvis commands' page. The demo was put in the structure work demo series because it was mainly developed to work towards making commands for molecular visualization.

'Hi              '

'thest     '

1
1
1
1
1
1
1
1
1
1


' fjksdjfks'

In [8]:
import time

def executeSomething():
    #code here
    print ('.')
    time.sleep(480) #60 seconds times 8 minutes

while True:
    executeSomething()

.


KeyboardInterrupt: 