# Demo of script to get sequence from multiFASTA file when description contains matching text

In order to demonstrate the use of my script `get_seq_from_multiFASTA_with_match_in_description.py`, I'll use it to collect protein sequences from a collection of PacBio sequenced yeast genomes from [Yue et al 2017](https://www.ncbi.nlm.nih.gov/pubmed/28416820).

Reference for sequence data:  
[Contrasting evolutionary genome dynamics between domesticated and wild yeasts.
Yue JX, Li J, Aigrain L, Hallin J, Persson K, Oliver K, Bergström A, Coupland P, Warringer J, Lagomarsino MC, Fischer G, Durbin R, Liti G. Nat Genet. 2017 Jun;49(6):913-924. doi: 10.1038/ng.3847. Epub 2017 Apr 17. PMID: 28416820](https://www.ncbi.nlm.nih.gov/pubmed/28416820)

This is meant to represent a typical workflow where a combination of these steps might be used.

------

<div class="alert alert-block alert-warning">
<p>If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.</p>

<p>
    Some tips:
    <ul>
        <li>Code cells have boxes around them. When you hover over them a <i class="fa-step-forward fa"></i> icon appears.</li>
        <li>To run a code cell either click the <i class="fa-step-forward fa"></i> icon, or click on the cell and then hit <b>Shift+Enter</b>. The <b>Shift+Enter</b> combo will also move you to the next cell, so it's a quick way to work through the notebook.</li>
        <li>While a cell is running a <b>*</b> appears in the square brackets next to the cell. Once the cell has finished running the asterisk will be replaced with a number.</li>
        <li>In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.</li>
        <li>To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.</li>
    </ul>
</p>
</div>

----

##  Basic use

This script gets a sequence from a sequence file with multiple sequence entries in FASTA format (a.k.a. multiFASTA file) if there is a match to the provided text in the description line.
It gets the first match it finds only.

Before going on to use in a situation more representativ, this is meant to show the basics of using it on the command line. (On the 'proper' command line you wouldn't need the exclamation points I put in front of these commands for them to work in this notebook.)

In [1]:
# Get the script
!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/Extract_from_FASTA/get_seq_from_multiFASTA_with_match_in_description.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9941  100  9941    0     0  51507      0 --:--:-- --:--:-- --:--:-- 51507


In [2]:
#install a necessary dependency
!pip install pyfaidx

Collecting pyfaidx
  Downloading https://files.pythonhosted.org/packages/75/a5/7e2569527b3849ea28d79b4f70d7cf46a47d36459bc59e0efa4e10e8c8b2/pyfaidx-0.5.5.2.tar.gz
Building wheels for collected packages: pyfaidx
  Running setup.py bdist_wheel for pyfaidx ... [?25ldone
[?25h  Stored in directory: /home/jovyan/.cache/pip/wheels/54/a2/b4/e242e58d23b2808e191b214067880faa46cd2341f363886e0b
Successfully built pyfaidx
Installing collected packages: pyfaidx
Successfully installed pyfaidx-0.5.5.2


#### Display USAGE block

In [3]:
!python get_seq_from_multiFASTA_with_match_in_description.py -h

usage: get_seq_from_multiFASTA_with_match_in_description.py
       [-h] [-cs] SEQUENCE_FILE TEXT_TO_MATCH

get_seq_from_multiFASTA_with_match_in_description.py takes any sequences in
FASTA format and gets the first sequence with a description line containing a
match to provided text string. For example, if provided a multi-sequence FASTA
file and a gene identifier, such as `YDL140C`, it will pull out the first
sequence matching that anywhere in description line. Defaults to ignoring
case. **** Script by Wayne Decatur (fomightez @ github) ***

positional arguments:
  SEQUENCE_FILE         Name of sequence file to search.
  TEXT_TO_MATCH         Text to match.

optional arguments:
  -h, --help            show this help message and exit
  -cs, --case_sensitive
                        Add this flag if you want to force matching to be
                        case-sensitive.


Next, we'll get example data and extract a sequence.

In [4]:
# Get a file of protein sequences
!curl -OL http://yjx1217.github.io/Yeast_PacBio_2016/data/Mitochondrial_PEP/DBVPG6044.mt.pep.fa.gz
!gunzip DBVPG6044.mt.pep.fa.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   178  100   178    0     0   2253      0 --:--:-- --:--:-- --:--:--  2253
100  1413  100  1413    0     0   8028      0 --:--:-- --:--:-- --:--:-- 28260


In [5]:
# Extract
!python get_seq_from_multiFASTA_with_match_in_description.py DBVPG6044.mt.pep.fa cox1



*****************DONE**************************
Extracted sequence saved in FASTA format as 'seq_cox1.fa'.
*****************DONE**************************


The next cell shows it worked.

In [6]:
!head seq_cox1.fa

>COX1|COX1.t01|COX1
MVQRWLYSTNAKDIAVLYFMLAIFSGMAGTAMSLIIRLELAAPGSQYLHGNSQLFNVLVTGHAVLMIFFLVMPALMGGFGNYLLPLMIGATDTAFPRINNIAFWVLPMGLVCLVTSTLVESGAGTGWTVYPPLSSIQAHSGPSVDLAIFALHLTSISSLLGAINFIVTTLNMRTNGMTMHKLPLFVWSIFITAFLLLLSLPVLSAGITMLLLDRNFNTSFFEVAGGGDPILYEHLFWFFGHPEVYILIIPGFGIISHVVSTYSKKPVFGEISMVYAMASIGLLGFLVWSHHMYIVGLDADTRAYFTSATMIIAIPTGIKIFSWLATIYGGSIRLATPMLYAIAFLFLFTMGGLTGVALANASLDVAFHDTYYVVGHFHYVLSMGAIFSLFAGYYYWSPQILGLNYNEKLAQIQFWLIFIGANVIFFPMHFLGINGMPRRIPDYPDAFAGWNYVASIGSFIATLSLFLFIYILYDQLVNGLNNKVNNKSVIYAKAPDFVESNTIFNLNTVKSSSIEFLLTSPPAVHSFNTPAVQS

There is an optional flag that can be added to make the matching case sensitive. (The default is to not be case sensitive.)

In [7]:
!python get_seq_from_multiFASTA_with_match_in_description.py DBVPG6044.mt.pep.fa cox1 -cs

**ERROR:No match to provided text found in description line for ANY sequence record.  ***ERROR*** 
EXITING.


In [8]:
!python get_seq_from_multiFASTA_with_match_in_description.py DBVPG6044.mt.pep.fa COX1 -cs



*****************DONE**************************
Extracted sequence saved in FASTA format as 'seq_COX1.fa'.
*****************DONE**************************


## Use script in a Jupyter notebook to collect sequences from a series of PacBio-sequenced genomes

This will demonstrating importing the script into a notebook and importing the main function in order to collect protein sequences from a collection of PacBio sequenced yeast genomes from [Yue et al 2017](https://www.ncbi.nlm.nih.gov/pubmed/28416820).

Reference for sequence data:  
[Contrasting evolutionary genome dynamics between domesticated and wild yeasts.
Yue JX, Li J, Aigrain L, Hallin J, Persson K, Oliver K, Bergström A, Coupland P, Warringer J, Lagomarsino MC, Fischer G, Durbin R, Liti G. Nat Genet. 2017 Jun;49(6):913-924. doi: 10.1038/ng.3847. Epub 2017 Apr 17. PMID: 28416820](https://www.ncbi.nlm.nih.gov/pubmed/28416820)

This is meant to demonstrate using Python and the script to make accomplish a series of steps.

In [9]:
# Get the script if the above section wasn't run
import os
file_needed = "get_seq_from_multiFASTA_with_match_in_description.py"
if not os.path.isfile(file_needed):
    !curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/Extract_from_FASTA/get_seq_from_multiFASTA_with_match_in_description.py

In [10]:
# Prepare for getting PacBio (Yue et al 2017 sequences)
#make a list of the strain designations
yue_et_al_strains = ["S288C","DBVPG6044","DBVPG6765","SK1","Y12",
                     "YPS128","UWOPS034614","CBS432","N44","YPS138",
                     "UFRJ50816","UWOPS919171"]

In [11]:
# Get & unpack protein sequences from strains 
for s in yue_et_al_strains:
    !curl -LO http://yjx1217.github.io/Yeast_PacBio_2016/data/Nuclear_PEP/{s}.pep.fa.gz
    !gunzip -f {s}.pep.fa.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   178  100   178    0     0   3708      0 --:--:-- --:--:-- --:--:--  3632
100 1653k  100 1653k    0     0  5661k      0 --:--:-- --:--:-- --:--:-- 5661k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   178  100   178    0     0   3296      0 --:--:-- --:--:-- --:--:--  3358
100 1660k  100 1660k    0     0  3539k      0 --:--:-- --:--:-- --:--:-- 3539k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   178  100   178    0     0   3358      0 --:--:-- --:--:-- --:--:--  3358
100 1655k  100 1655k    0     0  4160k      0 --:--:-- --:--:-- --:--:-- 4160k
  % Total    % Received % Xferd  Average Speed   Tim

With the protein sequences available, we are ready to step through each file and collect the protein sequence for a gene which we will designate with the [SGD](https://www.yeastgenome.org/) systematic identifier. Fortunately, Yue et al. had already annotated the description line of the protein encoding sequences with the identifier of the corresponding gene from the SGD reference sequence for S288C.

One more preparation step is to bring the main function from `get_seq_from_multiFASTA_with_match_in_description.py` into the notebook environment. The next cell does that.

In [12]:
# bring the main function from `get_seq_from_multiFASTA_with_match_in_description.py` into the namespace
from get_seq_from_multiFASTA_with_match_in_description import get_seq_from_multiFASTA_with_match_in_description

In [13]:
# Use fnmatch to find match to the protein sequence file names 
# so only check in the peptide fasta files (skip the index files ending 
# in `.fai` and the mito sequence example from above)

gene_to_match = "YDL140C"


fn_to_check = "pep.fa" #part of file name in files to search
sequences = "" #initialize a string to collect sequences in

import os
import fnmatch
for file in os.listdir('.'):
    if fnmatch.fnmatch(file, '*'+fn_to_check):
        if not file.endswith(".fai") and file != "DBVPG6044.mt.pep.fa":
            sequences += get_seq_from_multiFASTA_with_match_in_description(
                file,gene_to_match, return_record_as_string=True)
            sequences += "\n" # so the next entry is on a new line



*****************DONE**************************
Extracted sequence saved in FASTA format as 'seq_YDL140C.fa'.
*****************DONE**************************


*****************DONE**************************
Extracted sequence saved in FASTA format as 'seq_YDL140C.fa'.
*****************DONE**************************


*****************DONE**************************
Extracted sequence saved in FASTA format as 'seq_YDL140C.fa'.
*****************DONE**************************


*****************DONE**************************
Extracted sequence saved in FASTA format as 'seq_YDL140C.fa'.
*****************DONE**************************


*****************DONE**************************
Extracted sequence saved in FASTA format as 'seq_YDL140C.fa'.
*****************DONE**************************


*****************DONE**************************
Extracted sequence saved in FASTA format as 'seq_YDL140C.fa'.
*****************DONE**************************


*****************DONE*****************

Let's save the file of the sequences because we'll probably want them.

In [14]:
%store sequences > "YDL140C_pacbio_orthologs.fa"

Writing 'sequences' (str) to file 'YDL140C_pacbio_orthologs.fa'.


Let's check that:

In [15]:
!head -n 4 YDL140C_pacbio_orthologs.fa

>CBS432_04G00980|CBS432_04T00980.1|YDL140C
MVGQQYSSAPLRTVKEVQFGLFSPEEVRAISVAKIRFPETMDETQTRAKIGGLNDPRLGSIDRNLKCQTCQEGMNECPGHFGHIDLAKPVFHVGFIAKIKKVCECVCMHCGKLLLDEHNELMRQALAIKDSKKRFAAIWTLCKTKMVCETDVPSEDDPTQLVSRGGCGNTQPTVRKDGLKLVGSWKKDRASGDADEPELRVLSTEEILNIFKHISVKDFTSLGFNEVFSRPEWMILTCLPVPPPPVRPSISFNESQRGEDDLTFKLADILKANISLETLEHNGAPHHAIEEAESLLQFHVATYMDNDIAGQPQALQKSGRPVKSIRARLKGKEGRIRGNLMGKRVDFSARTVISGDPNLELDQVGVPKSIAKTLTYPEVVTPYNIDRLTQLVRNGPNEHPGAKYVIRDSGDRIDLRYSKRAGDIQLQYGWKVERHIMDNDPVLFNRQPSLHKMSMMAHRVKVIPYSTFRLNLSVTSPYNADFDGDEMNLHVPQSEETRAELSQLCAVPLQIVSPQSNKPCMGIVQDTLCGIRKLTLRDTFIELDQVLNMLYWVPDWDGVIPTPAIIKPKPLWSGKQILSVAIPNGIHLQRFDEGTTLLSPKDNGMLIIDGQIIFGVVEKKTVGSSNGGLIHVVTREKGPQVCAKLFGNIQKVVNFWLLHNGFSTGIGDTIADGPTMREITETIAEAKKKVLDVTKEAQANLLTAKHGMTLRESFEDNVVRFLNEARDKAGRLAEVNLKDLNNVKQMVMAGSKGSFINIAQMSACVGQQSVEGKRIAFGFVDRTLPHFSKDDYSPESKGFVENSYLRGLTPQEFFFHAMGGREGLIDTAVKTAETGYIQRRLVKALEDIMVHYDNTTRNSLGNVIQFIYGEDGMDAAHIEKQSLDTIGGSDTAFERRYRIDLLNTDHTLDPSLLESGSEILGDLKLQVLLDEEYKQLVKDRKFLREVFVDGEANWPLP

Note that we get two sequences even though I listed `-n 4` because description lines and sequence each separate line.

Save your file to your local machine.