# Demo of script to get specific subsequence from FASTA file

In order to demonstrate the use of my script `extract_subsequence_from_FASTA.py`, I'll use it to collect protein sequences from a collection of sequnecs derived from PacBio sequenced yeast genomes from [Yue et al 2017](https://www.ncbi.nlm.nih.gov/pubmed/28416820).

Reference for sequence data:  
[Contrasting evolutionary genome dynamics between domesticated and wild yeasts.
Yue JX, Li J, Aigrain L, Hallin J, Persson K, Oliver K, Bergström A, Coupland P, Warringer J, Lagomarsino MC, Fischer G, Durbin R, Liti G. Nat Genet. 2017 Jun;49(6):913-924. doi: 10.1038/ng.3847. Epub 2017 Apr 17. PMID: 28416820](https://www.ncbi.nlm.nih.gov/pubmed/28416820)


------

<div class="alert alert-block alert-warning">
<p>If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.</p>

<p>
    Some tips:
    <ul>
        <li>Code cells have boxes around them. When you hover over them a <i class="fa-step-forward fa"></i> icon appears.</li>
        <li>To run a code cell either click the <i class="fa-step-forward fa"></i> icon, or click on the cell and then hit <b>Shift+Enter</b>. The <b>Shift+Enter</b> combo will also move you to the next cell, so it's a quick way to work through the notebook.</li>
        <li>While a cell is running a <b>*</b> appears in the square brackets next to the cell. Once the cell has finished running the asterisk will be replaced with a number.</li>
        <li>In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.</li>
        <li>To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.</li>
    </ul>
</p>
</div>

---

##  Basic use

This script gets a sequence from a sequence file in FASTA format. It can be either a single sequence or more. You provide an indentifier to specify which sequence in the multiFASTA file to mine. In fact you always need to provide something for the indentifier parameter when calling this script or the main function of it, but that text can be nonseniscal if there is only one sequence in the sequence file. It disregards anything provided if there is only one.  
The only other thing necessary is providing start and end positions to specify the subsequence. Positions are to be specified in typical position terms where the first residue is numbered one.



In [1]:
# Get the script
!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/Extract_from_FASTA/extract_subsequence_from_FASTA.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 16612  100 16612    0     0   197k      0 --:--:-- --:--:-- --:--:--  197k


In [2]:
#install a necessary dependency
!pip install pyfaidx

Collecting pyfaidx
  Downloading https://files.pythonhosted.org/packages/75/a5/7e2569527b3849ea28d79b4f70d7cf46a47d36459bc59e0efa4e10e8c8b2/pyfaidx-0.5.5.2.tar.gz
Building wheels for collected packages: pyfaidx
  Running setup.py bdist_wheel for pyfaidx ... [?25ldone
[?25h  Stored in directory: /home/jovyan/.cache/pip/wheels/54/a2/b4/e242e58d23b2808e191b214067880faa46cd2341f363886e0b
Successfully built pyfaidx
Installing collected packages: pyfaidx
Successfully installed pyfaidx-0.5.5.2


#### Display USAGE block

In [3]:
!python extract_subsequence_from_FASTA.py -h

usage: extract_subsequence_from_FASTA.py [-h] [-uc] [-kd] [-os OUTPUT_SUFFIX]
                                         SEQUENCE_FILE RECORD_ID START_and_END

extract_subsequence_from_FASTA.py takes a sequence file (FASTA-format), an
identifier, and start and end range and extracts a sub sequence covering that
region from the matching sequence in the provided FASTA file. Produces the
sequence in FASTA format. (The FASTA-formatted sequence file is assumed by
default to be a multi-FASTA, i.e., multiple sequences in the provided file,
although it definitely doesn't have to be. In case it is only a single
sequence, the record id becomes moot, see below.) **** Script by Wayne Decatur
(fomightez @ github) ***

positional arguments:
  SEQUENCE_FILE         Name of sequence file to use as input. Must be FASTA
                        format. Can be a multi-FASTA file, i.e., multiple
                        sequences in FASTA format in one file.
  RECORD_ID             Specific identifier of sequ

Next, we'll get example data and extract a sequence.

In [4]:
# Get files of sequences
!curl -OL http://yjx1217.github.io/Yeast_PacBio_2016/data/Nuclear_Genome/SK1.genome.fa.gz
!gunzip SK1.genome.fa.gz
!curl -OL http://yjx1217.github.io/Yeast_PacBio_2016/data/Nuclear_CDS/SK1.cds.fa.gz
!gunzip SK1.cds.fa.gz
!curl -o S288C_YOR270C_VPH1_protein.fsa https://gist.githubusercontent.com/fomightez/f46b0624f1d8e3abb6ff908fc447e63b/raw/7ef7cfdaa2c9f9974f22fd60be3cfe7d1935cd86/ux_S288C_YOR270C_VPH1_protein.fsa

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   178  100   178    0     0   5085      0 --:--:-- --:--:-- --:--:--  5235
100 3406k  100 3406k    0     0  15.4M      0 --:--:-- --:--:-- --:--:-- 15.4M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   178  100   178    0     0   5235      0 --:--:-- --:--:-- --:--:--  5393
100 2445k  100 2445k    0     0  12.0M      0 --:--:-- --:--:-- --:--:-- 49.2M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   886  100   886    0     0   7572      0 --:--:-- --:--:-- --:--:--  7572


In [5]:
# Extract
!python extract_subsequence_from_FASTA.py SK1.genome.fa chrIII 101-200



*****************DONE**************************
Extracted sequence saved in FASTA format as 'seq_extractedchrIII.fa'.
*****************DONE**************************


The next cell shows it worked.

In [6]:
!head seq_extractedchrIII.fa

>chrIII:101-200
CCCACACACACACACCACACCACACCACACCCACACACCACACCCACACACACACCACACCCACACACACCACACCCACACCCACACCACACCCACCACA

Sometimes the record identifier can be complex and it seems adding quotes around it can make it work when it won't work without it. The first cell below will fail to run. Just adding quotes allows it to work in the cell after that one.

In [7]:
!python extract_subsequence_from_FASTA.py SK1.cds.fa SK1_01G00030|SK1_01T00030.1|YHR213W 1-50

/bin/sh: 1: SK1_01T00030.1: not found
/bin/sh: 1: YHR213W: not found
usage: extract_subsequence_from_FASTA.py [-h] [-uc] [-kd] [-os OUTPUT_SUFFIX]
                                         SEQUENCE_FILE RECORD_ID START_and_END
extract_subsequence_from_FASTA.py: error: the following arguments are required: START_and_END


In [8]:
!python extract_subsequence_from_FASTA.py SK1.cds.fa "SK1_01G00030|SK1_01T00030.1|YHR213W" 1-50



*****************DONE**************************
Extracted sequence saved in FASTA format as 'seq_extractedSK1_01G00030|SK1_01T00030.1|YHR213W.fa'.
*****************DONE**************************


In [9]:
!head "seq_extractedSK1_01G00030|SK1_01T00030.1|YHR213W.fa"

>SK1_01G00030|SK1_01T00030.1|YHR213W:1-50
ATGACAGGTTACTTTTTACCACCACAAACAAGTTCTTACACGTTCAGGTT

There is an optional flag `keep_description` that can be added to make the result keep the entire description line. 

In [10]:
!python extract_subsequence_from_FASTA.py S288C_YOR270C_VPH1_protein.fsa VPH1 201-300 --keep_description

Single sequence with id of 'VPH1' provided in the sequence file.
It will be used as the source of the sequence covering the provided positions.



*****************DONE**************************
Extracted sequence saved in FASTA format as 'seq_extractedVPH1.fa'.
*****************DONE**************************


In [11]:
!head seq_extractedVPH1.fa

>VPH1 YOR270C SGDID:S000005796:201-300
QILWRVLRGNLFFKTVEIEQPVYDVKTREYKHKNAFIVFSHGDLIIKRIRKIAESLDANLYDVDSSNEGRSQQLAKVNKNLSDLYTVLKTTSTTLESELY

Note if it had been done without the flag, this sould have been the result.

In [12]:
!python extract_subsequence_from_FASTA.py S288C_YOR270C_VPH1_protein.fsa VPH1 201-300

Single sequence with id of 'VPH1' provided in the sequence file.
It will be used as the source of the sequence covering the provided positions.



*****************DONE**************************
Extracted sequence saved in FASTA format as 'seq_extractedVPH1.fa'.
*****************DONE**************************


In [13]:
!head seq_extractedVPH1.fa

>VPH1:201-300
QILWRVLRGNLFFKTVEIEQPVYDVKTREYKHKNAFIVFSHGDLIIKRIRKIAESLDANLYDVDSSNEGRSQQLAKVNKNLSDLYTVLKTTSTTLESELY

The description line without the `-kd` flag ends at the point the first space occurs in the original description line of the single entry in `S288C_YOR270C_VPH1_protein.fsa`.

In [14]:
!head -n 1 S288C_YOR270C_VPH1_protein.fsa

>VPH1 YOR270C SGDID:S000005796


Note that because `S288C_YOR270C_VPH1_protein.fsa` we can enter any arbitrary text for the record identifier argument as it is disregarded.

In [15]:
%run extract_subsequence_from_FASTA.py S288C_YOR270C_VPH1_protein.fsa SUPER_DUPER_NONSENSE 501-800

Single sequence with id of 'VPH1' provided in the sequence file.
It will be used as the source of the sequence covering the provided positions.



*****************DONE**************************
Extracted sequence saved in FASTA format as 'seq_extractedVPH1.fa'.
*****************DONE**************************


## Use script in a Jupyter notebook 

This will demonstrate importing the main function into a notebook.


In [16]:
# Get the script if the above section wasn't run
import os
file_needed = "extract_subsequence_from_FASTA.py"
if not os.path.isfile(file_needed):
    !curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/Extract_from_FASTA/extract_subsequence_from_FASTA.py

Let's make the example `!python extract_subsequence_from_FASTA.py SK1.genome.fa chrIII 101-200` from above work here to get the result directly into the notebook environment as a python object. 

First we need to import the main function from the script file.

In [17]:
from extract_subsequence_from_FASTA import extract_subsequence_from_FASTA

Now to try using that.

In [18]:
result = extract_subsequence_from_FASTA("SK1.genome.fa", "chrIII",region_str="101-200", return_record_as_string=True)



*****************DONE**************************
Extracted sequence saved in FASTA format as 'seq_extractedchrIII.fa'.
*****************DONE**************************


We now have that as `result` and we can use it for further steps. For example, we can use the produced Python string to parse it farther to get the last ten residues.

In [19]:
end_of_seq = result.split()[1][-10:]
end_of_seq

'ACCCACCACA'

Note that the `keep_description` flag can also be used; it needs to be set to `True` when calling the main function.

In [20]:
fasta = extract_subsequence_from_FASTA("S288C_YOR270C_VPH1_protein.fsa", "blahBLAHblah",region_str="201-300", keep_description = True, return_record_as_string=True)
fasta

Single sequence with id of 'VPH1' provided in the sequence file.
It will be used as the source of the sequence covering the provided positions.



*****************DONE**************************
Extracted sequence saved in FASTA format as 'seq_extractedVPH1.fa'.
*****************DONE**************************


'>VPH1 YOR270C SGDID:S000005796:201-300\nQILWRVLRGNLFFKTVEIEQPVYDVKTREYKHKNAFIVFSHGDLIIKRIRKIAESLDANLYDVDSSNEGRSQQLAKVNKNLSDLYTVLKTTSTTLESELY'

The next cell shows that result without the flag.

In [21]:
fasta = extract_subsequence_from_FASTA("S288C_YOR270C_VPH1_protein.fsa", "blahBLAHblah",region_str="201-300", return_record_as_string=True)
fasta

Single sequence with id of 'VPH1' provided in the sequence file.
It will be used as the source of the sequence covering the provided positions.



*****************DONE**************************
Extracted sequence saved in FASTA format as 'seq_extractedVPH1.fa'.
*****************DONE**************************


'>VPH1:201-300\nQILWRVLRGNLFFKTVEIEQPVYDVKTREYKHKNAFIVFSHGDLIIKRIRKIAESLDANLYDVDSSNEGRSQQLAKVNKNLSDLYTVLKTTSTTLESELY'

If you wanted to save the object from within the notebook, you can do that too.

In [22]:
%store end_of_seq > end.fa

Writing 'end_of_seq' (str) to file 'end.fa'.


Let's check that:

In [23]:
!head end.fa

ACCCACCACA


Save your file to your local machine.

Enjoy!