# demonstration of `get_specified_length_of_end_of_seq_from_FASTA.py` script

If you'd like an active Jupyter session to run this notebook, launch one by clicking [here](https://mybinder.org/v2/gh/fomightez/clausen_ribonucleotides/master), and then upload this notebook to the session that starts up.  
Otherwise, the static version is rendered more nicely via [here](https://nbviewer.jupyter.org/github/fomightez/sequencework/blob/master/Extract_from_FASTA/demo%20get_seq_following_seq_from_FASTA.ipynb).

----

<div class="alert alert-block alert-warning">
<p>If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.</p>

<p>
    Some tips:
    <ul>
        <li>Code cells have boxes around them. When you hover over them a <i class="fa-step-forward fa"></i> icon appears.</li>
        <li>To run a code cell either click the <i class="fa-step-forward fa"></i> icon, or click on the cell and then hit <b>Shift+Enter</b>. The <b>Shift+Enter</b> combo will also move you to the next cell, so it's a quick way to work through the notebook.</li>
        <li>While a cell is running a <b>*</b> appears in the square brackets next to the cell. Once the cell has finished running the asterix will be replaced with a number.</li>
        <li>In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.</li>
        <li>To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.</li>
    </ul>
</p>
</div>

----

You'll need the current version of the script to run this notebook, and the next cell will get that. (Remember if you want to make things more reproducible when you use the script with your own data, you'll want to edit calls such as this to fetch a specific version of the script. How to do this is touched upon in the comment below [here](https://stackoverflow.com/a/48587645/8508004).

In [1]:
!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/Extract_from_FASTA/get_specified_length_of_end_of_seq_from_FASTA.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 11812  100 11812    0     0  78746      0 --:--:-- --:--:-- --:--:-- 78746


## Display Usage / Help Block

In [2]:
%run get_specified_length_of_end_of_seq_from_FASTA.py -h

usage: get_specified_length_of_end_of_seq_from_FASTA [-h]
                                                     SEQUENCE_FILE RECORD_ID
                                                     NUMBER_TO_GET

get_specified_length_of_end_of_seq_from_FASTA takes a sequence file (FASTA-
format), & a record id, and a number (integer), and extracts a sequence of
specified length from the end of the indicated sequence. The number provided
is what specifies the length extracted. (The FASTA-formatted sequence file is
assumed by default to be a multi-FASTA, i.e., multiple sequences in the
provided file, although it definitely doesn't have to be. In case it is only a
single sequence, the record id becomes moot, see below.) A sequence string of
the specified length will be returned. Redirect the output to a file if that
is what is needed. **** Script by Wayne Decatur (fomightez @ github) ***

positional arguments:
  SEQUENCE_FILE  Name of sequence file to use as input. Must be FASTA format.
           

To read more about this script beyond that and what is covered below, see [here](https://github.com/fomightez/sequencework/tree/master/Extract_from_FASTA).

-----

## Basic use examples set #1: Using from the command line (or equivalent / similar)

### Preparing for usage example

The following cell will generate a sequence in `mock_seqs.fa`. (This makes use of Maurice HT Ling's [randomseq.py](http://www.medcrave.com/articles/det/15292/Randomseq-python-command-ndash-line-random-sequence-generator).)

In [3]:
#get a script use it to generate a mock sequence file so that we can collect the end of one of them

#!curl -O https://raw.githubusercontent.com/amarallab/NullSeq/master/nullseq.py
#!curl -O https://raw.githubusercontent.com/amarallab/NullSeq/master/NullSeq_Functions.py
#%run nullseq.py -l 436
!pip install fire
!curl -O https://raw.githubusercontent.com/mauriceling/bactome/master/randomseq.py
#!python randomseq.py FLS -- --help #???<-- That didn't match typical Python help syntax I was familar with but it is noted at https://github.com/google/python-fire
!python randomseq.py FLS --length=706 --n=3 > mock_seqs.fa

Collecting fire
  Downloading https://files.pythonhosted.org/packages/5a/b7/205702f348aab198baecd1d8344a90748cb68f53bdcd1cc30cbc08e47d3e/fire-0.1.3.tar.gz
Building wheels for collected packages: fire
  Building wheel for fire (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/jovyan/.cache/pip/wheels/2a/1a/4d/6b30377c3051e76559d1185c1dbbfff15aed31f87acdd14c22
Successfully built fire
Installing collected packages: fire
Successfully installed fire-0.1.3
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 25229  100 25229    0     0   241k      0 --:--:-- --:--:-- --:--:--  241k


In [4]:
!head mock_seqs.fa

> Test_1
AAACGACGGTCAACACAAAACAACGCCAATCTCGAAGCCGTATCACCACAGGCGAAATTCGGCACGAGATCGACAGATACATCGAGGTGCCTCGGTCAGACTCATCAGTGTTCTCGCCAGTTTTCGCCGTCACGGAAGATACATCGATTTTCGCCACCGGGTCACGGGACACCAGAACTTTTCAAGTGGCGGTCCGATCGCTCAGGTTCGGCACCCTTTCATCACAATCTCACACCATCCGCACCAAGCGGGGAGACGCGCGGCGATCCACGGACGACCTCCTCCCCCGGTTTACCTCAGTCCAATACGACTCCGTCCCCGTGTGTGGGCTCACTATCCGTATTCTTCATCGCTACAGGTACCCGGAGATTTCGGTCCGCGTTCCGGCGGTTCGGAAACCCGTATTACGCGCAAAGGGGCATTCCGCCCCGGTCGCGCTCCCCGCAGGTTACCATCGCATCAAACGCCTCAAGTTCATTCCACACTTCTTCCAGAAACTCGGAAGATCCAGGCTATATTTCTTCAAAACCGTCTTACTTTCCGCACGAGATCGTCTTTTCTTTTCTTTTTCAGAACATCGCATCCGAATTACATATCCCAAAATCAAGATATCCAGGACCAAGTTTCAAGTACCGCTATCATCAGTTTTATTCTACCCAAAGGTGGTGCGAGAAAGTACCCGTCCAAAAGACATACAGCGTATACG
> Test_2
ATTTCCATACGATCCGATTTTCGCATTACCTACCCGAGATCGGAACTTACGCGTGGAGCACGATCAGACCTTTACTACTACAATCCGGCACGAACGTCAGCGTACGAGATTCAGGAAAGGCATTCAACATCAATATATTCCCGGGGCAGTTCACCTCAGGGGACCGACGAGAGCGAAAAACCAAAGTTCAAAGCTCCCAGGTCTCAACTTCCATTATTACTCCACGCGCACCGGCCACTCGGGTCATACCCATATACCCCGGACTACACAAAAAG

### Run the script

In [5]:
%%bash
python get_specified_length_of_end_of_seq_from_FASTA.py mock_seqs.fa Test_2 90

CGTGGCCTCAGCGACTACGAGCCGACGCGGCCTACGGCGATATCAGAGAGGGGCCGAACCGCGCCCAAAACCTCGCTTCGGGAGTCAAGA


In the above cell and elsewhere in this notebook, `%%bash` cell magic is used to send this to the shell to run as if on the command line. 

You could simply run something like `python get_specified_length_of_end_of_seq_from_FASTA.py mock_seqs.fa Test_2 90` if you are working on the command line directly. In fact, the terminal is available from the Jupyter dashboard (or from the JupyterLab launcher) and you can feel free to try running the command below in a terminal in this Jupyter session if you'd like.

    python get_specified_length_of_end_of_seq_from_FASTA.py mock_seqs.fa Test_2 90


Another example of using the script is in the cell below. This time the stderr stream shows some feeback, highlighted in pink.

In [6]:
%%bash
python get_specified_length_of_end_of_seq_from_FASTA.py mock_seqs.fa Test_2 887

ATTTCCATACGATCCGATTTTCGCATTACCTACCCGAGATCGGAACTTACGCGTGGAGCACGATCAGACCTTTACTACTACAATCCGGCACGAACGTCAGCGTACGAGATTCAGGAAAGGCATTCAACATCAATATATTCCCGGGGCAGTTCACCTCAGGGGACCGACGAGAGCGAAAAACCAAAGTTCAAAGCTCCCAGGTCTCAACTTCCATTATTACTCCACGCGCACCGGCCACTCGGGTCATACCCATATACCCCGGACTACACAAAAAGGACAGAGCAAAGACTCCGTCGATCATACAACGTTACTCCTCCAACTCGCTTCCTATACAACTCAACGGGGCTCGGCAATCCCCCGTGCCCGACCTCCCCGATCACCTCCATCAGGCTTCGGCCTCGGTCAGCTCAAACCGCGCAGCATCGGACAAAAAACCGCAAACATATCCCAGAGGTTTATCGGGCACTCCGGACGCAGAAAACGTACTCTCCGGTGTGCAAAAACATTCGGCGATACTACGGCCATTCTCCATCGGATTACCGAGCACAGGGCCATTCAAGTCAGGGGGCGCCAATCGGTCGGAGAATTTTTCCCAAGAGTCACTACCGCATCCTCTCGTGGCCTCAGCGACTACGAGCCGACGCGGCCTACGGCGATATCAGAGAGGGGCCGAACCGCGCCCAAAACCTCGCTTCGGGAGTCAAGA


Note that the sepecified number of residues to get, 887, exceeds the length of the specified record, which is 706 residues in length.
The entire sequences has been returned.

You may wish to redirect the output sequence text to a file. The next cell demonstrates that, and the one after it shows it worked by displaying the generated file.

In [7]:
%%bash
python get_specified_length_of_end_of_seq_from_FASTA.py mock_seqs.fa Test_2 887 > redirect_test.fa

Note that the sepecified number of residues to get, 887, exceeds the length of the specified record, which is 706 residues in length.
The entire sequences has been returned.

In [8]:
!head redirect_test.fa

ATTTCCATACGATCCGATTTTCGCATTACCTACCCGAGATCGGAACTTACGCGTGGAGCACGATCAGACCTTTACTACTACAATCCGGCACGAACGTCAGCGTACGAGATTCAGGAAAGGCATTCAACATCAATATATTCCCGGGGCAGTTCACCTCAGGGGACCGACGAGAGCGAAAAACCAAAGTTCAAAGCTCCCAGGTCTCAACTTCCATTATTACTCCACGCGCACCGGCCACTCGGGTCATACCCATATACCCCGGACTACACAAAAAGGACAGAGCAAAGACTCCGTCGATCATACAACGTTACTCCTCCAACTCGCTTCCTATACAACTCAACGGGGCTCGGCAATCCCCCGTGCCCGACCTCCCCGATCACCTCCATCAGGCTTCGGCCTCGGTCAGCTCAAACCGCGCAGCATCGGACAAAAAACCGCAAACATATCCCAGAGGTTTATCGGGCACTCCGGACGCAGAAAACGTACTCTCCGGTGTGCAAAAACATTCGGCGATACTACGGCCATTCTCCATCGGATTACCGAGCACAGGGCCATTCAAGTCAGGGGGCGCCAATCGGTCGGAGAATTTTTCCCAAGAGTCACTACCGCATCCTCTCGTGGCCTCAGCGACTACGAGCCGACGCGGCCTACGGCGATATCAGAGAGGGGCCGAACCGCGCCCAAAACCTCGCTTCGGGAGTCAAGA


(The cell above uses another Jupyter notebook/ IPython trick to send a command to the command line. Namely that anything on a line after an exclamation point `!` will be executed on the system command line. However, using that style I saw no advanced display formatting of the stderr when I tried using the exclamation point, e.g., `!python get_specified_length_of_end_of_seq_from_FASTA.py mock_seqs.fa Test_2 887` vs. using the `%%bash` cell magic. Hence, I used `%%bash` in the demo when calling the script.)

Note that the redirection operator was used just above in a way that only sent the stdout stream to the file. You can adapt that further as you see fit; more about redirect options can be found [here](https://www.brianstorti.com/understanding-shell-script-idiom-redirect/).



*Remember you can dispense with providing an actual record id if there is only one record.*

The next cell makes a file with only one record so that can be demonstrated.

In [9]:
#make a mock 'single sequence' file
!pip install fire
import os
script_needed = "randomseq.py"
if not os.path.isfile(script_needed):
    !curl -O https://raw.githubusercontent.com/mauriceling/bactome/master/randomseq.py
!python randomseq.py FLS --length=315 --n=1 > single_sequence.fa

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 25229  100 25229    0     0   150k      0 --:--:-- --:--:-- --:--:--  149k


Now when you invoke the script, you still have to provide *something* for record identifier, but it can be any string. In the example, below `moot` is used. Completely irrelevant but the 'placeholder' makes the command have all the parts needed.

In [10]:
%%bash
python get_specified_length_of_end_of_seq_from_FASTA.py single_sequence.fa moot 87

TCCCTACCGCGCTTCCACCACTACGTATCGGCGCGTGCAAAGGAACGCAAGACTACGAACGCCCACTCCATTCACGGGCACGCGATT


Single sequence with id of 'Test_1' provided in the sequence file.
It will be used to extract the last 87 characters in it.



If you are used to using Jupyter notebooks, you can use `%run` instead of `python get_specified_length_of_end_of_seq_from_FASTA.py single_sequence.fa moot 87` to get the same result.

In [11]:
%run get_specified_length_of_end_of_seq_from_FASTA.py single_sequence.fa moot 87

TCCCTACCGCGCTTCCACCACTACGTATCGGCGCGTGCAAAGGAACGCAAGACTACGAACGCCCACTCCATTCACGGGCACGCGATT


Single sequence with id of 'Test_1' provided in the sequence file.
It will be used to extract the last 87 characters in it.



However, one cannot simply add use of the shell redirection operator, `>`, to commands using `%%run`. This is because in the Jupyter notebook environment `%run` is not compatible with the redirect operator because it directs things to IPython and not the command line.

To do the equivalent, you can add in use of the %%capture cell magic to make the output a python object which you can then direct Python to save the object to a file. The idea being that having the output as a Python object in the notebook namespace gives you more options out-of-the-gate then the ouput immediately going to being stored in a file. The following cells that end this section are meant to illustrate this.

In [12]:
%%capture cell_output
%run get_specified_length_of_end_of_seq_from_FASTA.py single_sequence.fa moot 87

In [13]:
cell_output.stdout

'TCCCTACCGCGCTTCCACCACTACGTATCGGCGCGTGCAAAGGAACGCAAGACTACGAACGCCCACTCCATTCACGGGCACGCGATT\n'

In [14]:
cell_output.stderr

"Single sequence with id of 'Test_1' provided in the sequence file.\nIt will be used to extract the last 87 characters in it.\n\n"

In [15]:
curious_seq ='TCGCT'
if curious_seq in cell_output.stdout:
    print ("The sequence {} occurs in {}.".format(curious_seq,cell_output.stdout))

In [16]:
#save to a file
%store cell_output.stdout > py_out.fa

Writing 'cell_output.stdout' (str) to file 'py_out.fa'.


In [17]:
# demonstrate the file saving worked
!head py_out.fa

TCCCTACCGCGCTTCCACCACTACGTATCGGCGCGTGCAAAGGAACGCAAGACTACGAACGCCCACTCCATTCACGGGCACGCGATT


------

## Basic use example set #2: Use the main function via import

Very useful for when using this in a Jupyter notebook to build into a pipeline or workflow.

Prepare first by importing the main function from the script into the notbeook environment.

In [18]:
from get_specified_length_of_end_of_seq_from_FASTA import get_specified_length_of_end_of_seq_from_FASTA

(That call will look redundant; however, it actually means `from the file get_specified_length_of_end_of_seq_from_FASTA.py  import the get_specified_length_of_end_of_seq_from_FASTA() function`.)

Then call that function and provide the needed arguments in the call. The needed arguments are the `sequence file`, `record id` of the specific sequence to mine, and the `number of residues` to get at the end of the sequence.

The function will return the resulting sequence text as a string, and so the function call should be assigned to a variable in order to handle the output of the function subsequently as desired.

In [19]:
end_seq = get_specified_length_of_end_of_seq_from_FASTA("mock_seqs.fa", "Test_2", 200)

In [20]:
print (end_seq)

TCGGCGATACTACGGCCATTCTCCATCGGATTACCGAGCACAGGGCCATTCAAGTCAGGGGGCGCCAATCGGTCGGAGAATTTTTCCCAAGAGTCACTACCGCATCCTCTCGTGGCCTCAGCGACTACGAGCCGACGCGGCCTACGGCGATATCAGAGAGGGGCCGAACCGCGCCCAAAACCTCGCTTCGGGAGTCAAGA


*Remember you can dispense with providing an actual, real record id if there is only one record.*

You just need to supply *something* in that spot as a 'placeholder'.

In [21]:
end_seq = get_specified_length_of_end_of_seq_from_FASTA("single_sequence.fa", "MOOT_AGAIN", 600)

Single sequence with id of 'Test_1' provided in the sequence file.
It will be used to extract the last 600 characters in it.

Note that the sepecified number of residues to get, 600, exceeds the length of the specified record, which is 315 residues in length.
The entire sequences has been returned.


----

Enjoy!

Upload your own sequence files to any running Jupyter session and adapt the commands in this notebook to search wihin them. Edit the notebook or copy the necessary cells to make the script work with your own data.

----
### ADVANCED DEVELOPMENT NOTE

If editing the script (***ATYPICAL***) and using import of the main function to test changes here in this Jupyter notebook, you'll need to run the following code in order to specifically trigger import of the updated version of the code for the function subsequent to any edit. Otherwise, without a restart of the kernel, the notebook environment will see any call to import the function and essentially ignore it as it considers that function already imported into the notebook environment.

In [22]:
# Run this to have new code reflected in the version of the function in memory within the notebook namespace
import importlib
import get_specified_length_of_end_of_seq_from_FASTA; importlib.reload( get_specified_length_of_end_of_seq_from_FASTA ); from get_specified_length_of_end_of_seq_from_FASTA import get_specified_length_of_end_of_seq_from_FASTA
# above line from https://stackoverflow.com/a/11724154/8508004

----
