# demonstration of `report_coordinates_for_seq_within_FASTA.py` script

If you'd like an active Jupyter session to run this notebook, launch one by clicking [here](https://mybinder.org/v2/gh/fomightez/clausen_ribonucleotides/master), and then upload this notebook to the session that starts up.  
Otherwise, the static version is rendered more nicely via [here](https://nbviewer.jupyter.org/github/fomightez/sequencework/blob/master/Extract_from_FASTA/demo%20report_coordinates_for_seq_within_FASTA.ipynb).

<div class="alert alert-block alert-warning">
<p>If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.</p>

<p>
    Some tips:
    <ul>
        <li>Code cells have boxes around them. When you hover over them a <i class="fa-step-forward fa"></i> icon appears.</li>
        <li>To run a code cell either click the <i class="fa-step-forward fa"></i> icon, or click on the cell and then hit <b>Shift+Enter</b>. The <b>Shift+Enter</b> combo will also move you to the next cell, so it's a quick way to work through the notebook.</li>
        <li>While a cell is running a <b>*</b> appears in the square brackets next to the cell. Once the cell has finished running the asterix will be replaced with a number.</li>
        <li>In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.</li>
        <li>To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.</li>
    </ul>
</p>
</div>

You'll need the current version of the script to run this notebook, and the next cell will get that. (Remember if you want to make things more reproducible when you use the script with your own data, you'll want to edit calls such as this to fetch a specific version of the script. How to do this is touched upon in the comment below [here](https://stackoverflow.com/a/48587645/8508004).

In [1]:
!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/Extract_Details_or_Annotation/report_coordinates_for_seq_within_FASTA.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 17309  100 17309    0     0  51362      0 --:--:-- --:--:-- --:--:-- 51210


## Display Usage / Help Block

In [2]:
%run report_coordinates_for_seq_within_FASTA.py -h

usage: report_coordinates_for_seq_within_FASTA.py [-h] [-rgs]
                                                  SEQUENCE_FILE RECORD_ID
                                                  PATTERN

report_coordinates_for_seq_within_FASTA.py takes a sequence pattern string, a
sequence file (FASTA-format), and a record id, and reports the start and end
coordinates of that sequence within the specified FASTA record. Importantly,
the coordinates numbering is in 'common' terms where the position numbered one
corresponds to the first position. **** Script by Wayne Decatur (fomightez @
github) ***

positional arguments:
  SEQUENCE_FILE         Name of sequence file to use as input. Must be FASTA
                        format. Can be a multi-FASTA file, i.e., multiple
                        sequences in FASTA format in one file.
  RECORD_ID             Specific identifier of sequence entry in sequence file
                        to search. If the provided sequence file only contains
          

To read more about this script beyond that and what is covered below, see [here](https://github.com/fomightez/sequencework/tree/master/Extract_Details_or_Annotation).

-----

## Basic use examples set #1: Using from the command line (or equivalent / similar)

### Preparing for usage example

In [3]:
#write example FASTA to file
s = '''>evoli
atctgatctggggcgaaatgagactgatctgatctggtctgtggcg
>smer
atctgaatctgagactatatgagactgatctgatctgctctgaagc
'''

!echo "{s}" > sequence.fa

### Run the script

In [4]:
%%bash
python report_coordinates_for_seq_within_FASTA.py sequence.fa smer tCtgAGactatatgagactgatctgatctgctctgaag

8	45


The 'start' and 'end' coordinates, separated by a tab character, spanned by the sequence are generated.

**Note** that cell above illustrates that the comparison is insensitive to case.


In the above cell and elsewhere in this notebook, `%%bash` cell magic is used to send this to the shell to run as if on the command line. 

You could simply run something like `python get_seq_following_seq_from_FASTA.py sequence.fa smer tCtgagc` if you are working on the command line directly. In fact, the terminal is available from the Jupyter dashboard (or from the JupyterLab launcher) and you can feel free to try running the command below in a terminal in this Jupyter session if you'd like.

    python get_seq_following_seq_from_FASTA.py sequence.fa smer tCtgagc


Another example of using the script is in the cell below. This time the stderr stream shows some feeback, highlighted in pink.

In [5]:
%%bash
python report_coordinates_for_seq_within_FASTA.py sequence.fa smer tct

2	4


5 matches to the sequence found in the specified sequence. The coordinates
of the match encountered first have been returned.

You may wish to redirect the output sequence text to a file. The next cell demonstrates that, and the one after it shows it worked by displaying the generated file.

In [6]:
%%bash
python report_coordinates_for_seq_within_FASTA.py sequence.fa smer tCtgAGactatatgagactgatctgatct > redirect_test.tsv

In [7]:
!head redirect_test.tsv

8	36


(The cell above uses another Jupyter notebook/ IPython trick to send a command to the command line. Namely that anything on a line after an exclamation point `!` will be executed on the system command line. However, using that style I saw no advanced display formatting of the stderr when I tried using the exclamation point, e.g., `!python report_coordinates_for_seq_within_FASTA.py sequence.fa smer tct` vs. using the `%%bash` cell magic. Hence, I used `%%bash` in the demo when calling the script.)

Note that the redirection operator was used just above in a way that only sent the stdout stream to the file. You can adapt that further as you see fit; more about redirect options can be found [here](https://www.brianstorti.com/understanding-shell-script-idiom-redirect/).



*Remember you can dispense with providing an actual record id if there is only one record.*

In [8]:
#write example FASTA-formatted with one sequence to file
s = '''>evoli
atctgatctggggcgaaatgagactgatctgatctggtctgtggcg
'''

!echo "{s}" > single_sequence.fa

You still have to provide *something* for record identifier, but it can be any string. In the example, below `moot` is used. Completely irrelevant but the 'placeholder' makes the command have all the parts needed.

In [9]:
%%bash
python report_coordinates_for_seq_within_FASTA.py single_sequence.fa moot tctgaTCTGGG

2	12


Single sequence with id of 'evoli' provided in the sequence file.
It will be used to search for the provided sequence pattern
and provide the coordinates spanned.



If you are used to using Jupyter notebooks, you can use `%run` instead of `python get_seq_following_seq_from_FASTA.py sequence.fa smer tct 7` to get the same result.

In [10]:
%run report_coordinates_for_seq_within_FASTA.py sequence.fa smer tctgaGAC

8	15


However, one cannot simply add use of the shell redirection operator, `>`, to commands using `%%run`. This is because in the Jupyter notebook environment `%run` is not compatible with the redirect operator because it directs things to IPython and not the command line.

To do the equivalent, you can add in use of the %%capture cell magic to make the output a python object which you can then direct Python to save the object to a file. The idea being that having the output as a Python object in the notebook namespace gives you more options out-of-the-gate then the ouput immediately going to being stored in a file. The following cells that end this section are meant to illustrate this.

In [11]:
%%capture cell_output
pattern = 'tctga'
%run report_coordinates_for_seq_within_FASTA.py sequence.fa smer {pattern}

In [12]:
cell_output.stdout

'2\t6\n'

In [13]:
cell_output.stderr

'4 matches to the sequence found in the specified sequence. The coordinates\nof the match encountered first have been returned.'

Note that the `t` seen in the output from `cell_output.stdout` represents tab-spacing.

In [14]:
curious_coordinate = 7
if str(curious_coordinate) in cell_output.stdout:
    print ("'{}' is among the coordinates.".format(curious_coordinate))
    if str(curious_coordinate) == cell_output.stdout.split()[1]:
        pos = "end"
    else:
        pos = "start"
    print ("It is at the '{}'.".format(pos))

In [15]:
#save to a file
%store cell_output.stdout > py_out.tsv

Writing 'cell_output.stdout' (str) to file 'py_out.tsv'.


In [16]:
# demonstrate the file saving worked
!head py_out.tsv

2	6


------

## Basic use example set #2: Use the main function via import

Very useful for when using this in a Jupyter notebook to build into a pipeline or workflow.

Prepare first by importing the main function from the script into the notbeook environment.

In [17]:
from report_coordinates_for_seq_within_FASTA import report_coordinates_for_seq_within_FASTA

(That call will look redundant; however, it actually means `from the file report_coordinates_for_seq_within_FASTA.py  import the report_coordinates_for_seq_within_FASTA() function`.)

Then call that function and provide the needed arguments in the call. The needed arguments are the `sequence file`, `record id` of the specific sequence to search for the pattern within (can be gibberish if there is only one sequence provided inside sequence file), and `sequence pattern to search for`.

The function will return the resulting coordinates as a string, and so the function call should be assigned to a variable in order to handle the output of the function subsequently as desired.

In [18]:
coordinates = report_coordinates_for_seq_within_FASTA("sequence.fa", "evoli", "GATCTGGGGCGA")

In [19]:
print (coordinates)

5	16


*Remember you can dispense with providing an actual, real record id if there is only one record.*

You just need to supply *something* in that spot as a 'placeholder'.

In [20]:
coordinates = report_coordinates_for_seq_within_FASTA("single_sequence.fa", "MOOT_AGAIN", "GATCTGGGGCGA")
coordinates

Single sequence with id of 'evoli' provided in the sequence file.
It will be used to search for the provided sequence pattern
and provide the coordinates spanned.



'5\t16'

----

## More advanced use examples #1: Parse out the start and end

To keep things simple, this script only returns a text string with the 'start' and 'end' coordinates. The specific information can be parsed out if you like. This section illustrates that, building on 'Basic use example set #2: Use the main function via import' from above.

It takes advantage of two things: 
1. In the returned string, a `tab` is used as the separator between the two items. So we can get a list of the items by splitting on that `tab`.
2. Python can unpack a list (see [here](https://stackoverflow.com/questions/34308337/unpack-list-to-variables) for more on that.

In [21]:
from report_coordinates_for_seq_within_FASTA import report_coordinates_for_seq_within_FASTA
coordinates = report_coordinates_for_seq_within_FASTA("single_sequence.fa", "MOOT_AGAIN", "GATCTGGGGCGA")
start,end = coordinates.split("\t")
print("start:",start)
print("end:", end)

start: 5
end: 16


Single sequence with id of 'evoli' provided in the sequence file.
It will be used to search for the provided sequence pattern
and provide the coordinates spanned.



----

## More advanced use examples #2: Use with regular expressions

Providing sequence patterns to search for can accomodate regular expression search terms (see [Appendix 2 of Haddock and Dunn's Practical Computing for Biologists](http://practicalcomputing.org/files/PCfB_Appendices.pdf)). However, it can be tricky to input some of the symbols and special characters that regular expression search terms tend to use and get them interpreted exactly as expected. Especially in light of the many ways one can call this script or the associated function in a Jupyter notebook.

I illustrate some of the things I found to work here.

In [22]:
%run report_coordinates_for_seq_within_FASTA.py sequence.fa evoli g{{2,}}

10	13


3 matches to the sequence found in the specified sequence. The coordinates
of the match encountered first have been returned.

That regular expression search term is equivalent to `g{2,}` and searches for two or more matches to `g` in a row (or `G` in row because I make comparison case insensitive beyond input expression). Note that the brackets have to be doubled up to get read in from IPython to ultimately Python as single brackets. (Single brackets got converted to parantheses for some reason.) I worked this out by testing input from command by printing what I had right before search and luckily tried what I had learned from [here](https://stackoverflow.com/a/5466478/8508004) for dealing with brackets and `.format()`.



#### When using the function call, it seems no special escaping is needed.

**This is probably the best route to use regular expressions.**

In [23]:
coordinates = report_coordinates_for_seq_within_FASTA("sequence.fa", "evoli", "g{3,}")
coordinates

'10\t13'

In [24]:
coordinates = report_coordinates_for_seq_within_FASTA("sequence.fa", "evoli", "tctgg*")
coordinates

5 matches to the sequence found in the specified sequence. The coordinates
of the match encountered first have been returned.

'2\t5'

In [25]:
coordinates = report_coordinates_for_seq_within_FASTA("sequence.fa", "evoli", "tctggg*")
coordinates

2 matches to the sequence found in the specified sequence. The coordinates
of the match encountered first have been returned.

'7\t13'

In [26]:
#write example with blocks of unknown nucleotides in FASTA to file
s = '''>smar
atctgatNNNNNNNNNNNNNNNNNNNNNNNtgatctggtctgtggcg
>colc
atctgaatctgagactatatNNNNNNNNNNNNNNtctgctctgaagc
'''

!echo "{s}" > sequencewn.fa

coors = report_coordinates_for_seq_within_FASTA("sequencewn.fa", "colc", "N{5,}")
coors

'21\t34'

*Despite that method there being the most direct and easiest way to use them, I can imagine it won't cover all cases, and so I am going to detail my additional findings in this section.*

Interestingly, a different approach to escaping the brackets is necessary when using the `%%bash` cell magic.

In [27]:
%%bash
python report_coordinates_for_seq_within_FASTA.py sequence.fa smer a\{2,\}

6	7


2 matches to the sequence found in the specified sequence. The coordinates
of the match encountered first have been returned.

Yet, if you add in quotes you can get away without escaping the brackets.

In [28]:
%%bash
python report_coordinates_for_seq_within_FASTA.py sequence.fa smer "a{2,}"

6	7


2 matches to the sequence found in the specified sequence. The coordinates
of the match encountered first have been returned.

The cell below shows it works when using the exclamation mark way to send commands to shell, too.

In [29]:
!python report_coordinates_for_seq_within_FASTA.py sequence.fa evoli g{{3,}}

10	13


In [30]:
a = !python report_coordinates_for_seq_within_FASTA.py sequence.fa evoli g\{3,\}
a

['10\t13']

Except double-brackets method **fails** if want to assign output to a variable using exclamation point approach.

In [31]:
a = !python report_coordinates_for_seq_within_FASTA.py sequence.fa evoli g{{3,}}
a

['usage: report_coordinates_for_seq_within_FASTA.py [-h] [-rgs]',
 '                                                  SEQUENCE_FILE RECORD_ID',
 '                                                  PATTERN',
 'report_coordinates_for_seq_within_FASTA.py: error: unrecognized arguments: g']

In other words, `a = !python get_seq_following_seq_from_FASTA.py sequence.fa evoli g{{3,}}` failed to give expected result even though it assigned something to variable.

Adding quotes around pattern fixes it.

In [32]:
a = !python report_coordinates_for_seq_within_FASTA.py sequence.fa evoli "g{{3,}}"
a

['10\t13']

Since `%run` approach works though, I can use that to assign to variables within a Jupyter notebook if needed.

In [33]:
%%capture cell_output
%run report_coordinates_for_seq_within_FASTA.py sequence.fa evoli g{{3,}}

In [34]:
cell_output.stdout

'10\t13\n'

As below shows, other complex regular expression search terms work if quotes are used. In fact, it seems one above works both with and without quotes around the pattern.

In [35]:
%run report_coordinates_for_seq_within_FASTA.py sequence.fa evoli "...."

1	4


11 matches to the sequence found in the specified sequence. The coordinates
of the match encountered first have been returned.

In [36]:
%run report_coordinates_for_seq_within_FASTA.py sequence.fa evoli "g{{3,}}"

10	13


That last one was used at the start of this section without quotes.

Use of an asterisk in the regular expression search term with the `%run` approach seems to be allowed if handled like in the `%%bash` approach.

In [37]:
%run report_coordinates_for_seq_within_FASTA.py sequence.fa evoli tctggg\*

7	13


2 matches to the sequence found in the specified sequence. The coordinates
of the match encountered first have been returned.

In [38]:
%%bash
python report_coordinates_for_seq_within_FASTA.py sequence.fa evoli tctggg\*

7	13


2 matches to the sequence found in the specified sequence. The coordinates
of the match encountered first have been returned.

----

## More advanced use examples #3: Dealing with gaps

The default behaviour of the script is to remove gaps represented by dashes from any sequence pattern provided. The idea is that many use cases will involve searhcing for sequence patterns that have gaps because the sequence text was copied from a sequence alignment, and it seems like a waste of processing to have the user clean the sequences ahead of time. 

In [39]:
%run report_coordinates_for_seq_within_FASTA.py sequence.fa evoli GATCTGGG------GCGA

5	16


In [40]:
%run report_coordinates_for_seq_within_FASTA.py sequence.fa evoli "GATCTGGG------GCGA"

5	16


In [41]:
coordinates = report_coordinates_for_seq_within_FASTA("sequence.fa", "evoli", "GATCTGGG------------------------------GCGA")
coordinates

'5\t16'

In [42]:
%run report_coordinates_for_seq_within_FASTA.py sequence.fa evoli "---GCGA"

usage: report_coordinates_for_seq_within_FASTA.py [-h] [-rgs]
                                                  SEQUENCE_FILE RECORD_ID
                                                  PATTERN
report_coordinates_for_seq_within_FASTA.py: error: the following arguments are required: PATTERN


SystemExit: 2

In [43]:
%run report_coordinates_for_seq_within_FASTA.py sequence.fa evoli "G---GCGA"

12	16


In [44]:
coordinates = report_coordinates_for_seq_within_FASTA("sequence.fa", "evoli", "----GCGA")
coordinates

'13\t16'

But note, the next one fails despite that cell just above working. It seems when I include `%tb` to get the full traceback to be something related to how that `"----GCGA"` is being viewed by argparse.

In [45]:
%run report_coordinates_for_seq_within_FASTA.py sequence.fa evoli "----GCGA"

usage: report_coordinates_for_seq_within_FASTA.py [-h] [-rgs]
                                                  SEQUENCE_FILE RECORD_ID
                                                  PATTERN
report_coordinates_for_seq_within_FASTA.py: error: the following arguments are required: PATTERN


SystemExit: 2

Even fails when using command line using `%%bash` magic.

In [46]:
%%bash
python report_coordinates_for_seq_within_FASTA.py sequence.fa evoli "----GCGA"

usage: report_coordinates_for_seq_within_FASTA.py [-h] [-rgs]
                                                  SEQUENCE_FILE RECORD_ID
                                                  PATTERN
report_coordinates_for_seq_within_FASTA.py: error: the following arguments are required: PATTERN


CalledProcessError: Command 'b'python report_coordinates_for_seq_within_FASTA.py sequence.fa evoli "----GCGA"\n'' returned non-zero exit status 2.


----

Enjoy!

Upload your own sequence files to any running Jupyter session and adapt the commands in this notebook to search wihin them. Edit the notebook or copy the necessary cells to make the script work with your own data.

----
### ADVANCED DEVELOPMENT NOTE

If editing the script (***ATYPICAL***) and using import of the main function to test changes here in this Jupyter notebook, you'll need to run the following code in order to specifically trigger import of the updated version of the code for the function subsequent to any edit. Otherwise, without a restart of the kernel, the notebook environment will see any call to import the function and essentially ignore it as it considers that function already imported into the notebook environment.

In [None]:
# Run this to have new code reflected in the version of the function in memory within the notebook namespace
import importlib
import report_coordinates_for_seq_within_FASTA; importlib.reload( report_coordinates_for_seq_within_FASTA ); from report_coordinates_for_seq_within_FASTA import report_coordinates_for_seq_within_FASTA
# above line from https://stackoverflow.com/a/11724154/8508004

----
