# demonstration of `replace_unusual_nts_within_FASTA.py` script

If you'd like an active Jupyter session to run this notebook, launch one by clicking [here](https://mybinder.org/v2/gh/fomightez/cl_sq_demo-binder/master?filepath=index.ipynb), and then select 'Demo of script to replace unusual nts in a FASTA file' from the available notebooks listed there.  
Otherwise, the static version is rendered more nicely via [here](https://nbviewer.org/github/fomightez/cl_sq_demo-binder/blob/master/notebooks/demo%20demo%20replace_unusual_nts_within_FASTA.ipynb).

As it stands, you have to edit the script itself to change the character used in the substitution. Search for `character_for_subbing` in the 'USER ADJUSTABLE VALUES' section of the script.

This script makes use of the main function in my script `summarize_all_nts_even_ambiguous_present_in_FASTA.py` to summarize the nucleotdies to provide a better sense of the amount of unusual vs. normal nucleotdies. If it all goes right, it handles this all behind the scenes. I'm just pointing it out mainly so anyone reading about this script will see that summarizing functionality is available separate from the replacement steps. (Also, if fetching of `summarize_all_nts_even_ambiguous_present_in_FASTA.py` fails, it will ask to place that script in the same location with `replace_unusual_nts_within_FASTA.py` , and so it may be nice to know why it is asking that.)



<div class="alert alert-block alert-warning">
<p>If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.</p>

<p>
    Some tips:
    <ul>
        <li>Code cells have boxes around them. When you hover over them a <i class="fa-step-forward fa"></i> icon appears.</li>
        <li>To run a code cell either click the <i class="fa-step-forward fa"></i> icon, or click on the cell and then hit <b>Shift+Enter</b>. The <b>Shift+Enter</b> combo will also move you to the next cell, so it's a quick way to work through the notebook.</li>
        <li>While a cell is running a <b>*</b> appears in the square brackets next to the cell. Once the cell has finished running the asterix will be replaced with a number.</li>
        <li>In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.</li>
        <li>To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.</li>
    </ul>
</p>
</div>

You'll need the current version of the script to run this notebook, and the next cell will get that. (Remember if you want to make things more reproducible when you use the script with your own data, you'll want to edit calls such as this to fetch a specific version of the script. How to do this is touched upon in the comment below [here](https://stackoverflow.com/a/48587645/8508004).

In [1]:
!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/AdjustFASTA_or_FASTQ/replace_unusual_nts_within_FASTA.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 34809  100 34809    0     0   173k      0 --:--:-- --:--:-- --:--:--  173k


In [2]:
%pip install rich

Collecting rich
  Downloading rich-12.6.0-py3-none-any.whl (237 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m237.5/237.5 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m00:01[0m
Collecting commonmark<0.10.0,>=0.9.0
  Downloading commonmark-0.9.1-py2.py3-none-any.whl (51 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.1/51.1 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: commonmark, rich
Successfully installed commonmark-0.9.1 rich-12.6.0
Note: you may need to restart the kernel to use updated packages.


## Display Usage / Help Block

In [3]:
%run replace_unusual_nts_within_FASTA.py -h

usage: replace_unusual_nts_within_FASTA.py [-h] [-cs] [-mc]
                                           [-os OUTPUT_SUFFIX]
                                           SEQUENCE_FILE

replace_unusual_nts_within_FASTA.py takes a sequence file (FASTA-format) and
replaces anything in the sequences that isn't G,A,T,C, or N with a ?. If the
sequence(s) only contains G,A,T,C, or N no modified version is produced.
Assumes multi-FASTA, but single sequence entry is fine, too. When running on
the command line, it will also print out a summary table of counts of
nucleotides and other character in each sequence and totals. When calling the
main function it will, by default, return a dataframe with this information.
In case of unusual nucleotides present, a copy of the FASTA sequence file with
the unusual nucleotides replaced by ? will be produced. Only valid for DNA
sequences; script has no step checking for data type, and so you are
responsible for verifying appropriate input. By default, the summar

To read more about this script beyond that and what is covered below, see [here](https://github.com/fomightez/sequencework/tree/master/AdjustFASTA_or_FASTQ).

-----

## Basic use examples set #1: Using from the command line (or equivalent / similar)

### Preparing for usage example

In [4]:
#write example FASTA to file
s = '''>evoli
atctgatctggggcgaaatgagkactgatctgatctggtctgtggcgQtqQ*d
>smer
atctgaatctgagactatatgagkactgatctgatctgctctgaagc
'''

!echo "{s}" > sequence.fa

### Run the script

In [5]:
%%bash
python replace_unusual_nts_within_FASTA.py sequence.fa

┏━━━━━━━┳━━━━┳━━━━┳━━━━┳━━━━┳━━━┳━━━━━┳━━━━━┳━━━━━┳━━━┳━━━━━━━━━━━┳━━━━━┓
┃       ┃  A ┃  T ┃  C ┃  G ┃ K ┃   Q ┃   * ┃   D ┃ N ┃ Total_nts ┃ % N ┃
┡━━━━━━━╇━━━━╇━━━━╇━━━━╇━━━━╇━━━╇━━━━━╇━━━━━╇━━━━━╇━━━╇━━━━━━━━━━━╇━━━━━┩
│ evoli │  9 │ 14 │  8 │ 16 │ 1 │ 3.0 │ 1.0 │ 1.0 │ 0 │        53 │ 0.0 │
│  smer │ 13 │ 14 │  9 │ 10 │ 1 │ 0.0 │ 0.0 │ 0.0 │ 0 │        47 │ 0.0 │
│ TOTAL │ 22 │ 28 │ 17 │ 26 │ 2 │ 3.0 │ 1.0 │ 1.0 │ 0 │       100 │ 0.0 │
└───────┴────┴────┴────┴────┴───┴─────┴─────┴─────┴───┴───────────┴─────┘



Obtaining script containing a function to use to summarize the nucleotides and other letters present...
2 sequences provided in the sequence file.



*****************DONE**************************
The following letters not expected to be among the sequences were observed: Q,*,D,K.
These unusual characters were replaced with `?` and a modified version 
of 'sequence.fa' was saved as the output file 'sequence_subbed.fa'.

Note the table summarizing the nts present **will have color when run in an actual terminal**. (Assuming it is modern terminal that has a few colors.)

That can be simulated in in a notebook, like so:

In [6]:
%run replace_unusual_nts_within_FASTA.py sequence.fa

2 sequences provided in the sequence file.



*****************DONE**************************
The following letters not expected to be among the sequences were observed: D,K,*,Q.
These unusual characters were replaced with `?` and a modified version 
of 'sequence.fa' was saved as the output file 'sequence_subbed.fa'.

It is like if use the `%%bash magic` as shown above but with color!! This is beccause `%run` is special and works well with Jupyter.

However, those used to working in a terminal may prefer no color. That is possible while using the full-featured `%run` to call the script by adding the `--mono` flag:

In [7]:
%run replace_unusual_nts_within_FASTA.py --mono sequence.fa

2 sequences provided in the sequence file.



*****************DONE**************************
The following letters not expected to be among the sequences were observed: D,K,*,Q.
These unusual characters were replaced with `?` and a modified version 
of 'sequence.fa' was saved as the output file 'sequence_subbed.fa'.

That `--mono` flag can be abbreviated `-mc`.


How does the file generated look?:

In [8]:
cat sequence_subbed.fa

>evoli SUBBED
atctgatctggggcgaaatgag?actgatctgatctggtctgtggcg?t????
>smer SUBBED
atctgaatctgagactatatgag?actgatctgatctgctctgaagc


You can use a different flag to have the accounting of the letters in the sequencee be case-sensitive, like so:

In [9]:
%run replace_unusual_nts_within_FASTA.py -cs sequence.fa

2 sequences provided in the sequence file.



*****************DONE**************************
The following letters not expected to be among the sequences were observed: d,k,q,Q,*.
These unusual characters were replaced with `?` and a modified version 
of 'sequence.fa' was saved as the output file 'sequence_subbed.fa'.

The full version of that flag if you want to write it out, would be `--case_sensitive`.   
I'd suggest if you want the case-sensitive summary, you also colllect the 'default', case-insensitive summary table as I think they complement each other well.

------

## Basic use example set #2: Use the main function via import

Very useful for when using this in a Jupyter notebook to build into a pipeline or workflow.

Prepare first by  importing the main function from the script into the notbeook environment.

In [10]:
from replace_unusual_nts_within_FASTA import replace_unusual_nts_within_FASTA

(That call will look redundant; however, it actually means *from the file* `replace_unusual_nts_within_FASTA.py`  *import the* `replace_unusual_nts_within_FASTA()` *function*.)

With the main function imported, into the namespace, we are ready to call it for use. The needed argument for calling is the `sequence file`. Optionally, you can set if case-sensitive, with `case_sensitive=True`. 

The function will return a dataframe and generate a new file.

In [11]:
df = replace_unusual_nts_within_FASTA("sequence.fa")
df

2 sequences provided in the sequence file.



*****************DONE**************************
The following letters not expected to be among the sequences were observed: D,K,*,Q.
These unusual characters were replaced with `?` and a modified version 
of 'sequence.fa' was saved as the output file 'sequence_subbed.fa'.

Unnamed: 0,A,T,C,G,K,Q,*,D,N,Total_nts,% N
evoli,9,14,8,16,1,3.0,1.0,1.0,0,53,0.0
smer,13,14,9,10,1,0.0,0.0,0.0,0,47,0.0
TOTAL,22,28,17,26,2,3.0,1.0,1.0,0,100,0.0


In [12]:
df = replace_unusual_nts_within_FASTA("sequence.fa", case_sensitive=True)
df

2 sequences provided in the sequence file.



*****************DONE**************************
The following letters not expected to be among the sequences were observed: d,k,q,Q,*.
These unusual characters were replaced with `?` and a modified version 
of 'sequence.fa' was saved as the output file 'sequence_subbed.fa'.

Unnamed: 0,a,t,c,g,k,Q,q,*,d,N,n,Total_nts,% N,% n,% N&n
evoli,9,14,8,16,1,2.0,1.0,1.0,1.0,0,0,53,0.0,0.0,0.0
smer,13,14,9,10,1,0.0,0.0,0.0,0.0,0,0,47,0.0,0.0,0.0
TOTAL,22,28,17,26,2,2.0,1.0,1.0,1.0,0,0,100,0.0,0.0,0.0


The `mono` setting is moot when using the main function inside Python. That setting is only for when using the script in the terminal, or terminal-equivalent, setting.

If you didn't want the dataframe returned and instead just want the script to replace the unusual nucleotides, you can optionally also add `return_df = True`. (The case-sensitive nature only is reflectedin the dataframe returned so we don't care what the setting for `case_sensitive` is if not returning the dataframe and so I'm leaving it out here.)

In [13]:
replace_unusual_nts_within_FASTA("sequence.fa", return_df = False) 

2 sequences provided in the sequence file.



*****************DONE**************************
The following letters not expected to be among the sequences were observed: D,K,*,Q.
These unusual characters were replaced with `?` and a modified version 
of 'sequence.fa' was saved as the output file 'sequence_subbed.fa'.

----

Enjoy!

Upload your own sequence files to any running Jupyter session and adapt the commands in this notebook to search wihin them. Edit the notebook or copy the necessary cells to make the script work with your own data.

----
### ADVANCED DEVELOPMENT NOTE

If editing the script (***ATYPICAL***) and using import of the main function to test changes here in this Jupyter notebook, you'll need to run the following code in order to specifically trigger import of the updated version of the code for the function subsequent to any edit. Otherwise, without a restart of the kernel, the notebook environment will see any call to import the function and essentially ignore it as it considers that function already imported into the notebook environment.

In [14]:
# Run this to have new code reflected in the version of the function in memory within the notebook namespace
import importlib
import replace_unusual_nts_within_FASTA; importlib.reload(replace_unusual_nts_within_FASTA); from replace_unusual_nts_within_FASTA import replace_unusual_nts_within_FASTA
# above line from https://stackoverflow.com/a/11724154/8508004

----
