# Demonstrate the script `add_source_organism_info_to_FASTA.py`

This script will add the organism name in brackets to FASTA sequence entries.

The header or description lines of FASTA files are not standardized, and so sometimes this makes doing some steps programmatically difficult. This script is meant to help by adding the organism name in brackets so it is easily parsed for use in other scripts, such as my script [compare_organisms_in_two_files_of_fasta_entries.py](https://github.com/fomightez/sequencework/tree/master/CompareFASTA_or_FASTQ) , or by regular expressions, for example, see [this Biostarts answer by @cpad0112](https://www.biostars.org/p/9505505/#9505519).

Caveats:

- for now it works on GI numbers which [are being phased out]().
- the description line has to match a pattern like the demonstration entries from NCBI Genbank where the GI number can be parsed out.

-----

<div class="alert alert-block alert-warning">
<p>If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!</p>

<p>
    Some tips:
    <ul>
        <li>Code cells have boxes around them.</li>
        <li>To run a code cell, click on the cell and either click the <i class="fa-play fa"></i> button on the toolbar above, or then hit <b>Shift+Enter</b>. The <b>Shift+Enter</b> combo will also move you to the next cell, so it's a quick way to work through the notebook. Selecting from the menu above the toolbar, <b>Cell</b> > <b>Run All</b> is a shortcut to trigger attempting to run all the cells in the notebook.</li>
        <li>While a cell is running a <b>*</b> appears in the square brackets next to the cell. Once the cell has finished running the asterisk will be replaced with a number.</li>
        <li>In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.</li>
        <li>To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.</li>
    </ul>
</p>
</div>

----

## Demonstrating the script to make a dataframe from PDBePISA interface lists/reports

When you go to https://www.ebi.ac.uk/pdbe/pisa/pistart.html  and press `Launch PDBePISA` button and then enter your favorite PDB identifier code and then press the 'Interfaces' button, you'll get an  Interface report/list in the form of a table with the information for structure of the complex linked to that code. The Interface report/list you see looks great; however, that table isn't set up for easily using to downstream analysis.  
The script that will be demonstrated here overcomes that issue and provides you with a Pandas dataframe that represents the data you'd see at the site. Pandas dataframes are computational objects you can easily use in subsequent analysis or save into forms you can easily bring into Excel.

If you haven't encountered Pandas dataframes before I suggest you see the first two notebooks that come up with you launch a session from my [blast-binder](https://github.com/fomightez/blast-binder) site. Those first two notebooks cover using the dataframe containing BLAST results some. 


### Preparation: What is needed to use the script?

Three things are needed:

- An environment where you can run the script.

- The script itself.

- A FASTA sequence file with entries that match what you typicall get for Genbank entries from NCBI.

This demonstration notebook provide the first two requirements right in your browser.  

The demo will cover supplying an example of the third requirement or you can substitute your own to apply the script. You may wish to run the the demonstration at least once and then try that.

Let's finish the preparation and then begin with an example.

### Preparation: Fetch the script.

The script is stored on Github and running the next cell will bring a copy of it into the working directory here. (It is not included in the repository where this launches from to insure you always get the most current version, which is assumed to be the best available at the time.)

In [1]:
# Get a file if not yet retrieved / check if file exists
import os
file_needed = "add_source_organism_info_to_FASTA.py"
if not os.path.isfile(file_needed):
    !curl -OL https://raw.githubusercontent.com/fomightez/sequencework/master/AdjustFASTA_or_FASTQ/{file_needed}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 20188  100 20188    0     0  98000      0 --:--:-- --:--:-- --:--:-- 98000


### Preparation: Make an example FASTA file

Running the next cell will make a FASTA file present in the current working directory so that the demonstration script can work on it.  
If you are alread familiar with using script you can upload your sequence file in FAST format, or after you make the file, replace the content with your sequences in FASTA format.

Source of demostration sequences in the file:

- From [here](https://www.ncbi.nlm.nih.gov/nuccore/XM_009231696.1?report=fasta) for [Genbank entry XM_009231696.1](https://www.ncbi.nlm.nih.gov/nuccore/XM_009231696.1).
- From [here](https://www.ncbi.nlm.nih.gov/nuccore/NM_211569.1?report=fasta) for [Genabnk entry ](https://www.ncbi.nlm.nih.gov/nuccore/NM_211569.1).

In [2]:
s='''>gi|685422210|ref|XM_009231696.1| Gaeumannomyces graminis var. tritici R3-111a-1 hypothetical protein partial mRNA
ATGACGACGGACAGCACCAACCGCGCCTCGCCGCTGGCCAGGCTCGAGCAGGCCCGGCAGCTGCTCGCCA
CCGCGCGCGAGATCCTGGACCGGTTCAACTACAAGCACAAGAACCAGCACCGCCTGTCCAAGTGGTGGGC
CGCCTTCGACACCCTGCGCAGGCACCTCGCCAAGCTGGAGCGGGACGAAATCACGCCTGCTGTCGGGGCC
CTCGAGCTCGCCGCCCGCCGGTCGGCGGCCCGCGGCTCCGGTGGCGGCGGCTCCAAGCGTCCCCGGGACC
CTGCTGCCGCCGTGGACGTGGCCGTGATGCCGCCCGGCGTCGGGGCGAAATCCAAATGGATGCGCGACTA
CCTGATCCCCAGGGCCTACCCCTCCTTCACCCAGCTGGCTGCGGACAATCAGTTCGCCCACCTAGGGCTT
CTCTTGCTCGGCGTGCTGGCCCAGGTTCACGATCTCGTGTCTCTTATCTTGCCCCCGCGCATAGAGGAAG
ATGACGCCGAGGGTGTAAAGTGTGCGGCTGCGTCTCCGCTTCGTATCGAAGCACGGCCGGTGCCGGCAGT
GGGAACTGCCGCCCTTGTGCCTGCAGTTGCTTCCAGCAATCCAGATGATTTGGGCGTTGCTGTGTCCCGG
GACGACGTTGCACCTGCAAAACGTGACGTGCGCAGCAGAAGCGAAGATGCATGCGCGCGATTAAACGGCA
AGAAGCGTGAAAAACAGACCAGCACCAAACCCACCCTCTCAGAAGAGGCATCGCCAGGCCCAGACGAGGT
GACTAGGCCAGTTCGCGAGGCGCCGCTAGGTCCATCTGCGGCCAAGCCCAAGGTCAAGAAAAAGGCACGT
AGCGACGACGCCAAAGAGGAGAAGAAAAAGAAGAAGAGAAAGAAGGGCGACGAGTTTGACGACCTGTTTA
GCTCGCTGATGTAG


>gi|47075178|ref|NM_211569.1| Ashbya gossypii ATCC 10895 AGL160Wp (AGOS_AGL160W) mRNA, complete cds
ATGTCGGACAAGGCTTTGCGCGCTGGTGAGGATGGCACGGAGATCCGTAATGCGCTTCGCAGCCTACAGC
AAGAGCTGCGAGTCATTCACATCCTGTATCACAGGAACAAGAACCAGCACCGCGTAGCTACATGGTGGAA
GCAGCTCAATTCGCTTAAGCGCAGTGTGAGTCAGGTGGTTACAGTGACTAGTAAGCCGGTGCGCACAGAG
GCAGATCTGGAGGCGCTGGCAGGGTTGTTGCGGCGGTTTGCGGTGCGGCAGGCGCCGGCGATGTACTACG
AGTTTAACGGCGTGATTGCGCTAGGACAATTCGTGACGCTGGGAGTGGTGCTGGTGGCAGCGCTTGCGCG
CGTTTGGGCACTGTACGGGCAGCTGCGTGAGGCTCTCGGGCTACTGCCAGTGCGGGCGGCACAGGCGGAG
CGCGAGTGCGACGTTGCACCTACTGAAGAGATCGGTGAAGAGGTGGCTGTGGCGGTGGCGGCGTCGCCGC
CCGGCGCAGCCGCGCTGCCTGGCGGCAAGCGAATCAAGAAGAAAAGCAAGAGCAAACGTTCTGCGATCGA
CGACATTTTCGGCTGA
'''
%store s >sequences.fa

Writing 's' (str) to file 'sequences.fa'.


### Using the script

We will cover an example of using the script, much like you would on the command line in a terminal interface on your computer. If that doesn't yet mean much, that's okay because an interface much like that is provided right here.

If you've run the cells above, we have the script now. To process sequence file, run the next command where we use Python to run the script `add_source_organism_info_to_FASTA.py` and tell it we want it to process the sequence file just made by providing that as text after calling the script with `%run` in front of it.

In [3]:
%run add_source_organism_info_to_FASTA.py sequences.fa

Reading in your FASTA file...
Sending NCBI identifiers to fetch records...Fetching records 1 thru 2...
Concluded.
2 entries had the source organism added.
The new file 'sequences_with_source.fa' has been created in same directory as the input file.



We can view the produced file to see the result.

In [4]:
!cat sequences_with_source.fa

>gi|685422210|ref|XM_009231696.1| Gaeumannomyces graminis var. tritici R3-111a-1 hypothetical protein partial mRNA [Gaeumannomyces tritici R3-111a-1]
ATGACGACGGACAGCACCAACCGCGCCTCGCCGCTGGCCAGGCTCGAGCAGGCCCGGCAGCTGCTCGCCA
CCGCGCGCGAGATCCTGGACCGGTTCAACTACAAGCACAAGAACCAGCACCGCCTGTCCAAGTGGTGGGC
CGCCTTCGACACCCTGCGCAGGCACCTCGCCAAGCTGGAGCGGGACGAAATCACGCCTGCTGTCGGGGCC
CTCGAGCTCGCCGCCCGCCGGTCGGCGGCCCGCGGCTCCGGTGGCGGCGGCTCCAAGCGTCCCCGGGACC
CTGCTGCCGCCGTGGACGTGGCCGTGATGCCGCCCGGCGTCGGGGCGAAATCCAAATGGATGCGCGACTA
CCTGATCCCCAGGGCCTACCCCTCCTTCACCCAGCTGGCTGCGGACAATCAGTTCGCCCACCTAGGGCTT
CTCTTGCTCGGCGTGCTGGCCCAGGTTCACGATCTCGTGTCTCTTATCTTGCCCCCGCGCATAGAGGAAG
ATGACGCCGAGGGTGTAAAGTGTGCGGCTGCGTCTCCGCTTCGTATCGAAGCACGGCCGGTGCCGGCAGT
GGGAACTGCCGCCCTTGTGCCTGCAGTTGCTTCCAGCAATCCAGATGATTTGGGCGTTGCTGTGTCCCGG
GACGACGTTGCACCTGCAAAACGTGACGTGCGCAGCAGAAGCGAAGATGCATGCGCGCGATTAAACGGCA
AGAAGCGTGAAAAACAGACCAGCACCAAACCCACCCTCTCAGAAGAGGCATCGCCAGGCCCAGACGAGGT
GACTAGGCCAGTTCGCGAGGCGCCGCTAGGTCCATCTGCGGCCAAGCCCAAGGTCAA

Note the organism name now occurs in brackets.

## Usage

You can get the help or 'manual' info on usage  by using `--help` or `-h` when calling the script.

In [5]:
%run add_source_organism_info_to_FASTA.py --help

usage: add_source_organism_info_to_FASTA.py [-h] InputFile

add_source_organism_info_to_FASTA.py adds to each description line the source
organism information in between brackets if it appears not present based on
lack of brackets. Written by Wayne Decatur --> Fomightez @ Github or Twitter.

positional arguments:
  InputFile   file of sequences in FASTA format. REQUIRED.

optional arguments:
  -h, --help  show this help message and exit


## What about on your machine without Jupyter (or IPython)?

**IMPORTANTLY:**  
On your own machine, outside of Jupyter (or IPython), you'd replace `%run` with `python` (or perhaps `python3`, depending on your Python installation) if you wanted to run the script on your typical command line. So using the script in a terminal, would look something like the following after the prompt:

```shell
python add_source_organism_info_to_FASTA.py sequences.fa
```

That is telling your system to use Python to run the script and the rest is the name of the script and then the sequence file to pass to the script. Substitute your sequence file.

Note: you'd have to have the script placed in that working directory. If it's not already there, you may be able to use the following command to get it:

```shell
curl -OL https://raw.githubusercontent.com/fomightez/sequencework/master/AdjustFASTA_or_FASTQ/add_source_organism_info_to_FASTA.py
```

If that fails, try the following to use `wget`:

```shell
wget https://raw.githubusercontent.com/fomightez/sequencework/master/AdjustFASTA_or_FASTQ/add_source_organism_info_to_FASTA.py
```

Both `wget` and `curl` do pretty much the same thing; however, some machines only have one or the other installed.

*OPTIONAL*: Jupyter actually has a terminal as part of it, and so if you wanted you could use `python add_source_organism_info_to_FASTA.py sequences.fa` to run the script in this active session. You can get the terminal by pressing the Jupyter icon in the upper left side and then opening a termnal from the page that comes.  You'd need to use `cd` to change into the `notebooks` directory where this notebook is currently being run. 

**That covers the basics of using the script.** 


Enjoy.


--------

Go to the index page and click through to notebooks after the next in the series if you'd like.

------

-----