# Demo of script to get intergenic gap sizes from mitochondrial annotation file

In order to demonstrate the use of my script `measure_intergenic_regions_in_mito_annotations.py`, found [here](https://github.com/fomightez/sequencework/tree/master/annotation-utilities), I'll use  the Saacharomyces Genome Database (SGD) reference genome data for demonstration.

Because assume that these are circular for calculating distance between last and first annotated feature, I put that it is just for intergenic region of circular mitochondrial (fungal or others?) DNA.

References for sequence data:  
- [The reference genome sequence of Saccharomyces cerevisiae: then and now. Engel SR, Dietrich FS, Fisk DG, Binkley G, Balakrishnan R, Costanzo MC, Dwight SS, Hitz BC, Karra K, Nash RS, Weng S, Wong ED, Lloyd P, Skrzypek MS, Miyasato SR, Simison M, Cherry JM. G3 (Bethesda). 2014 Mar 20;4(3):389-98. doi: 10.1534/g3.113.008995.(PMID: 24374639)](https://www.ncbi.nlm.nih.gov/pubmed/24374639)

- [Saccharomyces Genome Database: the genomics resource of budding yeast. Cherry JM, Hong EL, Amundsen C, Balakrishnan R, Binkley G, Chan ET, Christie KR, Costanzo MC, Dwight SS, Engel SR, Fisk DG, Hirschman JE, Hitz BC, Karra K, Krieger CJ, Miyasato SR, Nash RS, Park J, Skrzypek MS, Simison M, Weng S, Wong ED. Nucleic Acids Res. 2012 Jan;40(Database issue):D700-5. doi: 10.1093/nar/gkr1029. Epub 2011 Nov 21. (PMID: 22110037)](https://www.ncbi.nlm.nih.gov/pubmed/22110037)


------

<div class="alert alert-block alert-warning">
<p>If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.</p>

<p>
    Some tips:
    <ul>
        <li>Code cells have boxes around them. When you hover over them a <i class="fa-step-forward fa"></i> icon appears.</li>
        <li>To run a code cell either click the <i class="fa-step-forward fa"></i> icon, or click on the cell and then hit <b>Shift+Enter</b>. The <b>Shift+Enter</b> combo will also move you to the next cell, so it's a quick way to work through the notebook.</li>
        <li>While a cell is running a <b>*</b> appears in the square brackets next to the cell. Once the cell has finished running the asterisk will be replaced with a number.</li>
        <li>In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.</li>
        <li>To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.</li>
    </ul>
</p>
</div>

---

##  Basic use

This script gets a sequence from a sequence file in FASTA format. It can be either a single sequence or more. You provide an indentifier to specify which sequence in the multiFASTA file to mine. In fact you always need to provide something for the indentifier parameter when calling this script or the main function of it, but that text can be nonseniscal if there is only one sequence in the sequence file. It disregards anything provided if there is only one.  
The only other thing necessary is providing start and end positions to specify the subsequence. Positions are to be specified in typical position terms where the first residue is numbered one.



In [1]:
# Get the script
!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/annotation-utilities/measure_intergenic_regions_in_mito_annotations.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 15535  100 15535    0     0  68738      0 --:--:-- --:--:-- --:--:-- 72934


In [2]:
#install a necessary dependency
!pip install pyfaidx

Collecting pyfaidx
  Downloading https://files.pythonhosted.org/packages/75/a5/7e2569527b3849ea28d79b4f70d7cf46a47d36459bc59e0efa4e10e8c8b2/pyfaidx-0.5.5.2.tar.gz
Building wheels for collected packages: pyfaidx
  Building wheel for pyfaidx (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/jovyan/.cache/pip/wheels/54/a2/b4/e242e58d23b2808e191b214067880faa46cd2341f363886e0b
Successfully built pyfaidx
Installing collected packages: pyfaidx
Successfully installed pyfaidx-0.5.5.2


#### Display USAGE block

In [3]:
!python measure_intergenic_regions_in_mito_annotations.py -h

usage: measure_intergenic_regions_in_mito_annotations.py [-h]
                                                         GFF_FILE SEQ_FILE

measure_intergenic_regions_in_mito_annotations.py takes output from MFannot
that has been converted to a gff file by a subsequent script, a corresponding
fungal mitochondrial sequence file, and determines intergenic gap sizes.
Assumes CIRCULAR genome. **** Script by Wayne Decatur (fomightez @ github) ***

positional arguments:
  GFF_FILE    Name of annotaion results file (gff format) from a mitochondrial
              genome to parse and possibly fix.
  SEQ_FILE    Name of file containing the fungal mitochondrial sequence (FASTA
              format) corresponding to the annotation file.

optional arguments:
  -h, --help  show this help message and exit


Next, we'll get an example fungal mitochondrial sequence and corresponding annotation and determine the intergenic gaps.

Get the annotation file into the active directory. (Source was one I made; see `Annotating SGD Reference S. cerevisiae S288C sequence mitochondrial genome with MFannot April 15 2019.ipynb`. Briefly, used Docker-version of [MFannot](https://github.com/BFL-lab/Mfannot) to annotate the sequence from SGD, converted the output from MFannot to gff3 format using `mfannot2gff3.pl` from [here](https://github.com/yjx1217/LRSDAY/blob/master/scripts/mfannot2gff3.pl), and then fixed the large ribosoma subunit annotation with `fix_lsu_rRNA_annotation_in_gff_resulting_from_mfannot.py` from [here](https://github.com/fomightez/sequencework/tree/master/Adjust_Annotation).)

In [4]:
!mv ../data/SGD_REF.mitoANNOTATED_by_MFANNOT.gff3 .

Get the S. cerevisiae mitochondrial sequence.

In [5]:
!curl -O https://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 87344  100 87344    0     0   207k      0 --:--:-- --:--:-- --:--:--  207k


Fix the description line to be more concise and descriptive.

In [6]:
import sys
sys.stderr.write("****BEFORE FIXING****")
!head -n 1 chrmt.fsa
sys.stderr.write("****NOW FIXING****\n")
!sed -i '1s/.*/>SGD_mito/' chrmt.fsa
sys.stderr.write("****AFTER FIXING****")
!head -n 1 chrmt.fsa

****BEFORE FIXING****

>ref|NC_001224| [org=Saccharomyces cerevisiae] [strain=S288C] [moltype=genomic] [location=mitochondrion] [top=circular] [note=R10-1-1]


****NOW FIXING****
****AFTER FIXING****

>SGD_mito


Here is something similar to how it would be run on a command line. You'd need to leave off the `!`; that is just for Jupyter to know to direct it to command line shell.

In [7]:
# Run analysis on equivalent of command line
!python measure_intergenic_regions_in_mito_annotations.py SGD_REF.mitoANNOTATED_by_MFANNOT.gff3 chrmt.fsa

Provided genome 'chrmt.fsa' is 85779 bps in length.
The intergenic gaps observed:
5325,333,1195,4374,965,675,6107,1096,3076,1248,614,634,2867,7261,-3720,1415,481,109,1428,34,779,177,87,783,810,4,487,244,590,525,1128,1053,1,2918,589,371,608,5013,206,750

The mean size of the gaps is 1316 bps.
The median size of the gaps is 712.5 bps.

A file listing the intergenic gaps has been saved as 'SGD_REF.mitoANNOTATED_by_MFANNOT_intergenic_gap_sizes.tsv'.


The next cell further shows it worked by showing the output file.

In [8]:
!cat SGD_REF.mitoANNOTATED_by_MFANNOT_intergenic_gap_sizes.tsv

gaps (bps)
5325
333
1195
4374
965
675
6107
1096
3076
1248
614
634
2867
7261
-3720
1415
481
109
1428
34
779
177
87
783
810
4
487
244
590
525
1128
1053
1
2918
589
371
608
5013
206
750


## Use script in a Jupyter notebook 

This will demonstrate importing the main function into a notebook.


In [9]:
# Get the script if the above section wasn't run
import os
file_needed = "measure_intergenic_regions_in_mito_annotations.py"
if not os.path.isfile(file_needed):
    !curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/annotation-utilities/measure_intergenic_regions_in_mito_annotations.py

Let's make the example `!python measure_intergenic_regions_in_mito_annotations.py SGD_REF.mitoANNOTATED_by_MFANNOT.gff3 chrmt.fsa` from above work here to get the result directly into the notebook environment as a python object. 

First we need to import the main function from the script file.

In [10]:
from measure_intergenic_regions_in_mito_annotations import measure_intergenic_regions_in_mito_annotations

Now to try using that.

In [11]:
measure_intergenic_regions_in_mito_annotations("SGD_REF.mitoANNOTATED_by_MFANNOT.gff3", "chrmt.fsa")

Provided genome 'chrmt.fsa' is 85779 bps in length.
The intergenic gaps observed:
5325,333,1195,4374,965,675,6107,1096,3076,1248,614,634,2867,7261,-3720,1415,481,109,1428,34,779,177,87,783,810,4,487,244,590,525,1128,1053,1,2918,589,371,608,5013,206,750

The mean size of the gaps is 1316 bps.
The median size of the gaps is 712.5 bps.

A file listing the intergenic gaps has been saved as 'SGD_REF.mitoANNOTATED_by_MFANNOT_intergenic_gap_sizes.tsv'.


Works. But were are now in a Jupyter notebooks environment. It would be nice to get the results as a pandas dataframe. To do that you just need to add `return_df=True` when calling the script.

In [12]:
result = measure_intergenic_regions_in_mito_annotations("SGD_REF.mitoANNOTATED_by_MFANNOT.gff3", "chrmt.fsa", return_df=True)

Provided genome 'chrmt.fsa' is 85779 bps in length.
The intergenic gaps observed:
5325,333,1195,4374,965,675,6107,1096,3076,1248,614,634,2867,7261,-3720,1415,481,109,1428,34,779,177,87,783,810,4,487,244,590,525,1128,1053,1,2918,589,371,608,5013,206,750

The mean size of the gaps is 1316 bps.
The median size of the gaps is 712.5 bps.

A file listing the intergenic gaps has been saved as 'SGD_REF.mitoANNOTATED_by_MFANNOT_intergenic_gap_sizes.tsv'.


Now we can look at the returned dataframe.

In [13]:
result

Unnamed: 0,gaps (bps)
0,5325
1,333
2,1195
3,4374
4,965
5,675
6,6107
7,1096
8,3076
9,1248


We now have that as `result` and we can use it for further steps. For example, we can use it to get information.

In [14]:
result.describe()

Unnamed: 0,gaps (bps)
count,40.0
mean,1316.0
std,1966.518217
min,-3720.0
25%,361.5
50%,712.5
75%,1289.75
max,7261.0


In [15]:
result.median()

gaps (bps)    712.5
dtype: float64

In [16]:
result.mean()

gaps (bps)    1316.0
dtype: float64

In [17]:
result.min()

gaps (bps)   -3720
dtype: int64

This negative value is caused by a serors annotation artifacts(?) MFannot makes just in front of the sequence for the large ribosomal subunit RNA, `rnl`. (I call them artifacts because they aren't shown in [Turk et al., 2013](https://www.ncbi.nlm.nih.gov/pubmed/24143261).) The following lines from the annotaion show this:

```text
SGD_mito	mfannot	gene	50731	50809	.	+	.	ID=trnE(cuc)_1;Name=trnE(cuc)_1
SGD_mito	mfannot	mRNA	50731	50809	.	+	.	ID=trnE(cuc)_1.mRNA.1;Name=trnE(cuc)_1.mRNA.1;Parent=trnE(cuc)_1
SGD_mito	mfannot	exon	50731	50809	.	+	.	ID=trnE(cuc)_1.exon.1;Name=trnE(cuc)_1.exon.1;Parent=trnE(cuc)_1.mRNA.1
SGD_mito	mfannot	CDS	50731	50809	.	+	.	ID=trnE(cuc)_1.CDS.1;Name=trnE(cuc)_1.CDS.1;Parent=trnE(cuc)_1.mRNA.1
SGD_mito	mfannot	gene	53676	53758	.	-	.	ID=trnE(cuc)_2;Name=trnE(cuc)_2
SGD_mito	mfannot	mRNA	53676	53758	.	-	.	ID=trnE(cuc)_2.mRNA.1;Name=trnE(cuc)_2.mRNA.1;Parent=trnE(cuc)_2
SGD_mito	mfannot	exon	53676	53758	.	-	.	ID=trnE(cuc)_2.exon.1;Name=trnE(cuc)_2.exon.1;Parent=trnE(cuc)_2.mRNA.1
SGD_mito	mfannot	CDS	53676	53758	.	-	.	ID=trnE(cuc)_2.CDS.1;Name=trnE(cuc)_2.CDS.1;Parent=trnE(cuc)_2.mRNA.1.)
SGD_mito	mfannot	gene	61019	61729	.	+	.	ID=orf236;Name=orf236
SGD_mito	mfannot	mRNA	61019	61729	.	+	.	ID=orf236.mRNA.1;Name=orf236.mRNA.1;Parent=orf236
SGD_mito	mfannot	exon	61019	61729	.	+	.	ID=orf236.exon.1;Name=orf236.exon.1;Parent=orf236.mRNA.1
SGD_mito	mfannot	CDS	61019	61729	.	+	.	ID=orf236.CDS.1;Name=orf236.CDS.1;Parent=orf236.mRNA.1
SGD_mito	mfannot	rRNA	58009	62447	.	+	.	ID=rnl;Name=rnl
```

The overlap of `orf236` with the `rnl` causes the negative 3720 value that is the lowest extreme.

Enjoy!