# GSD: Calculating ambiguous nucleotides and %G+C for the mitochondrial genomes of NCYC3594 and NCYC3585 from Wolters et al 2015

I noted at the end of `GSD Calculating_GC_for_nuclear_and_mitochondrial_genomes_of_SGD_reference_and_PB_set.ipynb` where I was discussing the range in %G+C for the *cerevisiae* that although I knew the reported %G+C for the two *cerevisiae* strains sequenced by PacBio in Wolters et al, specifically, strains NCYC3594 and NCYC3585 from [Wolters et al 2015 PMID: 26062918]((https://www.ncbi.nlm.nih.gov/pubmed/26062918)), that I'd need to analyze them to assess ambiguous nts. This is that effort. They only did the mitochondrial DNA by single molecule sequencing and so I am not going to deal with the nuclear. 


Reference for the sequence data:  
- [Population structure of mitochondrial genomes in Saccharomyces cerevisiae.
Wolters JF, Chiu K, Fiumera HL. BMC Genomics. 2015 Jun 11;16:451. doi: 10.1186/s12864-015-1664-4. PMID: 26062918](https://www.ncbi.nlm.nih.gov/pubmed/26062918). Article [direct](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-1664-4).
  
Uses code developed for ``GSD Calculating_GC_for_nuclear_and_mitochondrial_genomes_of_SGD_reference_and_PB_set.ipynb`` and `GSD Assessing_ambiguous_nts_in_complete_PB_genomes.ipynb`, `GSD Assessing_ambiguous_nts_in_nuclear_PB_genomes.ipynb`, `GSD Assessing_ambiguous_nts_in_1011_collection_genomes.ipynb`, `GSD Calculating_GC_for_nuclear_and_mitochondrial_genomes_of_SGD_reference.ipynb`. 

Should be able to be run in almost any Jupyter environment with Pandas and the 'essential' modules.

The sequence data is going to come directly from `SGDs288CplusPacBio_ADJUSTEDplusWoltersnW303forALIGNERS.fa`. I had made this for another purpose and adjusted the 'start' site to match the SGD reference sequence, see `Counting putative promoters and origins in current mito genomes collection.ipynb`; however, for counting the arrangment of start site doesn't matter by this is handy way to get the sequences I need with only need to upload one file.
   
  
-----

## Preparation

Get the packages and sequence data necessary.


Uploaded the following to where I was running this notebook in a Binder session using Jupyter Dashboard or JupyterLab's drag-and-drop:
- 'SGDs288CplusPacBio_ADJUSTEDplusWoltersnW303forALIGNERS.fa'


In [1]:
!pip install pyfaidx

Collecting pyfaidx
  Downloading https://files.pythonhosted.org/packages/75/a5/7e2569527b3849ea28d79b4f70d7cf46a47d36459bc59e0efa4e10e8c8b2/pyfaidx-0.5.5.2.tar.gz
Building wheels for collected packages: pyfaidx
  Building wheel for pyfaidx (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/jovyan/.cache/pip/wheels/54/a2/b4/e242e58d23b2808e191b214067880faa46cd2341f363886e0b
Successfully built pyfaidx
Installing collected packages: pyfaidx
Successfully installed pyfaidx-0.5.5.2


usage: faidx [-h] [-b BED] [-o OUT]
             [-i {bed,chromsizes,nucleotide,transposed}] [-c] [-r] [-y]
             [-a SIZE_RANGE] [-n | -f] [-t] [-x] [-l] [-s DEFAULT_SEQ]
             [-d DELIMITER] [-e HEADER_FUNCTION]
             [-u {stop,first,last,longest,shortest}] [-g REGEX] [-v] [-m | -M]
             [--no-output] [--no-rebuild] [--version]
             fasta [regions [regions ...]]

Fetch sequences from FASTA. If no regions are specified, all entries in the
input file are returned. Input FASTA file must be consistently line-wrapped,
and line wrapping of output is based on input line lengths.

positional arguments:
  fasta                 FASTA file
  regions               space separated regions of sequence to fetch e.g.
                        chr1:1-1000

optional arguments:
  -h, --help            show this help message and exit
  --no-rebuild          do not rebuild the .fai index even if it is out of
                        date. default: False
  --version      

Get the genomes from the uploaded file available as single files by running these commands. (Additionally it makes the names of the files consistently end in `*.mito.fa`.

In [4]:
import os
import sys
import fnmatch
example_produced_file = "NCYC3594.mito.fa"
if not os.path.isfile(example_produced_file): #so won't run again if already ran
    name_part_to_match = ".fa"
    name_part_to_expand_to = ".mito.fa"
    old_files_with_ext = []
    for file in os.listdir('.'):
        if fnmatch.fnmatch(file, '*'+name_part_to_match):
            old_files_with_ext.append(file)
    files_to_not_touch_despite_match = old_files_with_ext     
    seq_file = "SGDs288CplusPacBio_ADJUSTEDplusWoltersnW303forALIGNERS.fa"
    !faidx --split-files {seq_file}
    new_fasta = []
    for file in os.listdir('.'):
        if fnmatch.fnmatch(file, '*'+name_part_to_match) and file not in files_to_not_touch_despite_match:
            new_fasta.append(file)
    #fix name if it needs fixing
    for file in new_fasta:
        if not fnmatch.fnmatch(file, '*'+name_part_to_expand_to):
            new_file_name = file.split(".fa")[0] + name_part_to_expand_to
            !mv {file} {new_file_name}

Make a list of the mitochondrial genomes.

In [11]:
import os
import sys
import fnmatch
name_part_to_match = ".mito.fa"
genomes= []
for file in os.listdir('.'):
    if fnmatch.fnmatch(file, '*'+name_part_to_match):
        #print (file)
        #first_part_filen = file.rsplit(name_part_to_match,1)[0]
        genomes.append(file)
len(genomes)

14

In [8]:
import time

def executeSomething():
    #code here
    print ('.')
    time.sleep(480) #60 seconds times 8 minutes

while True:
    executeSomething()

.


KeyboardInterrupt: 


Now you are prepared to analyze each genome.

## Calculating ambiguous nts and %G+C for the mitochondrial genomes

Mainly want NCYC3594 and NCYC3585.

First count all the letters present and make a dataframe and then add a column with %G+C:

In [13]:
from pyfaidx import Fasta
import pandas as pd
import collections
nt_counts = {}
for g in genomes:
    if ".mito.fa" in g:
        strain_id = g.split(".mito.fa")[0]
    else:
        strain_id = g.split(".re.fa")[0][18:]
    concatenated_seqs = ""
    chrs = Fasta(g)
    for x in chrs:
        #print(x.name)
        concatenated_seqs += str(x)
    nt_counts[strain_id] = collections.Counter(concatenated_seqs)
nt_count_df = pd.DataFrame.from_dict(nt_counts, orient='index').fillna(0)
nt_count_df["Total_nts"] = nt_count_df.sum(1)
def percent_GCcalc(items):
    '''
    takes a list of three and calculates percentage of sum of first
    two itemswithin total (second item)
    '''
    return (items[0] + items[1])/items[2]
nt_count_df['%G+C'] = nt_count_df[['C','G','Total_nts']].apply(percent_GCcalc, axis=1)
nt_count_df = nt_count_df.sort_values('Total_nts',ascending=False)
#nt_count_df = nt_count_df.sort_values(['% N', 'Total_nts'],ascending=[0,0])
nt_count_df = nt_count_df.sort_index()
# df.iloc[np.lexsort((df.index, df.A.values))] # from https://stackoverflow.com/a/49354905/8508004
#nt_count_df_styled = nt_count_df.style.format({'Total_nts':'{:.2E}','% N':'{:.2%}'})
nt_count_df_styled = nt_count_df.style.format({'Total_nts':'{:.2E}','%G+C':'{:.2%}'})
nt_count_df_styled

Unnamed: 0,C,T,A,G,N,Total_nts,%G+C
DBVPG6044,6790,33027,33789,7487,0,81100.0,17.61%
DBVPG6765,6161,33876,34663,7022,0,81700.0,16.13%
NCYC3585,5447,32201,32856,6092,0,76600.0,15.06%
NCYC3594,5977,32772,33412,6756,0,78900.0,16.13%
S288c,6865,34943,36174,7811,0,85800.0,17.11%
SGDS288C,6863,34934,36169,7813,0,85800.0,17.11%
SK1,7079,34384,35306,7869,0,84600.0,17.66%
UFRJ50816,5502,31938,33611,6335,0,77400.0,15.30%
UWOPS034614,4913,31590,32191,5485,0,74200.0,14.02%
UWOPS919171,4879,30683,31981,5628,0,73200.0,14.36%


Done. Turns out there are no unknown nucleotides in NCYC3585 or NCYC3594.
Sort on %G+C.

In [14]:
nt_count_df_alt = nt_count_df.sort_values('%G+C',ascending=False)
#nt_count_df_styled = nt_count_df.style.format({'Total_nts':'{:.2E}','% N':'{:.2%}'})
nt_count_df_styled_alt = nt_count_df_alt.style.format({'Total_nts':'{:.2E}','%G+C':'{:.2%}'})
nt_count_df_styled_alt

Unnamed: 0,C,T,A,G,N,Total_nts,%G+C
SK1,7079,34384,35306,7869,0,84600.0,17.66%
DBVPG6044,6790,33027,33789,7487,0,81100.0,17.61%
SGDS288C,6863,34934,36169,7813,0,85800.0,17.11%
S288c,6865,34943,36174,7811,0,85800.0,17.11%
W303,7593,38585,40089,8594,10,94900.0,17.06%
Y12,6424,34065,35073,7306,0,82900.0,16.57%
NCYC3594,5977,32772,33412,6756,0,78900.0,16.13%
DBVPG6765,6161,33876,34663,7022,0,81700.0,16.13%
UFRJ50816,5502,31938,33611,6335,0,77400.0,15.30%
NCYC3585,5447,32201,32856,6092,0,76600.0,15.06%


I noted the lack of ambiguous nts at the end of `GSD Calculating_GC_for_nuclear_and_mitochondrial_genomes_of_SGD_reference_and_PB_set.ipynb`, where I had first pondered whether I could take the numbers reported in Wolters et al., 2015 at face value. Turns out I can because nothing but `GATC`s among the assembly.

----