# GSD: Calculating %G+C for the nuclear and mitochondrial genomes of S288C reference sequence at SGD and the PacBio set

What is exact %G+C for nuclear and mitochondrial genomes of the *S. crevisiae* S288C reference sequence at SGD and a dozen-or-so PacBio-sequenced *cerevisiae* and *paradoxus* genomes from [Yue et al., 2017](https://www.ncbi.nlm.nih.gov/pubmed/28416820)?


References for the sequence data:  
- [Contrasting evolutionary genome dynamics between domesticated and wild yeasts.
Yue JX, Li J, Aigrain L, Hallin J, Persson K, Oliver K, Bergström A, Coupland P, Warringer J, Lagomarsino MC, Fischer G, Durbin R, Liti G. Nat Genet. 2017 Jun;49(6):913-924. doi: 10.1038/ng.3847. Epub 2017 Apr 17. PMID: 28416820](https://www.ncbi.nlm.nih.gov/pubmed/28416820)


- [Life with 6000 genes. Goffeau A, Barrell BG, Bussey H, Davis RW, Dujon B, Feldmann H, Galibert F, Hoheisel JD, Jacq C, Johnston M, Louis EJ, Mewes HW, Murakami Y, Philippsen P, Tettelin H, Oliver SG. Science. 1996 Oct 25;274(5287):546, 563-7. PMID: 8849441](https://www.ncbi.nlm.nih.gov/pubmed/8849441)
  
Links to the specific data are included in the code.  
  
Uses code developed for `GSD Assessing_ambiguous_nts_in_complete_PB_genomes.ipynb`, `GSD Assessing_ambiguous_nts_in_nuclear_PB_genomes.ipynb`, `GSD Assessing_ambiguous_nts_in_1011_collection_genomes.ipynb`, `GSD Calculating_GC_for_nuclear_and_mitochondrial_genomes_of_SGD_reference.ipynb`.  
  
-----

## Preparation

Get the packages and sequence data necessary.


In [1]:
!pip install pyfaidx

Collecting pyfaidx
  Downloading https://files.pythonhosted.org/packages/75/a5/7e2569527b3849ea28d79b4f70d7cf46a47d36459bc59e0efa4e10e8c8b2/pyfaidx-0.5.5.2.tar.gz
Building wheels for collected packages: pyfaidx
  Building wheel for pyfaidx (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/jovyan/.cache/pip/wheels/54/a2/b4/e242e58d23b2808e191b214067880faa46cd2341f363886e0b
Successfully built pyfaidx
Installing collected packages: pyfaidx
Successfully installed pyfaidx-0.5.5.2


Get the S228C genomes data by running these commands.

In [2]:
# Cannot simply use `S288C_reference_sequence_R64-2-1_20150113.fsa` at https://downloads.yeastgenome.org/sequence/S288C_reference/genome_releases/S288C_reference_genome_Current_Release.tgz
# because want nuclear & mito sequences here to be separate
# NUCLEAR
def add_strain_id_to_description_line(file,strain_id):
    '''
    Takes a file and edits every description line to add 
    strain_id after the caret.
    
    Saves the fixed file
    '''
    import sys
    output_file_name = "temp.txt"
    # prepare output file for saving so it will be open and ready
    with open(output_file_name, 'w') as output_file:

        # read in the input file
        with open(file, 'r') as input_handler:
            # prepare to give feeback later or allow skipping to certain start
            lines_processed = 0

            for line in input_handler:
                lines_processed += 1
                if line.startswith(">"):
                    rest_o_line = line.split(">")
                    new_line = ">"+strain_id + rest_o_line[1]
                else:
                    new_line = line
                
                # Send text to output
                output_file.write(new_line)

    
    # replace the original file with edited
    !mv temp.txt {file}
    # Feedback
    sys.stderr.write("\n{} chromosome identifiers tagged.".format(file))
chr_nums = ["01","02","03","04","05","06","07","08","09","10","11",
    "12","13","14","15","16"]
g_fns = []
for chr_num in chr_nums:
    fn = "chr"+chr_num+".fsa"
    !curl -OL https://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/{fn}
    g_fns.append(fn)
    # edit or replace description lines to work similar to PacBio
    tag = "SGD_REFchr"+chr_num+" "
    add_strain_id_to_description_line(fn,tag)
# zip all nuclear chromosomes together as one file as they are from https://yjx1217.github.io/Yeast_PacBio_2016/data/
!cat {" ".join(g_fns)} > SGD_REF_nuc.genome.fa
!rm chr*.fsa

#MITO
!curl -OL https://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa
# Edit the description line to be clear and concise for later % rename to match nuclear
!sed -i '1s/.*/>SGD_REFmito/' chrmt.fsa
#!head chrmt.fsa
!mv chrmt.fsa SGD_REF_mito.genome.fa

# IF WANTED ALL SGD reference sequence, including mitochondria, as one file, could just use this:
'''
# Get SGD reference sequence that includes nuclear and mitochondrial sequence as one file,
# among others. I'll use file name for the reference genome worked out 
# in `GSD Assessing_ambiguous_nts_in_nuclear_PB_genomes.ipynb`, so more of the
# previously worked out code will work.
!curl -OL https://downloads.yeastgenome.org/sequence/S288C_reference/genome_releases/S288C_reference_genome_Current_Release.tgz
!tar -xzf S288C_reference_genome_Current_Release.tgz
!rm S288C_reference_genome_Current_Release.tgz
!mv S288C_reference_genome_R64-2-1_20150113/S288C_reference_sequence_R64-2-1_20150113.fsa ./SGD_REF.genome.fa
!rm -rf S288C_reference_genome_R64-2-1_20150113
''';

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  228k  100  228k    0     0   545k      0 --:--:-- --:--:-- --:--:--  545k



chr01.fsa chromosome identifiers tagged.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  807k  100  807k    0     0  1778k      0 --:--:-- --:--:-- --:--:-- 1774k



chr02.fsa chromosome identifiers tagged.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  314k  100  314k    0     0   838k      0 --:--:-- --:--:-- --:--:--  838k



chr03.fsa chromosome identifiers tagged.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1521k  100 1521k    0     0  3042k      0 --:--:-- --:--:-- --:--:-- 3042k



chr04.fsa chromosome identifiers tagged.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  572k  100  572k    0     0  1338k      0 --:--:-- --:--:-- --:--:-- 1335k



chr05.fsa chromosome identifiers tagged.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  268k  100  268k    0     0   646k      0 --:--:-- --:--:-- --:--:--  646k



chr06.fsa chromosome identifiers tagged.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1083k  100 1083k    0     0  2215k      0 --:--:-- --:--:-- --:--:-- 2210k



chr07.fsa chromosome identifiers tagged.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  558k  100  558k    0     0  1269k      0 --:--:-- --:--:-- --:--:-- 1269k



chr08.fsa chromosome identifiers tagged.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  436k  100  436k    0     0  1057k      0 --:--:-- --:--:-- --:--:-- 1057k



chr09.fsa chromosome identifiers tagged.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  740k  100  740k    0     0  1595k      0 --:--:-- --:--:-- --:--:-- 1595k



chr10.fsa chromosome identifiers tagged.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  662k  100  662k    0     0  1565k      0 --:--:-- --:--:-- --:--:-- 1565k



chr11.fsa chromosome identifiers tagged.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1070k  100 1070k    0     0  2363k      0 --:--:-- --:--:-- --:--:-- 2358k



chr12.fsa chromosome identifiers tagged.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  917k  100  917k    0     0  1982k      0 --:--:-- --:--:-- --:--:-- 1982k



chr13.fsa chromosome identifiers tagged.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  778k  100  778k    0     0  1707k      0 --:--:-- --:--:-- --:--:-- 1704k



chr14.fsa chromosome identifiers tagged.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1083k  100 1083k    0     0  2365k      0 --:--:-- --:--:-- --:--:-- 2365k



chr15.fsa chromosome identifiers tagged.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  941k  100  941k    0     0  1901k      0 --:--:-- --:--:-- --:--:-- 1901k



chr16.fsa chromosome identifiers tagged.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 87344  100 87344    0     0   275k      0 --:--:-- --:--:-- --:--:--  274k


Get the sequences from Yue et al., 2017 (PacBio-sequenced genomes) using the next two cells.

In [3]:
import pandas as pd
# Prepare for getting PacBio (Yue et al 2017 sequences)
#make a list of the strain designations
yue_et_al_strains = ["S288C","DBVPG6044","DBVPG6765","SK1","Y12",
                     "YPS128","UWOPS034614","CBS432","N44","YPS138",
                     "UFRJ50816","UWOPS919171"]
# Get & unpack the genome sequences from strains 
for s in yue_et_al_strains:
    !curl -LO http://yjx1217.github.io/Yeast_PacBio_2016/data/Nuclear_Genome/{s}.genome.fa.gz
    !curl -OL http://yjx1217.github.io/Yeast_PacBio_2016/data/Mitochondrial_Genome/{s}.mt.genome.fa.gz
    !gunzip -f {s}.genome.fa.gz
    !gunzip -f {s}.mt.genome.fa.gz
    # rename the files to follow the convention used for SGD reference
    !mv {s}.genome.fa {s}_nuc.genome.fa
    !mv {s}.mt.genome.fa {s}_mito.genome.fa

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   178  100   178    0     0   2472      0 --:--:-- --:--:-- --:--:--  2472
100 3687k  100 3687k    0     0  8232k      0 --:--:-- --:--:-- --:--:-- 8232k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   178  100   178    0     0   3955      0 --:--:-- --:--:-- --:--:--  3955
100 22109  100 22109    0     0   156k      0 --:--:-- --:--:-- --:--:--  156k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   178  100   178    0     0   3490      0 --:--:-- --:--:-- --:--:--  3490
100 3387k  100 3387k    0     0  14.0M      0 --:--:-- --:--:-- --:--:-- 14.0M
  % Total    % Received % Xferd  Average Speed   Tim

In [4]:
# add identifiers to each `chr` so results for each strain clear later
for s in yue_et_al_strains:
    add_strain_id_to_description_line(s+"_nuc.genome.fa",s)
    add_strain_id_to_description_line(s+"_mito.genome.fa",s)


S288C_nuc.genome.fa chromosome identifiers tagged.
S288C_mito.genome.fa chromosome identifiers tagged.
DBVPG6044_nuc.genome.fa chromosome identifiers tagged.
DBVPG6044_mito.genome.fa chromosome identifiers tagged.
DBVPG6765_nuc.genome.fa chromosome identifiers tagged.
DBVPG6765_mito.genome.fa chromosome identifiers tagged.
SK1_nuc.genome.fa chromosome identifiers tagged.
SK1_mito.genome.fa chromosome identifiers tagged.
Y12_nuc.genome.fa chromosome identifiers tagged.
Y12_mito.genome.fa chromosome identifiers tagged.
YPS128_mito.genome.fa chromosome identifiers tagged.
UWOPS034614_nuc.genome.fa chromosome identifiers tagged.
UWOPS034614_mito.genome.fa chromosome identifiers tagged.
CBS432_nuc.genome.fa chromosome identifiers tagged.
CBS432_mito.genome.fa chromosome identifiers tagged.
N44_nuc.genome.fa chromosome identifiers tagged.
N44_mito.genome.fa chromosome identifiers tagged.
YPS138_nuc.genome.fa chromosome identifiers tagged.
YPS138_mito.genome.fa chromosome identifiers tagged.

In [5]:
# Make a list of all `genome.fa` files, excluding `genome.fa.nhr` and `genome.fa.nin` and `genome.fansq`
# The excluding was only necessary because I had run some BLAST queries preiminarily in development. Normally, 
# it would just be the `.re.fa` at the outset. (But keeping because removal ability could be useful.)
fn_to_check = "genome.fa" 
genomes = []
import os
import fnmatch
for file in os.listdir('.'):
    if fnmatch.fnmatch(file, '*'+fn_to_check):
        if not file.endswith(".nhr") and not file.endswith(".nin") and not file.endswith(".nsq") :
            genomes.append(file)
genomes

['S288C_mito.genome.fa',
 'SGD_REF_mito.genome.fa',
 'Y12_mito.genome.fa',
 'DBVPG6044_mito.genome.fa',
 'N44_mito.genome.fa',
 'S288C_nuc.genome.fa',
 'DBVPG6765_mito.genome.fa',
 'UFRJ50816_mito.genome.fa',
 'UWOPS919171_nuc.genome.fa',
 'CBS432_nuc.genome.fa',
 'YPS128_nuc.genome.fa',
 'SK1_nuc.genome.fa',
 'YPS138_mito.genome.fa',
 'SGD_REF_nuc.genome.fa',
 'CBS432_mito.genome.fa',
 'YPS128_mito.genome.fa',
 'SK1_mito.genome.fa',
 'UWOPS919171_mito.genome.fa',
 'DBVPG6044_nuc.genome.fa',
 'UWOPS034614_nuc.genome.fa',
 'N44_nuc.genome.fa',
 'UWOPS034614_mito.genome.fa',
 'UFRJ50816_nuc.genome.fa',
 'YPS138_nuc.genome.fa',
 'DBVPG6765_nuc.genome.fa',
 'Y12_nuc.genome.fa']


Now you are prepared to analyze each genome.

## Calculating %G+C for the nuclear and mitochondrial genomes of S288C reference sequence at SGD, along with the PacBio-sequenced genomes from Yue et al., 2017

First count all the letters present and make a dataframe and then add a column with %G+C:

In [6]:
from pyfaidx import Fasta
import pandas as pd
import collections
nt_counts = {}
for g in genomes:
    if ".genome.fa" in g:
        strain_id = g.split(".genome.fa")[0]
    else:
        strain_id = g.split(".re.fa")[0][18:]
    concatenated_seqs = ""
    chrs = Fasta(g)
    for x in chrs:
        #print(x.name)
        concatenated_seqs += str(x)
    nt_counts[strain_id] = collections.Counter(concatenated_seqs)
nt_count_df = pd.DataFrame.from_dict(nt_counts, orient='index').fillna(0)
nt_count_df["Total_nts"] = nt_count_df.sum(1)
def percent_GCcalc(items):
    '''
    takes a list of three and calculates percentage of sum of first
    two itemswithin total (second item)
    '''
    return (items[0] + items[1])/items[2]
nt_count_df['%G+C'] = nt_count_df[['C','G','Total_nts']].apply(percent_GCcalc, axis=1)
nt_count_df = nt_count_df.sort_values('Total_nts',ascending=False)
#nt_count_df = nt_count_df.sort_values(['% N', 'Total_nts'],ascending=[0,0])
nt_count_df = nt_count_df.sort_index()
# df.iloc[np.lexsort((df.index, df.A.values))] # from https://stackoverflow.com/a/49354905/8508004
#nt_count_df_styled = nt_count_df.style.format({'Total_nts':'{:.2E}','% N':'{:.2%}'})
nt_count_df_styled = nt_count_df.style.format({'Total_nts':'{:.2E}','%G+C':'{:.2%}'})
nt_count_df_styled

Unnamed: 0,A,T,G,C,N,Total_nts,%G+C
CBS432_mito,31189,30251,5343,4699,0,71500.0,14.05%
CBS432_nuc,3683966,3675728,2322732,2321545,17357,12000000.0,38.63%
DBVPG6044_mito,33789,33027,7487,6790,0,81100.0,17.61%
DBVPG6044_nuc,3689297,3677723,2285884,2287780,17357,12000000.0,38.25%
DBVPG6765_mito,34663,33876,7022,6161,0,81700.0,16.13%
DBVPG6765_nuc,3647252,3632750,2258080,2257749,17357,11800000.0,38.23%
N44_mito,30428,29537,5343,4640,0,69900.0,14.27%
N44_nuc,3620330,3617409,2278564,2278029,17357,11800000.0,38.58%
S288C_mito,36174,34943,7811,6865,0,85800.0,17.11%
S288C_nuc,3752659,3738132,2323469,2325532,17357,12200000.0,38.24%


To make it easier to compare the different mitochondrial sequences to each other, we can sort on the %G+C. Plus, we'll add the species info.

In [7]:
# adding species info preparation
species_dict = {
    "SGD_REF":"cerevisiae",
    "S288C":"cerevisiae",
    "DBVPG6765":"cerevisiae",
    "Y12":"cerevisiae",
    "YPS128":"cerevisiae",
    "UWOPS034614":"cerevisiae",
    "CBS432":"paradoxus",
    "N44":"paradoxus",
    "YPS138":"paradoxus",
    "UWOPS919171":"paradoxus",
    "SK1":"cerevisiae",
    "DBVPG6044":"cerevisiae",
    "UFRJ50816":"paradoxus",
} # dictionary came from `GSD Rpb1_orthologs_in_PB_genomes.ipynb` orginally
# make the species_dict work for the nuclear- and mito-distringuised versions
nuc_species = {k+"_nuc":v for k,v in species_dict.items()}
mito_species = {k+"_mito":v for k,v in species_dict.items()}
#merge the dictionaries
species_dict_for_nuc_n_mito = {**nuc_species, **mito_species}

nt_count_df_alt = nt_count_df.sort_values('%G+C',ascending=False)
#nt_count_df_alt = nt_count_df_alt.sort_values(['Total_nts','%G+C'],ascending=[0,0])
nt_count_df_alt['species'] = nt_count_df_alt.index.map(species_dict_for_nuc_n_mito)
# while adding species info, add it to the first dataframe above too
nt_count_df['species'] = nt_count_df.index.map(species_dict_for_nuc_n_mito)
#nt_count_df = nt_count_df.sort_values(['% N', 'Total_nts'],ascending=[0,0])
#nt_count_df = nt_count_df.sort_index()
# df.iloc[np.lexsort((df.index, df.A.values))] # from https://stackoverflow.com/a/49354905/8508004
#nt_count_df_styled = nt_count_df.style.format({'Total_nts':'{:.2E}','% N':'{:.2%}'})
nt_count_df_styled_alt = nt_count_df_alt.style.format({'Total_nts':'{:.2E}','%G+C':'{:.2%}'})
nt_count_df_styled_alt

Unnamed: 0,A,T,G,C,N,Total_nts,%G+C,species
CBS432_nuc,3683966,3675728,2322732,2321545,17357,12000000.0,38.63%,paradoxus
N44_nuc,3620330,3617409,2278564,2278029,17357,11800000.0,38.58%,paradoxus
YPS138_nuc,3629766,3623944,2272582,2268073,17357,11800000.0,38.44%,paradoxus
UFRJ50816_nuc,3735782,3732052,2326900,2325192,17357,12100000.0,38.33%,paradoxus
SGD_REF_nuc,3730180,3718146,2309287,2313713,0,12100000.0,38.30%,cerevisiae
SK1_nuc,3719655,3707198,2307796,2311279,17357,12100000.0,38.29%,cerevisiae
UWOPS919171_nuc,3652729,3646729,2270324,2269479,17357,11900000.0,38.29%,paradoxus
YPS128_nuc,3672962,3663303,2279727,2277125,17357,11900000.0,38.26%,cerevisiae
DBVPG6044_nuc,3689297,3677723,2285884,2287780,17357,12000000.0,38.25%,cerevisiae
S288C_nuc,3752659,3738132,2323469,2325532,17357,12200000.0,38.24%,cerevisiae


The mitochondrial sequences actually vary by quite a large percentage of %G+C and dip rather low. Especially given [Langkjaer (more accurately Langkjær) et al., 2003 (PMID: 12799436)](https://www.ncbi.nlm.nih.gov/pubmed/12799436) , first hit that currently comes up when you search 'percent GC cerevisiae mitochondrial', says, "The S.cerevisiae mtDNA is characterised by a very low GC content, 17–18%..." ( I need to see how the number of classified GC elments and ori sequences relates to that pattern.) *paradoxus* tends to be on the lower side; however, the lowest of this group is a *cerevisiae*.

In [8]:
#nt_count_df.to_csv('SGD_REF_and_PB_nt_count.tsv', sep='\t',index = False)    
#nt_count_df.to_pickle("SGD_REF_and_PB_nt_count.pkl")

----