# GSD: Calculating %G+C for the nuclear and mitochondrial genomes of S288C reference sequence at SGD

What is exact %G+C for nuclear and mitochondrial genomes of *S. cerevisiae* S288C reference sequence at SGD?


Reference for sequence data:  
- [Life with 6000 genes. Goffeau A, Barrell BG, Bussey H, Davis RW, Dujon B, Feldmann H, Galibert F, Hoheisel JD, Jacq C, Johnston M, Louis EJ, Mewes HW, Murakami Y, Philippsen P, Tettelin H, Oliver SG. Science. 1996 Oct 25;274(5287):546, 563-7. PMID: 8849441](https://www.ncbi.nlm.nih.gov/pubmed/8849441)
  
Links to the specific data are included in the code.  
  
Uses code developed for `GSD Assessing_ambiguous_nts_in_complete_PB_genomes.ipynb`, `GSD Assessing_ambiguous_nts_in_nuclear_PB_genomes.ipynb` & `GSD Assessing_ambiguous_nts_in_1011_collection_genomes.ipynb`.  
  
-----

## Preparation

Get packages and sequence data necessary.


In [1]:
!pip install pyfaidx

Collecting pyfaidx
  Downloading https://files.pythonhosted.org/packages/75/a5/7e2569527b3849ea28d79b4f70d7cf46a47d36459bc59e0efa4e10e8c8b2/pyfaidx-0.5.5.2.tar.gz
Building wheels for collected packages: pyfaidx
  Building wheel for pyfaidx (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/jovyan/.cache/pip/wheels/54/a2/b4/e242e58d23b2808e191b214067880faa46cd2341f363886e0b
Successfully built pyfaidx
Installing collected packages: pyfaidx
Successfully installed pyfaidx-0.5.5.2


Get the genomes data by running these commands.

In [2]:
# Cannot simply use `S288C_reference_sequence_R64-2-1_20150113.fsa` at http://sgd-archive.yeastgenome.org/sequence/S288C_reference/genome_releases/S288C_reference_genome_Current_Release.tgz
# because want nuclear & mito sequences here to be separate
# NUCLEAR
def add_strain_id_to_description_line(file,strain_id):
    '''
    Takes a file and edits every description line to add 
    strain_id after the caret.
    
    Saves the fixed file
    '''
    import sys
    output_file_name = "temp.txt"
    # prepare output file for saving so it will be open and ready
    with open(output_file_name, 'w') as output_file:

        # read in the input file
        with open(file, 'r') as input_handler:
            # prepare to give feeback later or allow skipping to certain start
            lines_processed = 0

            for line in input_handler:
                lines_processed += 1
                if line.startswith(">"):
                    rest_o_line = line.split(">")
                    new_line = ">"+strain_id + rest_o_line[1]
                else:
                    new_line = line
                
                # Send text to output
                output_file.write(new_line)

    
    # replace the original file with edited
    !mv temp.txt {file}
    # Feedback
    sys.stderr.write("\n{} chromosome identifiers tagged.".format(file))
chr_nums = ["01","02","03","04","05","06","07","08","09","10","11",
    "12","13","14","15","16"]
g_fns = []
for chr_num in chr_nums:
    fn = "chr"+chr_num+".fsa"
    !curl -OL http://sgd-archive.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/{fn}
    g_fns.append(fn)
    # edit or replace description lines to work similar to PacBio
    tag = "SGD_REFchr"+chr_num+" "
    add_strain_id_to_description_line(fn,tag)
# zip all nuclear chromosomes together as one file as they are from https://yjx1217.github.io/Yeast_PacBio_2016/data/
!cat {" ".join(g_fns)} > SGD_REF_nuc.genome.fa
!rm chr*.fsa

#MITO
!curl -OL http://sgd-archive.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa
# Edit the description line to be clear and concise for later % rename to match nuclear
!sed -i '1s/.*/>SGD_REFmito/' chrmt.fsa
#!head chrmt.fsa
!mv chrmt.fsa SGD_REF_mito.genome.fa

# IF WANTED ALL SGD reference sequence, including mitochondria, as one file, could just use this:
'''
# Get SGD reference sequence that includes nuclear and mitochondrial sequence as one file,
# among others. I'll use file name for the reference genome worked out 
# in `GSD Assessing_ambiguous_nts_in_nuclear_PB_genomes.ipynb`, so more of the
# previously worked out code will work.
!curl -OL http://sgd-archive.yeastgenome.org/sequence/S288C_reference/genome_releases/S288C_reference_genome_Current_Release.tgz
!tar -xzf S288C_reference_genome_Current_Release.tgz
!rm S288C_reference_genome_Current_Release.tgz
!mv S288C_reference_genome_R64-2-1_20150113/S288C_reference_sequence_R64-2-1_20150113.fsa ./SGD_REF.genome.fa
!rm -rf S288C_reference_genome_R64-2-1_20150113
''';

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  228k  100  228k    0     0   587k      0 --:--:-- --:--:-- --:--:--  587k



chr01.fsa chromosome identifiers tagged.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  807k  100  807k    0     0  1964k      0 --:--:-- --:--:-- --:--:-- 1964k



chr02.fsa chromosome identifiers tagged.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  314k  100  314k    0     0   782k      0 --:--:-- --:--:-- --:--:--  782k



chr03.fsa chromosome identifiers tagged.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1521k  100 1521k    0     0  3155k      0 --:--:-- --:--:-- --:--:-- 3149k



chr04.fsa chromosome identifiers tagged.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  572k  100  572k    0     0  1278k      0 --:--:-- --:--:-- --:--:-- 1278k



chr05.fsa chromosome identifiers tagged.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  268k  100  268k    0     0   727k      0 --:--:-- --:--:-- --:--:--  727k



chr06.fsa chromosome identifiers tagged.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1083k  100 1083k    0     0  2091k      0 --:--:-- --:--:-- --:--:-- 2087k



chr07.fsa chromosome identifiers tagged.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  558k  100  558k    0     0  1266k      0 --:--:-- --:--:-- --:--:-- 1266k



chr08.fsa chromosome identifiers tagged.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  436k  100  436k    0     0  1078k      0 --:--:-- --:--:-- --:--:-- 1076k



chr09.fsa chromosome identifiers tagged.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  740k  100  740k    0     0  1694k      0 --:--:-- --:--:-- --:--:-- 1694k



chr10.fsa chromosome identifiers tagged.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  662k  100  662k    0     0  1611k      0 --:--:-- --:--:-- --:--:-- 1607k



chr11.fsa chromosome identifiers tagged.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1070k  100 1070k    0     0  2422k      0 --:--:-- --:--:-- --:--:-- 2422k



chr12.fsa chromosome identifiers tagged.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  917k  100  917k    0     0  2039k      0 --:--:-- --:--:-- --:--:-- 2039k



chr13.fsa chromosome identifiers tagged.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  778k  100  778k    0     0  1774k      0 --:--:-- --:--:-- --:--:-- 1770k



chr14.fsa chromosome identifiers tagged.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1083k  100 1083k    0     0  2381k      0 --:--:-- --:--:-- --:--:-- 2386k



chr15.fsa chromosome identifiers tagged.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  941k  100  941k    0     0  1941k      0 --:--:-- --:--:-- --:--:-- 1937k



chr16.fsa chromosome identifiers tagged.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 87344  100 87344    0     0   287k      0 --:--:-- --:--:-- --:--:--  287k


In [3]:
# Make a list of all `genome.fa` files, excluding `genome.fa.nhr` and `genome.fa.nin` and `genome.fansq`
# The excluding was only necessary because I had run some BLAST queries preiminarily in development. Normally, 
# it would just be the `.re.fa` at the outset. (But keeping because removal ability could be useful.)
fn_to_check = "genome.fa" 
genomes = []
import os
import fnmatch
for file in os.listdir('.'):
    if fnmatch.fnmatch(file, '*'+fn_to_check):
        if not file.endswith(".nhr") and not file.endswith(".nin") and not file.endswith(".nsq") :
            genomes.append(file)
genomes

['SGD_REF_mito.genome.fa', 'SGD_REF_nuc.genome.fa']


Now you are prepared to analyze each genome.

## Calculating %G+C for the nuclear and mitochondrial genomes of S288C reference sequence at SGD

First count all the letters present and make a dataframe and then add a column with %G+C:

In [4]:
from pyfaidx import Fasta
import pandas as pd
import collections
nt_counts = {}
for g in genomes:
    if ".genome.fa" in g:
        strain_id = g.split(".genome.fa")[0]
    else:
        strain_id = g.split(".re.fa")[0][18:]
    concatenated_seqs = ""
    chrs = Fasta(g)
    for x in chrs:
        #print(x.name)
        concatenated_seqs += str(x)
    nt_counts[strain_id] = collections.Counter(concatenated_seqs)
nt_count_df = pd.DataFrame.from_dict(nt_counts, orient='index').fillna(0)
nt_count_df["Total_nts"] = nt_count_df.sum(1)
def percent_GCcalc(items):
    '''
    takes a list of three and calculates percentage of sum of first
    two itemswithin total (second item)
    '''
    return (items[0] + items[1])/items[2]
nt_count_df['%G+C'] = nt_count_df[['C','G','Total_nts']].apply(percent_GCcalc, axis=1)
nt_count_df = nt_count_df.sort_values('Total_nts',ascending=False)
#nt_count_df_styled = nt_count_df.style.format({'Total_nts':'{:.2E}','% N':'{:.2%}'})
nt_count_df_styled = nt_count_df.style.format({'Total_nts':'{:.2E}','%G+C':'{:.2%}'})
nt_count_df_styled

Unnamed: 0,T,C,A,G,Total_nts,%G+C
SGD_REF_nuc,3718146,2313713,3730180,2309287,12100000.0,38.30%
SGD_REF_mito,34934,6863,36169,7813,85800.0,17.11%


In [5]:
#nt_count_df.to_csv('SGD_REF_nt_count.tsv', sep='\t',index = False)    
#nt_count_df.to_pickle("SGD_REF_nt_count.pkl")

----