# Basic statistics of SARS-CoV-2 virus
-   What is the size of SARS-CoV-2 genome?
-   What is the GC content of the genome?
    -   Is it different from other viruses?
    -   What is the GC content of Human?
-   How many genes in the SARS-CoV-2 genome?
    -   how many genes recorded in the NCBI page?
    -   Can it be predicted by programming? (by searching start codon ATC?)
-   How many functional proteins?
    -   What are there size?

##  What is the size of SARS-CoV-2 genome?

In [12]:
from Bio import SeqIO

record, = SeqIO.parse('../data/01_raw/NC_045512.fasta', 'fasta')
genome_size = len(record.seq)
print('The size of SARS-CoV-2 genome is {:,} bp'.format(genome_size))

The size of SARS-CoV-2 genome is 29,903 bp


##  What is the GC content of the genome?

In [13]:
from Bio.SeqUtils import GC
from collections import Counter
print('The GC content of SARS-CoV-2 genome is %d%%' % GC(record.seq))

nt_comp = Counter(record.seq)
print('Neucleotide composition of SARS-CoV-2: A(%d%%), T(%d%%), C(%d%%), G(%d%%)' % (nt_comp['A']/genome_size*100, nt_comp['T']/genome_size*100, nt_comp['C']/genome_size*100, nt_comp['G']/genome_size*100))

The GC content of SARS-CoV-2 genome is 37%
Neucleotide composition of SARS-CoV-2: A(29%), T(32%), C(18%), G(19%)


## Is it different from other viruses?
-   Astrovirus
-   Coronavirus
-   Coxsackie virus A9
-   Dengue virus type 1
-   Dengue virus type 2
-   Dengue virus type 3
-   Dengue virus type 4
-   Enterovirus 71
-   ...

In [17]:
import os
import gzip
import numpy as np
DIR= r'../data/01_raw/refseq/viral/'
viral_records = []
for folder in os.listdir(DIR):
    for f in os.listdir(os.path.join(DIR, folder)):
        if f.endswith(".fna.gz"):
            with gzip.open(os.path.join(DIR, folder, f), 'rt') as handle:
                viral_record = SeqIO.parse(handle, 'fasta')
                for viral in viral_record:
                    viral_records.append(viral)

ls_corona = ['NC_002306.3', 'NC_002645.1', 'NC_005831.2', 'NC_010437.1', 'EU420137.1',
'NC_010438.1', 'NC_003436.1', 'DQ811787.1', 'NC_009988.1', 'NC_009657.1', 'NC_038861.1', 
'NC_003045.1', 'LC061274.1', 'NC_006577.2', 'NC_006213.1', 'NC_001846.1', 'DQ011855.1', 
'NC_004718.3', 'NC_009694.1', 'NC_019843.3', 'NC_009021.1', 'NC_009020.1', 'NC_009019.1', 
'NC_001451.1', 'NC_010646.1', 'NC_010800.1', 'NC_011547.1', 'NC_011550.1', 'NC_011549.1']
GCs_viral = []
for viral in viral_records:
    if viral.id in ls_corona:
        GC_viral = GC(viral.seq)
        GCs_viral.append(GC_viral)
        print('%s:' % viral.description)
        print('The GC content of genome is %d%%' % GC_viral)
    else:
        continue
print('The mean GC content of coronavirus other than SARS-CoV-2 genome is %d%%' % np.mean(GCs_viral))


NC_003436.1 Porcine epidemic diarrhea virus, complete genome:
The GC content of genome is 42%
NC_002645.1 Human coronavirus 229E, complete genome:
The GC content of genome is 38%
NC_005831.2 Human Coronavirus NL63, complete genome:
The GC content of genome is 34%
NC_002306.3 Feline infectious peritonitis virus, complete genome:
The GC content of genome is 38%
NC_006577.2 Human coronavirus HKU1, complete genome:
The GC content of genome is 32%
NC_001846.1 Murine hepatitis virus strain MHV-A59 C12 mutant, complete genome:
The GC content of genome is 41%
NC_003045.1 Bovine coronavirus isolate BCoV-ENT, complete genome:
The GC content of genome is 37%
NC_001451.1 Avian infectious bronchitis virus, complete genome:
The GC content of genome is 37%
NC_004718.3 SARS coronavirus Tor2, complete genome:
The GC content of genome is 40%
NC_009021.1 Rousettus bat coronavirus HKU9, complete genome:
The GC content of genome is 41%
NC_009019.1 Tylonycteris bat coronavirus HKU4, complete genome:
The GC 