# Sequence metrics for Trinity contigs

First, let's import some required modules for manipulation of sequence data, as well as setting the path to the fasta file containing the results of the Trinity assembly:

In [4]:
from Bio import SeqIO
from Bio.SeqUtils import GC

fastafile = "Trinity.120.fasta"

## G-C content:

Calculating GC% for each record and mean GC% for the whole multifasta:

In [6]:
#Adapted from http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc298:
gc_values = (GC(rec.seq) for rec in SeqIO.parse(fastafile, "fasta")) #GC% for each record

#Mean GC% for whole multifasta
gc_sum = 0
seq_total = 0
for value in gc_values:
    gc_sum += value
    seq_total += 1
print ("Mean GC% of Trinity contigs = {}%".format(round(gc_sum/seq_total, 2)))

Mean GC% of Trinity contigs = 34.31%


## N50:

We'll use [assembly_stats](https://github.com/sanger-pathogens/assembly-stats) v1.0.1 to calculate N50:

In [7]:
%%bash -s "$fastafile"
/home/gabriel/anaconda3/envs/git/bin/assembly-stats "$1"

stats for Trinity.120.fasta
sum = 56658515, n = 83424, ave = 679.16, largest = 8714
N50 = 905, n = 17781
N60 = 712, n = 24839
N70 = 551, n = 33891
N80 = 422, n = 45692
N90 = 315, n = 61201
N100 = 151, n = 83424
N_count = 0
Gaps = 0


In order to undestand this output, take a look at https://github.com/sanger-pathogens/assembly-stats