# Lamprey Transcriptome Analysis: Genome Completeness Assessment

```
Camille Scott [camille dot scott dot w @gmail.com] [@camille_codon]

camillescott.github.io

Lab for Data Intensive Biology (DIB)
University of California Davis
```

## About

Uses the many-sample de novo transcriptome assembly to attempt to assess the completeness of the genome build (and orthogonally, the completeness of the transcriptome itself).

    assembly version: lamp10

    assembly program: Trinity
    
    genome build: 7.0.75 (ensembl release 75

## Libraries

In [1]:
%load_ext autoreload
%autoreload 2
%autosave 60

Autosaving every 60 seconds


In [2]:
from libs import *
%run -i common.ipy

** Using data resources found in ../resources.json
** Using config found in ../config.json


In [3]:
import pyprind

In [4]:
from blasttools import blast_to_df

In [5]:
from IPython.display import display, Image
import glob

In [6]:
%pylab inline
from matplotlib import rc_context
tall_size = (8,16)
norm_size = (12,8)
mpl_params = {'figure.autolayout': True,
               'axes.titlesize': 24,
               'axes.labelsize': 16,
               'ytick.labelsize': 14,
               'xtick.labelsize': 14
               }
sns.set(style="white", palette="Paired", rc=mpl_params)
#sns.set_palette("Paired", desat=.6)
b, g, r, p = sns.color_palette("muted", 4)

Populating the interactive namespace from numpy and matplotlib


In [28]:
from ete3 import NCBITaxa

ImportError: No module named ete3

In [8]:
%config InlineBackend.close_figures = False

## Data

We'll save our results to a dictionary and dump it to JSON for use in the paper.

In [9]:
results = {}

In [10]:
store = pd.HDFStore(wdir('{}.store.h5'.format(prefix)), complib='zlib', complevel=5)

In [11]:
import atexit

In [12]:
def dump_results(fn='../doc/petmar-genome-comp.results.json'):
    with open(fn, 'wb') as fp:
        json.dump(results, fp)

In [13]:
def exit_func():
    dump_results()
    store.close()
atexit.register(exit_func)

<function __main__.exit_func>

## Orthologs

How many transcripts have orthologs in other organisms, but no lamprey support? And vice versa?

In [14]:
ortho_panel = store['lamp10_ortho']

In [15]:
ortho_df = store['lamp10_ortho_filter_df']

In [16]:
blast_df = store['lamp10_blast_filter_df']

In [17]:
# Transcripts with recipricol best thits to our main databases
has_ortho = ((ortho_df['lamp10.fasta.x.braFlo.pep.all.fa.db.tsv'] == True) | 
             (ortho_df['lamp10.fasta.x.danRer.pep.fa.db.tsv'] == True) |
             (ortho_df['lamp10.fasta.x.homSap.pep.fa.db.tsv'] == True) |
             (ortho_df['lamp10.fasta.x.musMus.pep.fa.db.tsv'] == True))

In [18]:
# We want to filter out transcripts with *any* homologies to the genome,
# the lamprey proteins, and the lamprey cDNAs (lamp0)
lamp_filter = ((blast_df['lamp10.fasta.x.petMar2.fa.db.tsv'] == False) & 
               (blast_df['lamp10.fasta.x.petMar2.pep.fa.db.tsv'] == False) &
               (blast_df['lamp10.fasta.x.petMar2.cdna.fa.db.tsv'] == False))

In [19]:
ortho_df = ortho_panel.minor_xs('sseqid')

In [20]:
n_novel = (has_ortho & lamp_filter).sum()
results['n_novel_ortho'] = n_novel
print n_novel, 'have orthologies but no lamprey support'

1768 have orthologies but no lamprey support


In [21]:
n_genome_supported = (has_ortho & (blast_df['lamp10.fasta.x.petMar2.fa.db.tsv'] == True)).sum()
results['n_genome_supported'] = n_genome_supported
print n_genome_supported, 'have orthologies and genome support'

11990 have orthologies and genome support


In [22]:
n_supported = (has_ortho & (lamp_filter == False)).sum()
results['n_supported_ortho'] = n_supported
print n_supported, 'have orthologies and lamprey support'

13405 have orthologies and lamprey support


In [23]:
print 'This means {:.2f}% of orthologs are potentially novel'.format((float(n_novel) / (n_novel + n_supported)) * 100.0)

This means 11.65% of orthologs are potentially novel


### Corresponding Genes

In [27]:
ncbi = NCBITaxa()

In [26]:
ncbi.get_fuzzy_name_translation('petromyzon marinus')

ImportError: No module named pysqlite2.dbapi2

In [211]:
GNATHOSTOMATA = 7776
CYCLOSTOMATA = 1476529
PMARINUS = 7757


In [212]:
mg = mygene.MyGeneInfo()

In [213]:
mygene.__version__

'2.2.0'

In [168]:
ortho_df['lamp10.fasta.x.Myx.pep.all.fa.db.tsv'] = ortho_df['lamp10.fasta.x.Myx.pep.all.fa.db.tsv'].apply(uniprot_str)

In [170]:
ortho_df['lamp10.fasta.x.braFlo.pep.all.fa.db.tsv'] = ortho_df['lamp10.fasta.x.braFlo.pep.all.fa.db.tsv'].apply(uniprot_str)

In [173]:
novel_df = ortho_df[has_ortho & lamp_filter]

In [218]:
mg.querymany(['C3Z5N5', 'C3YM09'], scopes='uniprot', species=7739)

Finished.


[{u'_id': u'7220667',
  u'entrezgene': 7220667,
  u'name': u'hypothetical protein',
  u'query': u'C3Z5N5',
  u'symbol': u'BRAFLDRAFT_276112',
  u'taxid': 7739},
 {u'_id': u'7230509',
  u'entrezgene': 7230509,
  u'name': u'hypothetical protein',
  u'query': u'C3YM09',
  u'symbol': u'BRAFLDRAFT_147665',
  u'taxid': 7739}]

In [220]:
mg.getgene(7220667, fields='all')

{u'_id': u'7220667',
 u'_timestamp': u'2014-06-22T00:00:00',
 u'accession': {u'genomic': [u'GG666583', u'NW_003101438'],
  u'protein': [u'EEN52148', u'XP_002596136'],
  u'rna': u'XM_002596090'},
 u'entrezgene': 7220667,
 u'name': u'hypothetical protein',
 u'refseq': {u'genomic': u'NW_003101438',
  u'protein': u'XP_002596136',
  u'rna': u'XM_002596090'},
 u'symbol': u'BRAFLDRAFT_276112',
 u'taxid': 7739,
 u'type_of_gene': u'protein-coding',
 u'uniprot': {u'TrEMBL': u'C3Z5N5'}}

In [226]:
mg.query('ENSDARP00000110794', scopes='ensembleprotein', species='zebrafish')    

{u'hits': [{u'_id': u'324197',
   u'_score': 1.0501943,
   u'entrezgene': 324197,
   u'name': u'topoisomerase I binding, arginine/serine-rich a',
   u'symbol': u'toporsa',
   u'taxid': 7955}],
 u'max_score': 1.0501943,
 u'took': 112,
 u'total': 1}

In [None]:
mg.getgene(324197, )

Unnamed: 0_level_0,lamp10.fasta.x.Myx.pep.all.fa.db.tsv,lamp10.fasta.x.braFlo.pep.all.fa.db.tsv,lamp10.fasta.x.danRer.pep.fa.db.tsv,lamp10.fasta.x.homSap.pep.fa.db.tsv,lamp10.fasta.x.musMus.pep.fa.db.tsv,lamp10.fasta.x.petMar2.fa.db.tsv,lamp10.fasta.x.petMar2.cdna.fa.db.tsv,lamp10.fasta.x.petMar2.cds.fa.db.tsv,lamp10.fasta.x.petMar2.ncrna.fa.db.tsv,lamp10.fasta.x.petMar2.pep.fa.db.tsv
target_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
c366614_g4_i7,,C3Z5N5,,,,,,,,
c329845_g1_i1,,,ENSDARP00000105046,ENSP00000329735,ENSMUSP00000040433,,,,,
c363162_g3_i1,,C3YM09,ENSDARP00000110794,ENSP00000369187,ENSMUSP00000046843,,,,,
c351306_g1_i1,,,,,ENSMUSP00000136481,,,,,
c351306_g1_i2,,C3ZIU3,ENSDARP00000075970,ENSP00000417792,,,,,,
c341340_g1_i1,,C3YDG4,ENSDARP00000104686,ENSP00000225728,ENSMUSP00000021157,,,,,
c292244_g1_i1,,,ENSDARP00000018388,,,,,,,
c348522_g4_i1,,C3ZB55,,,,,,,,
c325520_g1_i2,,C3YKU9,ENSDARP00000059751,ENSP00000466868,ENSMUSP00000101017,,,,,
c335139_g2_i1,,C3Z3Y7,ENSDARP00000035108,ENSP00000382439,ENSMUSP00000005394,,,,,


In [190]:
mg.querymany(['ENSDARP00000105046', 'ENSP00000329735'], scopes='ensemblprotein', species='7757')

Finished.
2 input query terms found no hit:
	[u'ENSDARP00000105046', u'ENSP00000329735']
Pass "returnall=True" to return complete lists of duplicate or missing query terms.


[{u'notfound': True, u'query': u'ENSDARP00000105046'},
 {u'notfound': True, u'query': u'ENSP00000329735'}]

In [191]:
mg.query('FAM212A', species='7757')

{u'hits': [], u'max_score': None, u'took': 1854, u'total': 0}

## Germline DNA Samples

We've got some nice shiny new sperm DNA samples - let's play with them some. First things first, we'll check out their FastQC results to make sure they're not crap.

In [30]:
fastqc_folders = sorted(glob.glob('*fastqc'))

In [31]:
!ls 2_ATCACG_L001_R1_001_fastqc/Images

duplication_levels.png	 per_base_sequence_content.png
kmer_profiles.png	 per_sequence_gc_content.png
per_base_gc_content.png  per_sequence_quality.png
per_base_n_content.png	 sequence_length_distribution.png
per_base_quality.png


In [32]:
# quick utility function to generate a table of images
def get_img_table(folders, image):
    htmlout = '<table>'
    for i in xrange(0, len(folders), 2):
        left = fastqc_folders[i]
        right = fastqc_folders[i+1]
        htmlout += '<tr>'
        htmlout += '<td align="center">' + left + ', LEFT</td>'
        htmlout += '<td align="center">' + right + ', RIGHT</td>'
        htmlout += '</td>'
        
        htmlout += '<tr>'
        htmlout += '<td><img src="' + os.path.join(left, 'Images', image) + '"/></td>'
        htmlout += '<td><img src="' + os.path.join(right, 'Images', image) + '"/></td>'
        htmlout += '</tr>'
    htmlout += '</table>'
    return htmlout

#### Per-Base Sequence Quality

These look pretty normal -- in fact, the right fragments are a little better than I might expect for many runs.

In [33]:
display(HTML(get_img_table(fastqc_folders, 'per_base_quality.png')))

0,1
"2_ATCACG_L001_R1_001_fastqc, LEFT","2_ATCACG_L001_R2_001_fastqc, RIGHT"
,
"2_TGACCA_L001_R1_001_fastqc, LEFT","2_TGACCA_L001_R2_001_fastqc, RIGHT"
,
"4_CCGTCC_L001_R1_001_fastqc, LEFT","4_CCGTCC_L001_R2_001_fastqc, RIGHT"
,
"4_GATCAG_L001_R1_001_fastqc, LEFT","4_GATCAG_L001_R2_001_fastqc, RIGHT"
,


#### GC Content

For the most part, these results line up with what we expect from the lamprey genome paper: ~46% GC content for the whole genome (FastQC throws a warning for this check, but we can safely ignore it based on our prior knowledge). Curiously, in the genome paper, they report that coding regions had a GC content of ~61%, but we don't really see a bump in the distribution there -- instead, we see a bump at ~84% in all samples. There isn't anything unexpected in the per-*base* figures, so we can at least assume there isn't a technical artifact common to all reads.

In [34]:
display(HTML(get_img_table(fastqc_folders, 'per_sequence_gc_content.png')))

0,1
"2_ATCACG_L001_R1_001_fastqc, LEFT","2_ATCACG_L001_R2_001_fastqc, RIGHT"
,
"2_TGACCA_L001_R1_001_fastqc, LEFT","2_TGACCA_L001_R2_001_fastqc, RIGHT"
,
"4_CCGTCC_L001_R1_001_fastqc, LEFT","4_CCGTCC_L001_R2_001_fastqc, RIGHT"
,
"4_GATCAG_L001_R1_001_fastqc, LEFT","4_GATCAG_L001_R2_001_fastqc, RIGHT"
,


In [35]:
display(HTML(get_img_table(fastqc_folders, 'per_base_gc_content.png')))

0,1
"2_ATCACG_L001_R1_001_fastqc, LEFT","2_ATCACG_L001_R2_001_fastqc, RIGHT"
,
"2_TGACCA_L001_R1_001_fastqc, LEFT","2_TGACCA_L001_R2_001_fastqc, RIGHT"
,
"4_CCGTCC_L001_R1_001_fastqc, LEFT","4_CCGTCC_L001_R2_001_fastqc, RIGHT"
,
"4_GATCAG_L001_R1_001_fastqc, LEFT","4_GATCAG_L001_R2_001_fastqc, RIGHT"
,


Once more, nothing all too exciting -- a bit of A bias toward the end of the righ reads, which coincides with the expected drop in quality.

In [36]:
display(HTML(get_img_table(fastqc_folders, 'per_base_sequence_content.png')))

0,1
"2_ATCACG_L001_R1_001_fastqc, LEFT","2_ATCACG_L001_R2_001_fastqc, RIGHT"
,
"2_TGACCA_L001_R1_001_fastqc, LEFT","2_TGACCA_L001_R2_001_fastqc, RIGHT"
,
"4_CCGTCC_L001_R1_001_fastqc, LEFT","4_CCGTCC_L001_R2_001_fastqc, RIGHT"
,
"4_GATCAG_L001_R1_001_fastqc, LEFT","4_GATCAG_L001_R2_001_fastqc, RIGHT"
,


Seems a little odd that we get this slow increase in G/C homopolymers over the length of our reads. Need to investigate further.

In [37]:
display(HTML(get_img_table(fastqc_folders, 'kmer_profiles.png')))

0,1
"2_ATCACG_L001_R1_001_fastqc, LEFT","2_ATCACG_L001_R2_001_fastqc, RIGHT"
,
"2_TGACCA_L001_R1_001_fastqc, LEFT","2_TGACCA_L001_R2_001_fastqc, RIGHT"
,
"4_CCGTCC_L001_R1_001_fastqc, LEFT","4_CCGTCC_L001_R2_001_fastqc, RIGHT"
,
"4_GATCAG_L001_R1_001_fastqc, LEFT","4_GATCAG_L001_R2_001_fastqc, RIGHT"
,


In [38]:
display(HTML(get_img_table(fastqc_folders, 'duplication_levels.png')))

0,1
"2_ATCACG_L001_R1_001_fastqc, LEFT","2_ATCACG_L001_R2_001_fastqc, RIGHT"
,
"2_TGACCA_L001_R1_001_fastqc, LEFT","2_TGACCA_L001_R2_001_fastqc, RIGHT"
,
"4_CCGTCC_L001_R1_001_fastqc, LEFT","4_CCGTCC_L001_R2_001_fastqc, RIGHT"
,
"4_GATCAG_L001_R1_001_fastqc, LEFT","4_GATCAG_L001_R2_001_fastqc, RIGHT"
,


Summary: we should probably run a trimmer on these samples before using them. Shocker!