Create DATS encoding for mouse reference genome and gene-level annotation. #7

jonathancrabtree · 2018-06-19T13:31:37Z

Create initial DATS encoding for mouse reference genome and gene-level annotation.

jonathancrabtree · 2018-06-19T13:42:32Z

Genes (but not complete gene models) can be found in the following GFF file, along with pseudogenes and various other types of regions and features (CpG island, TSS region, snRNAs):

http://www.informatics.jax.org/downloads/reports/MGI_GTGUP.gff

The corresponding GFF3 file does have complete gene models but does not appear to have many of the additional features (CpG island, etc.) found in the plain GFF file. It also aggregates data from several different providers (MGI, miRBase, VEGA, NCBI_Gene, ENSEMBL):

http://www.informatics.jax.org/downloads/mgigff/MGI.gff3.gz

The genome build is GRCm38-C57BL/6J, although the actual reference sequence assembly doesn't appear to be on the MGD Download/Sequence Data page.

jonathancrabtree · 2018-06-20T03:33:52Z

It looks like there are a couple of ways that this could be structured in DATS. MolecularEntity is the appropriate schema/type for genes, proteins, etc., and a DATS Dataset can link to one or more of these via isAbout. However, there's no way to represent any deeper structure within a MolecularEntity e.g., to explicitly represent alternate splice forms of a gene as a nested data structure, as one might do in chado, for example. We can use the Dataset hasPart property to bring in a little bit of extra structure however. For a first pass on this I propose:

Dataset (representing a release of the C57BL/6 reference genome and annotation)
   -> hasPart
     Dataset for chromosome 1 + annotation
      ->isAbout
        MolecularEntity for the DNA sequence of chr1
        MolecularEntity for each gene/feature/region of interest
        (but not attempting to represent any finer structure within genes)
     Dataset for chromosome 2 + annotation
      ->isAbout
        MolecularEntity for the DNA sequence of chr2
        MolecularEntity for each gene/feature/region of interest
etc.

jonathancrabtree · 2018-06-20T15:45:01Z

Based on this morning's discussion we're going to skip the chromosome-level Datasets for now (which should by rights be encoded as molecular entities anyway.)

jonathancrabtree · 2018-06-21T20:02:34Z

A couple of comments/notes on parsing MGI.gff3.gz:

CDS features don't have ID attributes, which the GFF3 spec says are a requirement for features that span multiple lines (https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md)
Some gene features appear more than once in the file in "distinct" locations due to the inclusion of genome patches. For example:

$ gzcat MGI.gff3.gz | egrep 'ID=NCBI_Gene:100504180' 
10|NW_012132907.1       NCBI_Gene       gene    41035   45415   .       -       .       ID=NCBI_Gene:100504180;bioType=gene
10      NCBI_Gene       gene    58222403        58226783        .       -       .       ID=NCBI_Gene:100504180;Dbxref=MGI:MGI:3034577;bioType=gene

Some gene features have unknown strand. For example:

10      MGI     gene    8860863 8861443 .       .       .       ID=MGI:MGI:1922603;Name=4930553I21Rik;Dbxref=ENSEMBL:ENSMUSG00000111165;mgiName=RIKEN cDNA 4930553I21 gene;bioType=unclassified gene

Although note that in this case the referenced ENSEMBL gene appears elsewhere in the file and does have a defined strand (reverse.)

Also, at least initially, I'm only including the genes with source = 'MGI' in the DATS encoding.

jonathancrabtree · 2018-06-29T12:16:29Z

Initial encoding released in v0.3

jonathancrabtree self-assigned this Jun 19, 2018

agbeltran mentioned this issue Jun 20, 2018

Extend the molecular entity schema to support linking to other molecular entities. datatagsuite/schema#3

Closed

jonathancrabtree closed this as completed Jun 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create DATS encoding for mouse reference genome and gene-level annotation. #7

Create DATS encoding for mouse reference genome and gene-level annotation. #7

jonathancrabtree commented Jun 19, 2018

jonathancrabtree commented Jun 19, 2018

jonathancrabtree commented Jun 20, 2018

jonathancrabtree commented Jun 20, 2018

jonathancrabtree commented Jun 21, 2018

jonathancrabtree commented Jun 29, 2018

Create DATS encoding for mouse reference genome and gene-level annotation. #7

Create DATS encoding for mouse reference genome and gene-level annotation. #7

Comments

jonathancrabtree commented Jun 19, 2018

jonathancrabtree commented Jun 19, 2018

jonathancrabtree commented Jun 20, 2018

jonathancrabtree commented Jun 20, 2018

jonathancrabtree commented Jun 21, 2018

jonathancrabtree commented Jun 29, 2018