Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create DATS encoding for mouse reference genome and gene-level annotation. #7

Closed
jonathancrabtree opened this issue Jun 19, 2018 · 5 comments
Assignees

Comments

@jonathancrabtree
Copy link
Contributor

Create initial DATS encoding for mouse reference genome and gene-level annotation.

@jonathancrabtree jonathancrabtree self-assigned this Jun 19, 2018
@jonathancrabtree
Copy link
Contributor Author

Genes (but not complete gene models) can be found in the following GFF file, along with pseudogenes and various other types of regions and features (CpG island, TSS region, snRNAs):

http://www.informatics.jax.org/downloads/reports/MGI_GTGUP.gff

The corresponding GFF3 file does have complete gene models but does not appear to have many of the additional features (CpG island, etc.) found in the plain GFF file. It also aggregates data from several different providers (MGI, miRBase, VEGA, NCBI_Gene, ENSEMBL):

http://www.informatics.jax.org/downloads/mgigff/MGI.gff3.gz

The genome build is GRCm38-C57BL/6J, although the actual reference sequence assembly doesn't appear to be on the MGD Download/Sequence Data page.

@jonathancrabtree
Copy link
Contributor Author

It looks like there are a couple of ways that this could be structured in DATS. MolecularEntity is the appropriate schema/type for genes, proteins, etc., and a DATS Dataset can link to one or more of these via isAbout. However, there's no way to represent any deeper structure within a MolecularEntity e.g., to explicitly represent alternate splice forms of a gene as a nested data structure, as one might do in chado, for example. We can use the Dataset hasPart property to bring in a little bit of extra structure however. For a first pass on this I propose:

Dataset (representing a release of the C57BL/6 reference genome and annotation)
   -> hasPart
     Dataset for chromosome 1 + annotation
      ->isAbout
        MolecularEntity for the DNA sequence of chr1
        MolecularEntity for each gene/feature/region of interest
        (but not attempting to represent any finer structure within genes)
     Dataset for chromosome 2 + annotation
      ->isAbout
        MolecularEntity for the DNA sequence of chr2
        MolecularEntity for each gene/feature/region of interest
etc.

@jonathancrabtree
Copy link
Contributor Author

Based on this morning's discussion we're going to skip the chromosome-level Datasets for now (which should by rights be encoded as molecular entities anyway.)

@jonathancrabtree
Copy link
Contributor Author

A couple of comments/notes on parsing MGI.gff3.gz:

$ gzcat MGI.gff3.gz | egrep 'ID=NCBI_Gene:100504180' 
10|NW_012132907.1       NCBI_Gene       gene    41035   45415   .       -       .       ID=NCBI_Gene:100504180;bioType=gene
10      NCBI_Gene       gene    58222403        58226783        .       -       .       ID=NCBI_Gene:100504180;Dbxref=MGI:MGI:3034577;bioType=gene
  • Some gene features have unknown strand. For example:
10      MGI     gene    8860863 8861443 .       .       .       ID=MGI:MGI:1922603;Name=4930553I21Rik;Dbxref=ENSEMBL:ENSMUSG00000111165;mgiName=RIKEN cDNA 4930553I21 gene;bioType=unclassified gene

Although note that in this case the referenced ENSEMBL gene appears elsewhere in the file and does have a defined strand (reverse.)

Also, at least initially, I'm only including the genes with source = 'MGI' in the DATS encoding.

@jonathancrabtree
Copy link
Contributor Author

Initial encoding released in v0.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant