-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create DATS encoding for mouse reference genome and gene-level annotation. #7
Comments
Genes (but not complete gene models) can be found in the following GFF file, along with pseudogenes and various other types of regions and features (CpG island, TSS region, snRNAs): http://www.informatics.jax.org/downloads/reports/MGI_GTGUP.gff The corresponding GFF3 file does have complete gene models but does not appear to have many of the additional features (CpG island, etc.) found in the plain GFF file. It also aggregates data from several different providers (MGI, miRBase, VEGA, NCBI_Gene, ENSEMBL): http://www.informatics.jax.org/downloads/mgigff/MGI.gff3.gz The genome build is GRCm38-C57BL/6J, although the actual reference sequence assembly doesn't appear to be on the MGD Download/Sequence Data page. |
It looks like there are a couple of ways that this could be structured in DATS.
|
Based on this morning's discussion we're going to skip the chromosome-level Datasets for now (which should by rights be encoded as molecular entities anyway.) |
A couple of comments/notes on parsing
Although note that in this case the referenced ENSEMBL gene appears elsewhere in the file and does have a defined strand (reverse.) Also, at least initially, I'm only including the genes with source = 'MGI' in the DATS encoding. |
Initial encoding released in v0.3 |
Create initial DATS encoding for mouse reference genome and gene-level annotation.
The text was updated successfully, but these errors were encountered: