Skip to content

Supporting additional species

Brian Yee edited this page Dec 12, 2021 · 5 revisions

Clipper uses an internal database which supports the following species, which are pre-parsed annotation files that are included upon installation:

  • hg19
  • GRCh38
  • ce10
  • dm3
  • mm9
  • mm10

If you are using Clipper with an unlisted assembly, I hope this page may serve as a guide to creating your own annotations. Prior to installation, you will need to add the <NEWSPECIES>.AS.STRUCTURE.COMPILED.gff and to data and <NEWSPECIES>_genes.bed + <NEWSPECIES>_exons.bed to data/regions/, respectively. <NEWSPECIES> will be the name of your species, which you will specify when running clipper (ie. clipper --species hg18 would correspond to new annotations hg18.AS.STRUCTURE.COMPILED.gff, hg18_genes.bed, and hg18_exons.bed). Below is an example for one entry DDX3X

SPECIES_genes.bed

This BED file contains genomic coordinates for each gene whose name column will be the geneID and should match the gene identifiers in the other two required files:

chrX 41333283 41364472 ENSG00000215301.10 0 +

SPECIES_exons.bed

This BED file contains exon coordinates for each gene whose name column will be the geneID and should match the gene identifiers in the other two required files. Overlapping exons among distinct transcripts should be merged to generate a non-overlapping list of representative exons per gene:

chrX    41333283        41334297        ENSG00000215301.10      0       +
chrX    41334590        41336738        ENSG00000215301.10      0       +
chrX    41337407        41339604        ENSG00000215301.10      0       +
chrX    41339909        41344128        ENSG00000215301.10      0       +
chrX    41344190        41344564        ENSG00000215301.10      0       +
chrX    41345179        41345548        ENSG00000215301.10      0       +
chrX    41346228        41346622        ENSG00000215301.10      0       +
chrX    41346858        41351668        ENSG00000215301.10      0       +
chrX    41357832        41358000        ENSG00000215301.10      0       +
chrX    41364273        41364472        ENSG00000215301.10      0       +

SPECIES.AS.STRUCTURE.COMPILED.gff

This file is a gff-formatted file modified to provide an ID, mRNA length (length of the gene) and pre mRNA length (sum of associated exon lengths) for each gene. This file can be generated using the genes and exons file from above:

chrX AS_STRUCTURE gene 41333284 41364472 . + . gene_id=ENSG00000215301.10;mrna_length=15892;premrna_length=31189

Once these files are in their respective directories, re-install Clipper and the new species should be available to use.

Useful tools:

This helpful tool has been kindly developed by Vishal Koparde (https://github.com/kopardev) to autogenerate Clipper reference data:

https://github.com/kopardev/clipperhelper

Alternatively, the create_region_bedfiles script has been developed to perform GTF -> BED/AS.STRUCTURE transformation, which may be used as references for Clipper:

https://github.com/byee4/annotator