This repository contains chromosome/contig name mappings between UCSC <-> Ensembl <-> Gencode for a variety of genomes.
Latest commit 58f75d5 Jul 4, 2016 @dpryan79 committed on GitHub Merge pull request #12 from iljungr/master
UCSC<->ensembl for the galGal4 chicken genome.
Permalink
Failed to load latest commit information.
.gitignore all chromosomes from GRCh37, moved chrM after chrX because it's the o… Apr 16, 2015
BDGP6_UCSC2ensembl.txt dm6->BDGP6 and ce10->WBcel235, HT Charles Vejnar for suggesting stick… Sep 8, 2015
BDGP6_ensembl2UCSC.txt dm6->BDGP6 and ce10->WBcel235, HT Charles Vejnar for suggesting stick… Sep 8, 2015
GRCh37_UCSC2ensembl.txt Complete GRCh37 Ensembl<->UCSC<->Gencode, include all patches/haploty… Apr 19, 2015
GRCh37_UCSC2gencode.txt Complete GRCh37 Ensembl<->UCSC<->Gencode, include all patches/haploty… Apr 19, 2015
GRCh37_ensembl2UCSC.txt Complete GRCh37 Ensembl<->UCSC<->Gencode, include all patches/haploty… Apr 19, 2015
GRCh37_ensembl2gencode.txt Complete GRCh37 Ensembl<->UCSC<->Gencode, include all patches/haploty… Apr 19, 2015
GRCh37_gencode2UCSC.txt Complete GRCh37 Ensembl<->UCSC<->Gencode, include all patches/haploty… Apr 19, 2015
GRCh37_gencode2ensembl.txt Complete GRCh37 Ensembl<->UCSC<->Gencode, include all patches/haploty… Apr 19, 2015
GRCh38_UCSC2ensembl.txt Add dm3_ensembl2UCSC. All UCSC-related files are now uniformly named,… Apr 17, 2015
GRCh38_UCSC2gencode.txt Add dm3_ensembl2UCSC. All UCSC-related files are now uniformly named,… Apr 17, 2015
GRCh38_ensembl2UCSC.txt Update GRCh38 and GRCm38, make a note about contig versions Apr 2, 2016
GRCh38_ensembl2gencode.txt Update GRCh38 and GRCm38, make a note about contig versions Apr 2, 2016
GRCh38_gencode2UCSC.txt Update GRCh38 and GRCm38, make a note about contig versions Apr 2, 2016
GRCh38_gencode2ensembl.txt Update GRCh38 and GRCm38, make a note about contig versions Apr 2, 2016
GRCm37_UCSC2ensembl.txt Complete GRCm37 gencode<->ensembl<->UCSC. Mention ambiguous mappings … Apr 18, 2015
GRCm37_UCSC2gencode.txt Complete GRCm37 gencode<->ensembl<->UCSC. Mention ambiguous mappings … Apr 18, 2015
GRCm37_ensembl2UCSC.txt Complete GRCm37 gencode<->ensembl<->UCSC. Mention ambiguous mappings … Apr 18, 2015
GRCm37_ensemblgencode.txt Complete GRCm37 gencode<->ensembl<->UCSC. Mention ambiguous mappings … Apr 18, 2015
GRCm37_gencode2UCSC.txt Complete GRCm37 gencode<->ensembl<->UCSC. Mention ambiguous mappings … Apr 18, 2015
GRCm37_gencode2ensembl.txt Complete GRCm37 gencode<->ensembl<->UCSC. Mention ambiguous mappings … Apr 18, 2015
GRCm38_UCSC2ensembl.txt Complete GRCm38 gencode<->UCSC<->Ensembl Apr 18, 2015
GRCm38_UCSC2gencode.txt Complete GRCm38 gencode<->UCSC<->Ensembl Apr 18, 2015
GRCm38_ensembl2UCSC.txt Update GRCh38 and GRCm38, make a note about contig versions Apr 2, 2016
GRCm38_ensembl2gencode.txt Update GRCh38 and GRCm38, make a note about contig versions Apr 2, 2016
GRCm38_gencode2UCSC.txt Update GRCh38 and GRCm38, make a note about contig versions Apr 2, 2016
GRCm38_gencode2ensembl.txt Update GRCh38 and GRCm38, make a note about contig versions Apr 2, 2016
GRCz10_UCSC2ensembl.txt Adding UCSC2ensembl for zebrafish GRCz10 Sep 4, 2015
GRCz10_UCSC2gencode.txt Rename GRCZ10 -> GRCz10 Sep 4, 2015
GRCz10_ensembl2UCSC.txt Adding ensembl2UCSC for zebrafish GRCz10 Sep 6, 2015
GRCz10_gencode2UCSC.txt Rename GRCZ10 -> GRCz10 Sep 4, 2015
JGI_4.2_UCSC2ensembl.txt Adding Xenopus JGI_4.2 Sep 6, 2015
JGI_4.2_ensembl2UCSC.txt Adding Xenopus JGI_4.2 Sep 6, 2015
MEDAKA1_UCSC2ensembl.txt Adding Medaka MEDAKA1 Sep 7, 2015
MEDAKA1_ensembl2UCSC.txt Adding Medaka MEDAKA1 Sep 7, 2015
R64-1-1_UCSC2ensembl.txt Adding yeast R64-1-1 Sep 6, 2015
R64-1-1_ensembl2UCSC.txt Adding yeast R64-1-1 Sep 6, 2015
README.md Update the version example to use a real example. Apr 2, 2016
WBcel235_UCSC2ensembl.txt dm6->BDGP6 and ce10->WBcel235, HT Charles Vejnar for suggesting stick… Sep 8, 2015
WBcel235_ensembl2UCSC.txt dm6->BDGP6 and ce10->WBcel235, HT Charles Vejnar for suggesting stick… Sep 8, 2015
Zv9_UCSC2ensembl.txt Added Zv9 Ensembl<->UCSC Apr 17, 2015
Zv9_ensembl2UCSC.txt Added Zv9 Ensembl<->UCSC Apr 17, 2015
dm3_UCSC2ensembl.txt Update dm3 mappings, since chrM was omitted! Jun 9, 2015
dm3_ensembl2UCSC.txt Update dm3 mappings, since chrM was omitted! Jun 9, 2015
galGal4_UCSC2ensembl.txt UCSC<->ensembl for the galGal4 chicken genome. Jul 4, 2016
galGal4_ensembl2UCSC.txt UCSC<->ensembl for the galGal4 chicken genome. Jul 4, 2016
rn5_UCSC2ensembl.txt Add rn5 UCSC<->Ensembl Apr 17, 2015
rn5_ensembl2UCSC.txt Add rn5 UCSC<->Ensembl Apr 17, 2015

README.md

This repository contains chromosome/contig name mappings between UCSC <-> Ensembl <-> Gencode for a variety of genomes.

The files are named AAA_BBB2CCC.txt, where AAA is a genome and version (e.g., GRCh37) and BBB and CCC are sources (namely, ensembl, UCSC, or gencode). Each file contains two columns. The first is the chromosome name in BBB and the second that in CCC. For example, let's suppose we're interested in converting gencode to ensembl chromosome names for GRCh37. We would then look in the GRCh37_gencode2ensembl.txt file and would see lines such as:

chrX    X
chrY    Y
chrM    MT
GL877870.2  HG1001_PATCH
GL877872.1  HG1032_PATCH
GL383535.1  HG104_HG975_PATCH
JH159133.1  HG1063_PATCH

In this case, chrX is the gencode name and X is the equivalent Ensembl name.

Missing Chromosomes/Contigs

It's not always the case that a given chromosome/contig exists in all sources. An example of that is GRCh38_gencode2ucsc.txt. There, a number of entries exist in gencode that are absent in UCSC. In cases such as this, the second column in a txt file will simply be empty:

KI270937.1      chr3_KI270937v1_alt
KI270938.1      chr19_KI270938v1_alt
KN196472.1      
KN196473.1      

There is always a second tab-separated field above, but KN196472.1 and KN196473.1 simply don't exist in UCSC. So a script using these files can simply look for columns with values "" to indicate "missing".

Ambiguous/multi-way mappings

Occasionally, e.g., with mm9, UCSC will merge contigs together into an ordered *_random sequence. This means that an individual entry in UCSC can map to multiple entries in Ensembl and Gencode. Such case are treated the same as missing entries, described above. An alternative would be to provide a comma-separated list of mapping targets and their chromosome offsets and or ranges. As this situation tends to only occur in older UCSC reference genomes, which are decreasingly used, I would prefer to avoid this complication.

Patch versions

It's often the case that a patch will have an associated version, such as .2 in KB469738.2. While the patch itself will exist across genome updates, the version number may change. Consequently, it may be required to strip off these version when performing name conversions, simply to support different versions/patches of the same genome.

Note

Note that some data sources are absent. For example, wormbase has not been included, since it's chromosome naming system is identical to that in Ensembl.

Please submit a pull request or an issue if you find any errors!