Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genome patching effect on alignment #673

claymcleod opened this issue Jun 27, 2019 · 2 comments

Genome patching effect on alignment #673

claymcleod opened this issue Jun 27, 2019 · 2 comments


Copy link

@claymcleod claymcleod commented Jun 27, 2019

Dr. Dobin,

I'm creating this issue not to report a problem, but to get your opinion on an issue that I could not get any idea on by searching. We use the GRCh38_no_alt analysis set as our reference genome, which, as you probably know, is derived from the original GRCh38 build. In the manual, I see that you recommend this as a best practice:

Generally, patches and alternative haplotypes should not be included in the genome.

Additionally, you specify the GENCODE gene set as one of the gene models you recommend. If you go to the most current release of the GENCODE gene model (v30 at the time of writing), you will see that the gene set is based on GRCh38.p12. To my eye, there are 113 fixes that have been applied to GRCh38 since its inception that vary in size greatly.

Keeping in mind that the coordinates remain backward compatible between patches but the underlying sequence may change, do you have thoughts on what effect (if any) might occur because the apparent mismatch between theGRCh38.p12-based gene model and the GRCh38.p0- based reference genome? Intuition would lead me to believe that expression likely wouldn't be affected very much, but splice junction detection might be — especially in areas where the reference genome updated around these sites.

As far as I can tell, the community is not overly concerned:

  • The GDC are using GENCODE v22 against the GRCh38 no alt analysis set (sources for gene model and reference genome).
  • ENCODE is using GENCODE v24 against the GRCh38 no alt analysis set (source).

But before moving forward, I wanted to get your thoughts. We're considering using the latest GENCODE release for our pipeline (v30), so it's quite a bit more divergent from the original GRCh38 release than versions 22 and 24.


This comment has been minimized.

Copy link

@alexdobin alexdobin commented Jul 1, 2019

Hi Clay,

if the patches substantially change the sequence in the exon of expressed genes, both gene expression and splice junction detection may be affected. This is probably not easy to quantify, the simplest way may be to map a few samples to both p0 and p12 and compare the results.

For consistency sake, I think the best approach is to use the GENCODE GTF and FASTA (i.e. p12 for v30).


@alexdobin alexdobin added the question label Jul 1, 2019

This comment has been minimized.

Copy link

@claymcleod claymcleod commented Jul 4, 2019

Thank you for your reply, Dr. Dobin. My feeling is that it would be most consistent to use the supplied FASTA file as you say. However, I do think most of our users would probably prefer the benefits afforded to them in the no alt analysis set.

We will do an analysis on the effect of this discordance and either (a) use the latest GENCODE gene set if the differences are small or (b) roll back to the latest gene set based on patch 0 of GRCh38 if the differences are large.

@claymcleod claymcleod closed this Jul 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
2 participants
You can’t perform that action at this time.