Support GRCh38 #817

Closed
schelhorn opened this Issue Apr 8, 2015 · 40 comments

Comments

Projects
None yet
@schelhorn

I was wondering if there are any plans to support GRCh38 as an additional assembly. The alternate loci and decoy sequences are likely to improve both variant calling and expression studies. I suppose that the availability of genomes and annotations are not much of a problem anymore, but not all tools may be compatible. Could someone detail the major roadblocks concerning this issue? Thanks.

@chapmanb

This comment has been minimized.

Show comment
Hide comment
@chapmanb

chapmanb Apr 9, 2015

Member

Sven-Eric;
We'd love to support GRCh38 and I agree that it's a much improved build. The only major blocker is that we don't currently have any reference standards to validate against so any update would be blind to the performance of the callers or our implementation. The Genome in a Bottle consortium is currently working on this so hopefully this is only a short term issue.

Beyond that, it's a matter of putting together the time and resources to work on this. The major steps we need to do are:

  • Add data preparation for the reference build and associated files (dbSNP and friends). We've been using the Broad bundles for other human builds but they haven't yet moved to GRCh38.
  • Add data preparation for GRCh38 reference genomes for RNA-seq.
  • Update bwa-mem code to be aware and take into account alternative locus mapping (https://github.com/lh3/bwa/blob/master/README-alt.md).
  • Testing to ensure callers work cleanly with new build.
  • Updating downstream tools we rely on, like GEMINI, to contain annotations for GRCh38.

We're hopeful to get a project that needs/asks for GRCh38 we could use to fund the work on this, or happy to work along slowly as we have time to add it in. Thanks for the discussion.

Member

chapmanb commented Apr 9, 2015

Sven-Eric;
We'd love to support GRCh38 and I agree that it's a much improved build. The only major blocker is that we don't currently have any reference standards to validate against so any update would be blind to the performance of the callers or our implementation. The Genome in a Bottle consortium is currently working on this so hopefully this is only a short term issue.

Beyond that, it's a matter of putting together the time and resources to work on this. The major steps we need to do are:

  • Add data preparation for the reference build and associated files (dbSNP and friends). We've been using the Broad bundles for other human builds but they haven't yet moved to GRCh38.
  • Add data preparation for GRCh38 reference genomes for RNA-seq.
  • Update bwa-mem code to be aware and take into account alternative locus mapping (https://github.com/lh3/bwa/blob/master/README-alt.md).
  • Testing to ensure callers work cleanly with new build.
  • Updating downstream tools we rely on, like GEMINI, to contain annotations for GRCh38.

We're hopeful to get a project that needs/asks for GRCh38 we could use to fund the work on this, or happy to work along slowly as we have time to add it in. Thanks for the discussion.

@schelhorn

This comment has been minimized.

Show comment
Hide comment
@schelhorn

schelhorn Apr 10, 2015

Great, thanks for the comprehensive answer. It's good to be conservative about references and I for one am happy to wait for validation standards and GEMINI support if helps us to secure the validity of the results. Please feel free to close this issue. Perhaps it could make sense to copy the answer to the documentation in case anyone else is looking for it?

Great, thanks for the comprehensive answer. It's good to be conservative about references and I for one am happy to wait for validation standards and GEMINI support if helps us to secure the validity of the results. Please feel free to close this issue. Perhaps it could make sense to copy the answer to the documentation in case anyone else is looking for it?

@chapmanb

This comment has been minimized.

Show comment
Hide comment
@chapmanb

chapmanb Apr 14, 2015

Member

Sven-Eric, definitely want to keep this moving even if slowly. To get the process going here are some notes on GRCh38 resources. We can continue to collect them here and merge into the automated data installation. ccing @arq5x and @brentp in case you've started working on resource collection/installation for GEMINI already as I'd love to coordinate with you:

  • FASTA file with decoy, ready for bwa-mem, including bwa and bowtie indices:
    ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/

    With alternatives: GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna
    No alternatives alleles: GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

  • dbSNP for GRCh38:
    http://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/

    Full: 00-All.vcf.gz

  • Annotations from Ensembl:
    ftp://ftp.ensembl.org/pub/release-79/gtf/homo_sapiens/

    release79: Homo_sapiens.GRCh38.79.gtf.gz
    Needs to be converted to use chr style prefixes instead of bare to maintain
    consistency with NCBI and UCSC.

  • COSMIC (licensing restrictions with v72):
    sftp://sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v72/VCF/CosmicCodingMuts.vcf.gz
    sftp://sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v72/VCF/CosmicNonCodingVariants.vcf.gz

  • ClinVar:
    ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/

Member

chapmanb commented Apr 14, 2015

Sven-Eric, definitely want to keep this moving even if slowly. To get the process going here are some notes on GRCh38 resources. We can continue to collect them here and merge into the automated data installation. ccing @arq5x and @brentp in case you've started working on resource collection/installation for GEMINI already as I'd love to coordinate with you:

  • FASTA file with decoy, ready for bwa-mem, including bwa and bowtie indices:
    ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/

    With alternatives: GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna
    No alternatives alleles: GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

  • dbSNP for GRCh38:
    http://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/

    Full: 00-All.vcf.gz

  • Annotations from Ensembl:
    ftp://ftp.ensembl.org/pub/release-79/gtf/homo_sapiens/

    release79: Homo_sapiens.GRCh38.79.gtf.gz
    Needs to be converted to use chr style prefixes instead of bare to maintain
    consistency with NCBI and UCSC.

  • COSMIC (licensing restrictions with v72):
    sftp://sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v72/VCF/CosmicCodingMuts.vcf.gz
    sftp://sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v72/VCF/CosmicNonCodingVariants.vcf.gz

  • ClinVar:
    ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/

@arq5x

This comment has been minimized.

Show comment
Hide comment
@arq5x

arq5x Apr 15, 2015

We haven’t yet started coordinating annotations for GRCh38, but this is a very useful collection. My big concern is how to port datasets such as ESP and ExAC, as they will almost certainly not be recalled on GRCh38!

Aaron

On Apr 14, 2015, at 2:12 PM, Brad Chapman notifications@github.com wrote:

Sven-Eric, definitely want to keep this moving even if slowly. To get the process going here are some notes on GRCh38 resources. We can continue to collect them here and merge into the automated data installation. ccing @arq5x and @brentp in case you've started working on resource collection/installation for GEMINI already as I'd love to coordinate with you:

FASTA file with decoy, ready for bwa-mem, including bwa and bowtie indices:
ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/

With alternatives: GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna
No alternatives alleles: GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

dbSNP for GRCh38:
http://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/

Full: 00-All.vcf.gz

Annotations from Ensembl:
ftp://ftp.ensembl.org/pub/release-79/gtf/homo_sapiens/

release79: Homo_sapiens.GRCh38.79.gtf.gz
Needs to be converted to use chr style prefixes instead of bare to maintain
consistency with NCBI and UCSC.

COSMIC (licensing restrictions with v72):
sftp://sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v72/VCF/CosmicCodingMuts.vcf.gz
sftp://sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v72/VCF/CosmicNonCodingVariants.vcf.gz

ClinVar:
ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/


Reply to this email directly or view it on GitHub.

arq5x commented Apr 15, 2015

We haven’t yet started coordinating annotations for GRCh38, but this is a very useful collection. My big concern is how to port datasets such as ESP and ExAC, as they will almost certainly not be recalled on GRCh38!

Aaron

On Apr 14, 2015, at 2:12 PM, Brad Chapman notifications@github.com wrote:

Sven-Eric, definitely want to keep this moving even if slowly. To get the process going here are some notes on GRCh38 resources. We can continue to collect them here and merge into the automated data installation. ccing @arq5x and @brentp in case you've started working on resource collection/installation for GEMINI already as I'd love to coordinate with you:

FASTA file with decoy, ready for bwa-mem, including bwa and bowtie indices:
ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/

With alternatives: GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna
No alternatives alleles: GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

dbSNP for GRCh38:
http://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/

Full: 00-All.vcf.gz

Annotations from Ensembl:
ftp://ftp.ensembl.org/pub/release-79/gtf/homo_sapiens/

release79: Homo_sapiens.GRCh38.79.gtf.gz
Needs to be converted to use chr style prefixes instead of bare to maintain
consistency with NCBI and UCSC.

COSMIC (licensing restrictions with v72):
sftp://sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v72/VCF/CosmicCodingMuts.vcf.gz
sftp://sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v72/VCF/CosmicNonCodingVariants.vcf.gz

ClinVar:
ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/


Reply to this email directly or view it on GitHub.

@chapmanb

This comment has been minimized.

Show comment
Hide comment
@chapmanb

chapmanb Apr 16, 2015

Member

Thanks Aaron -- we're going to have to hold our noses and get familiar with liftOver and friends, at least in the medium term. It looks like there are 3 different tools:

and some relevant discussion:

https://www.biostars.org/p/65558/
https://www.biostars.org/p/113184/

Does anyone have good/bad experience with any of these?

Member

chapmanb commented Apr 16, 2015

Thanks Aaron -- we're going to have to hold our noses and get familiar with liftOver and friends, at least in the medium term. It looks like there are 3 different tools:

and some relevant discussion:

https://www.biostars.org/p/65558/
https://www.biostars.org/p/113184/

Does anyone have good/bad experience with any of these?

@chapmanb

This comment has been minimized.

Show comment
Hide comment
@chapmanb

chapmanb Apr 17, 2015

Member

From @druvus -- notes from SciLife's work on building a GATK bundle for 38:

https://wabi-wiki.scilifelab.se/display/SHGATG/gatk+bundle+in+hg38

Member

chapmanb commented Apr 17, 2015

From @druvus -- notes from SciLife's work on building a GATK bundle for 38:

https://wabi-wiki.scilifelab.se/display/SHGATG/gatk+bundle+in+hg38

@mw55309

This comment has been minimized.

Show comment
Hide comment
@mw55309

mw55309 Apr 20, 2015

Hi Guys

Bringing in Alison Meynert (@ameynert) who is one of our medical genomics experts in Edinburgh. She has been working on lift over of GRCh37 GATK resources to GRCh38.

Cheers
Mick

mw55309 commented Apr 20, 2015

Hi Guys

Bringing in Alison Meynert (@ameynert) who is one of our medical genomics experts in Edinburgh. She has been working on lift over of GRCh37 GATK resources to GRCh38.

Cheers
Mick

@ameynert

This comment has been minimized.

Show comment
Hide comment
@ameynert

ameynert Apr 20, 2015

I've been running CrossMap, and see similar results to what the SciLife post reported. Nearly all sites can be mapped to GRCh38 for all the GATK resources with the Ensembl chain file:

0.9983 1000G_omni2.5.b37.vcf
0.9974 1000G_phase1.indels.b37.vcf
0.9986 1000G_phase1.snps.high_confidence.b37.vcf
0.9961 dbsnp142.b37.vcf
0.9986 hapmap_3.3.b37.vcf
0.9976 Mills_and_1000G_gold_standard.indels.b37.vcf

I'll take a look at the breakdown of missing vs. incorrectly mapped for the dbSNP benchmark, and run dbSNP138 as well as 142.

I've been running CrossMap, and see similar results to what the SciLife post reported. Nearly all sites can be mapped to GRCh38 for all the GATK resources with the Ensembl chain file:

0.9983 1000G_omni2.5.b37.vcf
0.9974 1000G_phase1.indels.b37.vcf
0.9986 1000G_phase1.snps.high_confidence.b37.vcf
0.9961 dbsnp142.b37.vcf
0.9986 hapmap_3.3.b37.vcf
0.9976 Mills_and_1000G_gold_standard.indels.b37.vcf

I'll take a look at the breakdown of missing vs. incorrectly mapped for the dbSNP benchmark, and run dbSNP138 as well as 142.

@arq5x

This comment has been minimized.

Show comment
Hide comment
@arq5x

arq5x Apr 20, 2015

Very interesting, thanks! I expect that this approach will work equally as well for ESP, Clinvar, and the ExAC dataset. However, I wonder how appropriate lift over is for annotations such as CADD and fitCons where local genomic context factors into the scores assigned. In other words, the new location is likely not in doubt, but the question is whether the score calculated for b37 is appropriate for b38. I suspect tht there are many cases where this is not the case.

arq5x commented Apr 20, 2015

Very interesting, thanks! I expect that this approach will work equally as well for ESP, Clinvar, and the ExAC dataset. However, I wonder how appropriate lift over is for annotations such as CADD and fitCons where local genomic context factors into the scores assigned. In other words, the new location is likely not in doubt, but the question is whether the score calculated for b37 is appropriate for b38. I suspect tht there are many cases where this is not the case.

@chapmanb

This comment has been minimized.

Show comment
Hide comment
@chapmanb

chapmanb Apr 21, 2015

Member

Alison and Mick -- thanks so much for investigating the CrossMap approach. Those numbers looks great, I'd be happy to include the GATK supporting files into bcbio so we have them available. We can always iterate and update as we continue to evaluate these.

Practically, I added in downloads of hg38/GRCh38 (without alternative sequences) to bcbio. If you update to the latest version of the development code you can do:

bcbio_nextgen.py upgrade -u development
bcbio_nextgen.py upgrade --data --genomes hg38-noalt

and it will install the reference genome, bwa/bowtie2 indices, and dbSNP 142. The references and indices all come from NCBI:

ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/README_ANALYSIS_SETS

I haven't done any runs with the hg38 genome but having the basic data installation in place is step one.

Aaron, on the logistics side I used your GGD approach to manage the installation. The separation between the download/prep code and the machinery to actually do the installations is nice -- it makes the data provenance clear. I needed to stray from your initial GGD implementation of streaming the results of one download to one file since these preps involved picking and prepping multiple files from the initial tarballs. I implemented a take on GGD which uses a temporary directory, runs all of the file prep commands as a bash script, then pulls out the final files into the output directory when updated:

https://github.com/chapmanb/cloudbiolinux/tree/master/ggd-recipes
https://github.com/chapmanb/cloudbiolinux/blob/master/cloudbio/biodata/ggd.py

I'd love to try and coordinate this with what you want to do for GEMINI so we can have a single approach to get one set of downloaded data.

Thanks all for the discussion and ideas.

Member

chapmanb commented Apr 21, 2015

Alison and Mick -- thanks so much for investigating the CrossMap approach. Those numbers looks great, I'd be happy to include the GATK supporting files into bcbio so we have them available. We can always iterate and update as we continue to evaluate these.

Practically, I added in downloads of hg38/GRCh38 (without alternative sequences) to bcbio. If you update to the latest version of the development code you can do:

bcbio_nextgen.py upgrade -u development
bcbio_nextgen.py upgrade --data --genomes hg38-noalt

and it will install the reference genome, bwa/bowtie2 indices, and dbSNP 142. The references and indices all come from NCBI:

ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/README_ANALYSIS_SETS

I haven't done any runs with the hg38 genome but having the basic data installation in place is step one.

Aaron, on the logistics side I used your GGD approach to manage the installation. The separation between the download/prep code and the machinery to actually do the installations is nice -- it makes the data provenance clear. I needed to stray from your initial GGD implementation of streaming the results of one download to one file since these preps involved picking and prepping multiple files from the initial tarballs. I implemented a take on GGD which uses a temporary directory, runs all of the file prep commands as a bash script, then pulls out the final files into the output directory when updated:

https://github.com/chapmanb/cloudbiolinux/tree/master/ggd-recipes
https://github.com/chapmanb/cloudbiolinux/blob/master/cloudbio/biodata/ggd.py

I'd love to try and coordinate this with what you want to do for GEMINI so we can have a single approach to get one set of downloaded data.

Thanks all for the discussion and ideas.

@druvus

This comment has been minimized.

Show comment
Hide comment
@druvus

druvus Apr 22, 2015

Hi,
just want to bring in Pall Olason (@pallolason) who did the liftover work at SciLifeLabs into the discussion.

cheers,
Andreas

druvus commented Apr 22, 2015

Hi,
just want to bring in Pall Olason (@pallolason) who did the liftover work at SciLifeLabs into the discussion.

cheers,
Andreas

@ameynert

This comment has been minimized.

Show comment
Hide comment
@ameynert

ameynert Apr 22, 2015

The GATK bundle file of dbSNP138 excluding 1000 Genomes samples should also get lifted over for use with VQSR. There are issues with tranche calculations caused by hard-coded expectations of pre-1KG data. See the last comment on the GATK forum here: http://gatkforums.broadinstitute.org/discussion/5221/tranches-plot-issue

The presenters at this week's GATK roadshow said they were hoping to have this fixed in release 3.4.

The GATK bundle file of dbSNP138 excluding 1000 Genomes samples should also get lifted over for use with VQSR. There are issues with tranche calculations caused by hard-coded expectations of pre-1KG data. See the last comment on the GATK forum here: http://gatkforums.broadinstitute.org/discussion/5221/tranches-plot-issue

The presenters at this week's GATK roadshow said they were hoping to have this fixed in release 3.4.

@pallolason

This comment has been minimized.

Show comment
Hide comment
@pallolason

pallolason Apr 23, 2015

Thanks for the heads up @druvus.
@ameynert , did the Ensembl chain files work for you out of the box with CrossMap? When I use them to map vcf files, I have to edit the "target" genome chromosome names to "chrN" style. Otherwise CrossMap will not fetch the REF allele from the fasta file. I reproduced that with the CrossMap tool on Ensembls assembly converter page: http://www.ensembl.org/Homo_sapiens/Tools/AssemblyConverter?db=core

Thanks for the heads up @druvus.
@ameynert , did the Ensembl chain files work for you out of the box with CrossMap? When I use them to map vcf files, I have to edit the "target" genome chromosome names to "chrN" style. Otherwise CrossMap will not fetch the REF allele from the fasta file. I reproduced that with the CrossMap tool on Ensembls assembly converter page: http://www.ensembl.org/Homo_sapiens/Tools/AssemblyConverter?db=core

@ameynert

This comment has been minimized.

Show comment
Hide comment
@ameynert

ameynert Apr 28, 2015

@pallolason The chain files worked out of the box for coordinate mapping only, but not for the reference bases, as you found. I'm trying the UCSC hg19 to hg38 chain files to see how those compare to hacking the Ensembl chain files to use 'chrN' style coordinates. It's strange that CrossMap has this limitation since the VCF files I'm converting are all in 'N' chromosome style.

@pallolason The chain files worked out of the box for coordinate mapping only, but not for the reference bases, as you found. I'm trying the UCSC hg19 to hg38 chain files to see how those compare to hacking the Ensembl chain files to use 'chrN' style coordinates. It's strange that CrossMap has this limitation since the VCF files I'm converting are all in 'N' chromosome style.

@chapmanb

This comment has been minimized.

Show comment
Hide comment
@chapmanb

chapmanb Apr 28, 2015

Member

Alison -- for GATK VQSR plotting compatible dbSNPs, we could subset the existing NCBI GRCh38 dbSNP VCF. It has dbSNPBuildID for the introduced version, plus KGPhase1 and KGPhase3 to indicate 1000genomes versions. That'll help us create a subset file directly from the already lifted over version instead of doing it twice. Since it's just a plotting issue it might be worth waiting until 3.4 and seeing what the GATK folks recommend before doing this.

When you think the lifted over supplemental files are ready I'll be happy to include them in the bcbio distribution. Just let me know.

For the full GRCh38 build plus alts, we'll plan to use the distribution from 1000 genomes:

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/

This is the NCBI full build plus additional HLAs from @lh3 which will hopefully get us HLA typing:

https://github.com/lh3/bwa/blob/master/README-alt.md#hla-typing

Member

chapmanb commented Apr 28, 2015

Alison -- for GATK VQSR plotting compatible dbSNPs, we could subset the existing NCBI GRCh38 dbSNP VCF. It has dbSNPBuildID for the introduced version, plus KGPhase1 and KGPhase3 to indicate 1000genomes versions. That'll help us create a subset file directly from the already lifted over version instead of doing it twice. Since it's just a plotting issue it might be worth waiting until 3.4 and seeing what the GATK folks recommend before doing this.

When you think the lifted over supplemental files are ready I'll be happy to include them in the bcbio distribution. Just let me know.

For the full GRCh38 build plus alts, we'll plan to use the distribution from 1000 genomes:

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/

This is the NCBI full build plus additional HLAs from @lh3 which will hopefully get us HLA typing:

https://github.com/lh3/bwa/blob/master/README-alt.md#hla-typing

@pallolason

This comment has been minimized.

Show comment
Hide comment
@pallolason

pallolason Apr 28, 2015

Thanks @ameynert. The hg19tohg38 chain file from UCSC works fine for me, comparable numbers as with the Ensembl chain file after realizing that my initial benchmarks had a sorting glitch (intersectBed -sorted expects lexically sorted chromosomes and I used vcf-sort -c which numerically sorts the chromosomes). Note that once lifted over, the vcf files may contain entries where the ref and (one of the) alt alleles are the same and GATK does not like those: http://gatkforums.broadinstitute.org/discussion/4489/when-the-bundle-of-supportive-files-are-going-to-be-available-in-ftp-site-for-hg38-release.

Thanks @ameynert. The hg19tohg38 chain file from UCSC works fine for me, comparable numbers as with the Ensembl chain file after realizing that my initial benchmarks had a sorting glitch (intersectBed -sorted expects lexically sorted chromosomes and I used vcf-sort -c which numerically sorts the chromosomes). Note that once lifted over, the vcf files may contain entries where the ref and (one of the) alt alleles are the same and GATK does not like those: http://gatkforums.broadinstitute.org/discussion/4489/when-the-bundle-of-supportive-files-are-going-to-be-available-in-ftp-site-for-hg38-release.

@chapmanb

This comment has been minimized.

Show comment
Hide comment
@chapmanb

chapmanb May 10, 2015

Member

Hi all;
We pushed a new release (0.8.9) with initial support for build 38 for variant calling. There are two builds hg38 (the 1000 genomes distribution with all the NCBI alts + HLAs, prepared by Heng Li) and hg38-noalt (like hg37, but with the new build coordinates). Both will run a standard variant calling pipeline with annotations. The additional work we still need to do includes:

  • Performing validations. Deanna Church and the GA4GH group are working on converting the Genome in a Bottle reference to 38, which we can then use as a first pass for validating methods until we have clean reference validation builds aligned to 38.
  • Adding additional annotations, including the Broad bundle and resources from GEMINI. Right now it only includes dbSNP and ClinVar. For GATK runs, this means it'll skip VQSR and instead do hard filtering.
  • Supporting downstream queries through GEMINI. I hope to work with Brent and Aaron on converting over resources to 38 and having a central approach to download/prepare them.
  • RNA-seq support. We need to spend time to prepare data, test and run.
  • HLA calling. The basic framework is in place for bwa-mem alignments using Heng's bwakit approach but we need to finalize and validate. This is being tracked in #178.

Please let us know if you run into any issues with 38 as is and thanks for all the help and support on getting it this far.

Member

chapmanb commented May 10, 2015

Hi all;
We pushed a new release (0.8.9) with initial support for build 38 for variant calling. There are two builds hg38 (the 1000 genomes distribution with all the NCBI alts + HLAs, prepared by Heng Li) and hg38-noalt (like hg37, but with the new build coordinates). Both will run a standard variant calling pipeline with annotations. The additional work we still need to do includes:

  • Performing validations. Deanna Church and the GA4GH group are working on converting the Genome in a Bottle reference to 38, which we can then use as a first pass for validating methods until we have clean reference validation builds aligned to 38.
  • Adding additional annotations, including the Broad bundle and resources from GEMINI. Right now it only includes dbSNP and ClinVar. For GATK runs, this means it'll skip VQSR and instead do hard filtering.
  • Supporting downstream queries through GEMINI. I hope to work with Brent and Aaron on converting over resources to 38 and having a central approach to download/prepare them.
  • RNA-seq support. We need to spend time to prepare data, test and run.
  • HLA calling. The basic framework is in place for bwa-mem alignments using Heng's bwakit approach but we need to finalize and validate. This is being tracked in #178.

Please let us know if you run into any issues with 38 as is and thanks for all the help and support on getting it this far.

@schelhorn schelhorn referenced this issue in alexdobin/STAR May 12, 2015

Open

Supporting GRCh38/hg38 #39

roryk added a commit to chapmanb/cloudbiolinux that referenced this issue May 13, 2015

Less strict GTF cleaning.
Keep items with a 'gene_id' but no 'transcript_id'. When loading, guess
if the file has 'gene' and 'transcript' items already, and if so, skip
inferring them.

Addresses concerns in daler/gffutils#48. Brings RNA-sew support for
GRCh38 closer, addressing bcbio/bcbio-nextgen#817. Some related
discussion in bcbio/bcbio-nextgen#849 as well.
@ameynert

This comment has been minimized.

Show comment
Hide comment
@ameynert

ameynert May 15, 2015

I've been testing a standard BWA+GATK alignment pipeline with GRCh38 resources on exome samples using:

  • GRCh38 primary assembly downloaded from Ensembl
  • dbSNP 142 in GRCh38 downloaded directly from the dbSNP site
  • Mills & 1000G Gold indels, 1000G phase 1 indels for the RealignerTargetCreator and IndelRealigner steps, lifted over with CrossMap and the UCSC chain files, filtering out ref == alt cases
  • target BED file lifted over with CrossMap & UCSC chain files, filtering out the alt and Un contigs

Test runs on truncated FASTQ files are fine. I'm running 55 high depth samples through over the weekend and will post Monday on the results. The same set of samples were run on GRCh37 using otherwise identical parameters earlier in the week, so I should be able to make some useful comparisons.

@chapmanb Filtering down the existing dbSNP142 to exclude the post-129/1KG sites is a good plan. GATK 3.4 is out today. Their blog says "Switched VQSR tranches plot ordering rule (ordering is now based on tranche sensitivity instead of novel titv)." (https://www.broadinstitute.org/gatk/blog?id=5562), which is a little unclear, but that could mean the issue has been addressed. Unfortunately I don't have time to test this myself just now - I'm going to be mostly offline for the next 4 weeks - but I will certainly have a look when I get back.

I've been testing a standard BWA+GATK alignment pipeline with GRCh38 resources on exome samples using:

  • GRCh38 primary assembly downloaded from Ensembl
  • dbSNP 142 in GRCh38 downloaded directly from the dbSNP site
  • Mills & 1000G Gold indels, 1000G phase 1 indels for the RealignerTargetCreator and IndelRealigner steps, lifted over with CrossMap and the UCSC chain files, filtering out ref == alt cases
  • target BED file lifted over with CrossMap & UCSC chain files, filtering out the alt and Un contigs

Test runs on truncated FASTQ files are fine. I'm running 55 high depth samples through over the weekend and will post Monday on the results. The same set of samples were run on GRCh37 using otherwise identical parameters earlier in the week, so I should be able to make some useful comparisons.

@chapmanb Filtering down the existing dbSNP142 to exclude the post-129/1KG sites is a good plan. GATK 3.4 is out today. Their blog says "Switched VQSR tranches plot ordering rule (ordering is now based on tranche sensitivity instead of novel titv)." (https://www.broadinstitute.org/gatk/blog?id=5562), which is a little unclear, but that could mean the issue has been addressed. Unfortunately I don't have time to test this myself just now - I'm going to be mostly offline for the next 4 weeks - but I will certainly have a look when I get back.

@chapmanb

This comment has been minimized.

Show comment
Hide comment
@chapmanb

chapmanb May 15, 2015

Member

Alison;
Brilliant, thank you for testing and preparing the VQSR supplemental files. Are those publicly available so we can incorporate them into bcbio? I'm happy to host on S3. With those, I'll test VQSR and the tranche plots in 3.4 to see if we can avoid needing to support multiple dbSNPs. Thanks again for all the work and hope you have a relaxing time off.

Member

chapmanb commented May 15, 2015

Alison;
Brilliant, thank you for testing and preparing the VQSR supplemental files. Are those publicly available so we can incorporate them into bcbio? I'm happy to host on S3. With those, I'll test VQSR and the tranche plots in 3.4 to see if we can avoid needing to support multiple dbSNPs. Thanks again for all the work and hope you have a relaxing time off.

@ameynert

This comment has been minimized.

Show comment
Hide comment
@ameynert

ameynert May 15, 2015

I'll post the files once I've confirmed the high depth exome runs are ok, and put a link in this thread.

I used this file for dbSNP142: ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b142_GRCh38/VCF/All_20150416.vcf.gz

I'll post the files once I've confirmed the high depth exome runs are ok, and put a link in this thread.

I used this file for dbSNP142: ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b142_GRCh38/VCF/All_20150416.vcf.gz

@ameynert

This comment has been minimized.

Show comment
Hide comment
@ameynert

ameynert May 18, 2015

I've done a test of SNP and indel VQSRs with my GRCh38 GATK bundle files on four sets of lifted over VCFs, each with a few hundred exomes. No crashes, the results look fine, and the tranche stats are very close to those from the original GRCh37 VCFs. I've uploaded the files to my university's FTP server and sent the links to Brad so he can put them up and everyone else can have a crack at testing.

I've done a test of SNP and indel VQSRs with my GRCh38 GATK bundle files on four sets of lifted over VCFs, each with a few hundred exomes. No crashes, the results look fine, and the tranche stats are very close to those from the original GRCh37 VCFs. I've uploaded the files to my university's FTP server and sent the links to Brad so he can put them up and everyone else can have a crack at testing.

@mw55309

This comment has been minimized.

Show comment
Hide comment
@mw55309

mw55309 May 19, 2015

Hi

I want to bring in @lauraclarke who has been gathering GRCh38 resources for the 1KG project at EBI. Hopefully we can work together!

Laura: Alison Meynert from IGMM here in Edinburgh has been working on liftover from GRCh37

Cheers
Mick

mw55309 commented May 19, 2015

Hi

I want to bring in @lauraclarke who has been gathering GRCh38 resources for the 1KG project at EBI. Hopefully we can work together!

Laura: Alison Meynert from IGMM here in Edinburgh has been working on liftover from GRCh37

Cheers
Mick

@lauraclarke

This comment has been minimized.

Show comment
Hide comment
@lauraclarke

lauraclarke May 20, 2015

Hi Mick, Alison

I am more than happy to chat about what we are doing to remap the 1000
Genomes data.

Do let me know what issues you want to talk about.

thanks

Laura

On 19 May 2015 at 08:55, mw55309 notifications@github.com wrote:

Hi

I want to bring in @lauraclarke https://github.com/lauraclarke who has
been gathering GRCh38 resources for the 1KG project at EBI. Hopefully we
can work together!

Laura: Alison Meynert from IGMM here in Edinburgh has been working on
liftover from GRCh37

Cheers
Mick

Reply to this email directly or view it on GitHub
#817 (comment)
.

Hi Mick, Alison

I am more than happy to chat about what we are doing to remap the 1000
Genomes data.

Do let me know what issues you want to talk about.

thanks

Laura

On 19 May 2015 at 08:55, mw55309 notifications@github.com wrote:

Hi

I want to bring in @lauraclarke https://github.com/lauraclarke who has
been gathering GRCh38 resources for the 1KG project at EBI. Hopefully we
can work together!

Laura: Alison Meynert from IGMM here in Edinburgh has been working on
liftover from GRCh37

Cheers
Mick

Reply to this email directly or view it on GitHub
#817 (comment)
.

@chapmanb

This comment has been minimized.

Show comment
Hide comment
@chapmanb

chapmanb May 21, 2015

Member

Alison -- thanks so much for preparing the Broad bundle VQSR resource files. I uploaded these to an S3 bucket and added downloads for bcbio, so doing a bcbio_nextgen.py upgrade --data will prepare these to be ready to go.

Laura, it would be great to have you involved. Our goal has been to collect and create resources for 38 so we can start having the annotations and analyses we're used to on 37. We have a simple recipe based approach taken from what Aaron Quinlan put together with the current resources we've collected:

https://github.com/chapmanb/cloudbiolinux/tree/master/ggd-recipes/hg38

From my perspective some useful points of collaboration would be:

  • More supplemental data. We're aiming to support the large subset of annotation data available for 37 through GEMINI (http://gemini.readthedocs.org/en/latest/) so there is plenty to do. We'd be happy to use anything you're preparing.
  • Hosting for lifted over files. Right now I have some of these in an S3 bucket, but a more official repository of 38 resources would be a big help towards driving adoption.

Thank you all again for the work so far. Great to see this moving along.

Member

chapmanb commented May 21, 2015

Alison -- thanks so much for preparing the Broad bundle VQSR resource files. I uploaded these to an S3 bucket and added downloads for bcbio, so doing a bcbio_nextgen.py upgrade --data will prepare these to be ready to go.

Laura, it would be great to have you involved. Our goal has been to collect and create resources for 38 so we can start having the annotations and analyses we're used to on 37. We have a simple recipe based approach taken from what Aaron Quinlan put together with the current resources we've collected:

https://github.com/chapmanb/cloudbiolinux/tree/master/ggd-recipes/hg38

From my perspective some useful points of collaboration would be:

  • More supplemental data. We're aiming to support the large subset of annotation data available for 37 through GEMINI (http://gemini.readthedocs.org/en/latest/) so there is plenty to do. We'd be happy to use anything you're preparing.
  • Hosting for lifted over files. Right now I have some of these in an S3 bucket, but a more official repository of 38 resources would be a big help towards driving adoption.

Thank you all again for the work so far. Great to see this moving along.

@lauraclarke

This comment has been minimized.

Show comment
Hide comment
@lauraclarke

lauraclarke May 21, 2015

Hi Brad

This looks great, I would like to involve Holly from my team in this conversation but I don't seem to be able to tag her. Not sure if I am doing something wrong (her git user profile is https://github.com/HollyZB)

So we are in the process of remapping 1000Genomes data to GRCh38.

Our plan is to use the newest version of bwa with alt aware mapping. I see you are already pointing to our reference files

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/

There is also a version of dbSNP 142 on GRCh38 coords and the phase 3 indels (much better than the phase 1 indels). We don't have the devine/mills variants, We might have to pinch them from you as we haven't started quite yet. Are there instructions on how we might use those yaml files to actually download the data anywhere, we haven't used ggv yet

One open question is how different mapping to just the primary ref is vesus primary plus alts, decoy and hla. We are starting to build data for this. We have the first few alignments of the high coverage PCR free 2*250 samples that we have aligned both ways.

e.g

laura@pg-trace-001[ftp]:grep GRCh38 current.tree | grep cram | grep HG00096 | cut -f1
ftp/technical/working/20150511_IGSR_highcov_cram_no_alt/HG00096/high_cov_alignment/HG00096.bwamem_GRCh38primary.20150424.high_coverage.bam.cram.crai
ftp/technical/working/20150511_IGSR_highcov_cram_no_alt/HG00096/high_cov_alignment/HG00096.bwamem_GRCh38primary.20150424.high_coverage.bam.cram
ftp/data/HG00096/high_cov_alignment/HG00096.alt_bwamem_GRCh38DH.20150424.high_coverage.bam.cram.crai
ftp/data/HG00096/high_cov_alignment/HG00096.alt_bwamem_GRCh38DH.20150424.high_coverage.bam.cram

We want to compare between the two mapping sets. If you already have any tools to do this or just suggestions on what the best metrics to compare the files on, we would be happy to hear them

thanks

Laura

Hi Brad

This looks great, I would like to involve Holly from my team in this conversation but I don't seem to be able to tag her. Not sure if I am doing something wrong (her git user profile is https://github.com/HollyZB)

So we are in the process of remapping 1000Genomes data to GRCh38.

Our plan is to use the newest version of bwa with alt aware mapping. I see you are already pointing to our reference files

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/

There is also a version of dbSNP 142 on GRCh38 coords and the phase 3 indels (much better than the phase 1 indels). We don't have the devine/mills variants, We might have to pinch them from you as we haven't started quite yet. Are there instructions on how we might use those yaml files to actually download the data anywhere, we haven't used ggv yet

One open question is how different mapping to just the primary ref is vesus primary plus alts, decoy and hla. We are starting to build data for this. We have the first few alignments of the high coverage PCR free 2*250 samples that we have aligned both ways.

e.g

laura@pg-trace-001[ftp]:grep GRCh38 current.tree | grep cram | grep HG00096 | cut -f1
ftp/technical/working/20150511_IGSR_highcov_cram_no_alt/HG00096/high_cov_alignment/HG00096.bwamem_GRCh38primary.20150424.high_coverage.bam.cram.crai
ftp/technical/working/20150511_IGSR_highcov_cram_no_alt/HG00096/high_cov_alignment/HG00096.bwamem_GRCh38primary.20150424.high_coverage.bam.cram
ftp/data/HG00096/high_cov_alignment/HG00096.alt_bwamem_GRCh38DH.20150424.high_coverage.bam.cram.crai
ftp/data/HG00096/high_cov_alignment/HG00096.alt_bwamem_GRCh38DH.20150424.high_coverage.bam.cram

We want to compare between the two mapping sets. If you already have any tools to do this or just suggestions on what the best metrics to compare the files on, we would be happy to hear them

thanks

Laura

@lauraclarke

This comment has been minimized.

Show comment
Hide comment
@lauraclarke

lauraclarke May 22, 2015

I also realised I should say, if there is a good set of GRCh38 mapping resources, we are more than happy to host it on the 1000G ftp site, it is a natural place people come to look for such things anyway

I also realised I should say, if there is a good set of GRCh38 mapping resources, we are more than happy to host it on the 1000G ftp site, it is a natural place people come to look for such things anyway

@chapmanb

This comment has been minimized.

Show comment
Hide comment
@chapmanb

chapmanb May 22, 2015

Member

Laura -- thanks so much. You can notify/subscribe Holly to the discussion with an @ in front of her username (@HollyZB), and she'll get all future messages in the thread.

I appreciate all the help coordinating these files. It's great to be working on this as a community. Here's a pointer to all the files Alison prepared for direct downloads:

http://biodata.s3-website-us-east-1.amazonaws.com/hg38_bundle/

It would be brilliant to host these on the 1000G ftp site, if you're willing. One practical thing we've been trying to do is keep the chromosome naming to one consistent approach so we don't have the continued naming issues we have with GRCh37/hg19 (1 versus chr1). We're using the chr-prefixed style you also have in your reference distribution, so hopefully all is compatible.

For the VQSR resources you've prepared, do you have experience on how these truth inputs compare to the sets Broad make available that Alison lifted over? We're open to use anything that is tested and validated, so would definitely lean on your expertise.

For validation in general, you've probably already seen Heng's hapdip validation:

https://github.com/lh3/bwa/blob/master/README-alt.md#preliminary-evaluation

We're planning to also do a full vs noalt comparison using NA12878, Platinum Genomes fastqs, and the Genome in a Bottle reference. Deanna Church is using remap to convert the 37 Genome in a Bottle reference calls to 38 and we'll run validations against those (in the style of http://bcb.io/2014/10/07/joint-calling/). Happy to coordinate on any and all of this.

Thanks again for all the discussion and work on 38 support.

Member

chapmanb commented May 22, 2015

Laura -- thanks so much. You can notify/subscribe Holly to the discussion with an @ in front of her username (@HollyZB), and she'll get all future messages in the thread.

I appreciate all the help coordinating these files. It's great to be working on this as a community. Here's a pointer to all the files Alison prepared for direct downloads:

http://biodata.s3-website-us-east-1.amazonaws.com/hg38_bundle/

It would be brilliant to host these on the 1000G ftp site, if you're willing. One practical thing we've been trying to do is keep the chromosome naming to one consistent approach so we don't have the continued naming issues we have with GRCh37/hg19 (1 versus chr1). We're using the chr-prefixed style you also have in your reference distribution, so hopefully all is compatible.

For the VQSR resources you've prepared, do you have experience on how these truth inputs compare to the sets Broad make available that Alison lifted over? We're open to use anything that is tested and validated, so would definitely lean on your expertise.

For validation in general, you've probably already seen Heng's hapdip validation:

https://github.com/lh3/bwa/blob/master/README-alt.md#preliminary-evaluation

We're planning to also do a full vs noalt comparison using NA12878, Platinum Genomes fastqs, and the Genome in a Bottle reference. Deanna Church is using remap to convert the 37 Genome in a Bottle reference calls to 38 and we'll run validations against those (in the style of http://bcb.io/2014/10/07/joint-calling/). Happy to coordinate on any and all of this.

Thanks again for all the discussion and work on 38 support.

@HollyZB

This comment has been minimized.

Show comment
Hide comment
@HollyZB

HollyZB May 22, 2015

Thanks Brad.

I downloaded Mills' indels lifted by Alison. It passed our standard validation and yes indeed it needs "chr" in the start of all data lines. I will put it up at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/other_mapping_resources

HollyZB commented May 22, 2015

Thanks Brad.

I downloaded Mills' indels lifted by Alison. It passed our standard validation and yes indeed it needs "chr" in the start of all data lines. I will put it up at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/other_mapping_resources

@schelhorn

This comment has been minimized.

Show comment
Hide comment
@schelhorn

schelhorn May 22, 2015

I guess this thread shows why bcbio is a community-developed pipeline -- awesome! I have two tidbits to add, both of which may be resources that are well known but I still wanted to mention them here for reference:

The Seattle Seq Annotation Server has lifted over to hg38 PolyPhen, RepeatMasker/TandemRepeat, KEGG, UCSC CpG Islands, ESP, ExAC, GERP, CADD, and three HapMap populations (African, Asian, European).

Annovar has lifted hg38 resources for ljb26 (whole-exome SIFT, PolyPhen2 HDIV, PolyPhen2 HVAR, LRT, MutationTaster, MutationAssessor, FATHMM, MetaSVM, MetaLR, VEST, CADD, GERP++, PhyloP and SiPhy scores from dbNSFP), cosmic v70, esp6500, ExAC, 1kG (October 2014), NCI60 and ClinVar.

In both cases I do not have more information than the statements referring to hg38 on the linked websites, but I am sure the developers of Seattle Seq and ANNOVAR would be open to collaboration if someone would contact them. In that manner development of high-quality canonical resources could be furthered by reconciling duplicated work earlier rather than later.

I guess this thread shows why bcbio is a community-developed pipeline -- awesome! I have two tidbits to add, both of which may be resources that are well known but I still wanted to mention them here for reference:

The Seattle Seq Annotation Server has lifted over to hg38 PolyPhen, RepeatMasker/TandemRepeat, KEGG, UCSC CpG Islands, ESP, ExAC, GERP, CADD, and three HapMap populations (African, Asian, European).

Annovar has lifted hg38 resources for ljb26 (whole-exome SIFT, PolyPhen2 HDIV, PolyPhen2 HVAR, LRT, MutationTaster, MutationAssessor, FATHMM, MetaSVM, MetaLR, VEST, CADD, GERP++, PhyloP and SiPhy scores from dbNSFP), cosmic v70, esp6500, ExAC, 1kG (October 2014), NCI60 and ClinVar.

In both cases I do not have more information than the statements referring to hg38 on the linked websites, but I am sure the developers of Seattle Seq and ANNOVAR would be open to collaboration if someone would contact them. In that manner development of high-quality canonical resources could be furthered by reconciling duplicated work earlier rather than later.

@lauraclarke

This comment has been minimized.

Show comment
Hide comment
@lauraclarke

lauraclarke May 22, 2015

On existing snp resources, the VEP has been available on GRCh38 since last year and Ensembl v80 which was released yesterday has all the 1000G data but chrX, Y and a small piece of chr12 as they werent in dbSNP 142.

On existing snp resources, the VEP has been available on GRCh38 since last year and Ensembl v80 which was released yesterday has all the 1000G data but chrX, Y and a small piece of chr12 as they werent in dbSNP 142.

@schelhorn

This comment has been minimized.

Show comment
Hide comment
@schelhorn

schelhorn May 22, 2015

So I gather that most resources for GEMINI-based hg38 variant annotation should be available already, with possible exclusions of OMIM and HPRD (or are they available via UCSC tracks)? For CADD and FitCons (the latter is included in CADD 1.2) the authors may have to re-score on the new assembly due to the genomic context issue @arq5x mentioned. I can ask Martin Kircher of CADD to chime in and give an opinion, we used to be in the same group.

So I gather that most resources for GEMINI-based hg38 variant annotation should be available already, with possible exclusions of OMIM and HPRD (or are they available via UCSC tracks)? For CADD and FitCons (the latter is included in CADD 1.2) the authors may have to re-score on the new assembly due to the genomic context issue @arq5x mentioned. I can ask Martin Kircher of CADD to chime in and give an opinion, we used to be in the same group.

@schelhorn

This comment has been minimized.

Show comment
Hide comment
@schelhorn

schelhorn May 22, 2015

Martin kindly answered right away and I am posting his position concerning CADD liftover here:

CADD depends on several tracks that are not yet released for GRCh38 or which we would need to recalculate (e.g. conservation metrics excluding the human reference but on human coordinates [PhastCons, PhyloP], GERP and several ENCODE super tracks). Especially for the ENCODE information, we are considering to switch to the Ensembl Regulatory Build which has been build for both genome releases now. Overall, Ensembl seems to be a great resource for GRCh38 annotations. While we would like to release a GRCh38 version of CADD in 2015, we have not actually started this effort yet. At the moment, lifting scores is the only option, but it does not give you any benefit from using the new genome build (i.e. additional >3MB of genome sequence, fixed reference errors, removed minority alleles, alignment errors in annotations derived from GRCh37). If you liftOver a score set, I would recommend the last major release (1.0) rather than the developmental ones that are more experimental. We are also aware of a problem with overlapping gene annotation in 1.1/1.2. We are still optimizing training parameters and hope to have 1.3 out in the not so far future. Regarding fitCons, I would not recommend using the version that is available in the CADD annotation files. This is not the latest available version. Also note that CADD is not using fitCons, we only included it to make comparisons easier for us.

Martin kindly answered right away and I am posting his position concerning CADD liftover here:

CADD depends on several tracks that are not yet released for GRCh38 or which we would need to recalculate (e.g. conservation metrics excluding the human reference but on human coordinates [PhastCons, PhyloP], GERP and several ENCODE super tracks). Especially for the ENCODE information, we are considering to switch to the Ensembl Regulatory Build which has been build for both genome releases now. Overall, Ensembl seems to be a great resource for GRCh38 annotations. While we would like to release a GRCh38 version of CADD in 2015, we have not actually started this effort yet. At the moment, lifting scores is the only option, but it does not give you any benefit from using the new genome build (i.e. additional >3MB of genome sequence, fixed reference errors, removed minority alleles, alignment errors in annotations derived from GRCh37). If you liftOver a score set, I would recommend the last major release (1.0) rather than the developmental ones that are more experimental. We are also aware of a problem with overlapping gene annotation in 1.1/1.2. We are still optimizing training parameters and hope to have 1.3 out in the not so far future. Regarding fitCons, I would not recommend using the version that is available in the CADD annotation files. This is not the latest available version. Also note that CADD is not using fitCons, we only included it to make comparisons easier for us.

@chapmanb chapmanb referenced this issue in chapmanb/cloudbiolinux Jun 15, 2015

Closed

Unable to 'install' star index for hg38 #197

@chapmanb

This comment has been minimized.

Show comment
Hide comment
@chapmanb

chapmanb Jun 26, 2015

Member

Nice Reddit discussion about pros and cons of moving to build 38:

http://www.reddit.com/r/genome/comments/3b3s3t/switch_from_hg19build37_to_hg20build38/

with a pointer to this effort by Shuoguo Wang to remap the Broad bundle:

https://github.com/iiiir/GRCH38_gatk_bundle

Member

chapmanb commented Jun 26, 2015

Nice Reddit discussion about pros and cons of moving to build 38:

http://www.reddit.com/r/genome/comments/3b3s3t/switch_from_hg19build37_to_hg20build38/

with a pointer to this effort by Shuoguo Wang to remap the Broad bundle:

https://github.com/iiiir/GRCH38_gatk_bundle

roryk added a commit that referenced this issue Oct 18, 2015

preliminary hisat2 support added for RNA-seq.
Currently evaluating this as a way to move to hg38 for RNA-seq.
Thanks to @mjafin, @BioInfo, @schelhorn and @chapmanb for motivating
doing this.

Closes #1029. Partially addresses #817 for RNA-seq.
@schelhorn

This comment has been minimized.

Show comment
Hide comment
@schelhorn

schelhorn Oct 29, 2015

So the impossible just happened:

@dgmacarthur: OK, @KMS_Meltzy has convinced me. Firing up the terminal to remap ExAC onto GRCh38 as we speak. #gi2015

So the impossible just happened:

@dgmacarthur: OK, @KMS_Meltzy has convinced me. Firing up the terminal to remap ExAC onto GRCh38 as we speak. #gi2015

@mw55309

This comment has been minimized.

Show comment
Hide comment
@mw55309

mw55309 Oct 29, 2015

I think he may have been being sarcastic ;-)

mw55309 commented Oct 29, 2015

I think he may have been being sarcastic ;-)

@dgmacarthur

This comment has been minimized.

Show comment
Hide comment
@dgmacarthur

dgmacarthur Oct 29, 2015

Yes, that was sarcasm. One day we'll switch. Not today.


Sent from my phone.

On Oct 29, 2015, at 6:55, mw55309 notifications@github.com wrote:

I think he may have been being sarcastic ;-)


Reply to this email directly or view it on GitHub.

Yes, that was sarcasm. One day we'll switch. Not today.


Sent from my phone.

On Oct 29, 2015, at 6:55, mw55309 notifications@github.com wrote:

I think he may have been being sarcastic ;-)


Reply to this email directly or view it on GitHub.

@schelhorn

This comment has been minimized.

Show comment
Hide comment
@schelhorn

schelhorn Oct 29, 2015

Sad panda 🐼. Well, perhaps next year then. @dgmacarthur should start a kickstarter for financing the CPU cycles. I'd pledge the re-analysis of 100 exomes...

Sad panda 🐼. Well, perhaps next year then. @dgmacarthur should start a kickstarter for financing the CPU cycles. I'd pledge the re-analysis of 100 exomes...

@roryk

This comment has been minimized.

Show comment
Hide comment
@roryk

roryk Nov 15, 2015

Collaborator

I've added hisat2 support so we can support this for RNA-seq but we've been having some issues with it so it isn't live yet. Transcriptome-only methods like Sailfish shouldn't have these same issues so we should be able to support this for RNA-seq officially pretty soon.

Collaborator

roryk commented Nov 15, 2015

I've added hisat2 support so we can support this for RNA-seq but we've been having some issues with it so it isn't live yet. Transcriptome-only methods like Sailfish shouldn't have these same issues so we should be able to support this for RNA-seq officially pretty soon.

@roryk

This comment has been minimized.

Show comment
Hide comment
@roryk

roryk Dec 9, 2015

Collaborator

hisat2 seems to work well for handling RNA-seq alignments, and I ran the SEQC dataset through hg38 and hg38-noalt and it seems to do well quantitating hg38 with the alts, so we're good to go for that.

Collaborator

roryk commented Dec 9, 2015

hisat2 seems to work well for handling RNA-seq alignments, and I ran the SEQC dataset through hg38 and hg38-noalt and it seems to do well quantitating hg38 with the alts, so we're good to go for that.

@arahuja arahuja referenced this issue in hammerlab/biokepi Feb 16, 2016

Open

Pass `-known` to the IndelRealigner #131

@schelhorn

This comment has been minimized.

Show comment
Hide comment
@schelhorn

schelhorn Nov 1, 2017

Thanks for all the insightful comments; since hg38 support in bcbio has since been completed for both DNA-Seq variant calling/annotation and RNA-Seq expression quantitation I am now closing this issue.

Please feel free to open a new bcbio issue in case you have an interest in robust and fully-featured pipelines on genome graphs, which represent one of the important next steps in high-volume NGS processing that we should consider taking.

In particular, the bcbio devs would be interested in obtaining publicly available test datasets (for instance for NA12878) to showcase the advantages of/benchmark genome graphs compared to flat references with the currently available toolchains.

Thanks for all the insightful comments; since hg38 support in bcbio has since been completed for both DNA-Seq variant calling/annotation and RNA-Seq expression quantitation I am now closing this issue.

Please feel free to open a new bcbio issue in case you have an interest in robust and fully-featured pipelines on genome graphs, which represent one of the important next steps in high-volume NGS processing that we should consider taking.

In particular, the bcbio devs would be interested in obtaining publicly available test datasets (for instance for NA12878) to showcase the advantages of/benchmark genome graphs compared to flat references with the currently available toolchains.

@schelhorn schelhorn closed this Nov 1, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment