Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable more frequent ClinVar data updates in Exomiser database #501

Closed
julesjacobsen opened this issue Jul 7, 2023 · 5 comments
Closed
Assignees
Milestone

Comments

@julesjacobsen
Copy link
Contributor

julesjacobsen commented Jul 7, 2023

Monthly / quarterly frequency?
File format?

Related to #462 , #473

julesjacobsen added a commit that referenced this issue Jul 17, 2023
…ptor.toClinVarData as a public method

Minor formatting change to AbstractAnalysisRunner
Add default implementation to VariantFilterDataProvider
Delete unused commented-out code from GenomeAnalysisServiceImpl
julesjacobsen added a commit that referenced this issue Jul 17, 2023
Add new ClinVarDao
Add new ClinVarDaoMvStore
Update VariantDataServiceImpl to require ClinVarDao
Add new ClinVarWhiteListReader
Add new MvStoreUtil.clinVarMapBuilder and MvStoreUtil.openClinVarMVMap methods
julesjacobsen added a commit that referenced this issue Jul 17, 2023
Update spring-boot-autoconfigure module to include ClinVar MVStore in VariantDataServiceImpl
Add new application-default.properties set new clinvar-data-version and use-clinvar-whitelist properties
@julesjacobsen julesjacobsen added this to the 14.0.0 milestone Jul 17, 2023
@julesjacobsen julesjacobsen added this to To do in Release 14.0.0 via automation Jul 17, 2023
@julesjacobsen julesjacobsen self-assigned this Jul 17, 2023
@julesjacobsen
Copy link
Contributor Author

This has been implemented as a new H2 MVStore index created from the ClinVar clinvar.vcf.gz file. Previously this file was parsed twice during the variant data build - once to include the ClinVarData in the variants.mv.db file and a second time to produce the clinvar_whitelist.tsv.gz file.

The downside of this was that users had to hack the provided clinvar_whitelist with their own data each time there as a new release and also the variants.mv.db file was 99.9% identical from one data release to the next as only the ClinVar data included in it was actually updated, the rest was mostly static data.

By separating the ClinVar data (and providing easy downloads) this can be updated monthly following a ClinVar release with users only needing to download a ~55MB clinvar.mv.db file and update the version in the application.properties. For less savvy/discerning users, no action is required compared to the current workflow. The whitelist is now loaded from a user-supplied whitelist, should they have one and this is merged with a dynamically loaded set filtered from the clinvar.mv.db file. Users can disable using ClinVar as a whitelist source using exomiser.hg38.use-clinvar-whitelist=false (or the hg19 equivalent). By default this is set to true and is hidden.

The latest 2302 release hg38 data directory looks like this:

2302_hg38
├── 2302_hg38_clinvar_whitelist.tsv.gz
├── 2302_hg38_clinvar_whitelist.tsv.gz.tbi
├── 2302_hg38_genome.h2.db
├── 2302_hg38_transcripts_ensembl.ser
├── 2302_hg38_transcripts_refseq.ser
├── 2302_hg38_transcripts_ucsc.ser
└── 2302_hg38_variants.mv.db

The the updated hypothetical 2307_hg38 data release directory looks like this:

2307_hg38
├── 2307_hg38_clinvar.mv.db  # just the one file now, although its a binary blob
├── 2307_hg38_genome.h2.db
├── 2307_hg38_transcripts_ensembl.ser
├── 2307_hg38_transcripts_refseq.ser
├── 2307_hg38_transcripts_ucsc.ser
└── 2307_hg38_variants.mv.db

the next month when a new ClinVar release is built...

2307_hg38
├── 2308_hg38_clinvar.mv.db  # replaced the 2307 with the 2308 version
├── 2307_hg38_genome.h2.db
├── 2307_hg38_transcripts_ensembl.ser
├── 2307_hg38_transcripts_refseq.ser
├── 2307_hg38_transcripts_ucsc.ser
└── 2307_hg38_variants.mv.db

important! application.properties or ENV should be updated to use the new version!

exomiser.hg38.clinvar-data-version=2308

Logged on startup:

2023-07-14T17:50:01.343+01:00  INFO 274856 --- [           main] o.m.e.a.ExomiserConfigReporter           : exomiser.data-directory: /data/exomiser-data
2023-07-14T17:50:01.344+01:00  INFO 274856 --- [           main] o.m.e.a.ExomiserConfigReporter           : exomiser.hg19.data-version: -
2023-07-14T17:50:01.344+01:00  INFO 274856 --- [           main] o.m.e.a.ExomiserConfigReporter           : exomiser.hg38.data-version: 2307
2023-07-14T17:50:01.344+01:00  INFO 274856 --- [           main] o.m.e.a.ExomiserConfigReporter           : exomiser.hg38.clinvar-data-version: 2308
2023-07-14T17:50:01.344+01:00  INFO 274856 --- [           main] o.m.e.a.ExomiserConfigReporter           : exomiser.phenotype.data-version: 2307

@pnrobinson note that this will require changes to LIRICAL

Release 14.0.0 automation moved this from To do to Done Jul 17, 2023
@julesjacobsen julesjacobsen reopened this Jul 17, 2023
Release 14.0.0 automation moved this from Done to In progress Jul 17, 2023
@julesjacobsen
Copy link
Contributor Author

Still need to set up a cron and provide the data.

@wsstoregene
Copy link

wsstoregene commented Oct 26, 2023

Hello Jules,
Could you please provide a more detailed guideline about generating the 2308_hg38_clinvar.mv.db file from "https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20231021.vcf.gz"?
To summarise what you mentioned above: after generating this file, we can just replace the whitelist file with this new one and change the following parameters in application.properties?

  • exomiser.hg38.use-clinvar-whitelist=false
  • exomiser.hg38.clinvar-data-version=2308

Thank you in advance for your help.

Kind Regards

@julesjacobsen
Copy link
Contributor Author

julesjacobsen commented Nov 1, 2023

Hi @wsstoregene, this will be delivered in the next major version, so it's not ready yet unless you're building your own exomiser and running it from the development branch.

To generate the new file you'll need to use this command:

$ java -jar exomiser-data-genome-14.0.0-SNAPSHOT.jar  --build-dir=. --assembly hg38 --version 2311 --clinvar

This will create a directory called 2311_hg38 containing the file 2311_hg38_clinvar.mv.db. You'll need to move this into your current exomiser hg38 data directory and then update the version to use in the application.properties:

exomiser.hg38.clinvar-data-version=2308

You should always run with exomiser.hg38.use-clinvar-whitelist=true, as otherwise you're losing the benefit of having the ClinVar annotations used for scoring known, high-quality P/LP variants.

Be aware that this is still subject to change as we're also doing work on adding more ACMG categories (#473) and it is likely that the data will need to be annotated for the variant effect as well.

julesjacobsen added a commit that referenced this issue Nov 15, 2023
…nnotations on ClinVarData at build time.
@julesjacobsen
Copy link
Contributor Author

This now requires a transcript data file too, in order to annotate the variant consequence i.e.

$ java -jar exomiser-data-genome-14.0.0-SNAPSHOT.jar --assembly hg38 --version 2311 --clinvar path/to/2309_hg38/2309_hg38_transcripts_ensembl.ser

julesjacobsen added a commit that referenced this issue Nov 20, 2023
…nnotations on ClinVarData at build time.
Release 14.0.0 automation moved this from In progress to Done Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants