gene :: disease lookup. #571

brentp · 2015-10-26T14:57:32Z

via @jxchong cc @arq5x
A user may find a variant that is not itself associated with disease, but is in a gene known to be associated with the disease. There is no way to store or query this info in gemini at this time.

clinvar is an obvious choice and omim seems to be available to download for academic (http://omim.org/downloads).

Design

need a way to do this in a relatively generic fashion as we think about switching to support any organism.
need a simple query to do this even though we'd probably be storing in a table that's many-to-1 relative to gene_detailed (?)
could provide an example where user can use gemini annotate and smash all associations into a single "|" delimited string for every variant in that gene and it would be a single column in the variants table.
likely better to have a "|" delimited string associated with each gene and require a join on gene_name.

jxchong · 2015-10-26T16:58:51Z

For humans, OMIM is certainly where I check first.

For extending to other organisms, I suspect Monarch Initiative is the place to start (e.g. Exomiser uses model organism phenotype data curated by Monarch Initiative). I would guess you would want to be able to limit the "associated phenotype" to only those phenotypes that are actually known to be caused by mutations in the gene (vs GWAS signals, etc -- or at least a gene associated with a phenotype by GWAS is a different level of evidence/information than a gene in which mutations cause a Mendelian phenotype).

brentp · 2015-10-27T16:18:12Z

One way to do this currently would be to create a bed file with a pipe delimited disease list for each gene and then use gemini annotate.
We do want a more direct means, but that would be a general solution that's available in current gemini.

jxchong · 2015-10-27T19:05:34Z

Yeah I have an old file I made from OMIM that I'll use for now. Actually it occurs to me that this is the more general solution for this is also applicable to integrating pathway info -- I know pathways is currently a separate tool within Gemini, but ideally when doing variant filtration, you'd want to be able to output the pathway/KEGG/etc information for the gene that the variant is affecting.

jxchong · 2015-11-03T21:20:32Z

Oh hey! I think this appears to be implemented in VEP v82 with the VEP option
--gene_phenotype Indicates if the overlapped gene is associated with a phenotype, disease or trait. See list of phenotype sources. Not used by default

This appears to be multi-species. List of phenotype sources: http://uswest.ensembl.org/info/genome/variation/sources_phenotype_documentation.html

I have not had time to test yet though.

jxchong · 2015-11-09T19:03:32Z

According to the VEP devs, --gene_phenotype combined with --fields PHENO,GENE_PHENO should result in two columns (PHENO and GENE_PHENO) which ==1 if the variant or gene, respectively, are linked to a phenotype or ==0 if not linked to a phenotype. However there's some sort of issue in the cache files for VEPv82 that's making these lookup values always = 0 so I still can't test it yet. ;)

brentp · 2015-11-09T19:07:34Z

is there an additional column that indicates the phenotype? if so, this wont require any change to gemini.

jxchong · 2015-11-09T19:31:25Z

Apparently not... they said:

The phenotype name does not get stored in the output. Firstly, many phenotype names are long and contain odd characters that can break file format encoding. Secondly, many genes and/or variants have several phenotypes associated with them, so to list all of them (and multiple times in the case where you are reporting e.g. the same gene multiple times) would cause the output file size to increase hugely.

I'm debating whether to argue that it should be an option to output the phenotype name (though the above response makes me think they're reluctant to add it)

brentp · 2015-11-09T19:38:56Z

hrm. without that, it's not very helpful and with it, it'd be very helpful... and the more disease annotations they get, the bigger that divide in helpfulness becomes.

jxchong · 2015-11-09T19:47:32Z

I emailied the VEP dev mailing list to ask. You have a good point, right now a "1" tells me almost nothing -- for humans I'd already have to search nearly 20 databases to see which DB is the source of the "1" and it might have been wasted effort if the phenotype isn't even relevant.

jxchong · 2015-11-10T17:43:02Z

Will McLaren on the ENSEMBL dev list replied:

Hi Jessica,

Here's a way you can get the phenotype names in VEP without me writing any new code!

Create a BED file from Ensembl's database of phenotype annotations. This will take a few minutes:

mysql -hensembldb.ensembl.org -uanonymous -Dhomo_sapiens_variation_82_38 -e'select sr.name, pf.seq_region_start-1,pf.seq_region_end, concat(""", concat_ws(":", pf.type, s.name, pf.object_id, p.description), """) from seq_region sr, source s, phenotype p, phenotype_feature pf where pf.seq_region_id = sr.seq_region_id and pf.source_id = s.source_id and pf.phenotype_id = p.phenotype_id' | grep -v concat_ws | sort -k1,1 -k2,2n -k3,3n | bgzip -c phenotypes.bed.gz

Index it with tabix:

tabix -p bed phenotypes.bed.gz

Use it as a custom annotation in VEP (see http://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html):

echo "rs533747784" | perl variant_effect_predictor.pl -force -cache -custom phenotypes.bed.gz,phenotypes,bed,overlap

Then have fun parsing the output :-)

You could change the MySQL query above to limit it to one source of data, or remove the structural variants for example (there are a lot of them and they tend to overlap a significant portion of the genome).

dgaston · 2015-11-10T18:02:15Z

This would also be a good method to use the resulting BED file either as a
custom annotation source with snpEff/snpSift, or just with GEMINI itself
using gemini annotate. This is very useful thanks!

On Tue, Nov 10, 2015 at 1:43 PM, Jessica Chong notifications@github.com
wrote:

Will McLaren on the ENSEMBL dev list replied:

Hi Jessica,

Here's a way you can get the phenotype names in VEP without me writing any
new code!

Create a BED file from Ensembl's database of phenotype annotations.
This will take a few minutes:

mysql -hensembldb.ensembl.org -uanonymous -Dhomo_sapiens_variation_82_38
-e'select sr.name, pf.seq_region_start-1,pf.seq_region_end, concat(""",
concat_ws(":", pf.type, s.name, pf.object_id, p.description), """) from
seq_region sr, source s, phenotype p, phenotype_feature pf where
pf.seq_region_id = sr.seq_region_id and pf.source_id = s.source_id and
pf.phenotype_id = p.phenotype_id' | grep -v concat_ws | sort -k1,1 -k2,2n
-k3,3n | bgzip -c phenotypes.bed.gz

Index it with tabix:

tabix -p bed phenotypes.bed.gz

Use it as a custom annotation in VEP (see
http://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html):

echo "rs533747784" | perl variant_effect_predictor.pl -force -cache
-custom phenotypes.bed.gz,phenotypes,bed,overlap

Then have fun parsing the output :-)

You could change the MySQL query above to limit it to one source of data,
or remove the structural variants for example (there are a lot of them and
they tend to overlap a significant portion of the genome).

—
Reply to this email directly or view it on GitHub
#571 (comment).

brentp · 2015-11-10T18:09:32Z

That is useful. I couldn't figure out the invocation for 82_37 ...

arq5x · 2015-11-30T19:28:18Z

Current plan is to use VEP "annotation dump" to support this.

brentp · 2015-12-10T21:59:35Z

another couple of sources are: http://ctdbase.org/downloads/
and: http://www.disgenet.org/web/DisGeNET/menu/dbinfo#stats

both of which seem to have licenses that would allow us to use them in gemini

jxchong · 2015-12-15T00:03:29Z

FYI when using VEP (v83+) with --gene_phenotype to add information on whether a variant is in a gene that has a phenotype associated with it (resulting GEMINI column is vep_gene_pheno), or --check-existing to add whether a specific position has been observed as having a variant (resulting GEMINI column is vep_pheno), VEP leaves those fields blank (so the column is currently imported into GEMINI as NULL).

here are some examples from running VEP v83 on some known pathogenic variants in CFTR:

VEP options were:
--sift b --polyphen b --symbol --numbers --biotype --total_length --canonical --ccds --hgvs --shift_hgvs 1 --gene_phenotype --regulatory --fields Consequence,Codons,Amino_acids,Gene,SYMBOL,Feature,EXON,PolyPhen,SIFT,Protein_position,BIOTYPE,CANONICAL,CCDS,HGVSc,HGVSp,PHENO,GENE_PHENO,CELLL_TYPE

7   117199533   rs213950    G   A   137788  PASS    AC=10;AF=0.584;AN=24;BaseQRankSum=-2.368;ClippingRankSum=-0.081;DB;DP=7276;FS=1.206;InbreedingCoeff=0.0956;MLEAC=125;MLEAF=0.584;MQ0=0;MQ=60;MQRankSum=-0.032;POSITIVE_TRAIN_SITE;QD=22.5;ReadPosRankSum=0.197;SOR=0.817;VQSLOD=8.12;culprit=MQ;CSQ=downstream_gene_variant|||ENSG00000232661|AC000111.3|ENST00000441019|||||antisense|YES||||1&1&1||,missense_variant|Gtg/Atg|V/M|ENSG00000001626|CFTR|ENST00000426809|10/26|benign(0.002)|tolerated(1)|440/1438|protein_coding|||ENST00000426809.1:c.1318G>A|ENSP00000389119.1:p.Val440Met|1&1&1|1|,missense_variant|Gtg/Atg|V/M|ENSG00000001626|CFTR|ENST00000454343|10/26|benign(0.001)|tolerated(0.88)|409/1419|protein_coding|||ENST00000454343.1:c.1225G>A|ENSP00000403677.1:p.Val409Met|1&1&1|1|,missense_variant|Gtg/Atg|V/M|ENSG00000001626|CFTR|ENST00000003084|11/27|benign(0.003)|tolerated(1)|470/1480|protein_coding|YES|CCDS5773.1|ENST00000003084.6:c.1408G>A|ENSP00000003084.6:p.Val470Met|1&1&1|1|,upstream_gene_variant|||ENSG00000001626|CFTR|ENST00000472848|||||processed_transcript|||||1&1&1|1|    GT:AD:DP:GQ:PL  0/0:7,0:7:21:0,21,242   0/1:9,11:20:99:292,0,229    0/0:7,0:7:21:0,21,248   1/1:0,26:26:78:865,78,0 0/1:35,24:59:99:589,0,941   0/1:23,25:48:99:688,0,675   0/0:22,0:22:66:0,66,714 0/1:21,14:35:99:367,0,590   0/0:22,0:22:62:0,62,736 1/1:0,22:22:66:732,66,0 0/1:13,15:28:99:430,0,341   0/1:15,22:37:99:642,0,395
7   117199644   rs199826652 ATCT    A   3996.41 PASS    AC=4;AF=0.023;AN=24;BaseQRankSum=-0.941;ClippingRankSum=-0.033;DB;DP=7203;FS=2.556;InbreedingCoeff=-0.0257;MLEAC=5;MLEAF=0.023;MQ0=0;MQ=60.36;MQRankSum=0.129;QD=21.37;ReadPosRankSum=0.673;SOR=0.921;VQSLOD=1.51;culprit=SOR;CSQ=downstream_gene_variant|||ENSG00000232661|AC000111.3|ENST00000441019|||||antisense|YES||||1||,inframe_deletion|aTCTtt/att|IF/I|ENSG00000001626|CFTR|ENST00000426809|10/26|||477-478/1438|protein_coding|||ENST00000426809.1:c.1431_1433delCTT|ENSP00000389119.1:p.Phe478del|1|1|,inframe_deletion|aTCTtt/att|IF/I|ENSG00000001626|CFTR|ENST00000454343|10/26|||446-447/1419|protein_coding|||ENST00000454343.1:c.1338_1340delCTT|ENSP00000403677.1:p.Phe447del|1|1|,inframe_deletion|aTCTtt/att|IF/I|ENSG00000001626|CFTR|ENST00000003084|11/27|||507-508/1480|protein_coding|YES|CCDS5773.1|ENST00000003084.6:c.1521_1523delCTT|ENSP00000003084.6:p.Phe508del|1|1|,upstream_gene_variant|||ENSG00000001626|CFTR|ENST00000472848|||||processed_transcript|||||1|1|,regulatory_region_variant|||||ENSR00000633245|||||open_chromatin_region|||||1||    GT:AD:DP:GQ:PL  0/0:4,0:4:9:0,9,135 0/0:10,0:10:24:0,24,360 0/0:3,0:3:5:0,5,86  0/1:10,12:22:99:441,0,474   0/1:9,18:27:99:729,0,424    0/0:23,0:23:63:0,63,945 0/0:22,0:22:66:0,66,714 0/0:33,0:33:84:0,84,1096    0/0:22,0:22:62:0,62,736 0/1:17,14:31:99:537,0,881   0/1:13,16:29:99:620,0,632   0/0:24,0:24:60:0,60,791

There's a little weirdness with them using 1&1 or 1&1&1 and so on for the PHENO field -- will update when I figure it out (I'm assuming it's indicating a positive answer for each of their data sources? but not sure).

jxchong · 2015-12-15T22:39:03Z

If PHENO==1&1&1 each ofthose correspond to an ID in VEP's Existing_variation field

Example:

Existing_variation == rs2691305&rs75062661&COSM4144171
PHENO == 0&0&1

This means that rs2691305 and rs75062661 are not associated with a phenotype, disease, or trait, but COSM4144171 is.

This is sort of like a super-set that includes clinvar_sig (since VEP includes ClinVar)

brentp · 2015-12-16T16:27:49Z

@jxchong any ideas on how to handle this gene_phenotype? I don't want to have to special-case the Existing_variation and PHENO keys. I guess if it gets added to the db as-is, a user could query as:

... AND pheno like '%1%'

but not sure I grok enough to understand how useful that is.

jxchong · 2015-12-17T00:49:34Z

I think both of these keys can be added as-is to the GEMINI database using GEMINI's current standard handling of extra fields. (except for minor issue that blank values are not being output by GEMINI as "None" mentioned #639)

--gene_phenotype is straightforward, I think. GEMINI users can just query for .. AND vep_gene_pheno == 1

Personally I don't envision filtering on the results of VEP's PHENO field (vep_existing_variation or vep_pheno) because it's just too hard to determine automatically whether the variant being present in a database is even meaningful. It would be as bad as an idea (for gene discovery purposes) as filtering on in_dbsnp :) Instead I think I will just include those columns as part of the output for later manual/custom parsing (because having it there is still helpful and faster than searching each variant to see if it's present in any of the databases parsed by VEP).

brentp closed this as completed Nov 9, 2015

brentp reopened this Nov 9, 2015

arq5x added this to the 0.20.0 milestone Nov 30, 2015

This was referenced Dec 8, 2015

add clinvar_disease column #635

Closed

Consider integrating Phenolyzer. #638

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gene :: disease lookup. #571

gene :: disease lookup. #571

brentp commented Oct 26, 2015

jxchong commented Oct 26, 2015

brentp commented Oct 27, 2015

jxchong commented Oct 27, 2015

jxchong commented Nov 3, 2015

jxchong commented Nov 9, 2015

brentp commented Nov 9, 2015

jxchong commented Nov 9, 2015

brentp commented Nov 9, 2015

jxchong commented Nov 9, 2015

jxchong commented Nov 10, 2015

dgaston commented Nov 10, 2015

brentp commented Nov 10, 2015

arq5x commented Nov 30, 2015

brentp commented Dec 10, 2015

jxchong commented Dec 15, 2015

jxchong commented Dec 15, 2015

brentp commented Dec 16, 2015

jxchong commented Dec 17, 2015

gene :: disease lookup. #571

gene :: disease lookup. #571

Comments

brentp commented Oct 26, 2015

Design

jxchong commented Oct 26, 2015

brentp commented Oct 27, 2015

jxchong commented Oct 27, 2015

jxchong commented Nov 3, 2015

jxchong commented Nov 9, 2015

brentp commented Nov 9, 2015

jxchong commented Nov 9, 2015

brentp commented Nov 9, 2015

jxchong commented Nov 9, 2015

jxchong commented Nov 10, 2015

dgaston commented Nov 10, 2015

brentp commented Nov 10, 2015

arq5x commented Nov 30, 2015

brentp commented Dec 10, 2015

jxchong commented Dec 15, 2015

jxchong commented Dec 15, 2015

brentp commented Dec 16, 2015

jxchong commented Dec 17, 2015