Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gene :: disease lookup. #571

Open
brentp opened this issue Oct 26, 2015 · 18 comments
Open

gene :: disease lookup. #571

brentp opened this issue Oct 26, 2015 · 18 comments
Milestone

Comments

@brentp
Copy link
Collaborator

brentp commented Oct 26, 2015

via @jxchong cc @arq5x
A user may find a variant that is not itself associated with disease, but is in a gene known to be associated with the disease. There is no way to store or query this info in gemini at this time.

clinvar is an obvious choice and omim seems to be available to download for academic (http://omim.org/downloads).

Design

  • need a way to do this in a relatively generic fashion as we think about switching to support any organism.
  • need a simple query to do this even though we'd probably be storing in a table that's many-to-1 relative to gene_detailed (?)
  • could provide an example where user can use gemini annotate and smash all associations into a single "|" delimited string for every variant in that gene and it would be a single column in the variants table.
  • likely better to have a "|" delimited string associated with each gene and require a join on gene_name.
@jxchong
Copy link
Contributor

jxchong commented Oct 26, 2015

For humans, OMIM is certainly where I check first.

For extending to other organisms, I suspect Monarch Initiative is the place to start (e.g. Exomiser uses model organism phenotype data curated by Monarch Initiative). I would guess you would want to be able to limit the "associated phenotype" to only those phenotypes that are actually known to be caused by mutations in the gene (vs GWAS signals, etc -- or at least a gene associated with a phenotype by GWAS is a different level of evidence/information than a gene in which mutations cause a Mendelian phenotype).

@brentp
Copy link
Collaborator Author

brentp commented Oct 27, 2015

One way to do this currently would be to create a bed file with a pipe delimited disease list for each gene and then use gemini annotate.
We do want a more direct means, but that would be a general solution that's available in current gemini.

@jxchong
Copy link
Contributor

jxchong commented Oct 27, 2015

Yeah I have an old file I made from OMIM that I'll use for now. Actually it occurs to me that this is the more general solution for this is also applicable to integrating pathway info -- I know pathways is currently a separate tool within Gemini, but ideally when doing variant filtration, you'd want to be able to output the pathway/KEGG/etc information for the gene that the variant is affecting.

@jxchong
Copy link
Contributor

jxchong commented Nov 3, 2015

Oh hey! I think this appears to be implemented in VEP v82 with the VEP option
--gene_phenotype Indicates if the overlapped gene is associated with a phenotype, disease or trait. See list of phenotype sources. Not used by default

This appears to be multi-species. List of phenotype sources: http://uswest.ensembl.org/info/genome/variation/sources_phenotype_documentation.html

I have not had time to test yet though.

@jxchong
Copy link
Contributor

jxchong commented Nov 9, 2015

According to the VEP devs, --gene_phenotype combined with --fields PHENO,GENE_PHENO should result in two columns (PHENO and GENE_PHENO) which ==1 if the variant or gene, respectively, are linked to a phenotype or ==0 if not linked to a phenotype. However there's some sort of issue in the cache files for VEPv82 that's making these lookup values always = 0 so I still can't test it yet. ;)

@brentp
Copy link
Collaborator Author

brentp commented Nov 9, 2015

is there an additional column that indicates the phenotype? if so, this wont require any change to gemini.

@jxchong
Copy link
Contributor

jxchong commented Nov 9, 2015

Apparently not... they said:

The phenotype name does not get stored in the output. Firstly, many phenotype names are long and contain odd characters that can break file format encoding. Secondly, many genes and/or variants have several phenotypes associated with them, so to list all of them (and multiple times in the case where you are reporting e.g. the same gene multiple times) would cause the output file size to increase hugely.

I'm debating whether to argue that it should be an option to output the phenotype name (though the above response makes me think they're reluctant to add it)

@brentp
Copy link
Collaborator Author

brentp commented Nov 9, 2015

hrm. without that, it's not very helpful and with it, it'd be very helpful... and the more disease annotations they get, the bigger that divide in helpfulness becomes.

@jxchong
Copy link
Contributor

jxchong commented Nov 9, 2015

I emailied the VEP dev mailing list to ask. You have a good point, right now a "1" tells me almost nothing -- for humans I'd already have to search nearly 20 databases to see which DB is the source of the "1" and it might have been wasted effort if the phenotype isn't even relevant.

@brentp brentp closed this as completed Nov 9, 2015
@brentp brentp reopened this Nov 9, 2015
@jxchong
Copy link
Contributor

jxchong commented Nov 10, 2015

Will McLaren on the ENSEMBL dev list replied:

Hi Jessica,

Here's a way you can get the phenotype names in VEP without me writing any new code!

  1. Create a BED file from Ensembl's database of phenotype annotations. This will take a few minutes:

mysql -hensembldb.ensembl.org -uanonymous -Dhomo_sapiens_variation_82_38 -e'select sr.name, pf.seq_region_start-1,pf.seq_region_end, concat(""", concat_ws(":", pf.type, s.name, pf.object_id, p.description), """) from seq_region sr, source s, phenotype p, phenotype_feature pf where pf.seq_region_id = sr.seq_region_id and pf.source_id = s.source_id and pf.phenotype_id = p.phenotype_id' | grep -v concat_ws | sort -k1,1 -k2,2n -k3,3n | bgzip -c phenotypes.bed.gz

  1. Index it with tabix:

tabix -p bed phenotypes.bed.gz

  1. Use it as a custom annotation in VEP (see http://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html):

echo "rs533747784" | perl variant_effect_predictor.pl -force -cache -custom phenotypes.bed.gz,phenotypes,bed,overlap

Then have fun parsing the output :-)

You could change the MySQL query above to limit it to one source of data, or remove the structural variants for example (there are a lot of them and they tend to overlap a significant portion of the genome).

@dgaston
Copy link
Contributor

dgaston commented Nov 10, 2015

This would also be a good method to use the resulting BED file either as a
custom annotation source with snpEff/snpSift, or just with GEMINI itself
using gemini annotate. This is very useful thanks!

On Tue, Nov 10, 2015 at 1:43 PM, Jessica Chong notifications@github.com
wrote:

Will McLaren on the ENSEMBL dev list replied:

Hi Jessica,

Here's a way you can get the phenotype names in VEP without me writing any
new code!

  1. Create a BED file from Ensembl's database of phenotype annotations.
    This will take a few minutes:

mysql -hensembldb.ensembl.org -uanonymous -Dhomo_sapiens_variation_82_38
-e'select sr.name, pf.seq_region_start-1,pf.seq_region_end, concat(""",
concat_ws(":", pf.type, s.name, pf.object_id, p.description), """) from
seq_region sr, source s, phenotype p, phenotype_feature pf where
pf.seq_region_id = sr.seq_region_id and pf.source_id = s.source_id and
pf.phenotype_id = p.phenotype_id' | grep -v concat_ws | sort -k1,1 -k2,2n
-k3,3n | bgzip -c phenotypes.bed.gz

  1. Index it with tabix:

tabix -p bed phenotypes.bed.gz

  1. Use it as a custom annotation in VEP (see
    http://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html):

echo "rs533747784" | perl variant_effect_predictor.pl -force -cache
-custom phenotypes.bed.gz,phenotypes,bed,overlap

Then have fun parsing the output :-)

You could change the MySQL query above to limit it to one source of data,
or remove the structural variants for example (there are a lot of them and
they tend to overlap a significant portion of the genome).


Reply to this email directly or view it on GitHub
#571 (comment).

@brentp
Copy link
Collaborator Author

brentp commented Nov 10, 2015

That is useful. I couldn't figure out the invocation for 82_37 ...

@arq5x arq5x added this to the 0.20.0 milestone Nov 30, 2015
@arq5x
Copy link
Owner

arq5x commented Nov 30, 2015

Current plan is to use VEP "annotation dump" to support this.

@brentp
Copy link
Collaborator Author

brentp commented Dec 10, 2015

another couple of sources are: http://ctdbase.org/downloads/
and: http://www.disgenet.org/web/DisGeNET/menu/dbinfo#stats

both of which seem to have licenses that would allow us to use them in gemini

@jxchong
Copy link
Contributor

jxchong commented Dec 15, 2015

FYI when using VEP (v83+) with --gene_phenotype to add information on whether a variant is in a gene that has a phenotype associated with it (resulting GEMINI column is vep_gene_pheno), or --check-existing to add whether a specific position has been observed as having a variant (resulting GEMINI column is vep_pheno), VEP leaves those fields blank (so the column is currently imported into GEMINI as NULL).

here are some examples from running VEP v83 on some known pathogenic variants in CFTR:

VEP options were:
--sift b --polyphen b --symbol --numbers --biotype --total_length --canonical --ccds --hgvs --shift_hgvs 1 --gene_phenotype --regulatory --fields Consequence,Codons,Amino_acids,Gene,SYMBOL,Feature,EXON,PolyPhen,SIFT,Protein_position,BIOTYPE,CANONICAL,CCDS,HGVSc,HGVSp,PHENO,GENE_PHENO,CELLL_TYPE

7   117199533   rs213950    G   A   137788  PASS    AC=10;AF=0.584;AN=24;BaseQRankSum=-2.368;ClippingRankSum=-0.081;DB;DP=7276;FS=1.206;InbreedingCoeff=0.0956;MLEAC=125;MLEAF=0.584;MQ0=0;MQ=60;MQRankSum=-0.032;POSITIVE_TRAIN_SITE;QD=22.5;ReadPosRankSum=0.197;SOR=0.817;VQSLOD=8.12;culprit=MQ;CSQ=downstream_gene_variant|||ENSG00000232661|AC000111.3|ENST00000441019|||||antisense|YES||||1&1&1||,missense_variant|Gtg/Atg|V/M|ENSG00000001626|CFTR|ENST00000426809|10/26|benign(0.002)|tolerated(1)|440/1438|protein_coding|||ENST00000426809.1:c.1318G>A|ENSP00000389119.1:p.Val440Met|1&1&1|1|,missense_variant|Gtg/Atg|V/M|ENSG00000001626|CFTR|ENST00000454343|10/26|benign(0.001)|tolerated(0.88)|409/1419|protein_coding|||ENST00000454343.1:c.1225G>A|ENSP00000403677.1:p.Val409Met|1&1&1|1|,missense_variant|Gtg/Atg|V/M|ENSG00000001626|CFTR|ENST00000003084|11/27|benign(0.003)|tolerated(1)|470/1480|protein_coding|YES|CCDS5773.1|ENST00000003084.6:c.1408G>A|ENSP00000003084.6:p.Val470Met|1&1&1|1|,upstream_gene_variant|||ENSG00000001626|CFTR|ENST00000472848|||||processed_transcript|||||1&1&1|1|    GT:AD:DP:GQ:PL  0/0:7,0:7:21:0,21,242   0/1:9,11:20:99:292,0,229    0/0:7,0:7:21:0,21,248   1/1:0,26:26:78:865,78,0 0/1:35,24:59:99:589,0,941   0/1:23,25:48:99:688,0,675   0/0:22,0:22:66:0,66,714 0/1:21,14:35:99:367,0,590   0/0:22,0:22:62:0,62,736 1/1:0,22:22:66:732,66,0 0/1:13,15:28:99:430,0,341   0/1:15,22:37:99:642,0,395
7   117199644   rs199826652 ATCT    A   3996.41 PASS    AC=4;AF=0.023;AN=24;BaseQRankSum=-0.941;ClippingRankSum=-0.033;DB;DP=7203;FS=2.556;InbreedingCoeff=-0.0257;MLEAC=5;MLEAF=0.023;MQ0=0;MQ=60.36;MQRankSum=0.129;QD=21.37;ReadPosRankSum=0.673;SOR=0.921;VQSLOD=1.51;culprit=SOR;CSQ=downstream_gene_variant|||ENSG00000232661|AC000111.3|ENST00000441019|||||antisense|YES||||1||,inframe_deletion|aTCTtt/att|IF/I|ENSG00000001626|CFTR|ENST00000426809|10/26|||477-478/1438|protein_coding|||ENST00000426809.1:c.1431_1433delCTT|ENSP00000389119.1:p.Phe478del|1|1|,inframe_deletion|aTCTtt/att|IF/I|ENSG00000001626|CFTR|ENST00000454343|10/26|||446-447/1419|protein_coding|||ENST00000454343.1:c.1338_1340delCTT|ENSP00000403677.1:p.Phe447del|1|1|,inframe_deletion|aTCTtt/att|IF/I|ENSG00000001626|CFTR|ENST00000003084|11/27|||507-508/1480|protein_coding|YES|CCDS5773.1|ENST00000003084.6:c.1521_1523delCTT|ENSP00000003084.6:p.Phe508del|1|1|,upstream_gene_variant|||ENSG00000001626|CFTR|ENST00000472848|||||processed_transcript|||||1|1|,regulatory_region_variant|||||ENSR00000633245|||||open_chromatin_region|||||1||    GT:AD:DP:GQ:PL  0/0:4,0:4:9:0,9,135 0/0:10,0:10:24:0,24,360 0/0:3,0:3:5:0,5,86  0/1:10,12:22:99:441,0,474   0/1:9,18:27:99:729,0,424    0/0:23,0:23:63:0,63,945 0/0:22,0:22:66:0,66,714 0/0:33,0:33:84:0,84,1096    0/0:22,0:22:62:0,62,736 0/1:17,14:31:99:537,0,881   0/1:13,16:29:99:620,0,632   0/0:24,0:24:60:0,60,791

There's a little weirdness with them using 1&1 or 1&1&1 and so on for the PHENO field -- will update when I figure it out (I'm assuming it's indicating a positive answer for each of their data sources? but not sure).

@jxchong
Copy link
Contributor

jxchong commented Dec 15, 2015

If PHENO==1&1&1 each ofthose correspond to an ID in VEP's Existing_variation field

Example:

Existing_variation == rs2691305&rs75062661&COSM4144171
PHENO == 0&0&1

This means that rs2691305 and rs75062661 are not associated with a phenotype, disease, or trait, but COSM4144171 is.

This is sort of like a super-set that includes clinvar_sig (since VEP includes ClinVar)

@brentp
Copy link
Collaborator Author

brentp commented Dec 16, 2015

@jxchong any ideas on how to handle this gene_phenotype? I don't want to have to special-case the Existing_variation and PHENO keys. I guess if it gets added to the db as-is, a user could query as:

... AND pheno like '%1%'

but not sure I grok enough to understand how useful that is.

@jxchong
Copy link
Contributor

jxchong commented Dec 17, 2015

I think both of these keys can be added as-is to the GEMINI database using GEMINI's current standard handling of extra fields. (except for minor issue that blank values are not being output by GEMINI as "None" mentioned #639)

--gene_phenotype is straightforward, I think. GEMINI users can just query for .. AND vep_gene_pheno == 1

Personally I don't envision filtering on the results of VEP's PHENO field (vep_existing_variation or vep_pheno) because it's just too hard to determine automatically whether the variant being present in a database is even meaningful. It would be as bad as an idea (for gene discovery purposes) as filtering on in_dbsnp :) Instead I think I will just include those columns as part of the output for later manual/custom parsing (because having it there is still helpful and faster than searching each variant to see if it's present in any of the databases parsed by VEP).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants