gene :: disease lookup. #571

Open
brentp opened this Issue Oct 26, 2015 · 18 comments

Comments

Projects
None yet
4 participants
@brentp
Collaborator

brentp commented Oct 26, 2015

via @jxchong cc @arq5x
A user may find a variant that is not itself associated with disease, but is in a gene known to be associated with the disease. There is no way to store or query this info in gemini at this time.

clinvar is an obvious choice and omim seems to be available to download for academic (http://omim.org/downloads).

Design

  • need a way to do this in a relatively generic fashion as we think about switching to support any organism.
  • need a simple query to do this even though we'd probably be storing in a table that's many-to-1 relative to gene_detailed (?)
  • could provide an example where user can use gemini annotate and smash all associations into a single "|" delimited string for every variant in that gene and it would be a single column in the variants table.
  • likely better to have a "|" delimited string associated with each gene and require a join on gene_name.
@jxchong

This comment has been minimized.

Show comment
Hide comment
@jxchong

jxchong Oct 26, 2015

Contributor

For humans, OMIM is certainly where I check first.

For extending to other organisms, I suspect Monarch Initiative is the place to start (e.g. Exomiser uses model organism phenotype data curated by Monarch Initiative). I would guess you would want to be able to limit the "associated phenotype" to only those phenotypes that are actually known to be caused by mutations in the gene (vs GWAS signals, etc -- or at least a gene associated with a phenotype by GWAS is a different level of evidence/information than a gene in which mutations cause a Mendelian phenotype).

Contributor

jxchong commented Oct 26, 2015

For humans, OMIM is certainly where I check first.

For extending to other organisms, I suspect Monarch Initiative is the place to start (e.g. Exomiser uses model organism phenotype data curated by Monarch Initiative). I would guess you would want to be able to limit the "associated phenotype" to only those phenotypes that are actually known to be caused by mutations in the gene (vs GWAS signals, etc -- or at least a gene associated with a phenotype by GWAS is a different level of evidence/information than a gene in which mutations cause a Mendelian phenotype).

@brentp

This comment has been minimized.

Show comment
Hide comment
@brentp

brentp Oct 27, 2015

Collaborator

One way to do this currently would be to create a bed file with a pipe delimited disease list for each gene and then use gemini annotate.
We do want a more direct means, but that would be a general solution that's available in current gemini.

Collaborator

brentp commented Oct 27, 2015

One way to do this currently would be to create a bed file with a pipe delimited disease list for each gene and then use gemini annotate.
We do want a more direct means, but that would be a general solution that's available in current gemini.

@jxchong

This comment has been minimized.

Show comment
Hide comment
@jxchong

jxchong Oct 27, 2015

Contributor

Yeah I have an old file I made from OMIM that I'll use for now. Actually it occurs to me that this is the more general solution for this is also applicable to integrating pathway info -- I know pathways is currently a separate tool within Gemini, but ideally when doing variant filtration, you'd want to be able to output the pathway/KEGG/etc information for the gene that the variant is affecting.

Contributor

jxchong commented Oct 27, 2015

Yeah I have an old file I made from OMIM that I'll use for now. Actually it occurs to me that this is the more general solution for this is also applicable to integrating pathway info -- I know pathways is currently a separate tool within Gemini, but ideally when doing variant filtration, you'd want to be able to output the pathway/KEGG/etc information for the gene that the variant is affecting.

@jxchong

This comment has been minimized.

Show comment
Hide comment
@jxchong

jxchong Nov 3, 2015

Contributor

Oh hey! I think this appears to be implemented in VEP v82 with the VEP option
--gene_phenotype Indicates if the overlapped gene is associated with a phenotype, disease or trait. See list of phenotype sources. Not used by default

This appears to be multi-species. List of phenotype sources: http://uswest.ensembl.org/info/genome/variation/sources_phenotype_documentation.html

I have not had time to test yet though.

Contributor

jxchong commented Nov 3, 2015

Oh hey! I think this appears to be implemented in VEP v82 with the VEP option
--gene_phenotype Indicates if the overlapped gene is associated with a phenotype, disease or trait. See list of phenotype sources. Not used by default

This appears to be multi-species. List of phenotype sources: http://uswest.ensembl.org/info/genome/variation/sources_phenotype_documentation.html

I have not had time to test yet though.

@jxchong

This comment has been minimized.

Show comment
Hide comment
@jxchong

jxchong Nov 9, 2015

Contributor

According to the VEP devs, --gene_phenotype combined with --fields PHENO,GENE_PHENO should result in two columns (PHENO and GENE_PHENO) which ==1 if the variant or gene, respectively, are linked to a phenotype or ==0 if not linked to a phenotype. However there's some sort of issue in the cache files for VEPv82 that's making these lookup values always = 0 so I still can't test it yet. ;)

Contributor

jxchong commented Nov 9, 2015

According to the VEP devs, --gene_phenotype combined with --fields PHENO,GENE_PHENO should result in two columns (PHENO and GENE_PHENO) which ==1 if the variant or gene, respectively, are linked to a phenotype or ==0 if not linked to a phenotype. However there's some sort of issue in the cache files for VEPv82 that's making these lookup values always = 0 so I still can't test it yet. ;)

@brentp

This comment has been minimized.

Show comment
Hide comment
@brentp

brentp Nov 9, 2015

Collaborator

is there an additional column that indicates the phenotype? if so, this wont require any change to gemini.

Collaborator

brentp commented Nov 9, 2015

is there an additional column that indicates the phenotype? if so, this wont require any change to gemini.

@jxchong

This comment has been minimized.

Show comment
Hide comment
@jxchong

jxchong Nov 9, 2015

Contributor

Apparently not... they said:

The phenotype name does not get stored in the output. Firstly, many phenotype names are long and contain odd characters that can break file format encoding. Secondly, many genes and/or variants have several phenotypes associated with them, so to list all of them (and multiple times in the case where you are reporting e.g. the same gene multiple times) would cause the output file size to increase hugely.

I'm debating whether to argue that it should be an option to output the phenotype name (though the above response makes me think they're reluctant to add it)

Contributor

jxchong commented Nov 9, 2015

Apparently not... they said:

The phenotype name does not get stored in the output. Firstly, many phenotype names are long and contain odd characters that can break file format encoding. Secondly, many genes and/or variants have several phenotypes associated with them, so to list all of them (and multiple times in the case where you are reporting e.g. the same gene multiple times) would cause the output file size to increase hugely.

I'm debating whether to argue that it should be an option to output the phenotype name (though the above response makes me think they're reluctant to add it)

@brentp

This comment has been minimized.

Show comment
Hide comment
@brentp

brentp Nov 9, 2015

Collaborator

hrm. without that, it's not very helpful and with it, it'd be very helpful... and the more disease annotations they get, the bigger that divide in helpfulness becomes.

Collaborator

brentp commented Nov 9, 2015

hrm. without that, it's not very helpful and with it, it'd be very helpful... and the more disease annotations they get, the bigger that divide in helpfulness becomes.

@jxchong

This comment has been minimized.

Show comment
Hide comment
@jxchong

jxchong Nov 9, 2015

Contributor

I emailied the VEP dev mailing list to ask. You have a good point, right now a "1" tells me almost nothing -- for humans I'd already have to search nearly 20 databases to see which DB is the source of the "1" and it might have been wasted effort if the phenotype isn't even relevant.

Contributor

jxchong commented Nov 9, 2015

I emailied the VEP dev mailing list to ask. You have a good point, right now a "1" tells me almost nothing -- for humans I'd already have to search nearly 20 databases to see which DB is the source of the "1" and it might have been wasted effort if the phenotype isn't even relevant.

@brentp brentp closed this Nov 9, 2015

@brentp brentp reopened this Nov 9, 2015

@jxchong

This comment has been minimized.

Show comment
Hide comment
@jxchong

jxchong Nov 10, 2015

Contributor

Will McLaren on the ENSEMBL dev list replied:

Hi Jessica,

Here's a way you can get the phenotype names in VEP without me writing any new code!

  1. Create a BED file from Ensembl's database of phenotype annotations. This will take a few minutes:

mysql -hensembldb.ensembl.org -uanonymous -Dhomo_sapiens_variation_82_38 -e'select sr.name, pf.seq_region_start-1,pf.seq_region_end, concat(""", concat_ws(":", pf.type, s.name, pf.object_id, p.description), """) from seq_region sr, source s, phenotype p, phenotype_feature pf where pf.seq_region_id = sr.seq_region_id and pf.source_id = s.source_id and pf.phenotype_id = p.phenotype_id' | grep -v concat_ws | sort -k1,1 -k2,2n -k3,3n | bgzip -c phenotypes.bed.gz

  1. Index it with tabix:

tabix -p bed phenotypes.bed.gz

  1. Use it as a custom annotation in VEP (see http://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html):

echo "rs533747784" | perl variant_effect_predictor.pl -force -cache -custom phenotypes.bed.gz,phenotypes,bed,overlap

Then have fun parsing the output :-)

You could change the MySQL query above to limit it to one source of data, or remove the structural variants for example (there are a lot of them and they tend to overlap a significant portion of the genome).

Contributor

jxchong commented Nov 10, 2015

Will McLaren on the ENSEMBL dev list replied:

Hi Jessica,

Here's a way you can get the phenotype names in VEP without me writing any new code!

  1. Create a BED file from Ensembl's database of phenotype annotations. This will take a few minutes:

mysql -hensembldb.ensembl.org -uanonymous -Dhomo_sapiens_variation_82_38 -e'select sr.name, pf.seq_region_start-1,pf.seq_region_end, concat(""", concat_ws(":", pf.type, s.name, pf.object_id, p.description), """) from seq_region sr, source s, phenotype p, phenotype_feature pf where pf.seq_region_id = sr.seq_region_id and pf.source_id = s.source_id and pf.phenotype_id = p.phenotype_id' | grep -v concat_ws | sort -k1,1 -k2,2n -k3,3n | bgzip -c phenotypes.bed.gz

  1. Index it with tabix:

tabix -p bed phenotypes.bed.gz

  1. Use it as a custom annotation in VEP (see http://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html):

echo "rs533747784" | perl variant_effect_predictor.pl -force -cache -custom phenotypes.bed.gz,phenotypes,bed,overlap

Then have fun parsing the output :-)

You could change the MySQL query above to limit it to one source of data, or remove the structural variants for example (there are a lot of them and they tend to overlap a significant portion of the genome).

@dgaston

This comment has been minimized.

Show comment
Hide comment
@dgaston

dgaston Nov 10, 2015

Contributor

This would also be a good method to use the resulting BED file either as a
custom annotation source with snpEff/snpSift, or just with GEMINI itself
using gemini annotate. This is very useful thanks!

On Tue, Nov 10, 2015 at 1:43 PM, Jessica Chong notifications@github.com
wrote:

Will McLaren on the ENSEMBL dev list replied:

Hi Jessica,

Here's a way you can get the phenotype names in VEP without me writing any
new code!

  1. Create a BED file from Ensembl's database of phenotype annotations.
    This will take a few minutes:

mysql -hensembldb.ensembl.org -uanonymous -Dhomo_sapiens_variation_82_38
-e'select sr.name, pf.seq_region_start-1,pf.seq_region_end, concat(""",
concat_ws(":", pf.type, s.name, pf.object_id, p.description), """) from
seq_region sr, source s, phenotype p, phenotype_feature pf where
pf.seq_region_id = sr.seq_region_id and pf.source_id = s.source_id and
pf.phenotype_id = p.phenotype_id' | grep -v concat_ws | sort -k1,1 -k2,2n
-k3,3n | bgzip -c phenotypes.bed.gz

  1. Index it with tabix:

tabix -p bed phenotypes.bed.gz

  1. Use it as a custom annotation in VEP (see
    http://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html):

echo "rs533747784" | perl variant_effect_predictor.pl -force -cache
-custom phenotypes.bed.gz,phenotypes,bed,overlap

Then have fun parsing the output :-)

You could change the MySQL query above to limit it to one source of data,
or remove the structural variants for example (there are a lot of them and
they tend to overlap a significant portion of the genome).


Reply to this email directly or view it on GitHub
#571 (comment).

Contributor

dgaston commented Nov 10, 2015

This would also be a good method to use the resulting BED file either as a
custom annotation source with snpEff/snpSift, or just with GEMINI itself
using gemini annotate. This is very useful thanks!

On Tue, Nov 10, 2015 at 1:43 PM, Jessica Chong notifications@github.com
wrote:

Will McLaren on the ENSEMBL dev list replied:

Hi Jessica,

Here's a way you can get the phenotype names in VEP without me writing any
new code!

  1. Create a BED file from Ensembl's database of phenotype annotations.
    This will take a few minutes:

mysql -hensembldb.ensembl.org -uanonymous -Dhomo_sapiens_variation_82_38
-e'select sr.name, pf.seq_region_start-1,pf.seq_region_end, concat(""",
concat_ws(":", pf.type, s.name, pf.object_id, p.description), """) from
seq_region sr, source s, phenotype p, phenotype_feature pf where
pf.seq_region_id = sr.seq_region_id and pf.source_id = s.source_id and
pf.phenotype_id = p.phenotype_id' | grep -v concat_ws | sort -k1,1 -k2,2n
-k3,3n | bgzip -c phenotypes.bed.gz

  1. Index it with tabix:

tabix -p bed phenotypes.bed.gz

  1. Use it as a custom annotation in VEP (see
    http://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html):

echo "rs533747784" | perl variant_effect_predictor.pl -force -cache
-custom phenotypes.bed.gz,phenotypes,bed,overlap

Then have fun parsing the output :-)

You could change the MySQL query above to limit it to one source of data,
or remove the structural variants for example (there are a lot of them and
they tend to overlap a significant portion of the genome).


Reply to this email directly or view it on GitHub
#571 (comment).

@brentp

This comment has been minimized.

Show comment
Hide comment
@brentp

brentp Nov 10, 2015

Collaborator

That is useful. I couldn't figure out the invocation for 82_37 ...

Collaborator

brentp commented Nov 10, 2015

That is useful. I couldn't figure out the invocation for 82_37 ...

@arq5x arq5x added this to the 0.20.0 milestone Nov 30, 2015

@arq5x

This comment has been minimized.

Show comment
Hide comment
@arq5x

arq5x Nov 30, 2015

Owner

Current plan is to use VEP "annotation dump" to support this.

Owner

arq5x commented Nov 30, 2015

Current plan is to use VEP "annotation dump" to support this.

@brentp

This comment has been minimized.

Show comment
Hide comment
@brentp

brentp Dec 10, 2015

Collaborator

another couple of sources are: http://ctdbase.org/downloads/
and: http://www.disgenet.org/web/DisGeNET/menu/dbinfo#stats

both of which seem to have licenses that would allow us to use them in gemini

Collaborator

brentp commented Dec 10, 2015

another couple of sources are: http://ctdbase.org/downloads/
and: http://www.disgenet.org/web/DisGeNET/menu/dbinfo#stats

both of which seem to have licenses that would allow us to use them in gemini

@jxchong

This comment has been minimized.

Show comment
Hide comment
@jxchong

jxchong Dec 15, 2015

Contributor

FYI when using VEP (v83+) with --gene_phenotype to add information on whether a variant is in a gene that has a phenotype associated with it (resulting GEMINI column is vep_gene_pheno), or --check-existing to add whether a specific position has been observed as having a variant (resulting GEMINI column is vep_pheno), VEP leaves those fields blank (so the column is currently imported into GEMINI as NULL).

here are some examples from running VEP v83 on some known pathogenic variants in CFTR:

VEP options were:
--sift b --polyphen b --symbol --numbers --biotype --total_length --canonical --ccds --hgvs --shift_hgvs 1 --gene_phenotype --regulatory --fields Consequence,Codons,Amino_acids,Gene,SYMBOL,Feature,EXON,PolyPhen,SIFT,Protein_position,BIOTYPE,CANONICAL,CCDS,HGVSc,HGVSp,PHENO,GENE_PHENO,CELLL_TYPE

7   117199533   rs213950    G   A   137788  PASS    AC=10;AF=0.584;AN=24;BaseQRankSum=-2.368;ClippingRankSum=-0.081;DB;DP=7276;FS=1.206;InbreedingCoeff=0.0956;MLEAC=125;MLEAF=0.584;MQ0=0;MQ=60;MQRankSum=-0.032;POSITIVE_TRAIN_SITE;QD=22.5;ReadPosRankSum=0.197;SOR=0.817;VQSLOD=8.12;culprit=MQ;CSQ=downstream_gene_variant|||ENSG00000232661|AC000111.3|ENST00000441019|||||antisense|YES||||1&1&1||,missense_variant|Gtg/Atg|V/M|ENSG00000001626|CFTR|ENST00000426809|10/26|benign(0.002)|tolerated(1)|440/1438|protein_coding|||ENST00000426809.1:c.1318G>A|ENSP00000389119.1:p.Val440Met|1&1&1|1|,missense_variant|Gtg/Atg|V/M|ENSG00000001626|CFTR|ENST00000454343|10/26|benign(0.001)|tolerated(0.88)|409/1419|protein_coding|||ENST00000454343.1:c.1225G>A|ENSP00000403677.1:p.Val409Met|1&1&1|1|,missense_variant|Gtg/Atg|V/M|ENSG00000001626|CFTR|ENST00000003084|11/27|benign(0.003)|tolerated(1)|470/1480|protein_coding|YES|CCDS5773.1|ENST00000003084.6:c.1408G>A|ENSP00000003084.6:p.Val470Met|1&1&1|1|,upstream_gene_variant|||ENSG00000001626|CFTR|ENST00000472848|||||processed_transcript|||||1&1&1|1|    GT:AD:DP:GQ:PL  0/0:7,0:7:21:0,21,242   0/1:9,11:20:99:292,0,229    0/0:7,0:7:21:0,21,248   1/1:0,26:26:78:865,78,0 0/1:35,24:59:99:589,0,941   0/1:23,25:48:99:688,0,675   0/0:22,0:22:66:0,66,714 0/1:21,14:35:99:367,0,590   0/0:22,0:22:62:0,62,736 1/1:0,22:22:66:732,66,0 0/1:13,15:28:99:430,0,341   0/1:15,22:37:99:642,0,395
7   117199644   rs199826652 ATCT    A   3996.41 PASS    AC=4;AF=0.023;AN=24;BaseQRankSum=-0.941;ClippingRankSum=-0.033;DB;DP=7203;FS=2.556;InbreedingCoeff=-0.0257;MLEAC=5;MLEAF=0.023;MQ0=0;MQ=60.36;MQRankSum=0.129;QD=21.37;ReadPosRankSum=0.673;SOR=0.921;VQSLOD=1.51;culprit=SOR;CSQ=downstream_gene_variant|||ENSG00000232661|AC000111.3|ENST00000441019|||||antisense|YES||||1||,inframe_deletion|aTCTtt/att|IF/I|ENSG00000001626|CFTR|ENST00000426809|10/26|||477-478/1438|protein_coding|||ENST00000426809.1:c.1431_1433delCTT|ENSP00000389119.1:p.Phe478del|1|1|,inframe_deletion|aTCTtt/att|IF/I|ENSG00000001626|CFTR|ENST00000454343|10/26|||446-447/1419|protein_coding|||ENST00000454343.1:c.1338_1340delCTT|ENSP00000403677.1:p.Phe447del|1|1|,inframe_deletion|aTCTtt/att|IF/I|ENSG00000001626|CFTR|ENST00000003084|11/27|||507-508/1480|protein_coding|YES|CCDS5773.1|ENST00000003084.6:c.1521_1523delCTT|ENSP00000003084.6:p.Phe508del|1|1|,upstream_gene_variant|||ENSG00000001626|CFTR|ENST00000472848|||||processed_transcript|||||1|1|,regulatory_region_variant|||||ENSR00000633245|||||open_chromatin_region|||||1||    GT:AD:DP:GQ:PL  0/0:4,0:4:9:0,9,135 0/0:10,0:10:24:0,24,360 0/0:3,0:3:5:0,5,86  0/1:10,12:22:99:441,0,474   0/1:9,18:27:99:729,0,424    0/0:23,0:23:63:0,63,945 0/0:22,0:22:66:0,66,714 0/0:33,0:33:84:0,84,1096    0/0:22,0:22:62:0,62,736 0/1:17,14:31:99:537,0,881   0/1:13,16:29:99:620,0,632   0/0:24,0:24:60:0,60,791

There's a little weirdness with them using 1&1 or 1&1&1 and so on for the PHENO field -- will update when I figure it out (I'm assuming it's indicating a positive answer for each of their data sources? but not sure).

Contributor

jxchong commented Dec 15, 2015

FYI when using VEP (v83+) with --gene_phenotype to add information on whether a variant is in a gene that has a phenotype associated with it (resulting GEMINI column is vep_gene_pheno), or --check-existing to add whether a specific position has been observed as having a variant (resulting GEMINI column is vep_pheno), VEP leaves those fields blank (so the column is currently imported into GEMINI as NULL).

here are some examples from running VEP v83 on some known pathogenic variants in CFTR:

VEP options were:
--sift b --polyphen b --symbol --numbers --biotype --total_length --canonical --ccds --hgvs --shift_hgvs 1 --gene_phenotype --regulatory --fields Consequence,Codons,Amino_acids,Gene,SYMBOL,Feature,EXON,PolyPhen,SIFT,Protein_position,BIOTYPE,CANONICAL,CCDS,HGVSc,HGVSp,PHENO,GENE_PHENO,CELLL_TYPE

7   117199533   rs213950    G   A   137788  PASS    AC=10;AF=0.584;AN=24;BaseQRankSum=-2.368;ClippingRankSum=-0.081;DB;DP=7276;FS=1.206;InbreedingCoeff=0.0956;MLEAC=125;MLEAF=0.584;MQ0=0;MQ=60;MQRankSum=-0.032;POSITIVE_TRAIN_SITE;QD=22.5;ReadPosRankSum=0.197;SOR=0.817;VQSLOD=8.12;culprit=MQ;CSQ=downstream_gene_variant|||ENSG00000232661|AC000111.3|ENST00000441019|||||antisense|YES||||1&1&1||,missense_variant|Gtg/Atg|V/M|ENSG00000001626|CFTR|ENST00000426809|10/26|benign(0.002)|tolerated(1)|440/1438|protein_coding|||ENST00000426809.1:c.1318G>A|ENSP00000389119.1:p.Val440Met|1&1&1|1|,missense_variant|Gtg/Atg|V/M|ENSG00000001626|CFTR|ENST00000454343|10/26|benign(0.001)|tolerated(0.88)|409/1419|protein_coding|||ENST00000454343.1:c.1225G>A|ENSP00000403677.1:p.Val409Met|1&1&1|1|,missense_variant|Gtg/Atg|V/M|ENSG00000001626|CFTR|ENST00000003084|11/27|benign(0.003)|tolerated(1)|470/1480|protein_coding|YES|CCDS5773.1|ENST00000003084.6:c.1408G>A|ENSP00000003084.6:p.Val470Met|1&1&1|1|,upstream_gene_variant|||ENSG00000001626|CFTR|ENST00000472848|||||processed_transcript|||||1&1&1|1|    GT:AD:DP:GQ:PL  0/0:7,0:7:21:0,21,242   0/1:9,11:20:99:292,0,229    0/0:7,0:7:21:0,21,248   1/1:0,26:26:78:865,78,0 0/1:35,24:59:99:589,0,941   0/1:23,25:48:99:688,0,675   0/0:22,0:22:66:0,66,714 0/1:21,14:35:99:367,0,590   0/0:22,0:22:62:0,62,736 1/1:0,22:22:66:732,66,0 0/1:13,15:28:99:430,0,341   0/1:15,22:37:99:642,0,395
7   117199644   rs199826652 ATCT    A   3996.41 PASS    AC=4;AF=0.023;AN=24;BaseQRankSum=-0.941;ClippingRankSum=-0.033;DB;DP=7203;FS=2.556;InbreedingCoeff=-0.0257;MLEAC=5;MLEAF=0.023;MQ0=0;MQ=60.36;MQRankSum=0.129;QD=21.37;ReadPosRankSum=0.673;SOR=0.921;VQSLOD=1.51;culprit=SOR;CSQ=downstream_gene_variant|||ENSG00000232661|AC000111.3|ENST00000441019|||||antisense|YES||||1||,inframe_deletion|aTCTtt/att|IF/I|ENSG00000001626|CFTR|ENST00000426809|10/26|||477-478/1438|protein_coding|||ENST00000426809.1:c.1431_1433delCTT|ENSP00000389119.1:p.Phe478del|1|1|,inframe_deletion|aTCTtt/att|IF/I|ENSG00000001626|CFTR|ENST00000454343|10/26|||446-447/1419|protein_coding|||ENST00000454343.1:c.1338_1340delCTT|ENSP00000403677.1:p.Phe447del|1|1|,inframe_deletion|aTCTtt/att|IF/I|ENSG00000001626|CFTR|ENST00000003084|11/27|||507-508/1480|protein_coding|YES|CCDS5773.1|ENST00000003084.6:c.1521_1523delCTT|ENSP00000003084.6:p.Phe508del|1|1|,upstream_gene_variant|||ENSG00000001626|CFTR|ENST00000472848|||||processed_transcript|||||1|1|,regulatory_region_variant|||||ENSR00000633245|||||open_chromatin_region|||||1||    GT:AD:DP:GQ:PL  0/0:4,0:4:9:0,9,135 0/0:10,0:10:24:0,24,360 0/0:3,0:3:5:0,5,86  0/1:10,12:22:99:441,0,474   0/1:9,18:27:99:729,0,424    0/0:23,0:23:63:0,63,945 0/0:22,0:22:66:0,66,714 0/0:33,0:33:84:0,84,1096    0/0:22,0:22:62:0,62,736 0/1:17,14:31:99:537,0,881   0/1:13,16:29:99:620,0,632   0/0:24,0:24:60:0,60,791

There's a little weirdness with them using 1&1 or 1&1&1 and so on for the PHENO field -- will update when I figure it out (I'm assuming it's indicating a positive answer for each of their data sources? but not sure).

@jxchong

This comment has been minimized.

Show comment
Hide comment
@jxchong

jxchong Dec 15, 2015

Contributor

If PHENO==1&1&1 each ofthose correspond to an ID in VEP's Existing_variation field

Example:

Existing_variation == rs2691305&rs75062661&COSM4144171
PHENO == 0&0&1

This means that rs2691305 and rs75062661 are not associated with a phenotype, disease, or trait, but COSM4144171 is.

This is sort of like a super-set that includes clinvar_sig (since VEP includes ClinVar)

Contributor

jxchong commented Dec 15, 2015

If PHENO==1&1&1 each ofthose correspond to an ID in VEP's Existing_variation field

Example:

Existing_variation == rs2691305&rs75062661&COSM4144171
PHENO == 0&0&1

This means that rs2691305 and rs75062661 are not associated with a phenotype, disease, or trait, but COSM4144171 is.

This is sort of like a super-set that includes clinvar_sig (since VEP includes ClinVar)

@brentp

This comment has been minimized.

Show comment
Hide comment
@brentp

brentp Dec 16, 2015

Collaborator

@jxchong any ideas on how to handle this gene_phenotype? I don't want to have to special-case the Existing_variation and PHENO keys. I guess if it gets added to the db as-is, a user could query as:

... AND pheno like '%1%'

but not sure I grok enough to understand how useful that is.

Collaborator

brentp commented Dec 16, 2015

@jxchong any ideas on how to handle this gene_phenotype? I don't want to have to special-case the Existing_variation and PHENO keys. I guess if it gets added to the db as-is, a user could query as:

... AND pheno like '%1%'

but not sure I grok enough to understand how useful that is.

@jxchong

This comment has been minimized.

Show comment
Hide comment
@jxchong

jxchong Dec 17, 2015

Contributor

I think both of these keys can be added as-is to the GEMINI database using GEMINI's current standard handling of extra fields. (except for minor issue that blank values are not being output by GEMINI as "None" mentioned #639)

--gene_phenotype is straightforward, I think. GEMINI users can just query for .. AND vep_gene_pheno == 1

Personally I don't envision filtering on the results of VEP's PHENO field (vep_existing_variation or vep_pheno) because it's just too hard to determine automatically whether the variant being present in a database is even meaningful. It would be as bad as an idea (for gene discovery purposes) as filtering on in_dbsnp :) Instead I think I will just include those columns as part of the output for later manual/custom parsing (because having it there is still helpful and faster than searching each variant to see if it's present in any of the databases parsed by VEP).

Contributor

jxchong commented Dec 17, 2015

I think both of these keys can be added as-is to the GEMINI database using GEMINI's current standard handling of extra fields. (except for minor issue that blank values are not being output by GEMINI as "None" mentioned #639)

--gene_phenotype is straightforward, I think. GEMINI users can just query for .. AND vep_gene_pheno == 1

Personally I don't envision filtering on the results of VEP's PHENO field (vep_existing_variation or vep_pheno) because it's just too hard to determine automatically whether the variant being present in a database is even meaningful. It would be as bad as an idea (for gene discovery purposes) as filtering on in_dbsnp :) Instead I think I will just include those columns as part of the output for later manual/custom parsing (because having it there is still helpful and faster than searching each variant to see if it's present in any of the databases parsed by VEP).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment