Could a general mutation-load pattern confound mutation-specific signals? #8

dhimmel · 2016-07-20T22:41:50Z

I think it's likely that there is a general expression pattern for how mutated a tumor is. For example, super mutated tumors may have wacky gene expression, solely because they're super mutated and not specifically because of which exact mutations they contain.

For a given gene, tumors with mutations are more likely to be highly mutated overall. This could cause confounding. It may appear that a mutation is associated with a specific expression pattern, although the signal is be driven by general mutation-load.

So we may need to end up including a mutation-load covariate. In the meantime, someone should see whether it's possible to use gene expression to predict the mutation-load of each sample (labeling this a task and looking for a volunteer).

cgreene · 2016-07-21T01:16:42Z

This is an interesting question. From some quick searching of the academic literature, I've dug up mutational signatures of heavy mutation load cancers (the types of mutations that occur in these cancers seem to be different). I didn't find anything on a gene expression pattern common to them. It may be important to control for confounding by cancer type (maybe you pick the most mutated 10% within each cancer type as positive and the least mutated 10% as negative). I think this is an interesting question that may have just created another use case!

gwaybio · 2016-07-29T16:44:38Z

on a call now and this issue was mentioned. Really, the issue is mainly that these hyper-mutated tumors have a ton of passenger mutations and would contaminate gold standards. The solution proposed involved subsetting mutations using Cancer Hotspots as defined by Chang et al.

Essentially what the group is doing is only considering a sample to have a mutation in a given gene if the mutation is found in this database. I don't necessarily know what to do with this info - or if it even makes sense to use at all but generally, using it would increase the percentage of true positives but simultaneously increase false negatives.

dhimmel · 2016-08-01T14:58:01Z

it would increase the percentage of true positives but simultaneously increase false negatives

What do you mean by true positives and false negatives?

From Chang et al.:

Here, we developed a statistical algorithm to identify recurrently mutated residues in tumor samples. We applied the algorithm to 11,119 human tumors, spanning 41 cancer types, and identified 470 somatic substitution hotspots in 275 genes.

So if we were to only count mutations that were in recurrently mutated residues (cancer hotspots), we would only be able to offer our users a choice between 275 genes — not good? Additionally, I'm not sure I see:

that restricting to hotspots will be able to fully eliminate mutation load confounding
that restricting to hotspots makes sense given that we run a supervised algorithm that learns whether there's signal. Let it learn.

However, I still think a covariate is the way to go and can address most of the problem. A good first analysis to see the extent of this problem would be to measure the AUROC between TP53 mutation status versus total mutation count.

gwaybio · 2016-08-01T17:41:58Z

What do you mean by true positives and false negatives?

True positives meaning samples that actually have a deleterious mutation in the given gene (either an activating or inactivating mutation) that leads to a gene expression based signature representative of the normal gene activity being lost. False negatives meaning samples that actually do have the irregular gene expression signature but are incorrectly considered a "0" or "not mutated". Either will decrease the classifiers performance. We can get a false negative from either:

Assay technology or variant caller missed the mutation
The gene (or mRNA, or protein) being misregulated downstream from the DNA level

we would only be able to offer our users a choice between 275 genes — not good?

Probably not good, I agree.

that restricting to hotspots will be able to fully eliminate mutation load confounding

aside from removing samples with high mutation load, I don't think anything we do will fully eliminate this confounding. Restricting to hotspots for these samples will remove many passenger mutations that are less likely to alter gene expression signatures associated with the mutation of associated input genes. Adjusting for them when building a model could work nicely too.

that restricting to hotspots makes sense given that we run a supervised algorithm that learns whether there's signal. Let it learn.

The 'let it learn' argument makes much more sense in an unsupervised setting. For a supervised algorithm we are severely impacted by false labeling information and the first question when troubleshooting performance should always be: "is my data good?"

A good first analysis to see the extent of this problem would be to measure the AUROC between TP53 mutation status versus total mutation count.

I think this is a great idea! Although we probably should approach it using a gene other than TP53. Since TP53 is crucial for DNA repair, tumors with the defective protein are likely to have more mutations than tumors with wildtype TP53. I would recommend building a new classifier for RAS or NF1, or we can even try using genes in a pathway. E.g. Hippo Signalling Pathway to test this hypothesis.

In general, I would be in favor of sticking with our filtered mutation calls as a gold standard for now (at least until cleaner data comes in 😃) and testing to see how much of an impact mutation load has on predictions.

See how well covariates (non-expression features) predict TP53 mutation. Related to cognoma#8: General mutation-load does provide some ability to predict mutation status of TP53. Partially addresses cognoma#21: Covariates are extracted from samples.tsv.

Creates an explore directory and README for this type of exploratory notebook. See how well covariates (non-expression features) predict TP53 mutation. Related to cognoma#8: General mutation-load does provide some ability to predict mutation status of TP53. Partially addresses cognoma#21: Covariates are extracted from samples.tsv.

* Evaluate performance of covariates on TP53 Creates an explore directory and README for this type of exploratory notebook. See how well covariates (non-expression features) predict TP53 mutation. Related to #8: General mutation-load does provide some ability to predict mutation status of TP53. Partially addresses #21: Covariates are extracted from samples.tsv. * Evaluate more covariate/mutation combinations Evaluate covariate-only classifiers for the interesting mutations compiled in cognoma/cancer-data#22 (comment). Switches to an expand grid system for evaluating all possible covariate combinations. Plot performance of all covariates on each mutation. Switches to `covariates.tsv` created in cognoma/cancer-data#24 for encoded covariates. * Export clean notebook to script * Address review comments

dhimmel · 2016-09-27T13:58:10Z

Reproducing a comment by @gwaygenomics here:

I was at talk by Olivier Elemento - he was building models for a different purpose (predict immunotherapy responders) but was adjusting for mutation burden as a covariate. We may want to consider checking out his stuff and adjusting for burden too

I did the Elemento Lab's GitHub organization but I couldn't find the handle for the doctor himself. However, I did find his Twitter, so I'll tweet him the link to this question:

Q: We're creating models to predict mutation status at a specific gene using gene expression on TCGA samples. We'd like to add a mutation load covariate and have explored adding n_mutations_log1p (the log of 1 plus the number of mutations per sample) to the model. Do you have any advice or can you point us to models you've created with a mutation load covariate?

Update: link to Tweet

gwaybio · 2017-01-31T14:45:54Z

this issue has come up once again - it appears to be something the field is keenly aware of but do not know of a "best" solution for. It also appears to be extremely important when trying to predict the gene expression signature of samples that have DNA damage repair response defects.

Some of the solutions I have seen so far:

Remove hyper mutated tumors
- microsatellite instability (CESC, COAD)
- POLE-associated (UCEC, OV)
Add mutation burden to model

I have also seen a number of different ways mutation burden is added to the model. I plan on looking into this today at the meetup and exploring some of the solutions

See issue #8

Refs #8

dhimmel added the task label Jul 20, 2016

This was referenced Jul 27, 2016

July 19–26 Project Cognoma Acknowledgements cognoma/cognoma#23

Closed

What covariates should we include as features? #21

Open

dhimmel mentioned this issue Sep 7, 2016

Add exploratory analyses of mutation data cognoma/cancer-data#22

Merged

dhimmel mentioned this issue Sep 15, 2016

Evaluate performance of covariates at predicting various mutations #47

Merged

dhimmel mentioned this issue Oct 26, 2016

TP53 mutation prediction from metadata #66

Closed

dhimmel pushed a commit that referenced this issue Feb 13, 2017

Marginal gain of gene expression data over covariates (#67)

4e02964

See issue #8

dhimmel mentioned this issue May 25, 2017

Add covariates-only model for comparison in the main notebook #93

Merged

dhimmel pushed a commit that referenced this issue Jun 6, 2017

Add covariate-only & combined models to main notebook (#93)

2b07eed

Refs #8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could a general mutation-load pattern confound mutation-specific signals? #8

Could a general mutation-load pattern confound mutation-specific signals? #8

dhimmel commented Jul 20, 2016 •

edited

Loading

cgreene commented Jul 21, 2016

gwaybio commented Jul 29, 2016

dhimmel commented Aug 1, 2016 •

edited

Loading

gwaybio commented Aug 1, 2016

dhimmel commented Sep 27, 2016 •

edited

Loading

gwaybio commented Jan 31, 2017 •

edited

Loading

Could a general mutation-load pattern confound mutation-specific signals? #8

Could a general mutation-load pattern confound mutation-specific signals? #8

Comments

dhimmel commented Jul 20, 2016 • edited Loading

cgreene commented Jul 21, 2016

gwaybio commented Jul 29, 2016

dhimmel commented Aug 1, 2016 • edited Loading

gwaybio commented Aug 1, 2016

dhimmel commented Sep 27, 2016 • edited Loading

gwaybio commented Jan 31, 2017 • edited Loading

dhimmel commented Jul 20, 2016 •

edited

Loading

dhimmel commented Aug 1, 2016 •

edited

Loading

dhimmel commented Sep 27, 2016 •

edited

Loading

gwaybio commented Jan 31, 2017 •

edited

Loading