Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could a general mutation-load pattern confound mutation-specific signals? #8

Open
dhimmel opened this issue Jul 20, 2016 · 6 comments
Labels

Comments

@dhimmel
Copy link
Member

dhimmel commented Jul 20, 2016

I think it's likely that there is a general expression pattern for how mutated a tumor is. For example, super mutated tumors may have wacky gene expression, solely because they're super mutated and not specifically because of which exact mutations they contain.

For a given gene, tumors with mutations are more likely to be highly mutated overall. This could cause confounding. It may appear that a mutation is associated with a specific expression pattern, although the signal is be driven by general mutation-load.

So we may need to end up including a mutation-load covariate. In the meantime, someone should see whether it's possible to use gene expression to predict the mutation-load of each sample (labeling this a task and looking for a volunteer).

@dhimmel dhimmel added the task label Jul 20, 2016
@cgreene
Copy link
Member

cgreene commented Jul 21, 2016

This is an interesting question. From some quick searching of the academic literature, I've dug up mutational signatures of heavy mutation load cancers (the types of mutations that occur in these cancers seem to be different). I didn't find anything on a gene expression pattern common to them. It may be important to control for confounding by cancer type (maybe you pick the most mutated 10% within each cancer type as positive and the least mutated 10% as negative). I think this is an interesting question that may have just created another use case!

@gwaybio
Copy link
Member

gwaybio commented Jul 29, 2016

on a call now and this issue was mentioned. Really, the issue is mainly that these hyper-mutated tumors have a ton of passenger mutations and would contaminate gold standards. The solution proposed involved subsetting mutations using Cancer Hotspots as defined by Chang et al.

Essentially what the group is doing is only considering a sample to have a mutation in a given gene if the mutation is found in this database. I don't necessarily know what to do with this info - or if it even makes sense to use at all but generally, using it would increase the percentage of true positives but simultaneously increase false negatives.

@dhimmel
Copy link
Member Author

dhimmel commented Aug 1, 2016

it would increase the percentage of true positives but simultaneously increase false negatives

What do you mean by true positives and false negatives?

From Chang et al.:

Here, we developed a statistical algorithm to identify recurrently mutated residues in tumor samples. We applied the algorithm to 11,119 human tumors, spanning 41 cancer types, and identified 470 somatic substitution hotspots in 275 genes.

So if we were to only count mutations that were in recurrently mutated residues (cancer hotspots), we would only be able to offer our users a choice between 275 genes — not good? Additionally, I'm not sure I see:

  1. that restricting to hotspots will be able to fully eliminate mutation load confounding
  2. that restricting to hotspots makes sense given that we run a supervised algorithm that learns whether there's signal. Let it learn.

However, I still think a covariate is the way to go and can address most of the problem. A good first analysis to see the extent of this problem would be to measure the AUROC between TP53 mutation status versus total mutation count.

@gwaybio
Copy link
Member

gwaybio commented Aug 1, 2016

What do you mean by true positives and false negatives?

True positives meaning samples that actually have a deleterious mutation in the given gene (either an activating or inactivating mutation) that leads to a gene expression based signature representative of the normal gene activity being lost. False negatives meaning samples that actually do have the irregular gene expression signature but are incorrectly considered a "0" or "not mutated". Either will decrease the classifiers performance. We can get a false negative from either:

  1. Assay technology or variant caller missed the mutation
  2. The gene (or mRNA, or protein) being misregulated downstream from the DNA level

we would only be able to offer our users a choice between 275 genes — not good?

Probably not good, I agree.

that restricting to hotspots will be able to fully eliminate mutation load confounding

aside from removing samples with high mutation load, I don't think anything we do will fully eliminate this confounding. Restricting to hotspots for these samples will remove many passenger mutations that are less likely to alter gene expression signatures associated with the mutation of associated input genes. Adjusting for them when building a model could work nicely too.

that restricting to hotspots makes sense given that we run a supervised algorithm that learns whether there's signal. Let it learn.

The 'let it learn' argument makes much more sense in an unsupervised setting. For a supervised algorithm we are severely impacted by false labeling information and the first question when troubleshooting performance should always be: "is my data good?"

A good first analysis to see the extent of this problem would be to measure the AUROC between TP53 mutation status versus total mutation count.

I think this is a great idea! Although we probably should approach it using a gene other than TP53. Since TP53 is crucial for DNA repair, tumors with the defective protein are likely to have more mutations than tumors with wildtype TP53. I would recommend building a new classifier for RAS or NF1, or we can even try using genes in a pathway. E.g. Hippo Signalling Pathway to test this hypothesis.

In general, I would be in favor of sticking with our filtered mutation calls as a gold standard for now (at least until cleaner data comes in 😃) and testing to see how much of an impact mutation load has on predictions.

dhimmel added a commit to dhimmel/machine-learning that referenced this issue Sep 14, 2016
See how well covariates (non-expression features) predict TP53 mutation.

Related to cognoma#8:
General mutation-load does provide some ability to predict mutation status of
TP53.

Partially addresses cognoma#21:
Covariates are extracted from samples.tsv.
dhimmel added a commit to dhimmel/machine-learning that referenced this issue Sep 15, 2016
See how well covariates (non-expression features) predict TP53 mutation.

Related to cognoma#8:
General mutation-load does provide some ability to predict mutation status of
TP53.

Partially addresses cognoma#21:
Covariates are extracted from samples.tsv.
dhimmel added a commit to dhimmel/machine-learning that referenced this issue Sep 15, 2016
Creates an explore directory and README for this type of exploratory notebook.

See how well covariates (non-expression features) predict TP53 mutation.

Related to cognoma#8:
General mutation-load does provide some ability to predict mutation status of
TP53.

Partially addresses cognoma#21:
Covariates are extracted from samples.tsv.
dhimmel added a commit that referenced this issue Sep 22, 2016
* Evaluate performance of covariates on TP53

Creates an explore directory and README for this type of exploratory notebook.

See how well covariates (non-expression features) predict TP53 mutation.

Related to #8:
General mutation-load does provide some ability to predict mutation status of
TP53.

Partially addresses #21:
Covariates are extracted from samples.tsv.

* Evaluate more covariate/mutation combinations

Evaluate covariate-only classifiers for the interesting mutations compiled in
cognoma/cancer-data#22 (comment).

Switches to an expand grid system for evaluating all possible covariate
combinations.

Plot performance of all covariates on each mutation.

Switches to `covariates.tsv` created in
cognoma/cancer-data#24 for encoded covariates.

* Export clean notebook to script

* Address review comments
@dhimmel
Copy link
Member Author

dhimmel commented Sep 27, 2016

Reproducing a comment by @gwaygenomics here:

I was at talk by Olivier Elemento - he was building models for a different purpose (predict immunotherapy responders) but was adjusting for mutation burden as a covariate. We may want to consider checking out his stuff and adjusting for burden too

I did the Elemento Lab's GitHub organization but I couldn't find the handle for the doctor himself. However, I did find his Twitter, so I'll tweet him the link to this question:

Q: We're creating models to predict mutation status at a specific gene using gene expression on TCGA samples. We'd like to add a mutation load covariate and have explored adding n_mutations_log1p (the log of 1 plus the number of mutations per sample) to the model. Do you have any advice or can you point us to models you've created with a mutation load covariate?

Update: link to Tweet

@gwaybio
Copy link
Member

gwaybio commented Jan 31, 2017

this issue has come up once again - it appears to be something the field is keenly aware of but do not know of a "best" solution for. It also appears to be extremely important when trying to predict the gene expression signature of samples that have DNA damage repair response defects.

Some of the solutions I have seen so far:

I have also seen a number of different ways mutation burden is added to the model. I plan on looking into this today at the meetup and exploring some of the solutions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants