Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process the clinical matrix to extract sample attributes #10

Closed
ypar opened this issue Jul 26, 2016 · 10 comments
Closed

Process the clinical matrix to extract sample attributes #10

ypar opened this issue Jul 26, 2016 · 10 comments
Labels

Comments

@ypar
Copy link

ypar commented Jul 26, 2016

An issue has been raised in today's meeting.

The clinical matrix should be carefully analyzed to select a specific covariate or a set of covariates we can use for analyses.

The relevant notebook is here
tcga notebook for data download
and the dataset is named
PANCAN-clinicalMatrix

@dhimmel dhimmel changed the title selecting covariates from the clinical matrix prior to ML applications Process the clinical matrix dataset and extract sample attributes Jul 28, 2016
@dhimmel dhimmel changed the title Process the clinical matrix dataset and extract sample attributes Process the clinical matrix to extract sample attributes Jul 28, 2016
@dhimmel
Copy link
Member

dhimmel commented Jul 28, 2016

We would like to extract sample information for two purposes:

  1. Enabling sample selection by frontend users (see Identify the types of clinical data fields for the django team #13)
  2. Covariates to prevent confounding of our classifiers (see What covariates should we include as features? machine-learning#21)

@gwaybio
Copy link
Member

gwaybio commented Jul 29, 2016

  1. Enabling sample selection by frontend users (see Identify the types of clinical data fields for the django team #13)

To begin building a sample selector I don't think we need more info beyond mapping sample ID to tissue. Mappings are in the clinical matrix. Also, here is a text file holding tissue and TCGA acronym info: tcga_dictionary.txt.

The more that I think about it, the more I am liking the idea of scraping the sample selector all together. In this scenario the gene mutation selector aka status selector communicates with a backend process that curates the tissues that have enough mutations compared with the gene list specified (I have been using tissues with >= 10 mutations for inclusion). Then, the X matrix is subset to only those sample IDs belonging to those tissues that have enough mutation positives. We can then report classifier performance stratified by tissue.

I think having a service that describes the mutations across tissues/genders/age/etc. would be great but we have to be careful as to not reinvent the wheel here since many other services already do this. See COSMIC, NCI GDC, Broad Firehose, or CBioPortal

@dhimmel
Copy link
Member

dhimmel commented Jul 29, 2016

@gwaygenomics, provenance of tcga_dictionary.txt?

@gwaybio
Copy link
Member

gwaybio commented Jul 29, 2016

@dhimmel my keyboard!

@ypar
Copy link
Author

ypar commented Jul 29, 2016

A few questions.

re: tissue dictionary
Attached tcga_distionary.txt seems to be a dict for the primary disease and not the tissue. Also i noticed that the primary sites and sample collection include both normal and tumor tissues. Are these treated non-discriminantly as far as feature selections go?

re: mutations
Also to consider is what we want to tell by having a mutation selector. Are we providing some sort of a risk score? Are we simply counting? Is it used just as a QC threshold? Would these be compared to known databases such as EXaC or ClinVar or HGMD?

re: covariates
Do we have plans for how to handle missing data? The ClinicalMatrix has less missing data than most other clinical data sheets I've seen but it is not trivial. Are we considering specific variables for covariates or meta-variables selected by, e.g. PCA? Either way, if we are considering including such covariates for analysis (i.e. beyond their usage for sample selection), I think it should be explicitly stated. e.g. we can exclude all samples without sex in the dict if we deem sex to be a crucial confounder in our analysis.

@gwaybio
Copy link
Member

gwaybio commented Jul 30, 2016

@ypar thanks for these questions!

re: tissue dictionary

The TCGA acronym is how they identify "tissue source site" but you're right, they're not strictly "tissues" and "diseases" would be more appropriate. E.g. LUAD is "lung adenocarcinoma" and LUSC is "lung squamous cell carcinoma". TCGA has adopted this broad terminology however and to keep consistent, so will we. You're point about tumor vs. normal is definitely something we should consider in the final model. We'll need to filter out "normal" which is really "adjacent normal" - normal tissue from the same individual taken from close proximity within the actual tumor debulking surgery. We will also probably want to filter out "metastasis" and patients measured twice. Much of this sample curation is performed before the data is made public - but a lot is left in intentionally, or sneaks past the filters. We can use a combination of the representative columns and official TCGA Barcodes to create an official sample list. For unsupervised feature construction however, I think it is important to leave all the samples in!

re: mutations

Right now the mutation selector is as follows: user select a gene or genes, cognoma builds a Y matrix of 1's and 0's corresponding to samples in the expression matrix (X) indicating presence or absence of mutation. I think this is the minimum case example and should be focused on getting implemented before we try to get fancy. How we define impactful mutations is another story. We will use the official mutation calls from the mutation matrix to determine if the sample has a mutation in a given input gene. Currently, the plan is to filter only silent mutations but we could also have a stronger threshold which would reference a database. I think referencing a database would strongly benefit certain genes (like oncogenes where there are known activating mutations) but also limit the power for other genes (like tumor suppressor genes where there are several known and unknown inactivating mutations along the gene body)

re: covariates

I am not sure how to handle covariates at the moment... I think some sort of adjustment should be discussed but I don't know of the optimal solution. Right now I'm think it would be best to include performance of the model across different covariates in the results viewer.

@dhimmel
Copy link
Member

dhimmel commented Aug 1, 2016

For unsupervised feature construction however, I think it is important to leave all the samples in!

@gwaygenomics in the case of multiple samples per individual are you sure we want to leave those in? I think some unsupervised approaches will assume independent observations.

Currently, the plan is to filter only silent mutations but we could also have a stronger threshold which would reference a database.

That's not our current implementation. We ignore all code orange and code green mutations, based on a classification system developed by the Xena Browser team. See #2 (comment) for more information and a table of mutation counts by classification.

@gwaybio
Copy link
Member

gwaybio commented Aug 1, 2016

I think some unsupervised approaches will assume independent observations.

Good point. Yeah, we should remove those

@ypar
Copy link
Author

ypar commented Aug 1, 2016

I think some unsupervised approaches will assume independent observations.

Good point. Yeah, we should remove those

IMO, it is particularly important for unsupervised methods to have a cleanest possible data although one could argue that it is equally important for supervised methods.
e.g. if you do not have proper treatments of confounders and missing values, the first cluster will merely pick out precisely that information and that information only.

@dhimmel
Copy link
Member

dhimmel commented Aug 1, 2016

Discussion on this issue has become off topic. So if we want to keep discussing issues that are not related to processing the PANCAN-clinicalMatrix dataset to extract sample information, let's make new discussions or find an existing discussion that is topical.

dhimmel added a commit to dhimmel/cancer-data that referenced this issue Aug 24, 2016
Keeps only samples with type equal to "Primary Tumor". This filters multiple
samples from the same patient, which could cause an issue for machine learning
due to a dependent observations (discussed in cognoma#10). This filter reduced the
number of samples with expression and mutation from 7,705 to 7,306.

Closes cognoma#10: all variables that could help with sample selection or covariates,
that are in PANCAN_clinicalMatrix, are extracted to `data/samples.tsv`.

Relies on documentation of PANCAN_clinicalMatrix variables provided by the
Xena Browser team in cognoma#14.

Closes cognoma#17: only sample_ids with expression, mutation, and clinical data are
output to `data/`.
clairemcleod pushed a commit that referenced this issue Aug 25, 2016
* Extract sample info from PANCAN_clinicalMatrix

Keeps only samples with type equal to "Primary Tumor". This filters multiple
samples from the same patient, which could cause an issue for machine learning
due to a dependent observations (discussed in #10). This filter reduced the
number of samples with expression and mutation from 7,705 to 7,306.

Closes #10: all variables that could help with sample selection or covariates,
that are in PANCAN_clinicalMatrix, are extracted to `data/samples.tsv`.

Relies on documentation of PANCAN_clinicalMatrix variables provided by the
Xena Browser team in #14.

Closes #17: only sample_ids with expression, mutation, and clinical data are
output to `data/`.

* Retain primary blood cancers

Retain cancers whose type is "Primary Blood Derived Cancer - Peripheral Blood".
See #20 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants