Process the clinical matrix to extract sample attributes #10

ypar · 2016-07-26T23:17:09Z

An issue has been raised in today's meeting.

The clinical matrix should be carefully analyzed to select a specific covariate or a set of covariates we can use for analyses.

The relevant notebook is here
tcga notebook for data download
and the dataset is named
PANCAN-clinicalMatrix

dhimmel · 2016-07-28T20:54:31Z

We would like to extract sample information for two purposes:

Enabling sample selection by frontend users (see Identify the types of clinical data fields for the django team #13)
Covariates to prevent confounding of our classifiers (see What covariates should we include as features? machine-learning#21)

gwaybio · 2016-07-29T20:07:37Z

Enabling sample selection by frontend users (see Identify the types of clinical data fields for the django team #13)

To begin building a sample selector I don't think we need more info beyond mapping sample ID to tissue. Mappings are in the clinical matrix. Also, here is a text file holding tissue and TCGA acronym info: tcga_dictionary.txt.

The more that I think about it, the more I am liking the idea of scraping the sample selector all together. In this scenario the gene mutation selector aka status selector communicates with a backend process that curates the tissues that have enough mutations compared with the gene list specified (I have been using tissues with >= 10 mutations for inclusion). Then, the X matrix is subset to only those sample IDs belonging to those tissues that have enough mutation positives. We can then report classifier performance stratified by tissue.

I think having a service that describes the mutations across tissues/genders/age/etc. would be great but we have to be careful as to not reinvent the wheel here since many other services already do this. See COSMIC, NCI GDC, Broad Firehose, or CBioPortal

dhimmel · 2016-07-29T20:18:51Z

@gwaygenomics, provenance of tcga_dictionary.txt?

gwaybio · 2016-07-29T20:20:05Z

@dhimmel my keyboard!

ypar · 2016-07-29T21:31:30Z

A few questions.

re: tissue dictionary
Attached tcga_distionary.txt seems to be a dict for the primary disease and not the tissue. Also i noticed that the primary sites and sample collection include both normal and tumor tissues. Are these treated non-discriminantly as far as feature selections go?

re: mutations
Also to consider is what we want to tell by having a mutation selector. Are we providing some sort of a risk score? Are we simply counting? Is it used just as a QC threshold? Would these be compared to known databases such as EXaC or ClinVar or HGMD?

re: covariates
Do we have plans for how to handle missing data? The ClinicalMatrix has less missing data than most other clinical data sheets I've seen but it is not trivial. Are we considering specific variables for covariates or meta-variables selected by, e.g. PCA? Either way, if we are considering including such covariates for analysis (i.e. beyond their usage for sample selection), I think it should be explicitly stated. e.g. we can exclude all samples without sex in the dict if we deem sex to be a crucial confounder in our analysis.

gwaybio · 2016-07-30T13:50:20Z

@ypar thanks for these questions!

re: tissue dictionary

The TCGA acronym is how they identify "tissue source site" but you're right, they're not strictly "tissues" and "diseases" would be more appropriate. E.g. LUAD is "lung adenocarcinoma" and LUSC is "lung squamous cell carcinoma". TCGA has adopted this broad terminology however and to keep consistent, so will we. You're point about tumor vs. normal is definitely something we should consider in the final model. We'll need to filter out "normal" which is really "adjacent normal" - normal tissue from the same individual taken from close proximity within the actual tumor debulking surgery. We will also probably want to filter out "metastasis" and patients measured twice. Much of this sample curation is performed before the data is made public - but a lot is left in intentionally, or sneaks past the filters. We can use a combination of the representative columns and official TCGA Barcodes to create an official sample list. For unsupervised feature construction however, I think it is important to leave all the samples in!

re: mutations

Right now the mutation selector is as follows: user select a gene or genes, cognoma builds a Y matrix of 1's and 0's corresponding to samples in the expression matrix (X) indicating presence or absence of mutation. I think this is the minimum case example and should be focused on getting implemented before we try to get fancy. How we define impactful mutations is another story. We will use the official mutation calls from the mutation matrix to determine if the sample has a mutation in a given input gene. Currently, the plan is to filter only silent mutations but we could also have a stronger threshold which would reference a database. I think referencing a database would strongly benefit certain genes (like oncogenes where there are known activating mutations) but also limit the power for other genes (like tumor suppressor genes where there are several known and unknown inactivating mutations along the gene body)

re: covariates

I am not sure how to handle covariates at the moment... I think some sort of adjustment should be discussed but I don't know of the optimal solution. Right now I'm think it would be best to include performance of the model across different covariates in the results viewer.

dhimmel · 2016-08-01T15:37:36Z

For unsupervised feature construction however, I think it is important to leave all the samples in!

@gwaygenomics in the case of multiple samples per individual are you sure we want to leave those in? I think some unsupervised approaches will assume independent observations.

Currently, the plan is to filter only silent mutations but we could also have a stronger threshold which would reference a database.

That's not our current implementation. We ignore all code orange and code green mutations, based on a classification system developed by the Xena Browser team. See #2 (comment) for more information and a table of mutation counts by classification.

gwaybio · 2016-08-01T16:56:50Z

I think some unsupervised approaches will assume independent observations.

Good point. Yeah, we should remove those

ypar · 2016-08-01T17:31:26Z

I think some unsupervised approaches will assume independent observations.

Good point. Yeah, we should remove those

IMO, it is particularly important for unsupervised methods to have a cleanest possible data although one could argue that it is equally important for supervised methods.
e.g. if you do not have proper treatments of confounders and missing values, the first cluster will merely pick out precisely that information and that information only.

dhimmel · 2016-08-01T18:25:15Z

Discussion on this issue has become off topic. So if we want to keep discussing issues that are not related to processing the PANCAN-clinicalMatrix dataset to extract sample information, let's make new discussions or find an existing discussion that is topical.

Keeps only samples with type equal to "Primary Tumor". This filters multiple samples from the same patient, which could cause an issue for machine learning due to a dependent observations (discussed in cognoma#10). This filter reduced the number of samples with expression and mutation from 7,705 to 7,306. Closes cognoma#10: all variables that could help with sample selection or covariates, that are in PANCAN_clinicalMatrix, are extracted to `data/samples.tsv`. Relies on documentation of PANCAN_clinicalMatrix variables provided by the Xena Browser team in cognoma#14. Closes cognoma#17: only sample_ids with expression, mutation, and clinical data are output to `data/`.

* Extract sample info from PANCAN_clinicalMatrix Keeps only samples with type equal to "Primary Tumor". This filters multiple samples from the same patient, which could cause an issue for machine learning due to a dependent observations (discussed in #10). This filter reduced the number of samples with expression and mutation from 7,705 to 7,306. Closes #10: all variables that could help with sample selection or covariates, that are in PANCAN_clinicalMatrix, are extracted to `data/samples.tsv`. Relies on documentation of PANCAN_clinicalMatrix variables provided by the Xena Browser team in #14. Closes #17: only sample_ids with expression, mutation, and clinical data are output to `data/`. * Retain primary blood cancers Retain cancers whose type is "Primary Blood Derived Cancer - Peripheral Blood". See #20 (comment)

dhimmel changed the title ~~selecting covariates from the clinical matrix prior to ML applications~~ Process the clinical matrix dataset and extract sample attributes Jul 28, 2016

dhimmel changed the title ~~Process the clinical matrix dataset and extract sample attributes~~ Process the clinical matrix to extract sample attributes Jul 28, 2016

dhimmel added the task label Jul 28, 2016

dhimmel mentioned this issue Jul 28, 2016

Identify the types of clinical data fields for the django team #13

Open

dhimmel mentioned this issue Aug 24, 2016

Extract sample info from PANCAN_clinicalMatrix #20

Merged

clairemcleod closed this as completed in #20 Aug 25, 2016

dhimmel mentioned this issue Aug 26, 2016

Converting Xena datasets to standard identifiers rather than gene symbols #6

Closed

gwaybio mentioned this issue Sep 7, 2016

Visualizing pre-classifier data cognoma/frontend#13

Closed

dhimmel mentioned this issue Sep 26, 2016

Acronyms for diseases #26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process the clinical matrix to extract sample attributes #10

Process the clinical matrix to extract sample attributes #10

ypar commented Jul 26, 2016 •

edited

Loading

dhimmel commented Jul 28, 2016

gwaybio commented Jul 29, 2016

dhimmel commented Jul 29, 2016

gwaybio commented Jul 29, 2016

ypar commented Jul 29, 2016

gwaybio commented Jul 30, 2016

dhimmel commented Aug 1, 2016

gwaybio commented Aug 1, 2016

ypar commented Aug 1, 2016

dhimmel commented Aug 1, 2016

Process the clinical matrix to extract sample attributes #10

Process the clinical matrix to extract sample attributes #10

Comments

ypar commented Jul 26, 2016 • edited Loading

dhimmel commented Jul 28, 2016

gwaybio commented Jul 29, 2016

dhimmel commented Jul 29, 2016

gwaybio commented Jul 29, 2016

ypar commented Jul 29, 2016

gwaybio commented Jul 30, 2016

dhimmel commented Aug 1, 2016

gwaybio commented Aug 1, 2016

ypar commented Aug 1, 2016

dhimmel commented Aug 1, 2016

ypar commented Jul 26, 2016 •

edited

Loading