Loading CPTAC data generates many "gene" records #40

pieterlukasse · 2017-04-18T15:39:22Z

When loading brca study with CPTAC mass spectrometry data, the portal will generate a large amount of new "gene" records to store the data reported for each separate isoform(?) in the CPTAC files (72,159 new records in gene table!)

Here are some concerns:

the query page becomes slow when typing "PHOSPHOPROTEIN" in the Genes box (each new protein "gene" also gets this alias). The resulting drop down is very slow.
depending on what each symbol means in the CPTAC data file, this solution might not be scalable. For example: are SORBS1_pT72 or SORBS1_pT82_S89 encoding modifications to the canonical protein sequence known for gene SORBS1 rather than symbols of well known isoforms? If so, we risk an explosion of the number of records in the gene table as each study finds new modifications.

Another question I had when looking at the data (see data sample below) is:

how is the entry SORBS1|SORBS1 made? Is this an aggregation of all the other SORBS1|* items? How is this aggregation done?

Data sample from file:

SORBS1|SORBS1   0.545571655184  1.31369690336   1.20131762167   1.1320980343    0.54739111875   1.19041192239   2.73163154855   0.948705044244  1.33867851356   2.12510951076   1.01727605533   1.3008073214
SORBS1|SORBS1_pT72      0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.319789927879  0.325725496261  0.453594164798  0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0
SORBS1|SORBS1_pS76      0.0     0.0     0.0     0.0     0.0     0.0     0.802830082054  1.10324826511   1.43253238093   0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0
SORBS1|SORBS1_pS77      0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0
SORBS1|SORBS1_pS78      0.0312601610742 0.247415810145  1.62051540467   1.11476716945   0.849155398463  2.08833175784   0.0     0.0     0.0     0.0     0.0     0.0     1.32989448951   0.754650122826  0.78

The text was updated successfully, but these errors were encountered:

pieterlukasse · 2017-04-18T15:40:11Z

@jjgao @zheins @sheridancbio

pieterlukasse · 2017-04-18T15:56:34Z

Just for the record:

Here an example of a "protein" vs one of its isoforms(?) :

Here is how the isoform(?) is displayed in oncoprint, as "protein upregulation":

jjgao · 2017-04-18T22:48:39Z

cc'ing @pambot on this.

PHOSPHOPROTEIN and SORBS1_pT72 are definitely hacks we should solve. Ideally, we should be able to pull out SORBS1_pT72 when querying SORBS1. Now we have the generic_entity table. Maybe we can do it now.

SORBS1|SORBS1 is the general protein level. It's detected independent of the phospho-levels. Theoretically, it should include all the phosphoproteins. @pambot please correct me if I am wrong.

pieterlukasse · 2017-04-19T08:50:15Z

@jjgao thanks for the comments. I also think it would be nice to have SORBS1_pT72 (and others) appear as options when filling in SORBS1. This could probably be done by tweaking the way the gene_alias table is used now in the query page. The only problem could be that you will get a long list of options for each and every gene you fill in the query box....so maybe we need a new feature for this scenario?

pieterlukasse · 2019-06-21T13:03:59Z

@jjgao will phosphoproteins move to the generic assay feature as well?

jjgao · 2019-06-21T20:41:18Z

@pieterlukasse yes, that's the plan.

jjgao · 2019-06-21T20:45:57Z

https://github.com/cBioPortal/cbioportal/issues/6309

yichaoS added the backend label Jan 31, 2019

jjgao closed this as completed Jun 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading CPTAC data generates many "gene" records #40

Loading CPTAC data generates many "gene" records #40

pieterlukasse commented Apr 18, 2017 •

edited

Loading

pieterlukasse commented Apr 18, 2017

pieterlukasse commented Apr 18, 2017

jjgao commented Apr 18, 2017

pieterlukasse commented Apr 19, 2017

pieterlukasse commented Jun 21, 2019

jjgao commented Jun 21, 2019

jjgao commented Jun 21, 2019

Loading CPTAC data generates many "gene" records #40

Loading CPTAC data generates many "gene" records #40

Comments

pieterlukasse commented Apr 18, 2017 • edited Loading

pieterlukasse commented Apr 18, 2017

pieterlukasse commented Apr 18, 2017

jjgao commented Apr 18, 2017

pieterlukasse commented Apr 19, 2017

pieterlukasse commented Jun 21, 2019

jjgao commented Jun 21, 2019

jjgao commented Jun 21, 2019

pieterlukasse commented Apr 18, 2017 •

edited

Loading