Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very sparse documentation on converting files to GCT format. #3

Closed
draperjames opened this issue Apr 2, 2019 · 3 comments
Closed

Comments

@draperjames
Copy link

I would like to use PTM-SEA on a phosphoproteomic data set was exported as a CSV file from Proteome Discoverer 2.3 but I'm having trouble figuring out how to format the data appropriately into GCT format. When I looked at the example file for clarity I noticed that it contains 50 columns:

id, id.uniprot, id.flanking, adj.P.Val.Drug_06h, adj.P.Val.Drug_24h, AveExpr.Drug_06h, AveExpr.Drug_24h, Log.P.Value.Drug_06h, Log.P.Value.Drug_24h, logFC.Drug_06h, logFC.Drug_24h, P.Value.Drug_06h, P.Value.Drug_24h, t.Drug_06h, t.Drug_24h, id.query, id.mapped, id.concat, accessionNumber_VMsites_numVMsitesPresent_numVMsitesLocalizedBest_earliestVMsiteAA_latestVMsiteAA, accession_number, accession_numbers, entry_name, bestFilename, bestScore, bestDeltaForwardReverseScore, Best_scoreVML, Best_numPossibleVMsites_STY, Best_numActualVMSites_sty, Best_numLocalizedVMsites_sty, Best_numAmbiguousVMsites_sty, best_parent_m_over_z, best_matched_parent_mass, variableSites, nterm, StartAA, previous_aa, sequence, next_aa, sequenceVML, modifications, geneSymbol, protein_mw, numSpectraVMsiteObserved, peptide_num, protein_group_num, protein_score, species, VMsitesAll, Log.P.Value.Drug_06h.1, Log.P.Value.Drug_24h.1

This format is very different to the format that is exported by default from Proteome Discoverer. I'm sure that there is quite a bit overlapping information and that it is most likely possible to reformat my data so that it contains the essential information to run through morpheus and finally PTM-SEA.

Could you please explain which columns are necessary for running PTM-SEA? Does the format of the column name matter to the program?

@draperjames draperjames changed the title Very sparse documentation converting files to GCT format. Very sparse documentation on converting files to GCT format. Apr 2, 2019
@karstenkrug
Copy link
Contributor

karstenkrug commented Apr 2, 2019

Please take a look at https://clue.io/connectopedia/gct_format for more information about GCT.

The GCT format is actually not very different from any other matrix-type data format, but enables you to store meta-data alongside with the actual data in a structured manner. A GCT file is divided into different sections:

  • data matrix
  • row identifer
  • column identifier
  • row meta data (optional)
  • column meta data (optional)

The second line in a GCT file specifies which rows and columns in belong to which section. In the example dataset the second line looks like this:

23936	2	47	13

The first two numbers specify the dimensions of the data matrix (23,936 rows/p-sites and 2 columns/experiments). The third number specifies the number of row meta data columns that contain annotations for each p-site such as gene name, peptide sequence, PSM score, etc. In the example dataset there are 47 of such columns (+2 data column + 1 id-columns = 50 columns in total). These columns are not required to run PTM-SEA. The fourth number specifies the number of column meta data rows which contain annotation for each data column such as the name of the perturbagen used, perturbation time, cell line etc. Again, these additional rows are not required to run PTM-SEA.

I highly recommend to use the cmapR or cmapPy packages to read/write GCT files, if you are a programmer, or use Morpheus (https://software.broadinstitute.org/morpheus/) if programming is not your thing. There you can easily drag and drop a file and export it as GCT and you don't have to worry about figuring out the correct numbers for the second line.

If you just have a p x n data matrix without any meta data your GCT file would look something like this:

#1.3
p	n	0	0
id	cid_1	cid_2	...	cid_n
rid_1	x_11	x_12	...	x_1n	
rid_2	x_21	x_22	...	x_2n
.	     .     .          .
.      .     .          .
.      .     .          .
rid_p	x_p1	x_p2	...	x_pn    

This file only contains data and appropriate row and column identifier and the only difference to a regular tab-delimited text file are the two additional lines on top.

Row identifier (rid) and colum identifier (cid) are always required and need to be unique. There are no specific requirements for cid other than being an unambiguous identifier.

However, to run PTM-SEA the format of rid matters a lot since it determines which version of PTMsigDB one has to use. Unambigous identifier for PTM sites are not trivial like we describe in paragraph Modified Amino Acid Residues in PTMsigDB Are Robustly Represented Across Databases and Organisms in the results section of our manuscript. PTMsigDB supports three types of PTM site identifier, one of these should be used as rid in your GCT file:

format example
Flanking sequence ETRICKIYDSPCLPE-p
UniProt-centric Q06609;Y315-p
Site-Group-ID 448324-p

I highly recommend to use the sequence flanking the modified residue (7 amino acids in each direction) as identifier, since these are more stable over time than protein accessions and residue numbers.

Finally, each row in the GCT file is supposed to be a single modified site, i.e. multiple modifications on the same peptide have to be resolved and should be listed as separate rows in the data matrix (single-site-centric table). I am not sure at what level Proteome Discoverer exports phosphosite reports (peptide-level, site-level or single-site-level), but the site tables generated by MaxQuant are single site centric (e.g. Phospho (STY)Sites.txt). For other software packages additional pre-processing steps might be required, for details on how we convert site-centric into single-site-centric reports please refer to the methods section in our manuscript.

I hope this helps a bit.

Cheers,
K

@cindyli9888
Copy link

Hi, I also have problem regarding to creating a GCT file. Is the required data input field just "Log.P.Value.Drug_06h.1, Log.P.Value.Drug_24h.1"?
I have phosphoproteomic data that the data fields are "normalized intensities" which is different than the numbers in the example data. As mentioned in the paper, should we use the Limma R package to calculate the Log P value. Does that Log P value refers to the same required one in the final data input?
Thank you very much!

@karstenkrug
Copy link
Contributor

karstenkrug commented Dec 6, 2019

Hi! Your are right, these are the data columns used in the example dataset. The header in the GCT file tells you which columns are data columns and which are annotations. Please take a look at my previous post in which I explain the GCT format.

You can apply PTM-SEA/ssGSEA to any type of numeric data that can be used as a ranking. Using "normalized intensities" would be comparable to the analysis that we present in Figure 3 in our paper.

If the goal of your experiment is to compare two conditions, let's say A and B, we recommend to first perform a statistical analysis comparing A over B at the phosphosite level and then use the signed (by log fold change) and -1*log(p)-transformed p-values as input to PTM-SEA. That way the normalized enrichment scores (NES) and associated p-values calculated by PTM-SEA can be readily interpreted: positive NES with e.g. FDR < 0.01 are enriched in condition A; negative NES with FDR < 0.01 are enriched in condition B.

We typically use limma for these kinds of analyses but that's not a requirement.

Cheers,
K

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants