Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seurat and genefu #22

Closed
ksaunders73 opened this issue Dec 3, 2021 · 6 comments
Closed

Seurat and genefu #22

ksaunders73 opened this issue Dec 3, 2021 · 6 comments
Assignees
Labels

Comments

@ksaunders73
Copy link

Hello!

Thank you for the excellent package! I would like to use genefu's molecular.subtyping() function (using the pam.50.robust model) on my Seurat object, and was wondering whether the Seurat object should be

  1. only normalized beforehand with NormalizeData()
  2. additionally scaled after normalization using ScaleData()

Thank you for reading!

@ChristopherEeles
Copy link
Contributor

Hi @ksaunders73,

This is not a straight forward question to answer.

All of the cluster centroids in the genefu package were derived from RNA microarray data of their respective publications. Because the units of a microarray (fluorescence intensity or intensity ratio) are different from those of RNA sequencing (counts or FPKM or TPM), it is not clear-cut deciding how your counts/FPKM/TPM values should be processed to be comparable with the array based cluster centroids.

I recommend reading the PAM50 subtype paper, specifically the Methods section:

van ’t Veer, L. J., Dai, H., van de Vijver, M. J., He, Y. D., Hart, A. A. M., Mao, M., Peterse, H. L., van der Kooy, K., Marton, M. J., Witteveen, A. T., Schreiber, G. J., Kerkhoven, R. M., Roberts, C., Linsley, P. S., Bernards, R., & Friend, S. H. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415(6871), 530–536. https://doi.org/10.1038/415530a

My understanding is that they used log2 transformed expression ratios to conduct their clustering analysis. Therefore the centroids of their clusters will also be indicated in these units. From the supplementary Methods section for the aforementioned paper, the expression ratios were calculated as:

the logarithmic transcriptional expression level measured relative to a baseline condition

I was unable to find the definition of the baseline condition in the paper, maybe you can find it? Without knowing what the baseline for the expression ratios were it is hard to say how to make an analogous metric from counts/TPM.

My instinct would be to divide the TPM by the average or median for each gene across your patient cohort, but whether this is scientifically valid or not is a call you will need to make. It is possible they used a normal sample for their baseline.

Once you decide on how to get a log expression ratio from your Seurat data, you should apply the genefu::rescale function to the expression matrix since this is what has been done for the pam50.robust cluster centroids. It is also worth noting that the molecular.subtyping function always uses the robust variant of the cluster centroid data.

Information about different centroids can be found in the genefu package help, e.g. using ?pam50. This will include a reference to the publication from which the centroid data was retrieved.

Given that this package was designed for classifying data from Affymetrix microarrays, I am not sure it is optimal to adapt it for use on RNA sequencing data. You may want to consider an RNA seq based clustering algorithm due to the above technical considerations.

Hopefully that helps.

Best,
Christopher Eeles
Software Developer
BHK Lab | PM-Research | UHN

@ksaunders73
Copy link
Author

Thank you very much @ChristopherEeles!

@ChristopherEeles
Copy link
Contributor

Hi @ksaunders73,

I am going to close this issue. If you have further questions feel free to re-open this thread or file a new issue.

Best,
Christopher Eeles
Software Developer
BHK Lab | PM-Research | UHN

@zhangjl-work
Copy link

Excuse me, how to use single-cell data for PAM50 analysis, what does the input expression matrix look like, and which normalization method should be used?

@zhangjl-work
Copy link

你好!

感谢您提供的优质包裹!我想在我的 Seurat 对象上使用genefu 的molecular.subtyping() 函数(使用pam.50.robust 模型),并且想知道Seurat 对象是否应该是

  1. 仅使用NormalizeData()预先标准化
  2. 使用ScaleData()标准化后额外缩放

感谢您的阅读!

Excuse me, how to use single-cell data for PAM50 analysis, what does the input expression matrix look like, and which normalization method should be used?

@ChristopherEeles
Copy link
Contributor

It has come to my attention that the paper I cited above is not the original PAM50 publication. However, the discussion still applies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants