Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected spectra factors and some misc questions #24

Closed
ccruizm opened this issue Jun 20, 2023 · 6 comments
Closed

Unexpected spectra factors and some misc questions #24

ccruizm opened this issue Jun 20, 2023 · 6 comments

Comments

@ccruizm
Copy link

ccruizm commented Jun 20, 2023

Good day,

I have run Spectra successfully using both CPU and GPU. However, I have noticed an unexpected behavior in both approaches. I used a downsampled version of my data (120K cells), normalized using scran, and grouped in major cell types (malignant, myeloid, lymphoid, and vascular). In addition to the global gene sets, I added some specific ones for each cell group, and in total, the dictionary contains 465 gene sets.

When I examined and plot the output stored in adata.uns['SPECTRA_overlap'] and their respective scores adata.obsm['SPECTRA_cell_scores'], I noticed that the order and naming of the gene sets changed. In some cases, gene sets appear in global and other categories more than one time and show different results. Also, some categories are called only global, or malignant or myeloid. I thought they could be a summary of all gene sets in that category, but once again, there are several with the same name and different scores/distributions. See the example below (In this case, the gene set called ‘jessa22_M2’ is in the dictionary only as an entry for the malignant category and contains a signature of 50 genes).

Screenshot 2023-06-20 at 13 56 19

Screenshot 2023-06-20 at 13 56 39

I am using

model = spc.est_spectra(adata = adata, gene_set_dictionary = annotations, 
                        use_highly_variable = True, cell_type_key = "cell_type_annotations", 
                        use_weights = True, 
                        lam = 0.1, 
                        delta=0.001,kappa = 0.00001, rho = 0.00001, 
                        use_cell_types = True, n_top_vals = 50, 
                        label_factors = True,
                        overlap_threshold = 0.2,
                        num_epochs=10000,
                        verbose = True
                       )

Why do you think this is happening? Should I reduce lam (e.g., 0.01) to make it more strict?

I also wonder how can you discriminate potentially novel programs as you described in the manuscript for the therapy-induced macrophages? You mentioned in the methods that you defined “new factors as factors with a graph dependency parameter η < 0.25 and modified factors as factors with a graph dependency parameter η ≥ 0.25”. Which of the outputs can I use to filter potentially novel factors?

I also read in the methods that the gene set dictionary is optional. In that case, will Spectra perform an unsupervised gene program detection?

About the SPECTRA_cell_scores, I saw that most of the scores are not higher than 0.02. Do you know if this is normal? I see these scores even for well-known gene sets that define, for instance, the cell cycle. (I see the scales presented in your manuscript are much higher)

Finally, as I mentioned in #21, GPU can reach a lower LR with fewer epochs than the CPU implementation. Does this suggest the GPU version can provide more accurate results in less time?

Thanks in advance! Sorry for the questions, but I am excited to get this working on my dataset (and understanding better the outputs)!

I'm looking forward to hearing your thoughts and suggestions.

@kvshams
Copy link

kvshams commented Jun 20, 2023

@ccruizm I also seen a similar observation. And got the lam saturation in drastic diffrent on the same data set same machine and same seed (#23) on multiple run. This is much inconsistent in GPU implementation than in CPU.

@wallet-maker
Copy link
Member

Hi Cristian,

thanks for your feedback.

Regarding your first question:

These behaviors is expected.

Three aspects: a) Multiple factors getting named by the same gene set and b) cell type specific factors getting named by factors from a different cell type c) a factor does not get a gene set name assigned

a) There is by design no 1:1 mapping between gene sets and factors. The factors are named by their overlap coefficient with all input gene sets, so also cell type specific gene sets will be used for naming all of the factors. But don't worry, Spectra only used them to fit the factors for the cell types you indicated. Maybe I will constrain the factor naming to the gene sets that were used for fitting the factors for the respective cell type. Personally, I felt it was helpful to perform the naming by computing overlap of all gene sets vs all factors because when you set a gene set as cell type specific, Spectra may still discover a similar factor outside of that cell type in an unsupervised way that is without using the gene set. Let me know if that makes sense.

b) Because of this lack of 1:1 mapping, gene sets which are not very coherent (maybe they are a mix of several processes) can be split into several factors mapping to a same gene. You may want to look at the marker genes for each of these factors. I suspect they will use different genes from the gene set. We believe this behavior is desired as the gene sets cannot be regarded as ground truth and may contain several ground truth processes which should then be split into distinct factors by Spectra. I think this behavior will be reduced if you reduce lambda, however this will also lead to stronger supervision by the gene sets so you learn less from the data. If you have a lot more factors than gene sets you can try reducing the number of factors first. Another approach could also be to take a closer look at the marker genes and see if there are perhaps some important subprocesses in your gene set and whether the gene set can be split and renamed or made more coherent. Of course there are also other ways to name the factors e.g. by running GSEA on the marker genes/ gene scores/loadings.

c) If there is an gene set vs factor marker genes overlap coefficient lower than the threshold in spectra_est (default of overlap_threshold = 0.2 ), the factor does not get a gene set assigned, so it is just called 'factor-index-X-celltype-X-factor-index' with cell type being the cell type of that factor and the factor index just the index in the factors x gene loadings/scores matrix. The factor names should be unique though and this is also what I see from the data you posted correct?

Regarding eta:
You can find eta in the model file outputted by est_spectra by calling:
model.return_eta_diag()

Regarding the cell scores:
Yes, this is expected behavior. The absolute value of the cell score is not super helpful / not easy to interpret. You are correct that in the preprint we have generally higher cell scores which is probably just because we used a lower number of factors. We have added information and importance scores in the revised manuscript. These quantify the contribution of a factor to explaining the observed expression data and in cell type variation, respectively. You can find these in spectra_util . I will try to add these to the tutorial soon. Also if you have suggestions for better naming conventions we are happy to consider them.

Regarding the gene set dictionary:
Yes, if you run Spectra without a gene set dictionary or without a gene set dictionary and without cell types it will only be supervised by cell type or completely unsupervised, respectively.

Regarding the GPU version:
This is interesting, I do not have a very confident answer and would refer this to @russellkune .

Let me know if that helps.

Thanks,
Thomas

@ccruizm
Copy link
Author

ccruizm commented Jun 23, 2023

Hello Thomas,

I appreciate your thorough explanation. Now things are much clearer. I will need to invest some time to digest all the output I got from Spectra taking into account all the remarks you made :)

Will also test different lambda values and try to refine my gene sets so there are not 'promiuscous' gene sets that cover many cell processes.

Regarding your comment on unassigned factors to a specific gene set, you are right that all entries are unique and named with no duplicates. However I am curious why the final number of factors is 'limited/sticked' to the number of gene sets. I would assume if there are factors that are not fitting that well or less 'relevant', then the number of final factors should change. Because I have the feeling now that from my 465 gene sets, Spectra will try to fit 465 factors even if some of them are not specific for a particular cell type (such as the example with jessa22_M2 (which appears at least 8 times and abrogates other gene sets). What happened with the gene sets that now, talking again about this particular gene set (jessa22_M2), has 'replaced'? Does it mean that the gene set 17 that in my dictionary belongs to 'all_selenoamino-acid metabolism' has an overlap with jessa22_M2 but is better defined by the last one? (although I have noticed that factors try to keep the same order as in the dictionary, sometimes they are shuffled and there is no direct 1:1 replacement with less fitting factors).

About the cell scores, I had the feeling they are more 'arbitrary' units and not necessarily can be comparable across factors. Looking forward to read more about it once the revised version is publically available (or in a updated tutorial🙂).

Lastly, if I run Spectra in an unsupervised way, how does it determines the number of factors? do I need to set and expected number of factors to be recovered? or will Spectra determine the best and more stable value where no more overlapping factors and more relevant ones are reached?

Hope @russellkune can tell us a bit more about the GPU implementation of Spectra.

Thanks in advance for the time and detailed answers.

@ccruizm
Copy link
Author

ccruizm commented Jun 28, 2023

Hi there,

I was trying to run spectra fully unsupervised (no dictionary nor cell types), but I am unsure how to set the model besides setting use_cell_types=False. Since gene_set_dictionary must be provided, should I create an empty dictionary with global as a unique key? Which of the other options should be modified when running in an unsupervised way (e.g., lam, delta, etc)

Thanks!

@wallet-maker
Copy link
Member

Hi Cristian,
We have never used the method this way. I think setting gene_set_dictionary = {'global':{}} sounds good. The choice of lambda should not matter because it controls the weight of the data vs gene sets. I think you can use default settings for the parameters. There might be bugs running the method without gene sets and cell types. It was not really designed for that.
best,
Thomas

@wallet-maker
Copy link
Member

Perhaps to add to that: If you use Spectra without cell types and without gene sets it should essentially behave like NMF. So probably it's better to run NMF for this use case (mind that in the paper we show that NMF is actually performing very poorly so I would maybe rather use something like scHPF if you want to go fully unsupervised).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants