-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too many subpopulations? #10
Comments
Hi,
Thank you for your using our software! The clustering algorithm that
cellrouter implements at this point aims at identifying more clusters to
allow reconstruction of trajectories between specific locations in the
dimension reduction plot. I also noticed that for some datasets this might
not be ideal. So, you can either increase K to have less populations, such
that you obtain a number of clusters similar to the one that you obtain
with SC3. I am working on to optimize this step in CellRouter and I am also
including an option to use previously identified clusters as input. So, you
could use your SC3 clusters as input to CellRouter. Unfortunately, this
will take me about 2 weeks to finish. So, the quickest solution would be to
increase K.
Please, let me know if that helps! I am working on to improve cellrouter
and comments/suggestions are very welcome!
Thanks a lot!
2018-05-02 9:13 GMT-04:00 MingBit <notifications@github.com>:
… Hi,
Thanks again for the great work. :)
I'm testing cellrouter with our own data (Two conditions at day 3). There
are ~20 cell sub-populations identified. K = 12 which was defined by findK
function.
There are relatively few clusters identified by SC3 ([
https://github.com/hemberg-lab/SC3]), which seems to be close to our
expectation. So I'm wondering if cellrouter tends to give many
sub-populations, even though the input data is collected from two
conditions at single timepoint.
Looking forward to your response.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#10>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJqUR4-t_0XlQlyKy97gj4RqW960JNkNks5tubELgaJpZM4TvbuN>
.
--
Edroaldo
|
Thanks for your explanation. :) |
You are exactly right. It is fine to use different values of K for
clustering and trajectory identification. For example, clusters identified
with k=5 will bethe ones used for trajectory analysis between
subpopulations. You can choose another value for K for trajectory analysis
when your knn graph is not fully connected or when the subpopulations in
the transition that you want to study are not connected in the kNN graph.
Regardless of the value of K that you choose for the trajectory analysis
step, the subpopulations used will be the ones identified in the first
step. In the first tutorial in github, in section "Starting trajectory
analysis", the first figure shows that connections between the
subpopulations. This basically shows the connections/edges in the kNN
visualized in the tSNE space. In that figure, if you want to study the
transitions from 24 to 2, you will need to increase K, such that clusters 3
or 4 will be connected to cluster 2.
Please, let me know if it is clear...
Thanks!
2018-05-03 6:46 GMT-04:00 MingBit <notifications@github.com>:
… Thanks for your explanation. :)
In terms of K values, as I understood, cell sub-populations are identified
from the generated KNN graph and then detected by Louvain community
detection method. Sub-populations will be further used for trajectory
analysis.
In your tutorial example
<https://github.com/edroaldo/cellrouter/blob/master/stemid/StemID_BM_CellRouter.md>,
K=5 was used for cells clustering and K=10 was used for trajectory
analysis. So I'm wondering should the K values be identical in two analysis
part? Otherwise, cell-subpopulations would be different.
Thanks.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#10 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJqUR0fHA5dOJ4v_GWfJcFVmuhyOiRAHks5tut_-gaJpZM4TvbuN>
.
--
Edroaldo
|
Yesss... It's clear enough for me about choosing K values. :D So in terms of the those single time point datasets, I'm wondering that might that be better not use graph-based methods for clustering if I just wanna do cell sup-populations identification and differential gene expression analysis. Please correct me if I was mistaken. :) Looking forward to your suggestion. |
Ah! .. one more question about gene markers. Sry for my frequent posting... >_<... CellRouter is very interesting for us and I'll be presenting this method in our group.. :D |
Hi,
I think you are correct in your observations. It usually require some
iterations to identify clusters and signatures that make sense
biologically.
I am now actively working to improve the clustering part and also the
differential expression component of cellrouter. I hope to update the
github page some point late next week. Will be glad to hear your feedback
on it! I am trying to extend CellRouter further to be a more complete
tool...
Thanks a lot!
…On Fri, May 4, 2018, 8:37 AM MingBit ***@***.***> wrote:
Yesss... It's clear enough for me about choosing K values. :D
Since I currently only have SC RNA-seq datasets at the limited time point,
I'm more interested in cell clustering parts. ^_^ As I noticed, some
trajectory analysis tools (e.g. Wandlust, Monocle2
<https://github.com/cole-trapnell-lab/monocle-release> and Scanpy
<https://github.com/theislab/scanpy>) tend to use graph-based
methods(e.g. KNN+Louvain clustering) for sub-populations identification.
And some other tools, including SC3 <https://github.com/hemberg-lab/SC3>,
CIDR <https://github.com/VCCRI/CIDR/blob/master/R/CIDR.R> and RaceID
<https://github.com/dgrun/RaceID/blob/master/RaceID_class.R>, use either
K-means or hierarchical clustering to perform cells clustering. I've played
above packages a little bit and found that, for the dataset with two or
three conditions at one time point, graph-based methods tend to give more
clusters compared to K-means/hierarchical clustering.
So in terms of the those single time point datasets, I'm wondering that
might that be better not use graph-based methods for clustering if I just
wanna do *cell sup-populations identification* and *differential gene
expression analysis*. Please correct me if I was mistaken. :)
Looking forward to your suggestion.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#10 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJqURyOatUAs1MtR_P0r753hqrFdyCMzks5tvEuXgaJpZM4TvbuN>
.
|
The current findK function should not be used. I am also including another
clustering algorithm as part of CellRouter, based on model based
clustering, and I will also make available an option to provide as input
clusters identified by other tools.
I hope you can wait to next release o.o CellRouter next week to try that
out.
I also noticed that the analysis looks way better when data imputation
methods are used, such as MAGIC or scImpute.
Hope it helps! I am actively working to release the new version of
CellRouter next week.
Thank you very much for your interest in our work!
…On Fri, May 4, 2018, 9:49 AM MingBit ***@***.***> wrote:
Ah! .. one more question about gene markers.
So K values were increased a bit for getting less sup-populations. I
learned that differential gene expression analysis in CellRouter is
performed based on mean expression values. I tried to create feature plots
by *plotDRExpression()* for top differentially expressed genes of each
sub-population.
Two normalisation methods(log, z-score) were utilised, unfortunately their
changes look relatively gradual. So it seems that identified clusters are
not optimal. I'm wondering Is there any possibilities we could estimate K
if findK() cannot give a optimal value. Sry for my frequent posting...
>_<...
CellRouter is very interesting for us and I'll be presenting this method
in our group.. :D
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#10 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJqUR9bCoETLnp5dfiJ6O56ZytltjmP2ks5tvFyBgaJpZM4TvbuN>
.
|
Hi,
I think you are correct in your observations. It usually require some
iterations to identify clusters and signatures that make sense
biologically.
I am now actively working to improve the clustering part and also the
differential expression component of cellrouter. I hope to update the
github page some point late next week. Will be glad to hear your feedback
on it! I am trying to extend CellRouter further to be a more complete
tool...
Thanks a lot!
…On Fri, May 4, 2018, 9:49 AM MingBit ***@***.***> wrote:
Ah! .. one more question about gene markers.
So K values were increased a bit for getting less sup-populations. I
learned that differential gene expression analysis in CellRouter is
performed based on mean expression values. I tried to create feature plots
by *plotDRExpression()* for top differentially expressed genes of each
sub-population.
Two normalisation methods(log, z-score) were utilised, unfortunately their
changes look relatively gradual. So it seems that identified clusters are
not optimal. I'm wondering Is there any possibilities we could estimate K
if findK() cannot give a optimal value. Sry for my frequent posting...
>_<...
CellRouter is very interesting for us and I'll be presenting this method
in our group.. :D
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#10 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJqUR9bCoETLnp5dfiJ6O56ZytltjmP2ks5tvFyBgaJpZM4TvbuN>
.
|
That's is a typo. It should be m_i,j. I will check with the journal how we
could publish a correction for this.
Thanks!
2018-05-06 7:10 GMT-04:00 MingBit <notifications@github.com>:
… Hey edroaldo,
I'm very looking forward to the next version of CellRouter. 👍
Hmm.. Concerning the GRN score, I'm a little bit confused by m_t,j or
m_i,j. So in this formula:
[image: screen shot 2018-05-06 at 13 05 35]
<https://user-images.githubusercontent.com/22442392/39672625-32c3be28-512e-11e8-8a41-fc21577bbca6.png>
*m_t,j is the mean correlation of predicted targets of gene i regulated
along trajectory j*
And it was mentioned again here:
*Moreover, if its predicted target genes are also well correlated with the
differentiation trajectory, it is more likely that the regulator is
important (parameter m_i,j)*
I'm wondering are they actually the same? If no, what is t in m_t,j???
Thank you and Looking forward to your reply. :D
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#10 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJqUR09770kOkvq0qISz49H0iXi8NrjTks5tvtowgaJpZM4TvbuN>
.
--
Edroaldo
|
Hi,
Thanks again for the great work. :)
I'm testing cellrouter with our own data (Two conditions at day 3). There are ~20 cell sub-populations identified. K = 12 which was defined by findK function. I did try other K values as well.
There are relatively few clusters identified by SC3, which seems to be close to our expectation. So I'm wondering if cellrouter tends to give many sub-populations, even though the input data is collected from two conditions at single timepoint.
Looking forward to your response.
The text was updated successfully, but these errors were encountered: