-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate pyFlowSOM
with Pixie
#863
Conversation
… pyflowsom_add
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@ngreenwald tested this on both pixel and cell clustering, verified that both run seamlessly through. We'll need to merge in #856 before we merge this one into main. Also, this one will require rebuilding the Docker image. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Will you see if the issue here with large image files is still an issue with our python port? I'm guessing it is, since Zak didn't change the C code, which I believe is where the memory error came from.
If so, I think it should be straightforward to update the logic to process large chunks at a time rather than the entire image
Block-based seems slightly better on average. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay great, once Candace is happy this should be good to go. I assume you'll want to merge in the cleanup PR first?
We can make a new minor release afterwards to deal with the docker issues.
Sounds good. I'll merge the cleanup PR in first, then this one after approval. |
ark/phenotyping/cluster_helpers.py
Outdated
som_weights = som( | ||
data=data.values, xdim=self.xdim, ydim=self.ydim, | ||
alpha_range=(self.lr_start, self.lr_end), | ||
deterministic=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, so this deterministic=True
thing is bothering me slightly. In Zak's implementation, he has these lines: https://github.com/angelolab/pyFlowSOM/blob/c7989bdaaaa2f3d23c936df4c0b64c2769198353/pyFlowSOM/cyFlowSOM.pyx#L106-L107
So if deterministic=True
, that means the nodes are not randomly initialized (it's always the first 100 data points). Perhaps with a lot of iterations, it doesn't matter how the nodes are initialized, but we never tested this. In the og implementation of FlowSOM, it's always random (unless the user specifies the initial nodes): https://github.com/SofieVG/FlowSOM/blob/d5fc07fd81ff1840a4f8791e0e7526d86864cf1a/R/2_buildSOM.R#L190.
I think it would be better to always have a true random initialization of nodes. I think Zak probably made this deterministic=True
feature for testing. I think in practice, it would still be better to set a seed, and have that be something the users can change if they wish. This would require changing the pyFlowSOM code. I don't think we need this deterministic
parameter - instead, we should replace it with a seed
parameter, and just set the seed before initializing the nodes (before this line: https://github.com/angelolab/pyFlowSOM/blob/c7989bdaaaa2f3d23c936df4c0b64c2769198353/pyFlowSOM/cyFlowSOM.pyx#L109).
Here, if deterministic=True
, Zak sets a random seed: https://github.com/angelolab/pyFlowSOM/blob/c7989bdaaaa2f3d23c936df4c0b64c2769198353/pyFlowSOM/cyFlowSOM.pyx#L146. We could probably just replace that with whatever seed the user chooses.
I don't think this is that complex (I think @alex-l-kong or @srivarra could do it?). Just not sure how annoying it is to rebuild pyFlowSOM. @ngreenwald thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or alternatively, set the seed before the call to pyFlowSOM and then set deterministic=False
in the function call. We'd have to confirm that this actually does make pyFlowSOM reproducible though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cliu72 There are a couple of ways we have to implement this:
- In Python / Cython
- We might need to use the Numpy C - API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cliu72 let me test this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cliu72 I just tested it out. Setting np.random.seed
in Python prior to the call to pyFlowSOM.som
(with deterministic=False
) does not randomly initialize the weights. I think this random seed needs to be explicitly set in C.
Using deterministic=True
does ensure the weights are randomly initialized because even though it takes the first N rows of data, the subsetting process is set to a random seed, meaning we'll always be getting the first N rows of data each time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, looking at this a bit more, Zak changed the c code from the original package. In the original c code, https://github.com/SofieVG/FlowSOM/blob/master/src/som.c, there are these lines:
https://github.com/SofieVG/FlowSOM/blob/d5fc07fd81ff1840a4f8791e0e7526d86864cf1a/src/som.c#L14-L15
https://github.com/SofieVG/FlowSOM/blob/d5fc07fd81ff1840a4f8791e0e7526d86864cf1a/src/som.c#L93
https://github.com/SofieVG/FlowSOM/blob/d5fc07fd81ff1840a4f8791e0e7526d86864cf1a/src/som.c#L144
It seems like these were removed by Zak. I think that's what was allowing the c code to get the random state from R. I think we would need an equivalent for python (I think GetRNGstate as in the original c code is specific for R? But I'm not sure).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my initial analysis I think the only reason GetRNGstate
and PutRNGstate
are called in SofieVG's implementation is because they make an external call to R_CheckUserInterrupt
every 1024 updates, as well as R's internal unif_rand
function for random selections. These functions are needed to preserve the original default seed since C is actually calling R-based packages (see https://www.rdocumentation.org/packages/httpuv/versions/1.6.0/topics/getRNGState). However, Zak implemented this logic using just C (https://github.com/angelolab/pyFlowSOM/blob/c7989bdaaaa2f3d23c936df4c0b64c2769198353/pyFlowSOM/flowsom.c#L6-L8), which shouldn't cause any such issues to my knowledge.
It also seems like the deterministic
param is a recreation of init
in the SOM
function of FlowSOM
, however the deterministic creation of the weights in FlowSOM
is a lot more sophisticated (https://github.com/SofieVG/FlowSOM/blob/d5fc07fd81ff1840a4f8791e0e7526d86864cf1a/R/2_buildSOM.R#L279-L296), meaning we avoid the issue you brought up of sampling non-diverse pixels. I'm not sure why Zak didn't choose to implement this version and defaulted to a basic top-N sampling for deterministic=True
.
In either case, I think the main requirement here is ensuring that pyFlowSOM.som
can take an explicit random seed to initialize the weights in all cases. I'd argue for dropping the deterministic
parameter and instead pass a seed
parameter that is then set towards the beginning. In this way, we should be able to use something like:
if nodes is None:
np.random.seed(seed) # Sri does make a good point about using np.random.default_rng instead per https://albertcthomas.github.io/good-practices-random-number-generators/, this can be a PR on its own
nodes = data[np.random.choice(data.shape[0], xdim * ydim, replace=False), :]
to generate the same starting nodes each time on the same dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I told Zak that we never use the init
parameter and to leave it out of his implementation. In the original version of FlowSOM, the default is init=FALSE
and we never change that, so the complex initialization function you linked isn't relevant for us (we never use that function).
What we do use is a random initialization of nodes (https://github.com/SofieVG/FlowSOM/blob/d5fc07fd81ff1840a4f8791e0e7526d86864cf1a/R/2_buildSOM.R#L190-191), which is what is currently implemented in Zak's version when deterministic=False
. I think Zak only implemented deterministic=True
for testing.
And yes as you outlined and as I talk about above, I think setting a random seed before nodes is the way to go and removing deterministic
. Your code outline looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay great. Is there anything else besides this issue that still needs to be addressed from this PR from you @cliu72 ? We discussed today that it will be easier to merge this one in using the existing release of pyFlowSOM, and then have a separate PR that fixes the randomization issue in a different pyFlowSOM release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few more small comments
…notebook tests, remove mock file generation
…ysis into pyflowsom_add
Everything except the randomization issue (which will be handled in a separate PR) has been addressed and tested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a question about the docker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work!
What is the purpose of this PR?
Closes #811. Closes #833. Closes #688. This PR addresses integration of
pyFlowSOM
. See the issues for more details.How did you implement your changes
See #833.