Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate pyFlowSOM with Pixie #863

Merged
merged 33 commits into from
Jan 5, 2023
Merged

Integrate pyFlowSOM with Pixie #863

merged 33 commits into from
Jan 5, 2023

Conversation

alex-l-kong
Copy link
Contributor

@alex-l-kong alex-l-kong commented Dec 14, 2022

What is the purpose of this PR?

Closes #811. Closes #833. Closes #688. This PR addresses integration of pyFlowSOM. See the issues for more details.

How did you implement your changes

See #833.

@alex-l-kong alex-l-kong self-assigned this Dec 14, 2022
@alex-l-kong alex-l-kong marked this pull request as draft December 14, 2022 05:21
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@alex-l-kong
Copy link
Contributor Author

alex-l-kong commented Dec 15, 2022

@ngreenwald tested this on both pixel and cell clustering, verified that both run seamlessly through.

We'll need to merge in #856 before we merge this one into main. Also, this one will require rebuilding the Docker image.

@alex-l-kong alex-l-kong marked this pull request as ready for review December 15, 2022 08:16
Copy link
Member

@ngreenwald ngreenwald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Will you see if the issue here with large image files is still an issue with our python port? I'm guessing it is, since Zak didn't change the C code, which I believe is where the memory error came from.

If so, I think it should be straightforward to update the logic to process large chunks at a time rather than the entire image

#688

@alex-l-kong
Copy link
Contributor Author

Okay, and the issues about performance? How much of a difference does this make for predicting a single image worth of pixels?

Block-based seems slightly better on average.

Copy link
Member

@ngreenwald ngreenwald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay great, once Candace is happy this should be good to go. I assume you'll want to merge in the cleanup PR first?

We can make a new minor release afterwards to deal with the docker issues.

@alex-l-kong
Copy link
Contributor Author

Okay great, once Candace is happy this should be good to go. I assume you'll want to merge in the cleanup PR first?

We can make a new minor release afterwards to deal with the docker issues.

Sounds good. I'll merge the cleanup PR in first, then this one after approval.

som_weights = som(
data=data.values, xdim=self.xdim, ydim=self.ydim,
alpha_range=(self.lr_start, self.lr_end),
deterministic=True
Copy link
Contributor

@cliu72 cliu72 Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, so this deterministic=True thing is bothering me slightly. In Zak's implementation, he has these lines: https://github.com/angelolab/pyFlowSOM/blob/c7989bdaaaa2f3d23c936df4c0b64c2769198353/pyFlowSOM/cyFlowSOM.pyx#L106-L107

So if deterministic=True, that means the nodes are not randomly initialized (it's always the first 100 data points). Perhaps with a lot of iterations, it doesn't matter how the nodes are initialized, but we never tested this. In the og implementation of FlowSOM, it's always random (unless the user specifies the initial nodes): https://github.com/SofieVG/FlowSOM/blob/d5fc07fd81ff1840a4f8791e0e7526d86864cf1a/R/2_buildSOM.R#L190.

I think it would be better to always have a true random initialization of nodes. I think Zak probably made this deterministic=True feature for testing. I think in practice, it would still be better to set a seed, and have that be something the users can change if they wish. This would require changing the pyFlowSOM code. I don't think we need this deterministic parameter - instead, we should replace it with a seed parameter, and just set the seed before initializing the nodes (before this line: https://github.com/angelolab/pyFlowSOM/blob/c7989bdaaaa2f3d23c936df4c0b64c2769198353/pyFlowSOM/cyFlowSOM.pyx#L109).

Here, if deterministic=True, Zak sets a random seed: https://github.com/angelolab/pyFlowSOM/blob/c7989bdaaaa2f3d23c936df4c0b64c2769198353/pyFlowSOM/cyFlowSOM.pyx#L146. We could probably just replace that with whatever seed the user chooses.

I don't think this is that complex (I think @alex-l-kong or @srivarra could do it?). Just not sure how annoying it is to rebuild pyFlowSOM. @ngreenwald thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or alternatively, set the seed before the call to pyFlowSOM and then set deterministic=False in the function call. We'd have to confirm that this actually does make pyFlowSOM reproducible though.

Copy link
Contributor

@srivarra srivarra Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cliu72 There are a couple of ways we have to implement this:

  1. In Python / Cython
  2. We might need to use the Numpy C - API

Copy link
Contributor Author

@alex-l-kong alex-l-kong Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cliu72 let me test this out.

Copy link
Contributor Author

@alex-l-kong alex-l-kong Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cliu72 I just tested it out. Setting np.random.seed in Python prior to the call to pyFlowSOM.som (with deterministic=False) does not randomly initialize the weights. I think this random seed needs to be explicitly set in C.

Using deterministic=True does ensure the weights are randomly initialized because even though it takes the first N rows of data, the subsetting process is set to a random seed, meaning we'll always be getting the first N rows of data each time.

Copy link
Contributor

@cliu72 cliu72 Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, looking at this a bit more, Zak changed the c code from the original package. In the original c code, https://github.com/SofieVG/FlowSOM/blob/master/src/som.c, there are these lines:

https://github.com/SofieVG/FlowSOM/blob/d5fc07fd81ff1840a4f8791e0e7526d86864cf1a/src/som.c#L14-L15
https://github.com/SofieVG/FlowSOM/blob/d5fc07fd81ff1840a4f8791e0e7526d86864cf1a/src/som.c#L93
https://github.com/SofieVG/FlowSOM/blob/d5fc07fd81ff1840a4f8791e0e7526d86864cf1a/src/som.c#L144

It seems like these were removed by Zak. I think that's what was allowing the c code to get the random state from R. I think we would need an equivalent for python (I think GetRNGstate as in the original c code is specific for R? But I'm not sure).

Copy link
Contributor Author

@alex-l-kong alex-l-kong Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my initial analysis I think the only reason GetRNGstate and PutRNGstate are called in SofieVG's implementation is because they make an external call to R_CheckUserInterrupt every 1024 updates, as well as R's internal unif_rand function for random selections. These functions are needed to preserve the original default seed since C is actually calling R-based packages (see https://www.rdocumentation.org/packages/httpuv/versions/1.6.0/topics/getRNGState). However, Zak implemented this logic using just C (https://github.com/angelolab/pyFlowSOM/blob/c7989bdaaaa2f3d23c936df4c0b64c2769198353/pyFlowSOM/flowsom.c#L6-L8), which shouldn't cause any such issues to my knowledge.

It also seems like the deterministic param is a recreation of init in the SOM function of FlowSOM, however the deterministic creation of the weights in FlowSOM is a lot more sophisticated (https://github.com/SofieVG/FlowSOM/blob/d5fc07fd81ff1840a4f8791e0e7526d86864cf1a/R/2_buildSOM.R#L279-L296), meaning we avoid the issue you brought up of sampling non-diverse pixels. I'm not sure why Zak didn't choose to implement this version and defaulted to a basic top-N sampling for deterministic=True.

In either case, I think the main requirement here is ensuring that pyFlowSOM.som can take an explicit random seed to initialize the weights in all cases. I'd argue for dropping the deterministic parameter and instead pass a seed parameter that is then set towards the beginning. In this way, we should be able to use something like:

if nodes is None:
    np.random.seed(seed)  # Sri does make a good point about using np.random.default_rng instead per https://albertcthomas.github.io/good-practices-random-number-generators/, this can be a PR on its own
    nodes = data[np.random.choice(data.shape[0], xdim * ydim, replace=False), :]

to generate the same starting nodes each time on the same dataset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I told Zak that we never use the init parameter and to leave it out of his implementation. In the original version of FlowSOM, the default is init=FALSE and we never change that, so the complex initialization function you linked isn't relevant for us (we never use that function).

What we do use is a random initialization of nodes (https://github.com/SofieVG/FlowSOM/blob/d5fc07fd81ff1840a4f8791e0e7526d86864cf1a/R/2_buildSOM.R#L190-191), which is what is currently implemented in Zak's version when deterministic=False. I think Zak only implemented deterministic=True for testing.

And yes as you outlined and as I talk about above, I think setting a random seed before nodes is the way to go and removing deterministic. Your code outline looks good to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay great. Is there anything else besides this issue that still needs to be addressed from this PR from you @cliu72 ? We discussed today that it will be easier to merge this one in using the existing release of pyFlowSOM, and then have a separate PR that fixes the randomization issue in a different pyFlowSOM release.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few more small comments

@alex-l-kong
Copy link
Contributor Author

Everything except the randomization issue (which will be handled in a separate PR) has been addressed and tested.

Copy link
Member

@ngreenwald ngreenwald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a question about the docker

Dockerfile Outdated Show resolved Hide resolved
Copy link
Contributor

@cliu72 cliu72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

Copy link
Member

@ngreenwald ngreenwald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work!

@ngreenwald ngreenwald merged commit e9dfcc5 into main Jan 5, 2023
@ngreenwald ngreenwald deleted the pyflowsom_add branch January 5, 2023 22:23
@srivarra srivarra added the enhancement New feature or request label Jan 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
4 participants