Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing sign stability in incremental PCA #742

Merged
merged 5 commits into from
Sep 24, 2020

Conversation

eric-czech
Copy link
Contributor

@eric-czech eric-czech commented Sep 21, 2020

Hey @TomAugspurger, this includes some small changes to address the wrinkle I mentioned at #737 (comment). The "u vs v" choice in svd_flip does depend on whether or not you plan on using the right or left singular vectors and there is no way (I can find) to correct them both simultaneously. Ideally, all PCA functions use a "v-based" correction since the principal components are equivalent to the right singular vectors. I chose the "u-based" correction as the default in dask though, since that's what sklearn does, which leads to the somewhat awkward scenario at this level where you have to skip the correction in dask and then apply a similar one.

It would be most convenient for this project (and what we're trying to do) if a "v-based" correction was the default. I added that in dask/dask#6658 and wanted to point that out here. That would obviate the need for any kind of svd-related sign correction in dask-ml after the next dask release.

FYI: The code before my changes was using sklearn.svd_flip.u_based_decision=False only for incremental pca to get the stability in signs that all PCA functions should probably have. I can't see any reason why that shouldn't be what we do across the board.

@eric-czech
Copy link
Contributor Author

eric-czech commented Sep 21, 2020

Also FYI, all the build failures appear to be unrelated to these changes.

@TomAugspurger
Copy link
Member

dask/dask#6658 is in now. Would this PR be a good place to start using it for Dask>2.27?

@eric-czech
Copy link
Contributor Author

Would this PR be a good place to start using it for Dask>2.27?

Yep, I'll add the extra logic for it.

Do you have a guideline in your head for how long you like to support older releases of dask before it would become appropriate to increment the current dask version bound at 2.4.0 to something more recent? I can see the tests and code for the different PCA implementations cleaning up nicely after that, but I wasn't sure if increasing that bound is deceptively complicated given https://github.com/dask/dask-ml/blob/master/ci/environment-3.6.yaml#L8. Perhaps py3.6 support is difficult beyond dask 2.4?

@TomAugspurger
Copy link
Member

TomAugspurger commented Sep 22, 2020 via email

@eric-czech
Copy link
Contributor Author

Hey @TomAugspurger, it isn't pretty but this should take care of it once dask 2.28 is out: d3c8b5f.

Copy link
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pushing a commit to change the comparison a bit, so that our job testing against Dask master hits this (right now the version on master is 2.27.0+7..., which is strictly larger than 2.27, but less than 2.28.

dask_ml/_compat.py Outdated Show resolved Hide resolved
@TomAugspurger
Copy link
Member

Just the unrelated hyperband timeouts. Thanks!

I'll write up a changelog and get a release out today.

@TomAugspurger TomAugspurger merged commit bef33d6 into dask:master Sep 24, 2020
@eric-czech
Copy link
Contributor Author

Thanks again @TomAugspurger.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants