Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP]TruncatedSVD/PCA #78

Merged
merged 1 commit into from
Nov 8, 2017
Merged

[WIP]TruncatedSVD/PCA #78

merged 1 commit into from
Nov 8, 2017

Conversation

TomAugspurger
Copy link
Member

Closes #22

explained_variance_, total_var,
explained_variance_ratio_)
components_ = V
singular_values_ = S.copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the compute call above safe? Are the U and V parts of that computation likely to be large?

Do we ever use the computed U matrix explicitly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • components_ (V) is (min(n_components, n_features_), n_features)
  • S and U are both (min(n_components, n_features),)

So as long as the array is not too wide, this will be OK. I've mainly been assuming this up till now, though decomposition is probably the area we're mostly likely to run into extremely wide arrays.

@TomAugspurger
Copy link
Member Author

Gist showing some comparisons on PCA: https://gist.github.com/dced18b10aa28cba434771a37a3576f7

Dask is slower on small datasets (1,000 x 500), but quite a bit faster on multi-GB datasets. For 1 100,000 x 5,000 array of doubles (4GB), dask took 11s vs. 35s for scikit-learn, using the 'randomized' solvers (svd_compressed for dask, Halko et. al for scikit-learn).

Merging.

@TomAugspurger TomAugspurger merged commit d5fd60f into dask:master Nov 8, 2017
@mrocklin
Copy link
Member

mrocklin commented Nov 8, 2017

Nice results. Do the two methods have the same accuracy?

cc @ogrisel @marianotepper

@TomAugspurger
Copy link
Member Author

Do the two methods have the same accuracy?

Scikit-learn's solvers seem to have a higher degree of precision. In most of the tests, I needed to adjust the tolerence from something like 1e-5 to 1e-3.

I wish I had kept better notes on the patches I made to the scikit-learn test suite as I was going through. I have another branch that redid everything in a more careful way, but it was taking a while to re-do everything. TomAugspurger@07bfcb2 shows what the required differences for accuracy were like. The other commits in that branch do things like remove arpacak-specific solvers, remove tests where n_samples > n_features, which the tsqr algorithm doesn't like.

TomAugspurger added a commit to dsevero/dask-ml that referenced this pull request Jan 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants