New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP]TruncatedSVD/PCA #78

Merged
merged 1 commit into from Nov 8, 2017

Conversation

Projects
None yet
2 participants
@TomAugspurger
Member

TomAugspurger commented Oct 30, 2017

Closes #22

explained_variance_, total_var,
explained_variance_ratio_)
components_ = V
singular_values_ = S.copy()

This comment has been minimized.

@mrocklin

mrocklin Oct 31, 2017

Member

Is the compute call above safe? Are the U and V parts of that computation likely to be large?

Do we ever use the computed U matrix explicitly?

This comment has been minimized.

@TomAugspurger

TomAugspurger Oct 31, 2017

Member
  • components_ (V) is (min(n_components, n_features_), n_features)
  • S and U are both (min(n_components, n_features),)

So as long as the array is not too wide, this will be OK. I've mainly been assuming this up till now, though decomposition is probably the area we're mostly likely to run into extremely wide arrays.

@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Nov 8, 2017

Gist showing some comparisons on PCA: https://gist.github.com/dced18b10aa28cba434771a37a3576f7

Dask is slower on small datasets (1,000 x 500), but quite a bit faster on multi-GB datasets. For 1 100,000 x 5,000 array of doubles (4GB), dask took 11s vs. 35s for scikit-learn, using the 'randomized' solvers (svd_compressed for dask, Halko et. al for scikit-learn).

Merging.

@TomAugspurger TomAugspurger merged commit d5fd60f into dask:master Nov 8, 2017

2 checks passed

ci/circleci: py27 Your tests passed on CircleCI!
Details
ci/circleci: py36 Your tests passed on CircleCI!
Details

@TomAugspurger TomAugspurger deleted the TomAugspurger:svd branch Nov 8, 2017

@mrocklin

This comment has been minimized.

Member

mrocklin commented Nov 8, 2017

Nice results. Do the two methods have the same accuracy?

cc @ogrisel @marianotepper

@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Nov 8, 2017

Do the two methods have the same accuracy?

Scikit-learn's solvers seem to have a higher degree of precision. In most of the tests, I needed to adjust the tolerence from something like 1e-5 to 1e-3.

I wish I had kept better notes on the patches I made to the scikit-learn test suite as I was going through. I have another branch that redid everything in a more careful way, but it was taking a while to re-do everything. TomAugspurger@07bfcb2 shows what the required differences for accuracy were like. The other commits in that branch do things like remove arpacak-specific solvers, remove tests where n_samples > n_features, which the tsqr algorithm doesn't like.

TomAugspurger added a commit to daniel-severo/dask-ml that referenced this pull request Jan 15, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment