[WIP]TruncatedSVD/PCA #78

TomAugspurger · 2017-10-30T22:16:25Z

Closes #22

mrocklin · 2017-10-31T06:43:42Z

dask_ml/decomposition.py

+                     explained_variance_, total_var,
+                     explained_variance_ratio_)
+        components_ = V
+        singular_values_ = S.copy()


Is the compute call above safe? Are the U and V parts of that computation likely to be large?

Do we ever use the computed U matrix explicitly?

components_ (V) is (min(n_components, n_features_), n_features)

S and U are both (min(n_components, n_features),)

So as long as the array is not too wide, this will be OK. I've mainly been assuming this up till now, though decomposition is probably the area we're mostly likely to run into extremely wide arrays.

TomAugspurger · 2017-11-08T15:57:39Z

Gist showing some comparisons on PCA: https://gist.github.com/dced18b10aa28cba434771a37a3576f7

Dask is slower on small datasets (1,000 x 500), but quite a bit faster on multi-GB datasets. For 1 100,000 x 5,000 array of doubles (4GB), dask took 11s vs. 35s for scikit-learn, using the 'randomized' solvers (svd_compressed for dask, Halko et. al for scikit-learn).

Merging.

mrocklin · 2017-11-08T16:02:43Z

Nice results. Do the two methods have the same accuracy?

cc @ogrisel @marianotepper

TomAugspurger · 2017-11-08T16:15:32Z

Do the two methods have the same accuracy?

Scikit-learn's solvers seem to have a higher degree of precision. In most of the tests, I needed to adjust the tolerence from something like 1e-5 to 1e-3.

I wish I had kept better notes on the patches I made to the scikit-learn test suite as I was going through. I have another branch that redid everything in a more careful way, but it was taking a while to re-do everything. TomAugspurger@07bfcb2 shows what the required differences for accuracy were like. The other commits in that branch do things like remove arpacak-specific solvers, remove tests where n_samples > n_features, which the tsqr algorithm doesn't like.

mrocklin reviewed Oct 31, 2017

View reviewed changes

ENH: Added TruncatedSVD and PCA

626c17a

TomAugspurger force-pushed the svd branch from e5b1ce9 to 626c17a Compare November 6, 2017 17:08

TomAugspurger merged commit d5fd60f into dask:master Nov 8, 2017

TomAugspurger deleted the svd branch November 8, 2017 15:57

TomAugspurger added a commit to dsevero/dask-ml that referenced this pull request Jan 15, 2018

ENH: Added TruncatedSVD and PCA (dask#78)

e34baf1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP]TruncatedSVD/PCA #78

[WIP]TruncatedSVD/PCA #78

TomAugspurger commented Oct 30, 2017

mrocklin Oct 31, 2017

TomAugspurger Oct 31, 2017

TomAugspurger commented Nov 8, 2017

mrocklin commented Nov 8, 2017

TomAugspurger commented Nov 8, 2017

[WIP]TruncatedSVD/PCA #78

[WIP]TruncatedSVD/PCA #78

Conversation

TomAugspurger commented Oct 30, 2017

mrocklin Oct 31, 2017

Choose a reason for hiding this comment

TomAugspurger Oct 31, 2017

Choose a reason for hiding this comment

TomAugspurger commented Nov 8, 2017

mrocklin commented Nov 8, 2017

TomAugspurger commented Nov 8, 2017