-
-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP]TruncatedSVD/PCA #78
Conversation
dask_ml/decomposition.py
Outdated
explained_variance_, total_var, | ||
explained_variance_ratio_) | ||
components_ = V | ||
singular_values_ = S.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the compute call above safe? Are the U and V parts of that computation likely to be large?
Do we ever use the computed U
matrix explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
components_
(V
) is(min(n_components, n_features_), n_features)
S
andU
are both(min(n_components, n_features)
,)
So as long as the array is not too wide, this will be OK. I've mainly been assuming this up till now, though decomposition is probably the area we're mostly likely to run into extremely wide arrays.
Gist showing some comparisons on PCA: https://gist.github.com/dced18b10aa28cba434771a37a3576f7 Dask is slower on small datasets (1,000 x 500), but quite a bit faster on multi-GB datasets. For 1 100,000 x 5,000 array of doubles (4GB), dask took 11s vs. 35s for scikit-learn, using the 'randomized' solvers (svd_compressed for dask, Halko et. al for scikit-learn). Merging. |
Nice results. Do the two methods have the same accuracy? |
Scikit-learn's solvers seem to have a higher degree of precision. In most of the tests, I needed to adjust the tolerence from something like I wish I had kept better notes on the patches I made to the scikit-learn test suite as I was going through. I have another branch that redid everything in a more careful way, but it was taking a while to re-do everything. TomAugspurger@07bfcb2 shows what the required differences for accuracy were like. The other commits in that branch do things like remove arpacak-specific solvers, remove tests where |
Closes #22