-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Implement better randomized PCA #3532
Conversation
4669952
to
5be6088
Compare
At the moment, this PR does not override the usages of PCA across Orange. The cultprit is the
|
Codecov Report
@@ Coverage Diff @@
## master #3532 +/- ##
=========================================
Coverage ? 83.66%
=========================================
Files ? 370
Lines ? 66280
Branches ? 0
=========================================
Hits ? 55456
Misses ? 10824
Partials ? 0 |
@mstrazar I don't think that's right. I did override |
scikit-learn now supports PCA on sparse matrices without densifying them first. This PR basically reverts biolab#3532. At the time this was implemented, scikit-learn did not support this yet.
Issue
Scikit-learn's PCA does not support sparse data. This PR implements randomized PCA which efficiently handles sparse matrices, without ever centering the original matrix. This means that running PCA is now feasible (and quite fast) for very large, sparse matrices, which would not fit into memory if densified.
This implementation should be removed once I get it merged into scikit-learn and once that's released.
There are two reference implementations of which I'm aware of and which were used to put together this code:
Description of changes
var
andstd
to be consistent with scikit-learn's covariance. I made the minimal changes possible, and we still use scikit-learn's exact PCA - I've just replaced therandomized
solver. This means thatImprovedPCA
is mostly copied from scikit-learn's PCA, with slight changes in order to support sparse matrices.randomized_pca
computes PCA. I've used parameter values from scikit-learn, but these seem to be on the safe side in terms of accuracy i.e. they prefer accuracy over speed. In practice, it's likely we could reduce the number of power iterations and the number of oversamples and still get decent results.I didn't need to touch
OWPCA
, but it already supports sparse matrices now. I've left truncated SVD inside for the time being. If anybody really wants to keep it, fine, but IMO there's no reason it should stay in the PCA widget or anywhere at all. Why would you ever run SVD when PCA is available?Includes