[ENH] Implement better randomized PCA #3532

pavlin-policar · 2019-01-11T19:42:23Z

Issue

Scikit-learn's PCA does not support sparse data. This PR implements randomized PCA which efficiently handles sparse matrices, without ever centering the original matrix. This means that running PCA is now feasible (and quite fast) for very large, sparse matrices, which would not fit into memory if densified.

This implementation should be removed once I get it merged into scikit-learn and once that's released.

There are two reference implementations of which I'm aware of and which were used to put together this code:

https://github.com/KlugerLab/pcafast
https://github.com/facebook/fbpca/ (I would just use this, but it's not being maintained and doesn't support random seeds)

Description of changes

I added degrees of freedom to var and std to be consistent with scikit-learn's covariance. I made the minimal changes possible, and we still use scikit-learn's exact PCA - I've just replaced the randomized solver. This means that ImprovedPCA is mostly copied from scikit-learn's PCA, with slight changes in order to support sparse matrices. randomized_pca computes PCA. I've used parameter values from scikit-learn, but these seem to be on the safe side in terms of accuracy i.e. they prefer accuracy over speed. In practice, it's likely we could reduce the number of power iterations and the number of oversamples and still get decent results.

I didn't need to touch OWPCA, but it already supports sparse matrices now. I've left truncated SVD inside for the time being. If anybody really wants to keep it, fine, but IMO there's no reason it should stay in the PCA widget or anywhere at all. Why would you ever run SVD when PCA is available?

Includes

Code changes
Tests
Documentation

mstrazar · 2019-01-21T15:50:14Z

At the moment, this PR does not override the usages of PCA across Orange.

The cultprit is the _fit method of ImprovedPCA. Because ImprovedPCA extends skl_decomposition.PCA it doesn't override its fit method. Therefore, the call proj = proj.fit(X, Y) calls the latter instead of ImprovedPCA._fit.

class PCA(SklProjector, _FeatureScorerMixin):
    __wraps__ = ImprovedPCA
    name = 'PCA'
    supports_sparse = True

    def __init__(self, n_components=None, copy=True, whiten=False,
                 svd_solver='auto', tol=0.0, iterated_power='auto',
                 random_state=None, preprocessors=None):
        super().__init__(preprocessors=preprocessors)
        self.params = vars()

    def fit(self, X, Y=None):
        params = self.params.copy()
        if params["n_components"] is not None:
            params["n_components"] = min(min(X.shape), params["n_components"])
        proj = self.__wraps__(**params)
        proj = proj.fit(X, Y)    # <---- problem here
        return PCAModel(proj, self.domain, len(proj.components_))

codecov · 2019-01-25T08:04:20Z

Codecov Report

❗ No coverage uploaded for pull request base (master@e6c793a). Click here to learn what that means.
The diff coverage is 87.17%.

@@            Coverage Diff            @@
##             master    #3532   +/-   ##
=========================================
  Coverage          ?   83.66%           
=========================================
  Files             ?      370           
  Lines             ?    66280           
  Branches          ?        0           
=========================================
  Hits              ?    55456           
  Misses            ?    10824           
  Partials          ?        0

pavlin-policar · 2019-01-25T08:08:50Z

@mstrazar I don't think that's right. skl_decomposition.PCA.fit directly calls skl_decomposition.PCA._fit, which in turn delegates fitting to skl_decomposition.PCA._fit_exact or skl_decomposition.PCA._fit_truncated, depending on the solver.

I did override skl_decomposition.PCA._fit and skl_decomposition.PCA._fit_truncated and as far as I can tell, any call to the Orange PCA wrapper, also calls my randomized method. I even added a test for this just now. Unless the widgets are calling PCA directly into sklearn, then this should work properly.

scikit-learn now supports PCA on sparse matrices without densifying them first. This PR basically reverts biolab#3532. At the time this was implemented, scikit-learn did not support this yet.

Statistics.utils: Add ddof parameter to var and std functions

95608c0

pavlin-policar changed the title ~~[ENH] Implement PCA on sparse data~~ [ENH] Implement better randomized PCA Jan 11, 2019

PCA: Implement randomized PCA which supports sparse data

5be6088

pavlin-policar force-pushed the pca-for-sparse branch from 4669952 to 5be6088 Compare January 11, 2019 20:02

janezd assigned mstrazar Jan 18, 2019

PCA: Add test to ensure ImprovedPCA is properly called

cc54454

mstrazar merged commit 78776d9 into biolab:master Jan 25, 2019

pavlin-policar deleted the pca-for-sparse branch January 25, 2019 08:57

pavlin-policar mentioned this pull request Feb 1, 2019

[ENH] t-SNE: Add Normalize data checkbox #3570

Merged

3 tasks

pavlin-policar mentioned this pull request May 31, 2024

Remove Orange implementation of randomized PCA #6815

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Implement better randomized PCA #3532

[ENH] Implement better randomized PCA #3532

pavlin-policar commented Jan 11, 2019

mstrazar commented Jan 21, 2019 •

edited

Loading

codecov bot commented Jan 25, 2019

pavlin-policar commented Jan 25, 2019

[ENH] Implement better randomized PCA #3532

[ENH] Implement better randomized PCA #3532

Conversation

pavlin-policar commented Jan 11, 2019

Issue

Description of changes

Includes

mstrazar commented Jan 21, 2019 • edited Loading

codecov bot commented Jan 25, 2019

Codecov Report

pavlin-policar commented Jan 25, 2019

mstrazar commented Jan 21, 2019 •

edited

Loading