Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Implement better randomized PCA #3532

Merged
merged 3 commits into from
Jan 25, 2019

Conversation

pavlin-policar
Copy link
Collaborator

Issue

Scikit-learn's PCA does not support sparse data. This PR implements randomized PCA which efficiently handles sparse matrices, without ever centering the original matrix. This means that running PCA is now feasible (and quite fast) for very large, sparse matrices, which would not fit into memory if densified.

This implementation should be removed once I get it merged into scikit-learn and once that's released.

There are two reference implementations of which I'm aware of and which were used to put together this code:

Description of changes
  • I added degrees of freedom to var and std to be consistent with scikit-learn's covariance. I made the minimal changes possible, and we still use scikit-learn's exact PCA - I've just replaced the randomized solver. This means that ImprovedPCA is mostly copied from scikit-learn's PCA, with slight changes in order to support sparse matrices. randomized_pca computes PCA. I've used parameter values from scikit-learn, but these seem to be on the safe side in terms of accuracy i.e. they prefer accuracy over speed. In practice, it's likely we could reduce the number of power iterations and the number of oversamples and still get decent results.

I didn't need to touch OWPCA, but it already supports sparse matrices now. I've left truncated SVD inside for the time being. If anybody really wants to keep it, fine, but IMO there's no reason it should stay in the PCA widget or anywhere at all. Why would you ever run SVD when PCA is available?

Includes
  • Code changes
  • Tests
  • Documentation

@pavlin-policar pavlin-policar changed the title [ENH] Implement PCA on sparse data [ENH] Implement better randomized PCA Jan 11, 2019
@mstrazar
Copy link
Contributor

mstrazar commented Jan 21, 2019

At the moment, this PR does not override the usages of PCA across Orange.

The cultprit is the _fit method of ImprovedPCA. Because ImprovedPCA extends skl_decomposition.PCA it doesn't override its fit method. Therefore, the call proj = proj.fit(X, Y) calls the latter instead of ImprovedPCA._fit.

class PCA(SklProjector, _FeatureScorerMixin):
    __wraps__ = ImprovedPCA
    name = 'PCA'
    supports_sparse = True

    def __init__(self, n_components=None, copy=True, whiten=False,
                 svd_solver='auto', tol=0.0, iterated_power='auto',
                 random_state=None, preprocessors=None):
        super().__init__(preprocessors=preprocessors)
        self.params = vars()

    def fit(self, X, Y=None):
        params = self.params.copy()
        if params["n_components"] is not None:
            params["n_components"] = min(min(X.shape), params["n_components"])
        proj = self.__wraps__(**params)
        proj = proj.fit(X, Y)    # <---- problem here
        return PCAModel(proj, self.domain, len(proj.components_))

@codecov
Copy link

codecov bot commented Jan 25, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@e6c793a). Click here to learn what that means.
The diff coverage is 87.17%.

@@            Coverage Diff            @@
##             master    #3532   +/-   ##
=========================================
  Coverage          ?   83.66%           
=========================================
  Files             ?      370           
  Lines             ?    66280           
  Branches          ?        0           
=========================================
  Hits              ?    55456           
  Misses            ?    10824           
  Partials          ?        0

@pavlin-policar
Copy link
Collaborator Author

@mstrazar I don't think that's right. skl_decomposition.PCA.fit directly calls skl_decomposition.PCA._fit, which in turn delegates fitting to skl_decomposition.PCA._fit_exact or skl_decomposition.PCA._fit_truncated, depending on the solver.

I did override skl_decomposition.PCA._fit and skl_decomposition.PCA._fit_truncated and as far as I can tell, any call to the Orange PCA wrapper, also calls my randomized method. I even added a test for this just now. Unless the widgets are calling PCA directly into sklearn, then this should work properly.

@mstrazar mstrazar merged commit 78776d9 into biolab:master Jan 25, 2019
@pavlin-policar pavlin-policar deleted the pca-for-sparse branch January 25, 2019 08:57
pavlin-policar added a commit to pavlin-policar/orange3 that referenced this pull request May 31, 2024
scikit-learn now supports PCA on sparse matrices without densifying them first. This PR basically reverts biolab#3532. At the time this was implemented, scikit-learn did not support this yet.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants