Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Orange implementation of randomized PCA #6815

Merged
merged 5 commits into from
May 31, 2024

Conversation

pavlin-policar
Copy link
Collaborator

This PR basically reverts #3532.

At the time this was implemented, scikit-learn did not support PCA on sparse matrices without first densifying them. It has since added this support.

Issue

PCA failing on scikit-learn 1.5 due to PCA internal API changes.

Description of changes

The reason for our own custom PCA implementation (proper sparse support) is now invalid, since scikit-learn properly handles sprase data now. Therefore, I just removed our implementation of PCA and we now rely purely on scikit-learn. Our implentation was always meant to be temporary.

Includes
  • Code changes
  • Tests
  • Documentation

scikit-learn now supports PCA on sparse matrices without densifying them first. This PR basically reverts biolab#3532. At the time this was implemented, scikit-learn did not support this yet.
@pavlin-policar
Copy link
Collaborator Author

@markotoplak Was this the only thing blocking scikit-learn=1.5.0? If so, should I also change the dependency?

@pavlin-policar
Copy link
Collaborator Author

Better sparse support was added to scikit-learn in 1.4.0 (https://scikit-learn.org/dev/auto_examples/release_highlights/plot_release_highlights_1_4_0.html#improved-memory-and-runtime-efficiency-for-pca-on-sparse-data), hence the new minimum version bump.

@pavlin-policar
Copy link
Collaborator Author

I've sifted through some of the failing tests.

On macos, we now have the following curious error. scikit-learn now supports sparse data, but uses the arpack solver (here), which requires that the number of components is less than the number of rows/columns in the data. This is now a problem, because in Orange, we allow the user to request as many principal components as there are columns, hence the failure.

Some windows and ubuntu failures seem unrelated to the changes here.

@pavlin-policar
Copy link
Collaborator Author

It seems to me that most of the tests are now failing for different reasons.

@markotoplak markotoplak merged commit 97490a7 into biolab:master May 31, 2024
9 of 28 checks passed
@pavlin-policar pavlin-policar deleted the remove-randomized-pca branch May 31, 2024 11:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants