Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PCA component #1270

Merged
merged 14 commits into from Oct 12, 2020
Merged

Add PCA component #1270

merged 14 commits into from Oct 12, 2020

Conversation

eccabay
Copy link
Contributor

@eccabay eccabay commented Oct 7, 2020

Closes #1262 by introducing a PCA component for dimensionality reduction.

As an initial sanity test, I ran AutoML on the MNIST dataset with and without the PCA component. The original dataset had 784 sparse features, after running PCA this was reduced to 154 features.

Training with current pipelines:
Screen Shot 2020-10-07 at 10 10 32 AM

Training with the same pipelines but including the PCA component with default parameters:
Screen Shot 2020-10-07 at 8 48 07 AM

@codecov
Copy link

codecov bot commented Oct 7, 2020

Codecov Report

Merging #1270 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff            @@
##             main    #1270    +/-   ##
========================================
  Coverage   99.93%   99.94%            
========================================
  Files         210      213     +3     
  Lines       13247    13357   +110     
========================================
+ Hits        13239    13349   +110     
  Misses          8        8            
Impacted Files Coverage Δ
evalml/pipelines/components/__init__.py 100.00% <ø> (ø)
...alml/pipelines/components/transformers/__init__.py 100.00% <100.00%> (ø)
.../transformers/dimensionality_reduction/__init__.py 100.00% <100.00%> (ø)
...nents/transformers/dimensionality_reduction/pca.py 100.00% <100.00%> (ø)
evalml/tests/component_tests/test_components.py 100.00% <100.00%> (ø)
evalml/tests/component_tests/test_pca.py 100.00% <100.00%> (ø)
evalml/tests/component_tests/test_utils.py 100.00% <100.00%> (ø)
evalml/utils/gen_utils.py 100.00% <100.00%> (ø)
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6ed418e...3605e8a. Read the comment docs.

@eccabay eccabay marked this pull request as ready for review October 7, 2020 18:42
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay I think this is really cool!

Since this component changes the input data fed to the estimator, it has an impact on our model understanding methods.

As it stands now, graph_partial_dependence will error if the pipeline has a PCA component and the explain_predictions* functions don't show original column names (which is not a bug but maybe we can look into making them display the original columns).

Once this gets merged, I'll file an issue to track that!

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic! Just left minor comments / things I'm curious about but LGTM!

[10, 2, 2, 5],
[6, 2, 2, 5]])
pca = PCA()
expected_X_t = pd.DataFrame([[3.176246, 1.282616],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, how did you get these expected values? Wondering if there's a way to write / store these values in a way that will make more sense than just a 3d list of floats :O

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I'm not sure if there's a better way to write these numbers! PCA centers features and projects them down to a smaller vector space, so there's no simple or easy to anticipate relationship between the input and output features. Very open to any suggestions though!

evalml/tests/component_tests/test_components.py Outdated Show resolved Hide resolved
@eccabay
Copy link
Contributor Author

eccabay commented Oct 12, 2020

Wrote up a slightly more in-depth summary of my testing with MNIST here

@eccabay eccabay merged commit c4f0c5a into main Oct 12, 2020
2 checks passed
@eccabay eccabay deleted the 1262_pca_component branch October 22, 2020 14:21
@dsherry dsherry mentioned this pull request Oct 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add PCA Component
3 participants