-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PCA component #1270
Add PCA component #1270
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1270 +/- ##
========================================
Coverage 99.93% 99.94%
========================================
Files 210 213 +3
Lines 13247 13357 +110
========================================
+ Hits 13239 13349 +110
Misses 8 8
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eccabay I think this is really cool!
Since this component changes the input data fed to the estimator, it has an impact on our model understanding methods.
As it stands now, graph_partial_dependence
will error if the pipeline has a PCA component and the explain_predictions*
functions don't show original column names (which is not a bug but maybe we can look into making them display the original columns).
Once this gets merged, I'll file an issue to track that!
evalml/pipelines/components/transformers/dimensionality_reduction/pca.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/dimensionality_reduction/pca.py
Show resolved
Hide resolved
evalml/pipelines/components/transformers/dimensionality_reduction/pca.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic! Just left minor comments / things I'm curious about but LGTM!
evalml/pipelines/components/transformers/dimensionality_reduction/pca.py
Outdated
Show resolved
Hide resolved
[10, 2, 2, 5], | ||
[6, 2, 2, 5]]) | ||
pca = PCA() | ||
expected_X_t = pd.DataFrame([[3.176246, 1.282616], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, how did you get these expected values? Wondering if there's a way to write / store these values in a way that will make more sense than just a 3d list of floats :O
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I'm not sure if there's a better way to write these numbers! PCA centers features and projects them down to a smaller vector space, so there's no simple or easy to anticipate relationship between the input and output features. Very open to any suggestions though!
Wrote up a slightly more in-depth summary of my testing with MNIST here |
Closes #1262 by introducing a PCA component for dimensionality reduction.
As an initial sanity test, I ran AutoML on the MNIST dataset with and without the PCA component. The original dataset had 784 sparse features, after running PCA this was reduced to 154 features.
Training with current pipelines:
Training with the same pipelines but including the PCA component with default parameters: