Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Linear Discriminant Analysis component #1331

Merged
merged 14 commits into from Dec 14, 2020
Merged

Add Linear Discriminant Analysis component #1331

merged 14 commits into from Dec 14, 2020

Conversation

eccabay
Copy link
Contributor

@eccabay eccabay commented Oct 21, 2020

Closes #1314

The number of features remaining after transform is called is required to be <= min(n_classes-1, n_features). Because of this, an LDA component would only add value for Multiclass Classification problems, especially those with many target classes. This leads me to wonder if it's worth it to include in evalml, but I'm still leaning towards doing so because it has the potential to significantly decrease training time for those cases.

Since the only real benefit is for multiclass problems, the performance tests as they stand now are not a good measure of LDA's benefits. In order to get some sort of results, I performed the same test using MNIST as I did for the PCA component.

Current AutoML search ran with these results:
Screen Shot 2020-11-10 at 10 06 54 AM

Custom Pipelines that added the LDA component ran with these results:
Screen Shot 2020-11-10 at 10 08 06 AM

The 4-fold decrease in training time is spectacular to see. The performance does decrease noticeably, but the number of features is reduced from 784 down to 9. For comparison, the PCA component in its equivalent test had 152 features and reduced the performance to 95% accuracy.

I'm very interested in others' thoughts as to whether adding this component is worth it.

@codecov
Copy link

codecov bot commented Oct 21, 2020

Codecov Report

Merging #1331 (2463828) into main (8a4e763) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1331     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         232      234      +2     
  Lines       16639    16742    +103     
=========================================
+ Hits        16631    16734    +103     
  Misses          8        8             
Impacted Files Coverage Δ
evalml/pipelines/components/__init__.py 100.0% <ø> (ø)
...alml/pipelines/components/transformers/__init__.py 100.0% <100.0%> (ø)
.../transformers/dimensionality_reduction/__init__.py 100.0% <100.0%> (ø)
...nents/transformers/dimensionality_reduction/lda.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_components.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_lda.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_utils.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8a4e763...2463828. Read the comment docs.

@CLAassistant
Copy link

CLAassistant commented Nov 4, 2020

CLA assistant check
All committers have signed the CLA.

@eccabay eccabay marked this pull request as ready for review November 10, 2020 16:39
@eccabay eccabay self-assigned this Nov 10, 2020
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay I think this implementation is great!

I'd vote to keep this if we make it an estimator instead of a transformer for the following reason:

  • In the spectrum of interpretable to black box, LDA is on the interpretable side. It has a closed-form solution that rests on well-understood (maybe not by me 😆 ) statistical theory. Users may want to compare better performing black box estimators against simple estimators like LDA to gage if the performance is worth the penalty in speed. This was one of the reasons we decided to add single decision trees even though they rarely out-perform black-box estimators.

If we decide to go this route, we can treat the transform method that projects the data onto lower dimensions as a "model understanding" util much like #1239 would do for decision trees.

In short, we would need to performance test this to get a better sense of the trade-offs but I think there's precedent for adding estimators that sacrifice performance in favor of simplicity and speed.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I left a few comments, but nothing blocking

if not is_all_numeric(X):
raise ValueError("LDA input must be all numeric")

self._component_obj.fit(X, y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick, but can we not do super().fit(X, y)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and vice versa for transform

@dsherry
Copy link
Collaborator

dsherry commented Dec 7, 2020

I'd vote to keep this if we make it an estimator instead of a transformer for the following reason:

  • In the spectrum of interpretable to black box, LDA is on the interpretable side. It has a closed-form solution that rests on well-understood (maybe not by me 😆 ) statistical theory. Users may want to compare better performing black box estimators against simple estimators like LDA to gage if the performance is worth the penalty in speed. This was one of the reasons we decided to add single decision trees even though they rarely out-perform black-box estimators.

If we decide to go this route, we can treat the transform method that projects the data onto lower dimensions as a "model understanding" util much like #1239 would do for decision trees.

In short, we would need to performance test this to get a better sense of the trade-offs but I think there's precedent for adding estimators that sacrifice performance in favor of simplicity and speed.

@freddyaboulton this is a really great point. Both approaches can be useful. My suggestion: merge this PR to get the transformer in (not added to automl just yet), then file an issue to build an LDA classifier.

@eccabay eccabay mentioned this pull request Dec 14, 2020
@eccabay eccabay merged commit 0f1f1c6 into main Dec 14, 2020
1 check passed
@dsherry dsherry mentioned this pull request Dec 29, 2020
@eccabay eccabay deleted the 1314_lda_component branch March 10, 2022 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add LDA Component
5 participants