New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Linear Discriminant Analysis component #1331
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1331 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 232 234 +2
Lines 16639 16742 +103
=========================================
+ Hits 16631 16734 +103
Misses 8 8
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eccabay I think this implementation is great!
I'd vote to keep this if we make it an estimator instead of a transformer for the following reason:
- In the spectrum of interpretable to black box, LDA is on the interpretable side. It has a closed-form solution that rests on well-understood (maybe not by me
😆 ) statistical theory. Users may want to compare better performing black box estimators against simple estimators like LDA to gage if the performance is worth the penalty in speed. This was one of the reasons we decided to add single decision trees even though they rarely out-perform black-box estimators.
If we decide to go this route, we can treat the transform method that projects the data onto lower dimensions as a "model understanding" util much like #1239 would do for decision trees.
In short, we would need to performance test this to get a better sense of the trade-offs but I think there's precedent for adding estimators that sacrifice performance in favor of simplicity and speed.
evalml/pipelines/components/transformers/dimensionality_reduction/lda.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/dimensionality_reduction/lda.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I left a few comments, but nothing blocking
evalml/pipelines/components/transformers/dimensionality_reduction/lda.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/dimensionality_reduction/lda.py
Show resolved
Hide resolved
if not is_all_numeric(X): | ||
raise ValueError("LDA input must be all numeric") | ||
|
||
self._component_obj.fit(X, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick, but can we not do super().fit(X, y)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and vice versa for transform
@freddyaboulton this is a really great point. Both approaches can be useful. My suggestion: merge this PR to get the transformer in (not added to automl just yet), then file an issue to build an LDA classifier. |
Closes #1314
The number of features remaining after transform is called is required to be <=
min(n_classes-1, n_features)
. Because of this, an LDA component would only add value for Multiclass Classification problems, especially those with many target classes. This leads me to wonder if it's worth it to include in evalml, but I'm still leaning towards doing so because it has the potential to significantly decrease training time for those cases.Since the only real benefit is for multiclass problems, the performance tests as they stand now are not a good measure of LDA's benefits. In order to get some sort of results, I performed the same test using MNIST as I did for the PCA component.
Current AutoML search ran with these results:

Custom Pipelines that added the LDA component ran with these results:

The 4-fold decrease in training time is spectacular to see. The performance does decrease noticeably, but the number of features is reduced from 784 down to 9. For comparison, the PCA component in its equivalent test had 152 features and reduced the performance to 95% accuracy.
I'm very interested in others' thoughts as to whether adding this component is worth it.