Conversation
Codecov Report
@@ Coverage Diff @@
## main #2487 +/- ##
=======================================
- Coverage 99.9% 99.9% -0.0%
=======================================
Files 287 291 +4
Lines 26398 26587 +189
=======================================
+ Hits 26359 26544 +185
- Misses 39 43 +4
Continue to review full report at Codecov.
|
| numpy>=1.20.0 | ||
| pandas>=1.2.1 | ||
| scipy>=1.3.3 | ||
| scipy>=1.5.0 |
There was a problem hiding this comment.
Added because scipy changed how they return values for the Shapiro test (as a named tuple)
There was a problem hiding this comment.
version numbers don't care about your feelings
bchen1116
left a comment
There was a problem hiding this comment.
The implementation looks pretty good! Left some comments on things that I think we could fix, as well as clarifying questions for my own understanding! I don't think we need to include the problem_type as an arg for this datacheck, but I'm open to discussing it!
| "objective": objective, | ||
| } | ||
| }, | ||
| "TargetDistributionDataCheck": {"problem_type": problem_type}, |
There was a problem hiding this comment.
Do we need this problem_type arg for this data check? We don't do it for ClassImbalanceDataCheck, where we assume that the problem type is classification. I think we can probably make a similar assumption for this datacheck, where we add it only for regression problems here
There was a problem hiding this comment.
So these were the reasons I felt it was necessary:
- I had originally added it in case open source users try leveraging it against their data outside of
DefaultDataChecks. A binary or multiclass problem that consists of a few unique numbers in the target data could actually be interpreted by theTargetDistributionDataCheckas having a lognormal distribution although that data check message wouldn't make much sense for non regression problems, and a user without much data science experience might not know what to do with it. - I don't think most DataChecks need
problem_typein their implementation, so we leave it out. However I do feel likeClassImbalanceDataChecknot having it is more of an argument in favour of including it forDataChecksthat rely onproblem_typelikeUniquenessDataCheckrather than against including it here.
I don't mind removing it at all, since adding it here makes me feel like we should add it for any DataChecks that rely on the exclusion of a problem type. What do you think?
There was a problem hiding this comment.
@ParthivNaresh This makes sense. I think this does set a new precedent of having to add the problem_type arg to DataChecks that rely on specific problem types, but that isn't necessarily a bad thing. If this seems right to you, we can move forward with it!
evalml/tests/data_checks_tests/test_target_distribution_data_check.py
Outdated
Show resolved
Hide resolved
chukarsten
left a comment
There was a problem hiding this comment.
Even though I'm requesting changes, they're super minor. I think that instead of doing that list addition, you might as well just add the datacheck to the defaults, unless there's a reason I'm missing. If there's a good reason for doing it that way, let me know and I'll approve. But this looks good - good work.
| numpy>=1.20.0 | ||
| pandas>=1.2.1 | ||
| scipy>=1.3.3 | ||
| scipy>=1.5.0 |
There was a problem hiding this comment.
version numbers don't care about your feelings
| if is_time_series: | ||
| X, y = ts_data | ||
| else: | ||
| X, y = X_y_regression |
There was a problem hiding this comment.
Yea, you've probably seen this but you might need to change this branch to follow the "non_time_series" check.
# Conflicts: # core-requirements.txt # docs/source/api_reference.rst # docs/source/release_notes.rst # evalml/tests/component_tests/test_components.py # evalml/tests/dependency_update_check/minimum_core_requirements.txt
chukarsten
left a comment
There was a problem hiding this comment.
Once CodeCov is good, I'm good
bchen1116
left a comment
There was a problem hiding this comment.
Nice!, LGTM! Thanks for addressing all the comments. Left one lil nit, but great work on this one!
| super().__init__(parameters={}, component_obj=None, random_seed=random_seed) | ||
|
|
||
| def fit(self, X, y=None): | ||
| """Fits the LogTransform. |
There was a problem hiding this comment.
ultra-nit: Fits the LogTransformer?
Fixes #914
Documentation and notes are here
Perf tests are here