Skip to content

Log transformer data check#2487

Merged
ParthivNaresh merged 31 commits intomainfrom
LogTransformerDataCheck
Jul 30, 2021
Merged

Log transformer data check#2487
ParthivNaresh merged 31 commits intomainfrom
LogTransformerDataCheck

Conversation

@ParthivNaresh
Copy link
Contributor

@ParthivNaresh ParthivNaresh commented Jul 9, 2021

Fixes #914

Documentation and notes are here

Perf tests are here

@codecov
Copy link

codecov bot commented Jul 9, 2021

Codecov Report

Merging #2487 (f516f92) into main (66dcf8f) will decrease coverage by 0.1%.
The diff coverage is 99.1%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2487     +/-   ##
=======================================
- Coverage   99.9%   99.9%   -0.0%     
=======================================
  Files        287     291      +4     
  Lines      26398   26587    +189     
=======================================
+ Hits       26359   26544    +185     
- Misses        39      43      +4     
Impacted Files Coverage Δ
evalml/pipelines/components/__init__.py 100.0% <ø> (ø)
...alml/pipelines/components/transformers/__init__.py 100.0% <ø> (ø)
evalml/tests/component_tests/test_utils.py 97.5% <50.0%> (ø)
evalml/data_checks/__init__.py 100.0% <100.0%> (ø)
evalml/data_checks/data_check_action_code.py 100.0% <100.0%> (ø)
evalml/data_checks/data_check_message_code.py 100.0% <100.0%> (ø)
evalml/data_checks/default_data_checks.py 100.0% <100.0%> (ø)
...alml/data_checks/target_distribution_data_check.py 100.0% <100.0%> (ø)
.../components/transformers/preprocessing/__init__.py 100.0% <100.0%> (ø)
...ents/transformers/preprocessing/log_transformer.py 100.0% <100.0%> (ø)
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 66dcf8f...f516f92. Read the comment docs.

@ParthivNaresh ParthivNaresh self-assigned this Jul 19, 2021
@ParthivNaresh ParthivNaresh marked this pull request as ready for review July 19, 2021 14:29
numpy>=1.20.0
pandas>=1.2.1
scipy>=1.3.3
scipy>=1.5.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added because scipy changed how they return values for the Shapiro test (as a named tuple)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

version numbers don't care about your feelings

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks pretty good! Left some comments on things that I think we could fix, as well as clarifying questions for my own understanding! I don't think we need to include the problem_type as an arg for this datacheck, but I'm open to discussing it!

"objective": objective,
}
},
"TargetDistributionDataCheck": {"problem_type": problem_type},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this problem_type arg for this data check? We don't do it for ClassImbalanceDataCheck, where we assume that the problem type is classification. I think we can probably make a similar assumption for this datacheck, where we add it only for regression problems here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So these were the reasons I felt it was necessary:

  • I had originally added it in case open source users try leveraging it against their data outside of DefaultDataChecks. A binary or multiclass problem that consists of a few unique numbers in the target data could actually be interpreted by the TargetDistributionDataCheck as having a lognormal distribution although that data check message wouldn't make much sense for non regression problems, and a user without much data science experience might not know what to do with it.
  • I don't think most DataChecks need problem_type in their implementation, so we leave it out. However I do feel like ClassImbalanceDataCheck not having it is more of an argument in favour of including it for DataChecks that rely on problem_type like UniquenessDataCheck rather than against including it here.

I don't mind removing it at all, since adding it here makes me feel like we should add it for any DataChecks that rely on the exclusion of a problem type. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ParthivNaresh This makes sense. I think this does set a new precedent of having to add the problem_type arg to DataChecks that rely on specific problem types, but that isn't necessarily a bad thing. If this seems right to you, we can move forward with it!

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though I'm requesting changes, they're super minor. I think that instead of doing that list addition, you might as well just add the datacheck to the defaults, unless there's a reason I'm missing. If there's a good reason for doing it that way, let me know and I'll approve. But this looks good - good work.

numpy>=1.20.0
pandas>=1.2.1
scipy>=1.3.3
scipy>=1.5.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

version numbers don't care about your feelings

if is_time_series:
X, y = ts_data
else:
X, y = X_y_regression
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, you've probably seen this but you might need to change this branch to follow the "non_time_series" check.

# Conflicts:
#	core-requirements.txt
#	docs/source/api_reference.rst
#	docs/source/release_notes.rst
#	evalml/tests/component_tests/test_components.py
#	evalml/tests/dependency_update_check/minimum_core_requirements.txt
Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once CodeCov is good, I'm good

@ParthivNaresh ParthivNaresh requested a review from bchen1116 July 27, 2021 13:59
Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!, LGTM! Thanks for addressing all the comments. Left one lil nit, but great work on this one!

super().__init__(parameters={}, component_obj=None, random_seed=random_seed)

def fit(self, X, y=None):
"""Fits the LogTransform.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ultra-nit: Fits the LogTransformer?

@ParthivNaresh ParthivNaresh merged commit 2e0551b into main Jul 30, 2021
@chukarsten chukarsten mentioned this pull request Aug 3, 2021
@freddyaboulton freddyaboulton deleted the LogTransformerDataCheck branch May 13, 2022 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AutoML: apply nonlinear transformation to regression target during search

3 participants