Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical Stop Condition Options #808

Conversation

micdavis
Copy link
Contributor

@micdavis micdavis commented May 3, 2023

No description provided.

@taylorfturner taylorfturner added the Work In Progress Solution is being developed label May 3, 2023
@taylorfturner taylorfturner enabled auto-merge (squash) May 3, 2023 19:32
@taylorfturner taylorfturner removed the Work In Progress Solution is being developed label May 3, 2023
Comment on lines 873 to 874
max_sample_size_to_check_stop_condition: int = None,
stop_condition_unique_value_ratio: float = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The typing should have the | None
max_sample_size_to_check_stop_condition: int | None = None
same for both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed

dataprofiler/profilers/profiler_options.py Show resolved Hide resolved
dataprofiler/profilers/profiler_options.py Show resolved Hide resolved
dataprofiler/profilers/profiler_options.py Outdated Show resolved Hide resolved
dataprofiler/profilers/profiler_options.py Show resolved Hide resolved
dataprofiler/profilers/profiler_options.py Outdated Show resolved Hide resolved
dataprofiler/profilers/profiler_options.py Show resolved Hide resolved
@@ -11,15 +11,66 @@ class TestCategoricalOptions(TestBaseInspectorOptions):
def test_init(self):
option = self.get_options()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another PR, but this is naming I should have caught before:
get_options -> set_and_retrieve_option_values or something. Needs to be an issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_options is confusing when we set a value in it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -32,6 +83,8 @@ def test_set(self):
params_to_check = [
dict(prop="is_enabled", value_list=[False, True]),
dict(prop="top_k_categories", value_list=[None, 3]),
dict(prop="max_sample_size_to_check_stop_condition", value_list=[None, 20]),
dict(prop="stop_condition_unique_value_ratio", value_list=[None, 2]),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i suggest adding a float here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

[expected_max_sample_size_type_error], options._validate_helper()
)

# These ones should throw a max_sample_size_type_error
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment should be for unique value I think

Comment on lines 209 to 212
options = self.get_options(
max_sample_size_to_check_stop_condition=0,
stop_condition_unique_value_ratio=0.0,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the condition we discussed before, it might be good to make one of these a value. and one 0 to test that condition.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could do this as
stop_condition_unique_value_ratio=0.5,

and in check above
set stop_condition_unique_value_ratio=0.0

so we don't need more tests to do the same checks

Comment on lines 918 to 921
if self.stop_condition_unique_value_ratio is not None and (
not isinstance(self.stop_condition_unique_value_ratio, float)
or self.stop_condition_unique_value_ratio < 0
):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it is a ratio we need to: 0 <= stop_condition_unique_value_ratio <= 1.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I knew it didn't makes sense for a hard core < 0

@@ -32,6 +83,8 @@ def test_set(self):
params_to_check = [
dict(prop="is_enabled", value_list=[False, True]),
dict(prop="top_k_categories", value_list=[None, 3]),
dict(prop="max_sample_size_to_check_stop_condition", value_list=[None, 20]),
dict(prop="stop_condition_unique_value_ratio", value_list=[None, 2]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

taylorfturner
taylorfturner previously approved these changes May 3, 2023
@taylorfturner taylorfturner merged commit 8c0cf87 into capitalone:feature/memory-optimization May 3, 2023
5 checks passed
ksneab7 pushed a commit that referenced this pull request May 23, 2023
JGSweets added a commit that referenced this pull request May 23, 2023
* [WIP] Part 1 fix for categorical mem opt issue (#795)

* part_1 of fix for mem optimization for categoical dict creation issue

* precommit fix

* Separated the update from the check in stop conditions for categoical columns

* added tests and accounted for different varaibles affected by the change made to categories attribute

* Modifications to code based on test findings

* Fixes for logic and tests to match requirements from PR

* Fix for rebase carry over issue

* fixes for tests because of changes to variable names in categorical column object

* precommit fixes and improvement of code based on testing

* added stop_condition_unique_value_ratio and max_sample_size_to_check_stop_condition to CategoricalOptions (#808)

* implementation of setting stop conds via options for cat column profiler (#810)

* Space time analysis improvement (#809)

* Made space time analysis code improvements (detect if dataset is already generated, specify cats to generate)

* Modified md file to account for new variable in space time analysis code

* fix: cat bug (#816)

* hotfix for more conservatitive stop condition in categorical columns (#817)

* [WIP] Fix for histogram merging (#815)

* rough draft of merge fix for histograms

* final fixes for passing of existing tests

* Added option to remove calculations for updating row statistics (#827)

* Fix to doc strings (#829)

* Preset Option Fix: presets docsstring added (#830)

* presets docsstring added

* Update dataprofiler/profilers/profiler_options.py

* Update dataprofiler/profilers/profiler_options.py

Co-authored-by: Taylor Turner <taylorfturner@gmail.com>

* Update dataprofiler/profilers/profiler_options.py

* Update dataprofiler/profilers/profiler_options.py

---------

Co-authored-by: Taylor Turner <taylorfturner@gmail.com>

---------

Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com>
Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com>
Co-authored-by: JGSweets <JGSweets@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants