Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Part 1 fix for categorical mem opt issue #795

Conversation

ksneab7
Copy link
Contributor

@ksneab7 ksneab7 commented Apr 26, 2023

Dataset used is a single column ordered dataset
Conditions set:
self.max_sample_size_to_check_stop_condition = 100
self.stop_condition_unique_value_ratio = .90

  • 0
    • Profile
      • Pre change
        • Total space allocated: N/A
        • Line specific space allocated (Line number/s): N/A
        • Profile runtime: 0.02222418785095215
      • Post change
        • Total space allocated: N/A
        • Line specific space allocated (Line number/s): N/A
        • Profile runtime: 0.022262096405029297
    • Merge
      • Pre change
        • Total space allocated: N/A
        • Line specific space allocated (Line number/s):N/A
        • Merge runtime: 1.4066696166992188e-05
      • Post change
        • Total space allocated: N/A
        • Line specific space allocated (Line number/s): N/A
        • Merge runtime: 1.0967254638671875e-05
  • 100
    • Profile
      • Pre change
        • Total space allocated: 449.9 KiB
        • Line specific space allocated (Line number/s): N/A
        • Profile runtime: 0.11128807067871094
      • Post change
        • Total space allocated: 449.9 KiB
        • Line specific space allocated (Line number/s): N/A
        • Profile runtime: 0.10366582870483398
    • Merge
      • Pre change
        • Total space allocated: 160.8 KiB
        • Line specific space allocated (Line number/s): N/A
        • Merge runtime: 0.02127695083618164
      • Post change
        • Total space allocated: 160.8 KiB
        • Line specific space allocated (Line number/s): N/A
        • Merge runtime: 0.02112579345703125
  • 1000
    • Profile
      • Pre change
        • Total space allocated: 269.5 KiB
        • Line specific space allocated (Line number/s): N/A
        • Profile runtime: 0.07865023612976074
      • Post change
        • Total space allocated: 269.5 KiB
        • Line specific space allocated (Line number/s): N/A
        • Profile runtime: 0.07574105262756348
    • Merge
      • Pre change
        • Total space allocated: 220.8 KiB
        • Line specific space allocated (Line number/s): N/A
        • Merge runtime: 0.025424957275390625
      • Post change
        • Total space allocated: 220.8 KiB
        • Line specific space allocated (Line number/s): N/A
        • Merge runtime: 0.022446870803833008
  • 5000
    • Profile
      • Pre change
        • Total space allocated: 841.1 KiB
        • Line specific space allocated (Line number/s): N/A
        • Profile runtime: 0.141251802444458
      • Post change
        • Total space allocated: 841.1 KiB
        • Line specific space allocated (Line number/s): N/A
        • Profile runtime: 0.13668394088745117
    • Merge
      • Pre change
        • Total space allocated: 653.0 KiB
        • Line specific space allocated (Line number/s): N/A
        • Merge runtime: 0.03322911262512207
      • Post change
        • Total space allocated: 653.0 KiB
        • Line specific space allocated (Line number/s): N/A
        • Merge runtime: 0.03174471855163574
  • 7500
    • Profile
      • Pre change
        • Total space allocated: 1.3 MiB
        • Line specific space allocated (Line number/s): 288.0 KiB
        • Profile runtime: 0.18341469764709473
      • Post change
        • Total space allocated: 1.2 MiB
        • Line specific space allocated (Line number/s): N/A
        • Profile runtime: 0.1740732192993164
    • Merge
      • Pre change
        • Total space allocated: 1002.1 KiB
        • Line specific space allocated (Line number/s):
        • Merge runtime: 0.03673124313354492
      • Post change
        • Total space allocated: 1002.1 KiB
        • Line specific space allocated (Line number/s): N/A
        • Merge runtime: 0.034219980239868164
  • 100000
    • Profile
      • Pre change
        • Total space allocated: 22.6 MiB
        • Line specific space allocated (Line number/s): 5.0 MiB
        • Profile runtime: 1.42696213722229
      • Post change
        • Total space allocated: 17.6 MiB
        • Line specific space allocated (Line number/s): N/A
        • Profile runtime: 1.4447557926177979
    • Merge
      • Pre change
        • Total space allocated: 13.0 MiB
        • Line specific space allocated (Line number/s): N/A
        • Merge runtime: 0.21359801292419434
      • Post change
        • Total space allocated: 13.0 MiB
        • Line specific space allocated (Line number/s): N/A
        • Merge runtime: 0.15265393257141113

Separate measure for two different profiles to indicate merge differences:

 - Pre change
    - Total space allocated: 25.0 MiB
    - Line specific space allocated (Line number/s): N/A
- Post change
    - Total space allocated: 17.7 MiB
    - Line specific space allocated (Line number/s): N/A

@ksneab7 ksneab7 added the Work In Progress Solution is being developed label Apr 26, 2023
@taylorfturner taylorfturner enabled auto-merge (squash) April 26, 2023 20:39
@ksneab7 ksneab7 force-pushed the part_1_fix_for_categorical_mem_opt_issue branch 2 times, most recently from 409f16c to a50a7e0 Compare April 27, 2023 14:46
self.max_sample_size_to_check_stop_condition is not None
and len(data) >= self.max_sample_size_to_check_stop_condition
and self.stop_condition_unique_value_ratio is not None
and len(self._categories) / len(data)
Copy link
Collaborator

@JGSweets JGSweets Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merged_unique_count = len(self._categories)
merged_sample_size = (self.sample_size + len(data))
merged_unique_ratio =  merged_unique_count / merged_sample_size


if (
    self.max_sample_size_to_check_stop_condition is not None
    and  self. stop_condition_unique_value_ratio is not None
    and merged_sample_size >= self.max_sample_size_to_check_stop_condition
    and merged_unique_ratio >= stop_condition_unique_value_ratio
):
    self._stop_condition_is_met = True
    self._stopped_at_unique_ratio = merged_unique_ratio
    self._stopped_at_unique_count = merged_unique_count

I think this keeps it clear and then sets the values.

An option would be if we want this function to just be a check, we can calculate in_update_categories :
merged_unique_count = len(self._categories)
merged_sample_size = (self.sample_size + len(data))
merged_unique_ratio = merged_unique_count / merged_sample_size

and pass it into the check function:
```python
def _check_stop_condition_is_met(self, sample_count, sample_size, unqiue_ratio):
    """Return value stop_condition_is_met given stop conditions.
    
    :return: boolean for stop conditions
    """
    if (
          self.max_sample_size_to_check_stop_condition is not None
          and  self. stop_condition_unique_value_ratio is not None
          and sample_size >= self.max_sample_size_to_check_stop_condition
          and unique_ratio >= stop_condition_unique_value_ratio
    ):
        return True
    return False

then in _update_categories:

if self._check_stop_condition_is_met(merged.....):
    self._stop_condition_is_met = True
    self._categories = {}
    self._stopped_at_unique_ratio = merged_unique_ratio
    self._stopped_at_unique_count = merged_unique_count

^^ similar logic can then be used in the __add__ where we calc the:

if not profile1._stop_condition_is_met and not profile2._stop_condition_is_met:
    # merge dicts and then check the values for the merging with the stop condition, etc.
else:
    # merged categories is empty
    # unique ratio and unique count can be take from the profile with the largest sample size
    # if we avg, will be problem for streams.
 
 
# also transfer stop conditions: enforce the most expensive /lenient case for now (open to either case)

between the two profies

@taylorfturner taylorfturner deleted the branch capitalone:feature/memory-optimization April 28, 2023 19:22
auto-merge was automatically disabled April 28, 2023 19:22

Pull request was closed

@ksneab7 ksneab7 reopened this May 1, 2023
Comment on lines 107 to 108
merged_profile._stopped_at_unique_ratio = self._stopped_at_unique_ratio
merged_profile._stopped_at_unique_count = self._stopped_at_unique_count
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above, default should suffice in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above

Comment on lines +351 to +352
if self._stop_condition_is_met:
self._categories = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we already set:

self._stopped_at_unique_ratio = merged_unique_ratio
self._stopped_at_unique_count = merged_unique_count

below, we can join it with:

self._categories = {}

However, let's consider the function description and the action of the function.

When we say update_stop_condition, to me it would only update _stop_condition_is_met boolean. Do we wnat to rename it to be more explicit about it doing more? or do we need update_stop_condition?

Could we just have _update_categories and _check_stop_condition_is_met?

Copy link
Collaborator

@JGSweets JGSweets May 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to note these are code readability / quality comments, not functional as I think the code is in a functional state.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a feature branch, we can refactor in subsequent PR.

return True
return False

def update_stop_condition(self, data: DataFrame):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want this to continue to exist, I suggest making it private as we wouldn't want an external user to use this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why wouldnt we want that?

Comment on lines 121 to 126
merged_profile._stopped_at_unique_ratio = max(
self.unique_ratio, other.unique_ratio
)
merged_profile._stopped_at_unique_count = max(
self.unique_count, other.unique_count
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be purely based on max?

Consider:

  • Profile 1 has 10 samples and a unique_ratio of 1.00
  • Profile 2 has 10000 samples and a unique ratio of 0.50.

Do these values make sense?

Also, do we want to potentially choose different values from each, could that be problematic?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The option of not choosing one profiles values vs the other's complicates the sample_size below I believe.

Copy link
Collaborator

@JGSweets JGSweets May 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also have tests which ensure that if self has the values that get assigned vs swapping and ensuring that other's values can be chosen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to talk about this more thoroughly, I feel that not matter what we choose here, in the current state, there is going to be a problem and there isnt an alternative suggested here, so not sure how to move forward.

Comment on lines 127 to 131
merged_profile.times = (
self.times
if self.times["categories"] >= other.times["categories"]
else other.times
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't later merged_profile times. This should already be handled via the _add_helper

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should make sure our tests validates this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we should not add the times together because that is not consistent with a stop condition being hit

if self.times["categories"] >= other.times["categories"]
else other.times
)
merged_profile.sample_size = self.sample_size + other.sample_size
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe this is true if we are taking one value over the other above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont agree with this statement (mentioned above)

@@ -578,6 +614,44 @@ def test_categorical_merge(self):
}
self.assertCountEqual(report_count, expected_dict)

# Setting up of profile with stop condition not yet met
profile_w_stop_cond_1 = CategoricalColumn("merge_stop_condition_test")
profile_w_stop_cond_1.max_sample_size_to_check_stop_condition = 12
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should ensure times are correct, but looks like we don't properly do that in this file rn.

The way we've done previously is mocking timeit as a generator.
Example:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add issue for this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added:
#806

Copy link
Collaborator

@JGSweets JGSweets left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concerns about merge and suggestions for the update portion.

@ksneab7 ksneab7 force-pushed the part_1_fix_for_categorical_mem_opt_issue branch from 86e4681 to d4d07e2 Compare May 2, 2023 12:32
Comment on lines 82 to 84
self._merge_calculations(
merged_profile.__calculations, self.__calculations, other.__calculations
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still needs to be conducted above. right now we don't have it populated, but it should be done in either case.

Copy link
Collaborator

@JGSweets JGSweets left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming refactors after this in subsequent PRs

@taylorfturner
Copy link
Contributor

taylorfturner commented May 3, 2023

Assuming refactors after this in subsequent PRs

need these refactors to be notated as follow-up issues @ksneab7

@JGSweets JGSweets merged commit ebb3995 into capitalone:feature/memory-optimization May 3, 2023
5 checks passed
ksneab7 added a commit that referenced this pull request May 23, 2023
* part_1 of fix for mem optimization for categoical dict creation issue

* precommit fix

* Separated the update from the check in stop conditions for categoical columns

* added tests and accounted for different varaibles affected by the change made to categories attribute

* Modifications to code based on test findings

* Fixes for logic and tests to match requirements from PR

* Fix for rebase carry over issue

* fixes for tests because of changes to variable names in categorical column object

* precommit fixes and improvement of code based on testing
JGSweets added a commit that referenced this pull request May 23, 2023
* [WIP] Part 1 fix for categorical mem opt issue (#795)

* part_1 of fix for mem optimization for categoical dict creation issue

* precommit fix

* Separated the update from the check in stop conditions for categoical columns

* added tests and accounted for different varaibles affected by the change made to categories attribute

* Modifications to code based on test findings

* Fixes for logic and tests to match requirements from PR

* Fix for rebase carry over issue

* fixes for tests because of changes to variable names in categorical column object

* precommit fixes and improvement of code based on testing

* added stop_condition_unique_value_ratio and max_sample_size_to_check_stop_condition to CategoricalOptions (#808)

* implementation of setting stop conds via options for cat column profiler (#810)

* Space time analysis improvement (#809)

* Made space time analysis code improvements (detect if dataset is already generated, specify cats to generate)

* Modified md file to account for new variable in space time analysis code

* fix: cat bug (#816)

* hotfix for more conservatitive stop condition in categorical columns (#817)

* [WIP] Fix for histogram merging (#815)

* rough draft of merge fix for histograms

* final fixes for passing of existing tests

* Added option to remove calculations for updating row statistics (#827)

* Fix to doc strings (#829)

* Preset Option Fix: presets docsstring added (#830)

* presets docsstring added

* Update dataprofiler/profilers/profiler_options.py

* Update dataprofiler/profilers/profiler_options.py

Co-authored-by: Taylor Turner <taylorfturner@gmail.com>

* Update dataprofiler/profilers/profiler_options.py

* Update dataprofiler/profilers/profiler_options.py

---------

Co-authored-by: Taylor Turner <taylorfturner@gmail.com>

---------

Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com>
Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com>
Co-authored-by: JGSweets <JGSweets@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants