Part 1 fix for categorical mem opt issue #795

ksneab7 · 2023-04-26T20:30:30Z

Dataset used is a single column ordered dataset
Conditions set:
self.max_sample_size_to_check_stop_condition = 100
self.stop_condition_unique_value_ratio = .90

0
- Profile
  - Pre change
    - Total space allocated: N/A
    - Line specific space allocated (Line number/s): N/A
    - Profile runtime: 0.02222418785095215
  - Post change
    - Total space allocated: N/A
    - Line specific space allocated (Line number/s): N/A
    - Profile runtime: 0.022262096405029297
- Merge
  - Pre change
    - Total space allocated: N/A
    - Line specific space allocated (Line number/s):N/A
    - Merge runtime: 1.4066696166992188e-05
  - Post change
    - Total space allocated: N/A
    - Line specific space allocated (Line number/s): N/A
    - Merge runtime: 1.0967254638671875e-05
100
- Profile
  - Pre change
    - Total space allocated: 449.9 KiB
    - Line specific space allocated (Line number/s): N/A
    - Profile runtime: 0.11128807067871094
  - Post change
    - Total space allocated: 449.9 KiB
    - Line specific space allocated (Line number/s): N/A
    - Profile runtime: 0.10366582870483398
- Merge
  - Pre change
    - Total space allocated: 160.8 KiB
    - Line specific space allocated (Line number/s): N/A
    - Merge runtime: 0.02127695083618164
  - Post change
    - Total space allocated: 160.8 KiB
    - Line specific space allocated (Line number/s): N/A
    - Merge runtime: 0.02112579345703125
1000
- Profile
  - Pre change
    - Total space allocated: 269.5 KiB
    - Line specific space allocated (Line number/s): N/A
    - Profile runtime: 0.07865023612976074
  - Post change
    - Total space allocated: 269.5 KiB
    - Line specific space allocated (Line number/s): N/A
    - Profile runtime: 0.07574105262756348
- Merge
  - Pre change
    - Total space allocated: 220.8 KiB
    - Line specific space allocated (Line number/s): N/A
    - Merge runtime: 0.025424957275390625
  - Post change
    - Total space allocated: 220.8 KiB
    - Line specific space allocated (Line number/s): N/A
    - Merge runtime: 0.022446870803833008
5000
- Profile
  - Pre change
    - Total space allocated: 841.1 KiB
    - Line specific space allocated (Line number/s): N/A
    - Profile runtime: 0.141251802444458
  - Post change
    - Total space allocated: 841.1 KiB
    - Line specific space allocated (Line number/s): N/A
    - Profile runtime: 0.13668394088745117
- Merge
  - Pre change
    - Total space allocated: 653.0 KiB
    - Line specific space allocated (Line number/s): N/A
    - Merge runtime: 0.03322911262512207
  - Post change
    - Total space allocated: 653.0 KiB
    - Line specific space allocated (Line number/s): N/A
    - Merge runtime: 0.03174471855163574
7500
- Profile
  - Pre change
    - Total space allocated: 1.3 MiB
    - Line specific space allocated (Line number/s): 288.0 KiB
    - Profile runtime: 0.18341469764709473
  - Post change
    - Total space allocated: 1.2 MiB
    - Line specific space allocated (Line number/s): N/A
    - Profile runtime: 0.1740732192993164
- Merge
  - Pre change
    - Total space allocated: 1002.1 KiB
    - Line specific space allocated (Line number/s):
    - Merge runtime: 0.03673124313354492
  - Post change
    - Total space allocated: 1002.1 KiB
    - Line specific space allocated (Line number/s): N/A
    - Merge runtime: 0.034219980239868164
100000
- Profile
  - Pre change
    - Total space allocated: 22.6 MiB
    - Line specific space allocated (Line number/s): 5.0 MiB
    - Profile runtime: 1.42696213722229
  - Post change
    - Total space allocated: 17.6 MiB
    - Line specific space allocated (Line number/s): N/A
    - Profile runtime: 1.4447557926177979
- Merge
  - Pre change
    - Total space allocated: 13.0 MiB
    - Line specific space allocated (Line number/s): N/A
    - Merge runtime: 0.21359801292419434
  - Post change
    - Total space allocated: 13.0 MiB
    - Line specific space allocated (Line number/s): N/A
    - Merge runtime: 0.15265393257141113

Separate measure for two different profiles to indicate merge differences:

 - Pre change
    - Total space allocated: 25.0 MiB
    - Line specific space allocated (Line number/s): N/A
- Post change
    - Total space allocated: 17.7 MiB
    - Line specific space allocated (Line number/s): N/A

dataprofiler/profilers/categorical_column_profile.py

JGSweets · 2023-04-27T20:31:00Z

dataprofiler/profilers/categorical_column_profile.py

+            self.max_sample_size_to_check_stop_condition is not None
+            and len(data) >= self.max_sample_size_to_check_stop_condition
+            and self.stop_condition_unique_value_ratio is not None
+            and len(self._categories) / len(data)


merged_unique_count = len(self._categories) merged_sample_size = (self.sample_size + len(data)) merged_unique_ratio = merged_unique_count / merged_sample_size if ( self.max_sample_size_to_check_stop_condition is not None and self. stop_condition_unique_value_ratio is not None and merged_sample_size >= self.max_sample_size_to_check_stop_condition and merged_unique_ratio >= stop_condition_unique_value_ratio ): self._stop_condition_is_met = True self._stopped_at_unique_ratio = merged_unique_ratio self._stopped_at_unique_count = merged_unique_count

I think this keeps it clear and then sets the values.

An option would be if we want this function to just be a check, we can calculate in_update_categories :
merged_unique_count = len(self._categories)
merged_sample_size = (self.sample_size + len(data))
merged_unique_ratio = merged_unique_count / merged_sample_size

and pass it into the check function: ```python def _check_stop_condition_is_met(self, sample_count, sample_size, unqiue_ratio): """Return value stop_condition_is_met given stop conditions. :return: boolean for stop conditions """ if ( self.max_sample_size_to_check_stop_condition is not None and self. stop_condition_unique_value_ratio is not None and sample_size >= self.max_sample_size_to_check_stop_condition and unique_ratio >= stop_condition_unique_value_ratio ): return True return False

then in _update_categories:

if self._check_stop_condition_is_met(merged.....): self._stop_condition_is_met = True self._categories = {} self._stopped_at_unique_ratio = merged_unique_ratio self._stopped_at_unique_count = merged_unique_count

^^ similar logic can then be used in the __add__ where we calc the:

if not profile1._stop_condition_is_met and not profile2._stop_condition_is_met: # merge dicts and then check the values for the merging with the stop condition, etc. else: # merged categories is empty # unique ratio and unique count can be take from the profile with the largest sample size # if we avg, will be problem for streams. # also transfer stop conditions: enforce the most expensive /lenient case for now (open to either case)

between the two profies

dataprofiler/profilers/categorical_column_profile.py

dataprofiler/tests/profilers/test_categorical_column_profile.py

dataprofiler/profilers/categorical_column_profile.py

JGSweets · 2023-05-01T21:15:07Z

dataprofiler/profilers/categorical_column_profile.py

+            merged_profile._stopped_at_unique_ratio = self._stopped_at_unique_ratio
+            merged_profile._stopped_at_unique_count = self._stopped_at_unique_count


as above, default should suffice in this case.

same comment as above

JGSweets · 2023-05-01T21:24:22Z

dataprofiler/profilers/categorical_column_profile.py

+        if self._stop_condition_is_met:
+            self._categories = {}


if we already set:

self._stopped_at_unique_ratio = merged_unique_ratio self._stopped_at_unique_count = merged_unique_count

below, we can join it with:

self._categories = {}

However, let's consider the function description and the action of the function.

When we say update_stop_condition, to me it would only update _stop_condition_is_met boolean. Do we wnat to rename it to be more explicit about it doing more? or do we need update_stop_condition?

Could we just have _update_categories and _check_stop_condition_is_met?

Want to note these are code readability / quality comments, not functional as I think the code is in a functional state.

Since this is a feature branch, we can refactor in subsequent PR.

JGSweets · 2023-05-01T21:24:52Z

dataprofiler/profilers/categorical_column_profile.py

+            return True
+        return False
+
+    def update_stop_condition(self, data: DataFrame):


if we want this to continue to exist, I suggest making it private as we wouldn't want an external user to use this method.

why wouldnt we want that?

JGSweets · 2023-05-01T21:28:54Z

dataprofiler/profilers/categorical_column_profile.py

+            merged_profile._stopped_at_unique_ratio = max(
+                self.unique_ratio, other.unique_ratio
+            )
+            merged_profile._stopped_at_unique_count = max(
+                self.unique_count, other.unique_count
+            )


Should it be purely based on max?

Consider:

Profile 1 has 10 samples and a unique_ratio of 1.00

Profile 2 has 10000 samples and a unique ratio of 0.50.

Do these values make sense?

Also, do we want to potentially choose different values from each, could that be problematic?

The option of not choosing one profiles values vs the other's complicates the sample_size below I believe.

We should also have tests which ensure that if self has the values that get assigned vs swapping and ensuring that other's values can be chosen.

We need to talk about this more thoroughly, I feel that not matter what we choose here, in the current state, there is going to be a problem and there isnt an alternative suggested here, so not sure how to move forward.

JGSweets · 2023-05-01T21:29:42Z

dataprofiler/profilers/categorical_column_profile.py

+            merged_profile.times = (
+                self.times
+                if self.times["categories"] >= other.times["categories"]
+                else other.times
+            )


We shouldn't later merged_profile times. This should already be handled via the _add_helper

we should make sure our tests validates this.

No we should not add the times together because that is not consistent with a stop condition being hit

JGSweets · 2023-05-01T21:30:23Z

dataprofiler/profilers/categorical_column_profile.py

+                if self.times["categories"] >= other.times["categories"]
+                else other.times
+            )
+            merged_profile.sample_size = self.sample_size + other.sample_size


I don't believe this is true if we are taking one value over the other above

I dont agree with this statement (mentioned above)

dataprofiler/profilers/categorical_column_profile.py

JGSweets · 2023-05-01T21:38:29Z

dataprofiler/tests/profilers/test_categorical_column_profile.py

@@ -578,6 +614,44 @@ def test_categorical_merge(self):
        }
        self.assertCountEqual(report_count, expected_dict)

+        # Setting up of profile with stop condition not yet met
+        profile_w_stop_cond_1 = CategoricalColumn("merge_stop_condition_test")
+        profile_w_stop_cond_1.max_sample_size_to_check_stop_condition = 12


should ensure times are correct, but looks like we don't properly do that in this file rn.

The way we've done previously is mocking timeit as a generator.
Example:

DataProfiler/dataprofiler/tests/profilers/test_profile_builder.py

Line 106 in f206af2

with test_utils.mock_timeit():

Should add issue for this

added:
#806

JGSweets

Concerns about merge and suggestions for the update portion.

…y test merging

JGSweets · 2023-05-02T19:56:30Z

dataprofiler/profilers/categorical_column_profile.py

+            self._merge_calculations(
+                merged_profile.__calculations, self.__calculations, other.__calculations
+            )


This still needs to be conducted above. right now we don't have it populated, but it should be done in either case.

JGSweets

Assuming refactors after this in subsequent PRs

taylorfturner · 2023-05-03T16:47:16Z

Assuming refactors after this in subsequent PRs

need these refactors to be notated as follow-up issues @ksneab7

* part_1 of fix for mem optimization for categoical dict creation issue * precommit fix * Separated the update from the check in stop conditions for categoical columns * added tests and accounted for different varaibles affected by the change made to categories attribute * Modifications to code based on test findings * Fixes for logic and tests to match requirements from PR * Fix for rebase carry over issue * fixes for tests because of changes to variable names in categorical column object * precommit fixes and improvement of code based on testing

* [WIP] Part 1 fix for categorical mem opt issue (#795) * part_1 of fix for mem optimization for categoical dict creation issue * precommit fix * Separated the update from the check in stop conditions for categoical columns * added tests and accounted for different varaibles affected by the change made to categories attribute * Modifications to code based on test findings * Fixes for logic and tests to match requirements from PR * Fix for rebase carry over issue * fixes for tests because of changes to variable names in categorical column object * precommit fixes and improvement of code based on testing * added stop_condition_unique_value_ratio and max_sample_size_to_check_stop_condition to CategoricalOptions (#808) * implementation of setting stop conds via options for cat column profiler (#810) * Space time analysis improvement (#809) * Made space time analysis code improvements (detect if dataset is already generated, specify cats to generate) * Modified md file to account for new variable in space time analysis code * fix: cat bug (#816) * hotfix for more conservatitive stop condition in categorical columns (#817) * [WIP] Fix for histogram merging (#815) * rough draft of merge fix for histograms * final fixes for passing of existing tests * Added option to remove calculations for updating row statistics (#827) * Fix to doc strings (#829) * Preset Option Fix: presets docsstring added (#830) * presets docsstring added * Update dataprofiler/profilers/profiler_options.py * Update dataprofiler/profilers/profiler_options.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/profiler_options.py * Update dataprofiler/profilers/profiler_options.py --------- Co-authored-by: Taylor Turner <taylorfturner@gmail.com> --------- Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com> Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> Co-authored-by: JGSweets <JGSweets@users.noreply.github.com>

ksneab7 added the Work In Progress Solution is being developed label Apr 26, 2023

ksneab7 requested review from JGSweets, taylorfturner, micdavis and tyfarnan as code owners April 26, 2023 20:30

JGSweets reviewed Apr 26, 2023

View reviewed changes

dataprofiler/profilers/categorical_column_profile.py Show resolved Hide resolved

taylorfturner enabled auto-merge (squash) April 26, 2023 20:39

JGSweets reviewed Apr 26, 2023

View reviewed changes

dataprofiler/profilers/categorical_column_profile.py Outdated Show resolved Hide resolved

JGSweets reviewed Apr 26, 2023

View reviewed changes

dataprofiler/profilers/categorical_column_profile.py Outdated Show resolved Hide resolved

JGSweets reviewed Apr 26, 2023

View reviewed changes

dataprofiler/profilers/categorical_column_profile.py Outdated Show resolved Hide resolved

taylorfturner assigned ksneab7 Apr 26, 2023

ksneab7 force-pushed the part_1_fix_for_categorical_mem_opt_issue branch 2 times, most recently from 409f16c to a50a7e0 Compare April 27, 2023 14:46

JGSweets reviewed Apr 27, 2023

View reviewed changes

dataprofiler/profilers/categorical_column_profile.py Outdated Show resolved Hide resolved

JGSweets reviewed Apr 27, 2023

View reviewed changes

dataprofiler/profilers/categorical_column_profile.py Outdated Show resolved Hide resolved

JGSweets reviewed Apr 27, 2023

View reviewed changes

dataprofiler/profilers/categorical_column_profile.py Show resolved Hide resolved

JGSweets reviewed Apr 27, 2023

View reviewed changes

dataprofiler/profilers/categorical_column_profile.py Show resolved Hide resolved

JGSweets reviewed Apr 27, 2023

View reviewed changes

dataprofiler/profilers/categorical_column_profile.py Outdated Show resolved Hide resolved

JGSweets reviewed Apr 27, 2023

View reviewed changes

dataprofiler/tests/profilers/test_categorical_column_profile.py Outdated Show resolved Hide resolved

JGSweets reviewed Apr 27, 2023

View reviewed changes

dataprofiler/tests/profilers/test_categorical_column_profile.py Outdated Show resolved Hide resolved

JGSweets reviewed Apr 27, 2023

View reviewed changes

dataprofiler/tests/profilers/test_categorical_column_profile.py Outdated Show resolved Hide resolved

JGSweets reviewed Apr 27, 2023

View reviewed changes

dataprofiler/tests/profilers/test_categorical_column_profile.py Outdated Show resolved Hide resolved

JGSweets reviewed Apr 27, 2023

View reviewed changes

dataprofiler/tests/profilers/test_categorical_column_profile.py Outdated Show resolved Hide resolved

taylorfturner deleted the branch capitalone:feature/memory-optimization April 28, 2023 19:22

taylorfturner closed this Apr 28, 2023

auto-merge was automatically disabled April 28, 2023 19:22
Pull request was closed

ksneab7 reopened this May 1, 2023

ksneab7 added 2 commits May 1, 2023 08:14

part_1 of fix for mem optimization for categoical dict creation issue

9ee774a

precommit fix

e5ccfb4

JGSweets reviewed May 1, 2023

View reviewed changes

dataprofiler/tests/profilers/test_categorical_column_profile.py Show resolved Hide resolved

ksneab7 commented May 1, 2023

View reviewed changes

dataprofiler/profilers/categorical_column_profile.py Outdated Show resolved Hide resolved

JGSweets reviewed May 1, 2023

View reviewed changes

dataprofiler/profilers/categorical_column_profile.py Outdated Show resolved Hide resolved

JGSweets reviewed May 1, 2023

View reviewed changes

dataprofiler/profilers/categorical_column_profile.py Outdated Show resolved Hide resolved

JGSweets reviewed May 1, 2023

View reviewed changes

JGSweets suggested changes May 1, 2023

View reviewed changes

fixed transfer of stop condition variables for profile

d4d07e2

ksneab7 force-pushed the part_1_fix_for_categorical_mem_opt_issue branch from 86e4681 to d4d07e2 Compare May 2, 2023 12:32

modified logic for merging of profiles. Added tests to more thoroughl…

a014ac4

…y test merging

JGSweets reviewed May 2, 2023

View reviewed changes

ksneab7 and others added 6 commits May 2, 2023 17:42

Changes to fix mypy issue

5bf6a47

fixing test values

71d096d

test

f7b22d8

empty test

c633a8f

test empty

1f8ce09

test empty

a7ed486

JGSweets approved these changes May 3, 2023

View reviewed changes

taylorfturner added 2 commits May 3, 2023 12:32

test empty

16a72e0

test empty

777ca09

micdavis approved these changes May 3, 2023

View reviewed changes

JGSweets merged commit ebb3995 into capitalone:feature/memory-optimization May 3, 2023
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Part 1 fix for categorical mem opt issue #795

Part 1 fix for categorical mem opt issue #795

ksneab7 commented Apr 26, 2023 •

edited by taylorfturner

JGSweets Apr 27, 2023 •

edited

JGSweets May 1, 2023

ksneab7 May 2, 2023

JGSweets May 1, 2023

JGSweets May 1, 2023 •

edited

JGSweets May 3, 2023

JGSweets May 1, 2023

ksneab7 May 2, 2023

JGSweets May 1, 2023

JGSweets May 1, 2023

JGSweets May 1, 2023 •

edited

ksneab7 May 2, 2023

JGSweets May 1, 2023

JGSweets May 1, 2023

ksneab7 May 2, 2023

JGSweets May 1, 2023

ksneab7 May 2, 2023

JGSweets May 1, 2023

JGSweets May 3, 2023

JGSweets May 3, 2023

JGSweets left a comment

JGSweets May 2, 2023

JGSweets left a comment

taylorfturner commented May 3, 2023 •

edited

		merged_profile._stopped_at_unique_ratio = self._stopped_at_unique_ratio
		merged_profile._stopped_at_unique_count = self._stopped_at_unique_count

Part 1 fix for categorical mem opt issue #795

Part 1 fix for categorical mem opt issue #795

Conversation

ksneab7 commented Apr 26, 2023 • edited by taylorfturner

JGSweets Apr 27, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JGSweets May 1, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JGSweets May 1, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JGSweets left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JGSweets left a comment

Choose a reason for hiding this comment

taylorfturner commented May 3, 2023 • edited

ksneab7 commented Apr 26, 2023 •

edited by taylorfturner

JGSweets Apr 27, 2023 •

edited

JGSweets May 1, 2023 •

edited

JGSweets May 1, 2023 •

edited

taylorfturner commented May 3, 2023 •

edited