Fix for histogram merging #815

ksneab7 · 2023-05-09T20:08:12Z

No description provided.

JGSweets · 2023-05-09T20:23:45Z

dataprofiler/profilers/histogram_utils.py

@@ -14,6 +14,218 @@
 )


+def rework_ptp(maximum, minimum):


once we fix naming, we should make these private as well.

JGSweets · 2023-05-09T20:24:37Z

dataprofiler/profilers/histogram_utils.py

+    -------
+    h : An estimate of the optimal bin width for the given data.
+    """
+    dataset_size = sum(profile._stored_histogram["histogram"]["bin_counts"])


We can instead use the profile.sampling_size.

AttributeError: 'IntColumn' object has no attribute 'sampling_size'

JGSweets · 2023-05-09T20:24:59Z

dataprofiler/profilers/histogram_utils.py

+    minimum = profile._stored_histogram["histogram"]["bin_edges"][0]
+    maximum = profile._stored_histogram["histogram"]["bin_edges"][-1]


Should be able to use profile.min etc.

JGSweets · 2023-05-09T20:25:14Z

dataprofiler/profilers/histogram_utils.py

+    dataset_size = sum(profile._stored_histogram["histogram"]["bin_counts"])
+    minimum = profile._stored_histogram["histogram"]["bin_edges"][0]
+    maximum = profile._stored_histogram["histogram"]["bin_edges"][-1]
+    return (rework_ptp(maximum, minimum) /


JGSweets · 2023-05-09T20:25:24Z

dataprofiler/profilers/histogram_utils.py

+    dataset_size = sum(profile._stored_histogram["histogram"]["bin_counts"])
+    minimum = profile._stored_histogram["histogram"]["bin_edges"][0]
+    maximum = profile._stored_histogram["histogram"]["bin_edges"][-1]


JGSweets · 2023-05-09T20:25:30Z

dataprofiler/profilers/histogram_utils.py

+    dataset_size = sum(profile._stored_histogram["histogram"]["bin_counts"])
+    minimum = profile._stored_histogram["histogram"]["bin_edges"][0]
+    maximum = profile._stored_histogram["histogram"]["bin_edges"][-1]


JGSweets · 2023-05-09T20:25:51Z

dataprofiler/profilers/histogram_utils.py

+    h : An estimate of the optimal bin width for the given data.
+    """
+    iqr = np.subtract(profile._get_percentile([75]), profile._get_percentile([25]))
+    dataset_size = sum(profile._stored_histogram["histogram"]["bin_counts"])


JGSweets · 2023-05-09T20:26:07Z

dataprofiler/profilers/histogram_utils.py

+    -------
+    h : An estimate of the optimal bin width for the given data.
+    """
+    dataset_size = sum(profile._stored_histogram["histogram"]["bin_counts"])


JGSweets · 2023-05-09T20:26:56Z

dataprofiler/profilers/histogram_utils.py

+    dataset_size = sum(profile._stored_histogram["histogram"]["bin_counts"])
+    minimum = profile._stored_histogram["histogram"]["bin_edges"][0]
+    maximum = profile._stored_histogram["histogram"]["bin_edges"][-1]


JGSweets · 2023-05-09T21:02:36Z

dataprofiler/profilers/histogram_utils.py

+        n_equal_bins = 1
+    else:
+        # Do not call selectors on empty arrays
+        width = rework_hist_bin_selectors[bin_name](profile)


i'm wondering if it is better to take in profile or a dict of properties. this does work and is future flexible.

I'm curious if there are ways to reduce overall calcs by generating props before this loop and passing that in instead of the profile.

JGSweets · 2023-05-09T21:03:58Z

dataprofiler/profilers/numerical_column_stats.py

@@ -76,7 +76,6 @@ def __init__(self, options: NumericalOptions = None) -> None:
        self.histogram_selection: str | None = None
        self.user_set_histogram_bin: int | None = None
        self.bias_correction: bool = True  # By default, we correct for bias
-        self._mode_is_enabled: bool = True


Why is this removed?

Weird I think this may solve some of my issue above. This should not be deleted

…ging

JGSweets · 2023-05-12T18:01:07Z

dataprofiler/profilers/histogram_utils.py

+    """
+    # parse the overloaded bins argument
+
+    rework_hist_bin_selectors = {


instead of re init internal every time, maybe global in this file. Probably should privatize the variable too. Update name to suggest from profile in the name as opposed to rework.

dataprofiler/profilers/histogram_utils.py

…bin width

dataprofiler/profilers/histogram_utils.py

taylorfturner · 2023-05-15T16:33:42Z

dataprofiler/tests/profilers/test_histogram_utils.py

+class TestColumn(NumericStatsMixin):
+    def __init__(self):
+        NumericStatsMixin.__init__(self)
+        self.times = defaultdict(float)
+        self.match_count = 5
+        self.min = 1
+        self.max = 5
+        self._biased_skewness = 1.0
+        self._stored_histogram["histogram"]["bin_counts"] = [1, 1, 1, 1, 1, 1]
+        self._stored_histogram["histogram"]["bin_edges"] = [0, 1, 2, 3, 4, 5, 6]
+
+    def update(self, df_series):
+        pass
+
+    def _filter_properties_w_options(self, calculations, options):
+        pass


taylorfturner · 2023-05-15T16:49:37Z

dataprofiler/profilers/numerical_column_stats.py


+    def _assimilate_histogram(


yeah doesn't look like self is even utilized outside of L1359 in the function so maybe we add this as an issue @ksneab7 for follow-up work but for now we focus on getting the working update in the feature branch

JGSweets · 2023-05-15T20:47:14Z

dataprofiler/profilers/numerical_column_stats.py

+        try:
+            calculated_loss = (
+                (histogram_loss_1 * other1.sample_size)
+                + (histogram_loss_2 * other2.sample_size)
+            ) / (other1.sample_size + other2.sample_size)
+        except AttributeError:
+            sample_size_1 = sum(other1._stored_histogram["histogram"]["bin_counts"])
+            sample_size_2 = sum(other2._stored_histogram["histogram"]["bin_counts"])
+            calculated_loss = (
+                (histogram_loss_1 * sample_size_1) + (histogram_loss_2 * sample_size_2)
+            ) / (sample_size_1 + sample_size_2)


Is sample_size the correct value or should it be match_count? Do we have tests evaluating this?

Also, I think loss should be the sum, right? It's the aggregate loss of both being inserted into the _stored_histogrtam.

sure match_count is probably right.
so in this calculation we are taking the the loss of each histogram and using the size of the dataset associated with that loss as a weight factor. So yes it is a sum, in this case I am doing a weighted sum.

JGSweets · 2023-05-15T20:50:26Z

dataprofiler/profilers/numerical_column_stats.py

+            dest_hist_bin_edges=ideal_bin_edges,
+            dest_hist_num_bin=ideal_count_of_bins,
+        )
+        try:


we could separate out getting sizes from the actual calc so the calc is not within the try itself. that is if sizes are needed.

…d helper function

…ging

ksneab7 · 2023-05-16T17:53:32Z

dataprofiler/profilers/numerical_column_stats.py

                hist_loss += (
-                    ((new_bin_edge[2] - new_bin_edge[1]) - (bin_edge[1] - bin_edge[0]))
+                    ((new_bin_edge[2] + new_bin_edge[1]) - (bin_edge[1] + bin_edge[0]))


This was the last bug to be fixed

tyfarnan · 2023-05-16T18:01:53Z

dataprofiler/profilers/histogram_utils.py

+    "sturges": _calc_sturges_bin_width_from_profile,
+}
+
+
 def _get_bin_edges(
    a: np.ndarray,


think we might like a more descriptive variable name here?

It's not in code we changed and is actually how numpy also implements it as well. My suggestion is to table for now.

* rough draft of merge fix for histograms * final fixes for passing of existing tests

* [WIP] Part 1 fix for categorical mem opt issue (#795) * part_1 of fix for mem optimization for categoical dict creation issue * precommit fix * Separated the update from the check in stop conditions for categoical columns * added tests and accounted for different varaibles affected by the change made to categories attribute * Modifications to code based on test findings * Fixes for logic and tests to match requirements from PR * Fix for rebase carry over issue * fixes for tests because of changes to variable names in categorical column object * precommit fixes and improvement of code based on testing * added stop_condition_unique_value_ratio and max_sample_size_to_check_stop_condition to CategoricalOptions (#808) * implementation of setting stop conds via options for cat column profiler (#810) * Space time analysis improvement (#809) * Made space time analysis code improvements (detect if dataset is already generated, specify cats to generate) * Modified md file to account for new variable in space time analysis code * fix: cat bug (#816) * hotfix for more conservatitive stop condition in categorical columns (#817) * [WIP] Fix for histogram merging (#815) * rough draft of merge fix for histograms * final fixes for passing of existing tests * Added option to remove calculations for updating row statistics (#827) * Fix to doc strings (#829) * Preset Option Fix: presets docsstring added (#830) * presets docsstring added * Update dataprofiler/profilers/profiler_options.py * Update dataprofiler/profilers/profiler_options.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/profiler_options.py * Update dataprofiler/profilers/profiler_options.py --------- Co-authored-by: Taylor Turner <taylorfturner@gmail.com> --------- Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com> Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> Co-authored-by: JGSweets <JGSweets@users.noreply.github.com>

rough draft of merge fix for histograms

0319d01

ksneab7 added the Work In Progress Solution is being developed label May 9, 2023

ksneab7 requested review from JGSweets, taylorfturner, micdavis and tyfarnan as code owners May 9, 2023 20:08

JGSweets changed the base branch from main to feature/memory-optimization May 9, 2023 20:22

JGSweets reviewed May 9, 2023

View reviewed changes

taylorfturner assigned ksneab7 May 9, 2023

JGSweets reviewed May 9, 2023

View reviewed changes

ksneab7 and others added 2 commits May 11, 2023 13:54

final fixes for passing of existing tests

fc3a6d4

Merge branch 'feature/memory-optimization' into fix_for_histogram_mer…

802cc96

…ging

taylorfturner enabled auto-merge (squash) May 11, 2023 18:03

ksneab7 added 2 commits May 12, 2023 08:42

renamed and added doc strings to functions

88cc319

created tests for new histogram generation code

1f6d34c

ksneab7 changed the title ~~[WIP] Fix for histogram merging~~ Fix for histogram merging May 12, 2023

ksneab7 removed the Work In Progress Solution is being developed label May 12, 2023

added reference to numpy histogram files

9752945

JGSweets reviewed May 12, 2023

View reviewed changes

rework of naming of histogram utils

8acb73a

JGSweets reviewed May 12, 2023

View reviewed changes

dataprofiler/profilers/histogram_utils.py Outdated Show resolved Hide resolved

ksneab7 added 5 commits May 15, 2023 08:17

Made tests more explcitly labeled in the calculation comparisons for …

11d97a8

…bin width

fixed ignore for protected numpy functions

4a3b091

Make histogram utils compliant with numpy's licensing agreement

fe929d7

PR fixes

ccb02db

PR fixes

8cac0d1

taylorfturner added the High Priority Dramatic improvement, inaccurate calculation(s) or bug / feature making the library unusable label May 15, 2023

JGSweets reviewed May 15, 2023

View reviewed changes

dataprofiler/profilers/histogram_utils.py Show resolved Hide resolved

taylorfturner previously approved these changes May 15, 2023

View reviewed changes

docstring addition for loose reference to numpy functionality

5c9a6fc

ksneab7 dismissed taylorfturner’s stale review via 5c9a6fc May 15, 2023 17:28

taylorfturner previously approved these changes May 15, 2023

View reviewed changes

added comment to test for explicit callout of ignoring values in calcs

81e21e5

ksneab7 dismissed taylorfturner’s stale review via 81e21e5 May 15, 2023 19:02

taylorfturner previously approved these changes May 15, 2023

View reviewed changes

JGSweets reviewed May 15, 2023

View reviewed changes

moved calculation of weighted total loss outside of try except for ad…

ee94de7

…d helper function

ksneab7 dismissed taylorfturner’s stale review via ee94de7 May 16, 2023 12:53

ksneab7 and others added 2 commits May 16, 2023 13:10

bug fix for hist loss

41da615

Merge branch 'feature/memory-optimization' into fix_for_histogram_mer…

7c8b171

…ging

ksneab7 commented May 16, 2023

View reviewed changes

taylorfturner approved these changes May 16, 2023

View reviewed changes

taylorfturner added the Bug Something isn't working label May 16, 2023

JGSweets approved these changes May 16, 2023

View reviewed changes

tyfarnan reviewed May 16, 2023

View reviewed changes

tyfarnan approved these changes May 16, 2023

View reviewed changes

taylorfturner merged commit 2f768f4 into capitalone:feature/memory-optimization May 16, 2023
5 checks passed

ksneab7 added a commit that referenced this pull request May 23, 2023

[WIP] Fix for histogram merging (#815)

8331eb4

* rough draft of merge fix for histograms * final fixes for passing of existing tests

ksneab7 mentioned this pull request May 23, 2023

Fuse the functionality used in both _merge_histogram and the newly created _assimilate_histogram #838

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for histogram merging #815

Fix for histogram merging #815

ksneab7 commented May 9, 2023

JGSweets May 9, 2023

JGSweets May 9, 2023

ksneab7 May 10, 2023

JGSweets May 9, 2023

JGSweets May 9, 2023

JGSweets May 9, 2023

JGSweets May 9, 2023

JGSweets May 9, 2023

JGSweets May 9, 2023

JGSweets May 9, 2023

JGSweets May 9, 2023

JGSweets May 9, 2023 •

edited

JGSweets May 9, 2023

ksneab7 May 10, 2023

JGSweets May 12, 2023 •

edited

taylorfturner May 15, 2023

taylorfturner May 15, 2023

JGSweets May 15, 2023

JGSweets May 15, 2023

ksneab7 May 16, 2023

JGSweets May 15, 2023

JGSweets May 15, 2023

ksneab7 May 16, 2023

tyfarnan May 16, 2023

JGSweets May 16, 2023 •

edited

taylorfturner May 16, 2023

		minimum = profile._stored_histogram["histogram"]["bin_edges"][0]
		maximum = profile._stored_histogram["histogram"]["bin_edges"][-1]

Fix for histogram merging #815

Fix for histogram merging #815

Conversation

ksneab7 commented May 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JGSweets May 9, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JGSweets May 12, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JGSweets May 16, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JGSweets May 9, 2023 •

edited

JGSweets May 12, 2023 •

edited

JGSweets May 16, 2023 •

edited