Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for histogram merging #815

Conversation

ksneab7
Copy link
Contributor

@ksneab7 ksneab7 commented May 9, 2023

No description provided.

@ksneab7 ksneab7 added the Work In Progress Solution is being developed label May 9, 2023
@JGSweets JGSweets changed the base branch from main to feature/memory-optimization May 9, 2023 20:22
@@ -14,6 +14,218 @@
)


def rework_ptp(maximum, minimum):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once we fix naming, we should make these private as well.

-------
h : An estimate of the optimal bin width for the given data.
"""
dataset_size = sum(profile._stored_histogram["histogram"]["bin_counts"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can instead use the profile.sampling_size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AttributeError: 'IntColumn' object has no attribute 'sampling_size'

Comment on lines 46 to 47
minimum = profile._stored_histogram["histogram"]["bin_edges"][0]
maximum = profile._stored_histogram["histogram"]["bin_edges"][-1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be able to use profile.min etc.

Comment on lines 78 to 81
dataset_size = sum(profile._stored_histogram["histogram"]["bin_counts"])
minimum = profile._stored_histogram["histogram"]["bin_edges"][0]
maximum = profile._stored_histogram["histogram"]["bin_edges"][-1]
return (rework_ptp(maximum, minimum) /
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above

Comment on lines 104 to 106
dataset_size = sum(profile._stored_histogram["histogram"]["bin_counts"])
minimum = profile._stored_histogram["histogram"]["bin_edges"][0]
maximum = profile._stored_histogram["histogram"]["bin_edges"][-1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above

Comment on lines 127 to 129
dataset_size = sum(profile._stored_histogram["histogram"]["bin_counts"])
minimum = profile._stored_histogram["histogram"]["bin_edges"][0]
maximum = profile._stored_histogram["histogram"]["bin_edges"][-1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above

h : An estimate of the optimal bin width for the given data.
"""
iqr = np.subtract(profile._get_percentile([75]), profile._get_percentile([25]))
dataset_size = sum(profile._stored_histogram["histogram"]["bin_counts"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above

-------
h : An estimate of the optimal bin width for the given data.
"""
dataset_size = sum(profile._stored_histogram["histogram"]["bin_counts"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above

Comment on lines 326 to 328
dataset_size = sum(profile._stored_histogram["histogram"]["bin_counts"])
minimum = profile._stored_histogram["histogram"]["bin_edges"][0]
maximum = profile._stored_histogram["histogram"]["bin_edges"][-1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above

n_equal_bins = 1
else:
# Do not call selectors on empty arrays
width = rework_hist_bin_selectors[bin_name](profile)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm wondering if it is better to take in profile or a dict of properties. this does work and is future flexible.

Copy link
Collaborator

@JGSweets JGSweets May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if there are ways to reduce overall calcs by generating props before this loop and passing that in instead of the profile.

@@ -76,7 +76,6 @@ def __init__(self, options: NumericalOptions = None) -> None:
self.histogram_selection: str | None = None
self.user_set_histogram_bin: int | None = None
self.bias_correction: bool = True # By default, we correct for bias
self._mode_is_enabled: bool = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird I think this may solve some of my issue above. This should not be deleted

@taylorfturner taylorfturner enabled auto-merge (squash) May 11, 2023 18:03
@ksneab7 ksneab7 changed the title [WIP] Fix for histogram merging Fix for histogram merging May 12, 2023
@ksneab7 ksneab7 removed the Work In Progress Solution is being developed label May 12, 2023
"""
# parse the overloaded bins argument

rework_hist_bin_selectors = {
Copy link
Collaborator

@JGSweets JGSweets May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of re init internal every time, maybe global in this file. Probably should privatize the variable too. Update name to suggest from profile in the name as opposed to rework.

@taylorfturner taylorfturner added the High Priority Dramatic improvement, inaccurate calculation(s) or bug / feature making the library unusable label May 15, 2023
taylorfturner
taylorfturner previously approved these changes May 15, 2023
Comment on lines +35 to +50
class TestColumn(NumericStatsMixin):
def __init__(self):
NumericStatsMixin.__init__(self)
self.times = defaultdict(float)
self.match_count = 5
self.min = 1
self.max = 5
self._biased_skewness = 1.0
self._stored_histogram["histogram"]["bin_counts"] = [1, 1, 1, 1, 1, 1]
self._stored_histogram["histogram"]["bin_edges"] = [0, 1, 2, 3, 4, 5, 6]

def update(self, df_series):
pass

def _filter_properties_w_options(self, calculations, options):
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️


def _assimilate_histogram(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah doesn't look like self is even utilized outside of L1359 in the function so maybe we add this as an issue @ksneab7 for follow-up work but for now we focus on getting the working update in the feature branch

taylorfturner
taylorfturner previously approved these changes May 15, 2023
taylorfturner
taylorfturner previously approved these changes May 15, 2023
Comment on lines 246 to 256
try:
calculated_loss = (
(histogram_loss_1 * other1.sample_size)
+ (histogram_loss_2 * other2.sample_size)
) / (other1.sample_size + other2.sample_size)
except AttributeError:
sample_size_1 = sum(other1._stored_histogram["histogram"]["bin_counts"])
sample_size_2 = sum(other2._stored_histogram["histogram"]["bin_counts"])
calculated_loss = (
(histogram_loss_1 * sample_size_1) + (histogram_loss_2 * sample_size_2)
) / (sample_size_1 + sample_size_2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is sample_size the correct value or should it be match_count? Do we have tests evaluating this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think loss should be the sum, right? It's the aggregate loss of both being inserted into the _stored_histogrtam.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure match_count is probably right.
so in this calculation we are taking the the loss of each histogram and using the size of the dataset associated with that loss as a weight factor. So yes it is a sum, in this case I am doing a weighted sum.

dest_hist_bin_edges=ideal_bin_edges,
dest_hist_num_bin=ideal_count_of_bins,
)
try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could separate out getting sizes from the actual calc so the calc is not within the try itself. that is if sizes are needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see below

hist_loss += (
((new_bin_edge[2] - new_bin_edge[1]) - (bin_edge[1] - bin_edge[0]))
((new_bin_edge[2] + new_bin_edge[1]) - (bin_edge[1] + bin_edge[0]))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the last bug to be fixed

@taylorfturner taylorfturner added the Bug Something isn't working label May 16, 2023
"sturges": _calc_sturges_bin_width_from_profile,
}


def _get_bin_edges(
a: np.ndarray,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think we might like a more descriptive variable name here?

Copy link
Collaborator

@JGSweets JGSweets May 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not in code we changed and is actually how numpy also implements it as well. My suggestion is to table for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call

@taylorfturner taylorfturner merged commit 2f768f4 into capitalone:feature/memory-optimization May 16, 2023
5 checks passed
ksneab7 added a commit that referenced this pull request May 23, 2023
* rough draft of merge fix for histograms

* final fixes for passing of existing tests
JGSweets added a commit that referenced this pull request May 23, 2023
* [WIP] Part 1 fix for categorical mem opt issue (#795)

* part_1 of fix for mem optimization for categoical dict creation issue

* precommit fix

* Separated the update from the check in stop conditions for categoical columns

* added tests and accounted for different varaibles affected by the change made to categories attribute

* Modifications to code based on test findings

* Fixes for logic and tests to match requirements from PR

* Fix for rebase carry over issue

* fixes for tests because of changes to variable names in categorical column object

* precommit fixes and improvement of code based on testing

* added stop_condition_unique_value_ratio and max_sample_size_to_check_stop_condition to CategoricalOptions (#808)

* implementation of setting stop conds via options for cat column profiler (#810)

* Space time analysis improvement (#809)

* Made space time analysis code improvements (detect if dataset is already generated, specify cats to generate)

* Modified md file to account for new variable in space time analysis code

* fix: cat bug (#816)

* hotfix for more conservatitive stop condition in categorical columns (#817)

* [WIP] Fix for histogram merging (#815)

* rough draft of merge fix for histograms

* final fixes for passing of existing tests

* Added option to remove calculations for updating row statistics (#827)

* Fix to doc strings (#829)

* Preset Option Fix: presets docsstring added (#830)

* presets docsstring added

* Update dataprofiler/profilers/profiler_options.py

* Update dataprofiler/profilers/profiler_options.py

Co-authored-by: Taylor Turner <taylorfturner@gmail.com>

* Update dataprofiler/profilers/profiler_options.py

* Update dataprofiler/profilers/profiler_options.py

---------

Co-authored-by: Taylor Turner <taylorfturner@gmail.com>

---------

Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com>
Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com>
Co-authored-by: JGSweets <JGSweets@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working High Priority Dramatic improvement, inaccurate calculation(s) or bug / feature making the library unusable
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants