-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SyntheticDataOptions and metrics calculations for NaN replication #571
Conversation
report["synthetic_data"] = { | ||
"nan_replication": self._nan_replication_metrics | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would put this in report["data_stats"][<column_index>]["nan_replication_metrics"]
@@ -27,7 +27,7 @@ def test_default_profiler_options(self, *mocks): | |||
self.assertTrue(profile.options.data_labeler.is_enabled) | |||
for column in profile.options.properties: | |||
# TODO: remove the check for correlation option once it's updated to True | |||
if column == "correlation": | |||
if column == "correlation" or column == "synthetic_data": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change option name ... nan_replication_metrics
data = pd.DataFrame(clean_samples).apply(pd.to_numeric, errors="coerce") | ||
|
||
for col_id in range(len(self._profile)): | ||
null_count = getattr(self._profile[col_id], "null_count") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add something like compiler = self._profile[id]
so you don't have to keep repeating the self._profile[col_id]
call
|
||
# Calculate class priors | ||
# i.e probability that a value in target col is or is not NaN | ||
sample_size = getattr(self._profile[col_id], "sample_size") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see 2164 comment
prev_mean_nan, mean_nan, prev_null_count, added_null_count | ||
) | ||
|
||
col_name = getattr(self._profile[col_id], "name") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see 2164 comment
|
||
merged_properties = {} | ||
for col_id in range(len(self._profile)): | ||
self_null_count = getattr(self._profile[col_id], "null_count") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see 2164 comment
continue | ||
|
||
merged_properties[col_id] = {} | ||
self_sample_size = getattr(self._profile[col_id], "sample_size") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see 2164 comment
|
||
merged_properties[col_id]["class_prior"] = [prior_not_nan, prior_nan] | ||
|
||
col_name = getattr(self._profile[col_id], "name") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see 2164 comment
null_count = getattr(self._profile[col_id], "null_count", 0) | ||
sample_size = getattr(self._profile[col_id], "sample_size", 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see 2164 comment
|
||
merged_properties[col_id] = {} | ||
self_sample_size = getattr(self._profile[col_id], "sample_size") | ||
other_sample_size = getattr(other._profile[col_id], "sample_size") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see 2164 comment
merged_properties[col_id]["class_mean"] = other_mean | ||
|
||
elif col_id not in other._nan_replication_metrics: | ||
self_mean = self._nan_replication_metrics[col_id]["class_mean"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._nan_replication_metrics[col_id]["class_mean"]
repated...
self._nan_replication_metrics[col_id]["class_mean"]
--> class_mean = self._nan_replication_metrics[col_id]["class_mean"]
and then just use class_mean
in the rest of the function
other_true_count = other_sample_size - other_null_count | ||
|
||
self_mean = self._nan_replication_metrics[col_id]["class_mean"] | ||
other_mean = other._nan_replication_metrics[col_id]["class_mean"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment on 2270 ... same comment for other_mean
class SyntheticDataOptions(BaseInspectorOptions): | ||
"""For configuring options for synthetic data generation""" | ||
|
||
def __init__(self): | ||
""" | ||
Initialize options for synthetic data generation | ||
:ivar is_enabled: boolean option to enable/disable synthetic data generation options | ||
:vartype is_enabled: BooleanOption | ||
:ivar replicate_nan: boolean option to enable/disable calculating metrics needed for | ||
replicating nan values in synthetic data generator | ||
:vartype replicate_nan: BooleanOption | ||
""" | ||
|
||
self.replicate_nan = BooleanOption(is_enabled=True) | ||
super().__init__(is_enabled=False) | ||
|
||
def _validate_helper(self, variable_path="SyntheticDataOptions"): | ||
""" | ||
Validate the options do not conflict and cause errors. | ||
|
||
:param variable_path: current path to variable set. | ||
:type variable_path: str | ||
:return: list of errors (if raise_error is false) | ||
:rtype: list(str) | ||
""" | ||
return super()._validate_helper(variable_path) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed, remove and put nan_replication_metrics
as self.nan_replication_metrics
in the class StructuredOptions(BaseOption)
in profiler_options.py
@@ -1170,6 +1200,7 @@ def __init__(self, null_values=None): | |||
self.data_labeler = DataLabelerOptions() | |||
self.correlation = CorrelationOptions() | |||
self.chi2_homogeneity = BooleanOption(is_enabled=True) | |||
self.synthetic_data = SyntheticDataOptions() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update to StructuredOptions
I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This options is an attribute of/ part of the StructuredOptions
class (the code highlighted is in the def __init__
of the StructuredOptions
class) so it wouldn’t really make sense to use StructuredOptions
here. Was thinking of a simple BooleanOptions
.
dataprofiler/profilers/utils.py
Outdated
:return: Combined mean vector | ||
:rtype: np.ndarray | ||
""" | ||
return (mean1 * n1 + mean2 * n2) / (n1 + n2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not return (mean1 + mean2) / 2
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That equation would only work if n1 == n2
. If sample sizes the means were calculated from are different then we need to use (mean1 * n1 + mean2 * n2) / (n1 + n2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not weighted properly return (mean1 + mean2) / 2
Updates in the recent commit include:
|
self_compiler = self._profile[col_id] | ||
other_compiler = other._profile[col_id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have done self_profile
and other_profile
but won't hold the PR up over this
Head branch was pushed to by a user without write access
Head branch was pushed to by a user without write access
Changes made in latest commit:
|
@@ -1438,6 +1438,9 @@ def __init__(self, data, samples_per_update=None, min_true_samples=0, options=No | |||
self.correlation_matrix = None | |||
self.chi2_matrix = None | |||
|
|||
# capitalone/synthetic-data specific metrics | |||
self._null_replication_metrics = dict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instantiating class variables with None
is best practice. Interesting we do it on line 1435.
https://docs.python-guide.org/writing/gotchas/#mutable-default-arguments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do it on line 1438 and 1439
# Array index serves as class label | ||
# 0 indicates not null, 1 indicates null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit picky and not going to hold the PR up over it... just move the comments to above the dict and not inside the dict.
Head branch was pushed to by a user without write access
Head branch was pushed to by a user without write access
:param clean_samples: input cleaned dataset | ||
:type clean_samples: dict | ||
""" | ||
data = pd.DataFrame(clean_samples).apply(pd.to_numeric, errors="coerce") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this call still needed? I assume yes for the conversion to numeric? Do we need to apply the entire array? is this expensive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, not too sure about that. Doing it because that is what correlation calculation does and I need the same type of data for null replication as correlation calculation uses.
def test_null_replication_metrics_calculation(self): | ||
data = pd.DataFrame( | ||
{ | ||
"a": [3, 2, np.nan, 7, np.nan], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we validate that the clean, etc. works for nulls that aren't just caught by np.nan
. e.g. blank spaces, or None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as the DataProfiler successfully detects null values the calculation shouldn't be affected. Could definitely add that to the tests though.
} | ||
) | ||
|
||
profiler.update_profile(data_2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't the values after this and the values after summing two profiles, e.g. merge
, be the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially profiler
has data
, then update profiler
with data_2
(so profiler
has data
and data_2
). Then initialize profiler2
with data
then merge it with profiler
, therefore merged_profile
contains 2*data+data_2
whereas profile
after update contains data+data_2
so it shouldn't be the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I thought it was going to be data + data2
I might suggest doing that as we should validate that merge / update should provide the same results.
one way would be to do merge prior to update to validate.
{ | ||
"data_labeler.is_enabled": False, | ||
"structured_options.multiprocess.is_enabled": False, | ||
"null_replication_metrics.is_enabled": True, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
Head branch was pushed to by a user without write access
Added
SyntheticDataOptions
which can be set usingprofile_options.set({"synthetic_data.is_enabled": True})
. This enables calculation of metrics specifically needed forcapitalone/synthetic-data
.Added calculation of metrics needed by
capitalone/synthetic-data
for replicating NaN values in the generated dataset.