New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add natural language nan data check #2122
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2122 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 291 293 +2
Lines 23809 23915 +106
=========================================
+ Hits 23799 23905 +106
Misses 10 10
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work! Left a couple comments on testing, but LGTM.
evalml/tests/data_checks_tests/test_natural_language_nan_data_check.py
Outdated
Show resolved
Hide resolved
assert nl_nan_check.validate([nl_col, nl_col_without_nan]) == expected | ||
|
||
# test np.array | ||
assert nl_nan_check.validate(np.array([nl_col, nl_col_without_nan])) == expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also add a test on this behavior on an empty string in a natural language column?
cols = ["", "string_that_is_long_enough_for_natural_language", "string_that_is_long_enough_for_natural_language"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bchen1116 I believe an empty string should pass this check since the LSA component was only bugging out out on NaN
np.nan is an invalid document, expected byte or unicode string.
from #2000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @jeremyliweishih ! I agree with @bchen1116 's test suggestions.
nan_columns = nl_cols.columns[nl_cols.isna().any()].tolist() | ||
if len(nan_columns) > 0: | ||
nan_columns = [str(col) for col in nan_columns] | ||
cols_str = ', '.join(nan_columns) if len(nan_columns) > 1 else nan_columns[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need the if else
here? ', '.join(nan_columns) should be nan_columns[0] if nan_columns only has one element.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! good catch 😄
} | ||
|
||
X = infer_feature_types(X) | ||
nl_cols = _convert_woodwork_types_wrapper(X.select("natural_language").to_dataframe()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's possible to do this with X.describe()['nan_count']
to save us the conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool will add! I'm going to use describe_dict()
since its more straightforward for me to use a dict instead of a DT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just left some nitpicks hehe
>>> data['A'] = [None, "string_that_is_long_enough_for_natural_language"] | ||
>>> data['B'] = ['string_that_is_long_enough_for_natural_language', 'string_that_is_long_enough_for_natural_language'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hehe nit-pick / I don't think we did a good job of doing this before but maybe just explicitly set some cols as nat lang cols? Just thinking if WW ever changes their type inference, this might break and no longer be considered nat lang.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call - I'll add it to the docstring and the explicit WW tests!
DataCheckMessageCode, | ||
NaturalLanguageNaNDataCheck | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment, we check using None here. While it shouldn't make a difference in behavior, can we also check with np.nan
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't use np.nan
here since their will be conflicting types but I will go ahead and try with empty string.
…skEngine`` #1975. - Added optional ``engine`` argument to ``AutoMLSearch`` #1975 - Added a warning about how time series support is still in beta when a user passes in a time series problem to ``AutoMLSearch`` #2118 - Added ``NaturalLanguageNaNDataCheck`` data check #2122 - Added ValueError to ``partial_dependence`` to prevent users from computing partial dependence on columns with all NaNs #2120 - Added standard deviation of cv scores to rankings table #2154 - Fixed ``BalancedClassificationDataCVSplit``, ``BalancedClassificationDataTVSplit``, and ``BalancedClassificationSampler`` to use ``minority:majority`` ratio instead of ``majority:minority`` #2077 - Fixed bug where two-way partial dependence plots with categorical variables were not working correctly #2117 - Fixed bug where ``hyperparameters`` were not displaying properly for pipelines with a list ``component_graph`` and duplicate components #2133 - Fixed bug where ``pipeline_parameters`` argument in ``AutoMLSearch`` was not applied to pipelines passed in as ``allowed_pipelines`` #2133 - Fixed bug where ``AutoMLSearch`` was not applying custom hyperparameters to pipelines with a list ``component_graph`` and duplicate components #2133 - Removed ``hyperparameter_ranges`` from Undersampler and renamed ``balanced_ratio`` to ``sampling_ratio`` for samplers #2113 - Renamed ``TARGET_BINARY_NOT_TWO_EXAMPLES_PER_CLASS`` data check message code to ``TARGET_MULTICLASS_NOT_TWO_EXAMPLES_PER_CLASS`` #2126 - Modified one-way partial dependence plots of categorical features to display data with a bar plot #2117 - Renamed ``score`` column for ``automl.rankings`` as ``mean_cv_score`` #2135 - Fixed ``conf.py`` file #2112 - Added a sentence to the automl user guide stating that our support for time series problems is still in beta. #2118 - Fixed documentation demos #2139 - Update test badge in README to use GitHub Actions #2150 - Fixed ``test_describe_pipeline`` for ``pandas`` ``v1.2.4`` #2129 - Added a GitHub Action for building the conda package #1870 #2148 .. warning:: - Renamed ``balanced_ratio`` to ``sampling_ratio`` for the ``BalancedClassificationDataCVSplit``, ``BalancedClassificationDataTVSplit``, ``BalancedClassficationSampler``, and Undersampler #2113 - Deleted the "errors" key from automl results #1975 - Deleted the ``raise_and_save_error_callback`` and the ``log_and_save_error_callback`` #1975 - Fixed ``BalancedClassificationDataCVSplit``, ``BalancedClassificationDataTVSplit``, and ``BalancedClassificationSampler`` to use minority:majority ratio instead of majority:minority #2077
Fixes #2000