Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add natural language nan data check #2122

Merged
merged 13 commits into from Apr 14, 2021
Merged

Add natural language nan data check #2122

merged 13 commits into from Apr 14, 2021

Conversation

jeremyliweishih
Copy link
Contributor

@jeremyliweishih jeremyliweishih commented Apr 9, 2021

Fixes #2000

  • Add to default_data_checks

@codecov
Copy link

codecov bot commented Apr 12, 2021

Codecov Report

Merging #2122 (4a97ad7) into main (8f30b08) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #2122     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         291      293      +2     
  Lines       23809    23915    +106     
=========================================
+ Hits        23799    23905    +106     
  Misses         10       10             
Impacted Files Coverage Δ
evalml/data_checks/invalid_targets_data_check.py 100.0% <ø> (ø)
evalml/tests/data_checks_tests/test_data_checks.py 100.0% <ø> (ø)
...ta_checks_tests/test_invalid_targets_data_check.py 100.0% <ø> (ø)
evalml/automl/automl_search.py 100.0% <100.0%> (ø)
evalml/data_checks/__init__.py 100.0% <100.0%> (ø)
evalml/data_checks/data_check_message_code.py 100.0% <100.0%> (ø)
evalml/data_checks/default_data_checks.py 100.0% <100.0%> (ø)
...lml/data_checks/natural_language_nan_data_check.py 100.0% <100.0%> (ø)
evalml/model_understanding/graphs.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_automl.py 100.0% <100.0%> (ø)
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8f30b08...4a97ad7. Read the comment docs.

@jeremyliweishih jeremyliweishih marked this pull request as ready for review April 12, 2021 17:30
Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work! Left a couple comments on testing, but LGTM.

assert nl_nan_check.validate([nl_col, nl_col_without_nan]) == expected

# test np.array
assert nl_nan_check.validate(np.array([nl_col, nl_col_without_nan])) == expected
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add a test on this behavior on an empty string in a natural language column?

cols = ["", "string_that_is_long_enough_for_natural_language", "string_that_is_long_enough_for_natural_language"]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 I believe an empty string should pass this check since the LSA component was only bugging out out on NaN

np.nan is an invalid document, expected byte or unicode string. from #2000

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @jeremyliweishih ! I agree with @bchen1116 's test suggestions.

nan_columns = nl_cols.columns[nl_cols.isna().any()].tolist()
if len(nan_columns) > 0:
nan_columns = [str(col) for col in nan_columns]
cols_str = ', '.join(nan_columns) if len(nan_columns) > 1 else nan_columns[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need the if else here? ', '.join(nan_columns) should be nan_columns[0] if nan_columns only has one element.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! good catch 😄

}

X = infer_feature_types(X)
nl_cols = _convert_woodwork_types_wrapper(X.select("natural_language").to_dataframe())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's possible to do this with X.describe()['nan_count'] to save us the conversion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool will add! I'm going to use describe_dict() since its more straightforward for me to use a dict instead of a DT.

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just left some nitpicks hehe

Comment on lines +32 to +33
>>> data['A'] = [None, "string_that_is_long_enough_for_natural_language"]
>>> data['B'] = ['string_that_is_long_enough_for_natural_language', 'string_that_is_long_enough_for_natural_language']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe nit-pick / I don't think we did a good job of doing this before but maybe just explicitly set some cols as nat lang cols? Just thinking if WW ever changes their type inference, this might break and no longer be considered nat lang.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call - I'll add it to the docstring and the explicit WW tests!

DataCheckMessageCode,
NaturalLanguageNaNDataCheck
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment, we check using None here. While it shouldn't make a difference in behavior, can we also check with np.nan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't use np.nan here since their will be conflicting types but I will go ahead and try with empty string.

@jeremyliweishih jeremyliweishih merged commit 40e8bf2 into main Apr 14, 2021
chukarsten added a commit that referenced this pull request Apr 20, 2021
…skEngine`` #1975.

- Added optional ``engine`` argument to ``AutoMLSearch`` #1975
- Added a warning about how time series support is still in beta when a user passes in a time series problem to ``AutoMLSearch`` #2118
- Added ``NaturalLanguageNaNDataCheck`` data check #2122
- Added ValueError to ``partial_dependence`` to prevent users from computing partial dependence on columns with all NaNs #2120
- Added standard deviation of cv scores to rankings table #2154
- Fixed ``BalancedClassificationDataCVSplit``, ``BalancedClassificationDataTVSplit``, and ``BalancedClassificationSampler`` to use ``minority:majority`` ratio instead of ``majority:minority`` #2077
- Fixed bug where two-way partial dependence plots with categorical variables were not working correctly #2117
- Fixed bug where ``hyperparameters`` were not displaying properly for pipelines with a list ``component_graph`` and duplicate components #2133
- Fixed bug where ``pipeline_parameters`` argument in ``AutoMLSearch`` was not applied to pipelines passed in as ``allowed_pipelines`` #2133
- Fixed bug where ``AutoMLSearch`` was not applying custom hyperparameters to pipelines with a list ``component_graph`` and duplicate components #2133
- Removed ``hyperparameter_ranges`` from Undersampler and renamed ``balanced_ratio`` to ``sampling_ratio`` for samplers #2113
- Renamed ``TARGET_BINARY_NOT_TWO_EXAMPLES_PER_CLASS`` data check message code to ``TARGET_MULTICLASS_NOT_TWO_EXAMPLES_PER_CLASS`` #2126
- Modified one-way partial dependence plots of categorical features to display data with a bar plot #2117
- Renamed ``score`` column for ``automl.rankings`` as ``mean_cv_score`` #2135
- Fixed ``conf.py`` file #2112
- Added a sentence to the automl user guide stating that our support for time series problems is still in beta. #2118
- Fixed documentation demos #2139
- Update test badge in README to use GitHub Actions #2150
- Fixed ``test_describe_pipeline`` for ``pandas`` ``v1.2.4`` #2129
- Added a GitHub Action for building the conda package #1870 #2148
.. warning::
- Renamed ``balanced_ratio`` to ``sampling_ratio`` for the ``BalancedClassificationDataCVSplit``, ``BalancedClassificationDataTVSplit``, ``BalancedClassficationSampler``, and Undersampler #2113
- Deleted the "errors" key from automl results #1975
- Deleted the ``raise_and_save_error_callback`` and the ``log_and_save_error_callback`` #1975
- Fixed ``BalancedClassificationDataCVSplit``, ``BalancedClassificationDataTVSplit``, and ``BalancedClassificationSampler`` to use minority:majority ratio instead of majority:minority #2077
@chukarsten chukarsten mentioned this pull request Apr 20, 2021
@freddyaboulton freddyaboulton deleted the js_2000_nlp_nan branch May 13, 2022 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LSA component errors due to nans in natural language feature
4 participants