Skip to content

Fix delimiter inconsistency in Natural Language primitives#2423

Merged
gsheni merged 27 commits intomainfrom
fix-delimiter-inconsistency
Jan 3, 2023
Merged

Fix delimiter inconsistency in Natural Language primitives#2423
gsheni merged 27 commits intomainfrom
fix-delimiter-inconsistency

Conversation

@sbadithe
Copy link
Contributor

@sbadithe sbadithe commented Dec 20, 2022

closes #2419
closes #2425
closes #2426

Changes:

After this change, the default delimiters for all primitives are only whitespace, unless manually overrided. This means primitives will leave words like "alteryx.com" and "1,000" as one word, unless explicitly told not to.

This results in adjusting test cases for some primitives. Furthermore, it results in a change to NumberOfCommonWords where words are split on whitespace and then stripped of punctuation before being checked for membership in the common word set.

@gsheni gsheni changed the title fix delimiter inconsistency in NatLang primitives Fix delimiter inconsistency in Natural Language primitives Dec 20, 2022
The default delimiters include [-.!?]\n\t

Examples:
>>> x = ['This is a test file', 'This is second line', 'third line $1,000', None]
Copy link
Contributor Author

@sbadithe sbadithe Dec 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previous primitive behavior was to break up "$1,000" into "$1" and "000". After this change, it stays as one word

@codecov
Copy link

codecov bot commented Dec 22, 2022

Codecov Report

Merging #2423 (17c1527) into main (6f20cb3) will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #2423   +/-   ##
=======================================
  Coverage   99.44%   99.44%           
=======================================
  Files         340      340           
  Lines       20870    20870           
=======================================
  Hits        20755    20755           
  Misses        115      115           
Impacted Files Coverage Δ
...s/standard/transform/natural_language/constants.py 100.00% <100.00%> (ø)
...d/transform/natural_language/median_word_length.py 100.00% <100.00%> (ø)
...ansform/natural_language/number_of_common_words.py 100.00% <100.00%> (ø)
...ansform/natural_language/number_of_unique_words.py 100.00% <100.00%> (ø)
...form/natural_language/number_of_words_in_quotes.py 100.00% <100.00%> (ø)
...rd/transform/natural_language/total_word_length.py 100.00% <100.00%> (ø)
...ge_primitives_tests/test_number_of_common_words.py 100.00% <100.00%> (ø)
...ge_primitives_tests/test_number_of_unique_words.py 100.00% <100.00%> (ø)
...anguage_primitives_tests/test_total_word_length.py 100.00% <100.00%> (ø)
...ests/primitive_tests/test_feature_serialization.py 100.00% <100.00%> (ø)
... and 1 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@sbadithe sbadithe marked this pull request as ready for review December 22, 2022 17:11
@sbadithe sbadithe requested a review from a team December 22, 2022 17:16
Copy link
Contributor

@rwedge rwedge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For NumberOfUniqueWords should we test for \t somewhere?

@sbadithe sbadithe force-pushed the fix-delimiter-inconsistency branch from b0af6b1 to 031daee Compare December 23, 2022 05:13
@gsheni gsheni merged commit ede3671 into main Jan 3, 2023
@gsheni gsheni deleted the fix-delimiter-inconsistency branch January 3, 2023 17:05
@thehomebrewnerd thehomebrewnerd mentioned this pull request Jan 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants