Skip to content

Conversation

@DeanEby
Copy link
Collaborator

@DeanEby DeanEby commented Feb 14, 2025

A very simple change to have our ngram tokenizer split on whitespace instead of punctuation

@KristijanArmeni KristijanArmeni self-assigned this Feb 27, 2025
@KristijanArmeni KristijanArmeni self-requested a review February 27, 2025 15:10
Copy link
Collaborator

@KristijanArmeni KristijanArmeni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @DeanEby! This looks straightforward. Just a small clarification (since it was not explained in the PR/issue):

Splitting on whitespace will keep punctuation symbols to tag along with tokens/words if there's no whitespace. See:

>>> import re
>>> re.split(" +", "example, with punctuation.")
['example,', 'with', 'punctuation.']

Is that OK/expected in the context of analysis, i.e. are we fine with example and example, being treated as distinct tokens? Or is that taken care of elsewhere?

Cf:

>>> re.split(r"\W+", "example, with punctuation.")
['example', 'with', 'punctuation', '']

@KristijanArmeni KristijanArmeni added enhancement Enhancement of the code, not introducing new features. domain: datasci Affects the analyzer's data science logic labels Feb 27, 2025
@DeanEby
Copy link
Collaborator Author

DeanEby commented Feb 27, 2025

@KristijanArmeni I think that falls under expected for now since we are looking for explicitly copy-pasted text. It is definitely something to think about and potentially implement as a future feature, but I think splitting on whitespace is better than the existing tokenization we are doing.

Copy link
Collaborator

@KristijanArmeni KristijanArmeni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that makes sense. Thanks @DeanEby. Fine to merge on my end.

@DeanEby DeanEby merged commit 167790c into civictechdc:main Feb 27, 2025
1 check passed
VEDA95 added a commit that referenced this pull request Mar 30, 2025
DeanEby added a commit that referenced this pull request Mar 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: datasci Affects the analyzer's data science logic enhancement Enhancement of the code, not introducing new features.

Projects

Development

Successfully merging this pull request may close these issues.

2 participants