support tokenize while keeping common censor chars #767

haileyok · 2024-09-29T09:28:12Z

There are some cases where it's useful to tokenize a string while not splitting on some non-letter chars like #, *, -, or _. Unfortunately right now Tokenize will split on all of these, making some matching difficult.

This just adds a second TokenizeTextSkippingCensorChars for those particular use cases. Also adding TokenizeTextWithRegex, so that other cases can be easily covered in the future if they arise.

TUNISIA-user · 2024-11-05T07:55:43Z

hey

haileyok added 3 commits September 29, 2024 02:24

add tokenize while keeping common censor chars

0a82c6c

rename

4da4e79

clean

c8238e1

bnewbold approved these changes Nov 5, 2024

View reviewed changes

bnewbold merged commit 25c72f5 into main Nov 5, 2024
7 checks passed

bnewbold deleted the hailey/tokenize-skip-censor-chars branch November 5, 2024 05:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support tokenize while keeping common censor chars #767

support tokenize while keeping common censor chars #767

haileyok commented Sep 29, 2024

TUNISIA-user commented Nov 5, 2024

support tokenize while keeping common censor chars #767

support tokenize while keeping common censor chars #767

Conversation

haileyok commented Sep 29, 2024

TUNISIA-user commented Nov 5, 2024