Add Japanese kana characters to default tokenizer exceptions (fix #9693) #9742

polm · 2021-11-25T04:57:36Z

This includes the main kana, or phonetic characters, used in Japanese. Not having these in the tokenizer exception list means that when they are used in non-Japanese text they behave strangely, and can combine with punctuation.

There are some supplemental kana blocks in Unicode outside the BMP that
could also be included, but because their actual use is rare I omitted
them for now, but maybe they should be added. The omitted blocks are:

Kana Supplement
Kana Extended (A and B)
Small Kana Extension

I didn't add a test, not sure if the speed tradeoff is worth it since it seems other similar classes aren't tested.

Types of change

tokenizer exception adjustment

Checklist

I confirm that I have the right to submit this contribution under the project's MIT license.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

This includes the main kana, or phonetic characters, used in Japanese. There are some supplemental kana blocks in Unicode outside the BMP that could also be included, but because their actual use is rare I omitted them for now, but maybe they should be added. The omitted blocks are: - Kana Supplement - Kana Extended (A and B) - Small Kana Extension

adrianeboyd · 2021-11-25T07:27:10Z

I think this should go in develop. For the history, we can probably just wait until develop gets updated from master and do the base switch back and forth again. I think it's okay without additional tests.

… (explosion#9742) This includes the main kana, or phonetic characters, used in Japanese. There are some supplemental kana blocks in Unicode outside the BMP that could also be included, but because their actual use is rare I omitted them for now, but maybe they should be added. The omitted blocks are: - Kana Supplement - Kana Extended (A and B) - Small Kana Extension

* Edited Slovenian stop words list (#9707) * Noun chunks for Italian (#9662) * added it vocab * copied portuguese * added possessive determiner * added conjed Nps * added nmoded Nps * test misc * more examples * fixed typo * fixed parenth * fixed comma * comma fix * added syntax iters * fix some index problems * fixed index * corrected heads for test case * fixed tets case * fixed determiner gender * cleaned left over * added example with apostophe * French NP review (#9667) * adapted from pt * added basic tests * added fr vocab * fixed noun chunks * more examples * typo fix * changed naming * changed the naming * typo fix * Add Japanese kana characters to default exceptions (fix #9693) (#9742) This includes the main kana, or phonetic characters, used in Japanese. There are some supplemental kana blocks in Unicode outside the BMP that could also be included, but because their actual use is rare I omitted them for now, but maybe they should be added. The omitted blocks are: - Kana Supplement - Kana Extended (A and B) - Small Kana Extension * Remove NER words from stop words in Norwegian (#9820) Default stop words in Norwegian bokmål (nb) in Spacy contain important entities, e.g. France, Germany, Russia, Sweden and USA, police district, important units of time, e.g. months and days of the week, and organisations. Nobody expects their presence among the default stop words. There is a danger of users complying with the general recommendation of filtering out stop words, while being unaware of filtering out important entities from their data. See explanation in #3052 (comment) and comment #3052 (comment) * Bump sudachipy version * Update sudachipy versions * Bump versions Bumping to the most recent dictionary just to keep thing current. Bumping sudachipy to 5.2 because older versions don't support recent dictionaries. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Richard Hudson <richard@explosion.ai> Co-authored-by: Duygu Altinok <duygu@explosion.ai> Co-authored-by: Haakon Meland Eriksen <haakon.eriksen@far.no>

polm added feat / tokenizer Feature: Tokenizer lang / ja Japanese language data and models labels Nov 25, 2021

polm linked an issue Nov 25, 2021 that may be closed by this pull request

Kana is not included in the default suffix tokenization rules for English models #9693

Closed

adrianeboyd changed the base branch from master to develop November 25, 2021 07:25

polm added the v3.3 Related to v3.3 label Nov 28, 2021

svlandeg changed the base branch from develop to master November 30, 2021 13:01

svlandeg changed the base branch from master to develop November 30, 2021 13:01

svlandeg merged commit b4d526c into explosion:develop Nov 30, 2021

danieldk mentioned this pull request Jan 13, 2022

Update develop with master #10045

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Japanese kana characters to default tokenizer exceptions (fix #9693) #9742

Add Japanese kana characters to default tokenizer exceptions (fix #9693) #9742

polm commented Nov 25, 2021

adrianeboyd commented Nov 25, 2021

Add Japanese kana characters to default tokenizer exceptions (fix #9693) #9742

Add Japanese kana characters to default tokenizer exceptions (fix #9693) #9742

Conversation

polm commented Nov 25, 2021

Types of change

Checklist

adrianeboyd commented Nov 25, 2021