Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edited Slovenian stop words list #9707

Merged
merged 1 commit into from Nov 22, 2021
Merged

Edited Slovenian stop words list #9707

merged 1 commit into from Nov 22, 2021

Conversation

richardpaulhudson
Copy link
Contributor

Tidied up the Slovenian stop words list, which contained various words that are not normally considered stop words such as names of months.

@polm polm added v3.3 Related to v3.3 lang / sk Slovak language data and models labels Nov 20, 2021
@richardpaulhudson richardpaulhudson added lang / sl Slovenian language data and models and removed lang / sk Slovak language data and models labels Nov 20, 2021
@svlandeg svlandeg added the enhancement Feature requests and improvements label Nov 22, 2021
@svlandeg svlandeg merged commit a1f2541 into develop Nov 22, 2021
@svlandeg svlandeg deleted the slovenian-stop-words branch November 22, 2021 08:46
danieldk pushed a commit to danieldk/spaCy that referenced this pull request Dec 16, 2021
@danieldk danieldk mentioned this pull request Jan 13, 2022
3 tasks
adrianeboyd pushed a commit that referenced this pull request Jan 17, 2022
* Edited Slovenian stop words list (#9707)

* Noun chunks for Italian (#9662)

* added it vocab

* copied portuguese

* added possessive determiner

* added conjed Nps

* added nmoded Nps

* test misc

* more examples

* fixed typo

* fixed parenth

* fixed comma

* comma fix

* added syntax iters

* fix some index problems

* fixed index

* corrected heads for test case

* fixed tets case

* fixed determiner gender

* cleaned left over

* added example with apostophe

* French NP review (#9667)

* adapted from pt

* added basic tests

* added fr vocab

* fixed noun chunks

* more examples

* typo fix

* changed naming

* changed the naming

* typo fix

* Add Japanese kana characters to default exceptions (fix #9693) (#9742)

This includes the main kana, or phonetic characters, used in Japanese.

There are some supplemental kana blocks in Unicode outside the BMP that
could also be included, but because their actual use is rare I omitted
them for now, but maybe they should be added. The omitted blocks are:

- Kana Supplement
- Kana Extended (A and B)
- Small Kana Extension

* Remove NER words from stop words in Norwegian (#9820)

Default stop words in Norwegian bokmål (nb) in Spacy contain important entities, e.g. France, Germany, Russia, Sweden and USA, police district, important units of time, e.g. months and days of the week, and organisations.

Nobody expects their presence among the default stop words. There is a danger of users complying with the general recommendation of filtering out stop words, while being unaware of filtering out important entities from their data.

See explanation in #3052 (comment) and comment #3052 (comment)

* Bump sudachipy version

* Update sudachipy versions

* Bump versions

Bumping to the most recent dictionary just to keep thing current.
Bumping sudachipy to 5.2 because older versions don't support recent
dictionaries.

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Richard Hudson <richard@explosion.ai>
Co-authored-by: Duygu Altinok <duygu@explosion.ai>
Co-authored-by: Haakon Meland Eriksen <haakon.eriksen@far.no>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements lang / sl Slovenian language data and models v3.3 Related to v3.3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants