-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create chinese.stop #108838
Create chinese.stop #108838
Conversation
Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR. Before a member of our team reviews your PR, I have some potential action items for you:
I was unable to automatically find a reviewer. You can try CCing one of the following members:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Thank you for your contribution! We appreciate it! If you could please apply the following diff to your PR, it should help pass CI:
|
@michae2 Please keep the list sorted, i.e., |
Thank you for the reminder. Do I need to withdraw the PR and resubmit? |
When dealing with Chinese, if sorted by first letter, is it sorted by the first letter of pinyin pronunciation? |
@zhenruyan Looks like your PR does not update that file. |
Thank you for updating your pull request. Before a member of our team reviews your PR, I have some potential action items for you:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please squash these 3 commits into one commit? We should be good to go after that.
I am curious about what words should be considered as "stopwords"? I saw that our current stopwords are copied from Postgres (this commit). I looked at the Chinese stopwords proposed in this PR and wonder "ahh, are they insignificant enough to be considered 'stopwords'"? |
@zhenruyan do you mind providing some reasoning behind this? |
I am more than happy to do so. |
On why the Chinese stop words were included. : I'm building a general purpose search engine platform using pg/cockroach. I was pleasantly surprised to find that it's much more flexible than elasticsearch, which is great in a variable real-world business. But the experience isn't completely out-of-the-box, so I'm hoping to patch that up. On the definition of Chinese stop words: When postgresql uses the plug-in mechanism to add a Chinese word-splitting tool, a similar Chinese stop word is added. And these stop words have been verified in tens of millions of Chinese blog posts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add Chinese stop words, and compile dependency paths.
5d2bec4
to
9c28b04
Compare
Thank you for updating your pull request. Before a member of our team reviews your PR, I have some potential action items for you:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
9c28b04
to
5d33a4a
Compare
Thank you for updating your pull request. Before a member of our team reviews your PR, I have some potential action items for you:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Can you give me a link to where those Chinese stop words were added? Is it in a commit or some web page? Thank you very much! |
https://www.oschina.net/project We are China's open source community, open source software index directory inclusion as well as the blog's tags in the word segmentation, are used in the Chinese stop words of the word segmentation tool. In the near future, will be online based on cockroachdb or pg search (which will be based on the later ab stress test). |
Thanks for the reply. My only concern/question is that the list of stopwords you provided does not really match my intuitive understanding of Chinese stopwords. A colleague found this website https://towardsdatascience.com/chinese-natural-language-pre-processing-an-introduction-995d16c2705f in which it talked a bit about Chinese stopwords:
Looking at the Chinese stopwords provided there (https://pypi.org/project/stopwordsiso/), it match more to my intuitive understanding of Chinese stopwords. I suggest we use that as the source for Chinese stopwords list, and you can modify it locally to best suit your own application and use case. What do you think? |
It's good to have an evaluable standard, and I think it's great, that it doesn't conflict with my desire for it to have a better Chinese search experience. I will resubmit the commit chinese.stop. |
5d33a4a
to
384d4a5
Compare
Thank you for updating your pull request. Before a member of our team reviews your PR, I have some potential action items for you:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Note: The list of stopwrods is copied from https://github.com/stopwords-iso/stopwords-zh/blob/master/stopwords-zh.txt @zhenruyan Please add a Update: |
Yes, please update the commit to add |
(replace chinese stop words mirror : https://github.com/stopwords-iso/stopwords-zh/blob/master/stopwords-zh.txt ) Release note: None Epic: None
384d4a5
to
04ae35f
Compare
Thank you for updating your pull request. My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
@Xiang-Gu @rickystewart Thank you very much for your patience. I resubmitted the commit. |
Thanks! bors r=rickystewart,Xiang-Gu |
Build succeeded: |
add tsearch chinese stopwords
I'm considering using cockroachdb for Chinese searches, so I hope I can contribute something.
Release note: None
Epic: None