Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create chinese.stop #108838

Merged
merged 1 commit into from
Aug 25, 2023
Merged

Create chinese.stop #108838

merged 1 commit into from
Aug 25, 2023

Conversation

zhenruyan
Copy link
Contributor

@zhenruyan zhenruyan commented Aug 16, 2023

add tsearch chinese stopwords

I'm considering using cockroachdb for Chinese searches, so I hope I can contribute something.

Release note: None
Epic: None

@blathers-crl
Copy link

blathers-crl bot commented Aug 16, 2023

Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR.

Before a member of our team reviews your PR, I have some potential action items for you:

  • Please ensure your git commit message contains a release note.
  • When CI has completed, please ensure no errors have appeared.

I was unable to automatically find a reviewer. You can try CCing one of the following members:

  • A person you worked with closely on this PR.
  • The person who created the ticket, or a CRDB organization member involved with the ticket (author, commenter, etc.).
  • Join our community slack channel and ask on #contributors.
  • Try find someone else from here.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Aug 16, 2023

CLA assistant check
All committers have signed the CLA.

@blathers-crl blathers-crl bot added O-community Originated from the community X-blathers-untriaged blathers was unable to find an owner labels Aug 16, 2023
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@yuzefovich yuzefovich removed the X-blathers-untriaged blathers was unable to find an owner label Aug 17, 2023
@yuzefovich yuzefovich requested review from a team and michae2 and removed request for a team August 17, 2023 05:25
@michae2
Copy link
Collaborator

michae2 commented Aug 17, 2023

Thank you for your contribution! We appreciate it!

If you could please apply the following diff to your PR, it should help pass CI:

diff --git a/pkg/util/tsearch/BUILD.bazel b/pkg/util/tsearch/BUILD.bazel
index b0fde092e8..6adb1cd2e4 100644
--- a/pkg/util/tsearch/BUILD.bazel
+++ b/pkg/util/tsearch/BUILD.bazel
@@ -30,6 +30,7 @@ go_library(
         "stopwords/spanish.stop",
         "stopwords/swedish.stop",
         "stopwords/turkish.stop",
+        "stopwords/chinese.stop",
     ],

@rickystewart
Copy link
Collaborator

@michae2 Please keep the list sorted, i.e., chinese.stop should be in alphabetical order and not at the end of the list.

@zhenruyan
Copy link
Contributor Author

Thank you for your contribution! We appreciate it!

If you could please apply the following diff to your PR, it should help pass CI:

diff --git a/pkg/util/tsearch/BUILD.bazel b/pkg/util/tsearch/BUILD.bazel
index b0fde092e8..6adb1cd2e4 100644
--- a/pkg/util/tsearch/BUILD.bazel
+++ b/pkg/util/tsearch/BUILD.bazel
@@ -30,6 +30,7 @@ go_library(
         "stopwords/spanish.stop",
         "stopwords/swedish.stop",
         "stopwords/turkish.stop",
+        "stopwords/chinese.stop",
     ],

Thank you for the reminder.

Do I need to withdraw the PR and resubmit?

@zhenruyan
Copy link
Contributor Author

@michae2 Please keep the list sorted, i.e., chinese.stop should be in alphabetical order and not at the end of the list.

When dealing with Chinese, if sorted by first letter, is it sorted by the first letter of pinyin pronunciation?

@zhenruyan
Copy link
Contributor Author

@michae2 #108983

I submitted it. I think I understand the meaning. The file names should be sorted alphabetically.

At first, I thought it was the internal content of the file.

@rickystewart
Copy link
Collaborator

@zhenruyan Looks like your PR does not update that file.

@blathers-crl
Copy link

blathers-crl bot commented Aug 21, 2023

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

  • We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
  • Please ensure your git commit message contains a release note.
  • When CI has completed, please ensure no errors have appeared.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Copy link
Collaborator

@rickystewart rickystewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please squash these 3 commits into one commit? We should be good to go after that.

@Xiang-Gu
Copy link
Contributor

I am curious about what words should be considered as "stopwords"? I saw that our current stopwords are copied from Postgres (this commit).

I looked at the Chinese stopwords proposed in this PR and wonder "ahh, are they insignificant enough to be considered 'stopwords'"?

@Xiang-Gu
Copy link
Contributor

@zhenruyan do you mind providing some reasoning behind this?

@zhenruyan
Copy link
Contributor Author

Can you please squash these 3 commits into one commit? We should be good to go after that.

I am more than happy to do so.

@zhenruyan
Copy link
Contributor Author

@zhenruyan do you mind providing some reasoning behind this?

On why the Chinese stop words were included. :

I'm building a general purpose search engine platform using pg/cockroach. I was pleasantly surprised to find that it's much more flexible than elasticsearch, which is great in a variable real-world business. But the experience isn't completely out-of-the-box, so I'm hoping to patch that up.

On the definition of Chinese stop words:

When postgresql uses the plug-in mechanism to add a Chinese word-splitting tool, a similar Chinese stop word is added. And these stop words have been verified in tens of millions of Chinese blog posts.

Copy link
Contributor Author

@zhenruyan zhenruyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add Chinese stop words, and compile dependency paths.

@blathers-crl
Copy link

blathers-crl bot commented Aug 22, 2023

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

  • We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
  • Please ensure your git commit message contains a release note.
  • When CI has completed, please ensure no errors have appeared.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl
Copy link

blathers-crl bot commented Aug 22, 2023

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

  • Please ensure your git commit message contains a release note.
  • When CI has completed, please ensure no errors have appeared.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@Xiang-Gu
Copy link
Contributor

When postgresql uses the plug-in mechanism to add a Chinese word-splitting tool, a similar Chinese stop word is added. And these stop words have been verified in tens of millions of Chinese blog posts.

Can you give me a link to where those Chinese stop words were added? Is it in a commit or some web page? Thank you very much!

@zhenruyan
Copy link
Contributor Author

When postgresql uses the plug-in mechanism to add a Chinese word-splitting tool, a similar Chinese stop word is added. And these stop words have been verified in tens of millions of Chinese blog posts.

Can you give me a link to where those Chinese stop words were added? Is it in a commit or some web page? Thank you very much!

https://www.oschina.net/project
and
https://www.oschina.net/blog

We are China's open source community, open source software index directory inclusion as well as the blog's tags in the word segmentation, are used in the Chinese stop words of the word segmentation tool. In the near future, will be online based on cockroachdb or pg search (which will be based on the later ab stress test).

@Xiang-Gu
Copy link
Contributor

Thanks for the reply. My only concern/question is that the list of stopwords you provided does not really match my intuitive understanding of Chinese stopwords.

A colleague found this website https://towardsdatascience.com/chinese-natural-language-pre-processing-an-introduction-995d16c2705f in which it talked a bit about Chinese stopwords:

In NLP, stop words are “meaningless” words that make the data too noisy or ambiguous. In our example sentence, the stop words are 是, 在 and 的. We could manually filter them out, but that’s also very tedious. Just like with English, there are pre-set lists of stop words out there. There are about 119 official stop words in Chinese, and they can be viewed on this website. Instead of manually removing them, could import the stopwordsiso package for a full list of Chinese stop words. More information can be found here. And with this, we can easily create code to filter out any stop words in large text data.

Looking at the Chinese stopwords provided there (https://pypi.org/project/stopwordsiso/), it match more to my intuitive understanding of Chinese stopwords. I suggest we use that as the source for Chinese stopwords list, and you can modify it locally to best suit your own application and use case. What do you think?

@zhenruyan
Copy link
Contributor Author

Thanks for the reply. My only concern/question is that the list of stopwords you provided does not really match my intuitive understanding of Chinese stopwords.

A colleague found this website https://towardsdatascience.com/chinese-natural-language-pre-processing-an-introduction-995d16c2705f in which it talked a bit about Chinese stopwords:

In NLP, stop words are “meaningless” words that make the data too noisy or ambiguous. In our example sentence, the stop words are 是, 在 and 的. We could manually filter them out, but that’s also very tedious. Just like with English, there are pre-set lists of stop words out there. There are about 119 official stop words in Chinese, and they can be viewed on this website. Instead of manually removing them, could import the stopwordsiso package for a full list of Chinese stop words. More information can be found here. And with this, we can easily create code to filter out any stop words in large text data.

Looking at the Chinese stopwords provided there (https://pypi.org/project/stopwordsiso/), it match more to my intuitive understanding of Chinese stopwords. I suggest we use that as the source for Chinese stopwords list, and you can modify it locally to best suit your own application and use case. What do you think?

It's good to have an evaluable standard, and I think it's great, that it doesn't conflict with my desire for it to have a better Chinese search experience. I will resubmit the commit chinese.stop.

@blathers-crl
Copy link

blathers-crl bot commented Aug 24, 2023

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

  • Please ensure your git commit message contains a release note.
  • When CI has completed, please ensure no errors have appeared.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@Xiang-Gu
Copy link
Contributor

Xiang-Gu commented Aug 24, 2023

Note: The list of stopwrods is copied from https://github.com/stopwords-iso/stopwords-zh/blob/master/stopwords-zh.txt

@zhenruyan Please add a Release note: None in your commit and you are good to merge! Thanks again for contributing.

Update: Epic: None as well

@rickystewart
Copy link
Collaborator

Yes, please update the commit to add Release note: None and Epic: None.

@blathers-crl
Copy link

blathers-crl bot commented Aug 25, 2023

Thank you for updating your pull request.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@zhenruyan
Copy link
Contributor Author

Note: The list of stopwrods is copied from https://github.com/stopwords-iso/stopwords-zh/blob/master/stopwords-zh.txt

@zhenruyan Please add a Release note: None in your commit and you are good to merge! Thanks again for contributing.

Update: Epic: None as well

Yes, please update the commit to add Release note: None and Epic: None.

@Xiang-Gu @rickystewart Thank you very much for your patience. I resubmitted the commit.

@rickystewart
Copy link
Collaborator

Thanks!

bors r=rickystewart,Xiang-Gu

@craig
Copy link
Contributor

craig bot commented Aug 25, 2023

Build succeeded:

@craig craig bot merged commit 87fc44f into cockroachdb:master Aug 25, 2023
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-community Originated from the community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants