Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tags and mute words #2968

Merged
merged 52 commits into from
Feb 27, 2024
Merged

Add tags and mute words #2968

merged 52 commits into from
Feb 27, 2024

Conversation

estrattonbailey
Copy link
Member

No description provided.

kisaragi-hiu and others added 30 commits February 22, 2024 17:36
* Add bare minimum hashtags support

As atproto/api already parses hashtags, this is as simple as hooking it
up like link segments.

This is "bare minimum" because:

- Opening hashtag "#foo" is actually just a search for "foo" right now
  to work around #2491.
- There is no integration in the composer. This hasn't stopped people
  from using hashtags already, and can be added later.
- This change itself only had to hook things up - thank you for having
  already put the hashtag parsing in place.

* Remove workaround for hash search not working now that it's fixed

if (mutedWordNoPunc === wordNoPunc) return true
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to enable support for phrases, I think what we can do is this:

  • Instead of splitting the text into words, run your punctuation strip on it
  • Convert the muteword into a regex like this (havent tested this regex): new RegExp('[\s\t\n\r\f\,\.\;\!\?\_\-\"\(\)]' + stripPunctuation(muteWord) + '[\s\t\n\r\f\,\.\;\!\?\_\-\"\(\)]')
  • Just run the muteword regexes against the punctuation-stripped post text

Copy link
Contributor

@mary-ext mary-ext Feb 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the link! Will have a look and compare 🙏

Copy link
Collaborator

@pfrazee pfrazee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work Eric. Really just into the tweaking stage.

Copy link
Contributor

@haileyok haileyok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything here looks solid. Test cases look good, giving it a spin for the new few days will probably turn up something but I say let's :shipit:

@pfrazee pfrazee merged commit 58aaad7 into main Feb 27, 2024
4 checks passed
@pfrazee pfrazee deleted the eric/tags branch February 27, 2024 04:33
@pfrazee
Copy link
Collaborator

pfrazee commented Feb 27, 2024

Hot diggity dog

@dolciss
Copy link
Contributor

dolciss commented Feb 28, 2024

Glad to see this PR merged🎉, but forgive me for commenting when it is already closed.

Mute words appear to split the body text at the following locations,

const words = postText.split(REGEX.WORD_BOUNDARY)

However, this often fails to mute words in languages that are not separated by spaces, such as Japanese.
it(`match: single character 希`, () => {
/**
* @see https://bsky.app/profile/mukuuji.bsky.social/post/3klji4fvsdk2c
*/
const rt = new RichText({
text: `改善希望です`,
})
rt.detectFacetsWithoutResolution()
const match = hasMutedWord(
[{value: '希', targets: ['content']}],
rt.text,
rt.facets,
[],
)
expect(match).toBe(true)
})

There are testing mute of a single kanji character here, but we often mute Japanese characters on a word-by-word basis as well. For example, "ハッシュタグ"(means Hashtags) in "次のバージョンでハッシュタグが実装される!"
It would be great if you could consider a non word-by-word word mute option!

@estrattonbailey
Copy link
Member Author

@dolciss no problem at all! Appreciate you bringing this to our attention. We've considered adding options for allowing users to select which type of matching they'd like to do, and we may continue that investigation.

For now, we pushed a new change just now that checks for the post language, and if it's a language that doesn't use spaces — or uses spaces in ways other than separating words — then we check for the string directly instead of word-by-word.

In English, this direct matching would return false-positives e.g. AI would mute the word brain.

If you don't mind, may I ask: with this new change, do you happen to know if this type of partial-match issue will be a concern in Japanese as well?

@dolciss
Copy link
Contributor

dolciss commented Feb 29, 2024

@estrattonbailey Thank you very much for your improve at #3018 !

If you don't mind, may I ask: with this new change, do you happen to know if this type of partial-match issue will be a concern in Japanese as well?

Yes, partial-match problems are a concern in Japanese as well. (e.g. ネット would mute the word インターネット ネットワーク)
However, solving this problem is distressing because it requires complex processing, such as morphological analysis.
(I am sorry, but I am not that familiar with morphological analysis.)

// rest of the checks are for `content` only
if (!mute.targets.includes('content')) continue
// single character, has to use includes
if (mutedWord.length === 1 && postText.includes(mutedWord)) return true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checks the utf16 code-unit length, which may give an unexpected result for certain international scripts and/or emojis (maybe it's fine? I'm not sure I totally understand the purpose of this check)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good shout. Yeah the original intention was to handle international scripts with this. In response to the comments above I added a check a couple weeks ago for a few languages that commonly don't use spaces as well.

But what you've called out here is still an issue: trying to match a single character (potentially with length > 1) in a language not part of our exceptions may fail unless we can split the text at the word boundary. Interesting...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a practical test-case if someone mutes "💩", then it it'll match when it's on its own, but not as part of a "word"
image

Muting an emoji is probably quite a common thing to want to do, and I think counting graphemes would give the expected behavior in most cases (which is normally an annoying thing to do, but I know the app already has grapheme counting logic in it elsewhere)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants