Fix: Content Resizing: The "shorten" action does not detect the character length of Japanese text#581
Fix: Content Resizing: The "shorten" action does not detect the character length of Japanese text#581hbhalodia wants to merge 6 commits into
Conversation
…aracters and words
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message. To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
There was a problem hiding this comment.
Hi Team, Do we need to update the copy here to use characters? based on wordCountType and also on similar places?
IMO we should update that, waiting for feedback before applying the change.
Thanks,
There was a problem hiding this comment.
So a couple thoughts here.
We have a few places that use the @wordpress/wordcount package and now two separate places that use the wordCountType to determine if we should look at words or characters. Wondering if we extract all that out into a shared helper so we're not duplicating code?
In addition, we're inconsistent right now on if we look at words or characters (not just based on locale but just in general). For instance, we default to looking at words here and for Content Classification but I believe Content Summarization defaults to looking at characters.
I'm thinking it might be nice to standardize this across the plugin to avoid these inconsistencies. This may also simplify the code a bit if we decide to standardize on characters instead of words, avoiding the need to change things depending on locale.
Any thoughts on these?
There was a problem hiding this comment.
Also noting we have #545 open that adds a minimum content length to Editorial Notes and it uses characters.
There was a problem hiding this comment.
In practice and testing, I've seen AI reviews and feedback much more useful when there's a LOT more content available for it to utilize. However, I don't know that setting something like 500 words is realistic for all use cases for the AI plugin so perhaps something much lower like 100 words or 100 characters (noting neither are super specific so much as a round number). To some extent here, I don't care too much about the minimum length so much as we set something and use it across the plugin and allow that to be filterable. So let's just YOLO into something and can iterate if feedback shows we need to raise or lower that minimum threshold.
There was a problem hiding this comment.
As one approach, how about considering both words and characters? In other words, the label displays both words and characters.
_n(
'+%1$d word (+%2$d characters)',
'+%1$d words (+%2$d characters)',
magnitude,
'ai'
),
magnitude,
charMagnitudeFurthermore, the threshold check might be able to consider both of those.
if ( action === 'shorten' ) {
const wordCount = count( blockContent, 'words', {} );
const charCount = count( blockContent, 'characters_excluding_spaces', {} );
if ( wordCount < SHORTEN_MIN_WORDS && charCount < SHORTEN_MIN_CHARS ) {
noticesDispatch.createErrorNotice();
return;
}
}There was a problem hiding this comment.
We have a few places that use the @wordpress/wordcount package and now two separate places that use the wordCountType to determine if we should look at words or characters. Wondering if we extract all that out into a shared helper so we're not duplicating code?
Sure I would make a helper function, that checks for type.
This may also simplify the code a bit if we decide to standardize on characters instead of words, avoiding the need to change things depending on locale.
Not sure on this, but what as @t-hamano said, if looking at English it is mostly used words, while some languages such as Japanese uses characters (in that some languages have 2 parts, to include space or to not include space). So to make it consistent, we should go with characters.
But worth noting the suggestion about checking for both words and characters. But still, what if words are less and characters satisfies teh condition, then above code will still throw the notice.
As per me we should follow the gutenberg approach, define both words and characters min length individually. Based on the locale use either words or characters for counting. And this should be conistent across all the exoeriments, this, Content Classification and Content Summarization?
There was a problem hiding this comment.
I'm open to any approach here, what I'm wanting is to ensure we approach this in a way that works across all of our features. Right now we are inconsistent with what features require a minimum content length and which ones don't. And the ones that do require it do it differently.
I would like to see a PR that standardizes this across all features and does it in a reusable way to avoid code duplication. At a high-level I think this means a client-side helper that all features can use. A server-side helper (with a filter around the content length) that all features can use. And then integrating that within all features, so a feature remains in a disabled state until the minimum length is hit.
As far as what the minimum length should be and if that should be characters, words or both, I'm open to suggestions there.
There was a problem hiding this comment.
Hi @dkotter @t-hamano, I have updated the PR with changes to make it first reusable accross all the features.
I am closing this PR - #577, We can focus on to improve all this feature in this same PR.
Currently checking on if we can optimize to make use of both characters and words and display both as a suggestion shared by @t-hamano.
Thanks,
There was a problem hiding this comment.
Hi Team, I need the suggestion here. Based on current state of PR, everything is standarized, duplication removed and used helper functions.
All the features will either use characters if user's locale is the language that uses characters (Eg: Japanese), and will use the words if the users local is the language that use words to count (Eg: English).
Now, we have a question to what to show the user, should we update the text to show characters or words depending on the user's locale, or Should we introduce both minWords and minCharacter count and show user both and disable experiments if both of them does not satisfies the condition?
Currently, I am updating the PR with the first approach, to update text based on user's locale, will update it based on the feedback.
There was a problem hiding this comment.
Now, we have a question to what to show the user, should we update the text to show characters or words depending on the user's locale
I'm saying this as someone that only speaks English so definitely want other opinions here but if the decision is to use either words or characters depending on locale (and not standardize on characters only or standardize on checking both) I think the right approach is to update the message to show either characters or words, not both.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #581 +/- ##
=============================================
- Coverage 73.18% 73.15% -0.03%
Complexity 1731 1731
=============================================
Files 85 85
Lines 7473 7476 +3
=============================================
Hits 5469 5469
- Misses 2004 2007 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
What?
Closes #578
Why?
wordCountTypefor counting words or characters based on the users localeHow?
_x()language pack.Use of AI Tools
Testing Instructions
あああああああああああScreenshots or screencast
Screen.Recording.2026-05-19.at.3.52.39.PM.mov
Changelog Entry