New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Counting words for Asian languages #761

Open
atimmer opened this Issue Jul 4, 2016 · 81 comments

Comments

Projects
None yet
@atimmer
Copy link
Member

atimmer commented Jul 4, 2016

WordPress has build code to be able to count words in Asian languages. We should take this as inspiration. WordPress actually ships the word counting code only in the Chinese language pack, the third link shows where to find it.

Patch for word-count.js
https://core.trac.wordpress.org/ticket/20738
https://core.trac.wordpress.org/ticket/30966

Where to find the language pack:
https://core.trac.wordpress.org/ticket/33454

@monbauza

This comment has been minimized.

Copy link

monbauza commented Dec 6, 2016

Please inform the customer of conversation # 99154 when this conversation has been closed.

@monbauza

This comment has been minimized.

Copy link

monbauza commented Dec 6, 2016

Please inform the customer of conversation # 148759 when this conversation has been closed.

@paulovsky

This comment has been minimized.

Copy link

paulovsky commented Jan 3, 2017

Any developments/ideas on this issue? Not being able to use Yoast with Chinese content is keeping the plugin away from 450 million internet users.

@monbauza

This comment has been minimized.

Copy link

monbauza commented Jan 17, 2017

Please inform the customer of conversation # 172656 when this conversation has been closed.

@a4jp-com

This comment has been minimized.

Copy link

a4jp-com commented Jan 19, 2017

Maybe a character count is better for Japanese/Chinese. One kanji character is usually takes up the space of 2 regular characters.

@terw-dan

This comment has been minimized.

Copy link
Member

terw-dan commented Jan 23, 2017

@a4jp-com thanks for your suggestion. So each kanji character can be seen as 1 word? Or is there more to it?

@paulovsky

This comment has been minimized.

Copy link

paulovsky commented Jan 24, 2017

Japanese (also Chinese, Korean, Vietnamese) is a logographic language from the Han family; that means that not all characters represent morphemes: some morphemes are composed of more than one characters (see more here).

@terw-dan The approach is to take the word count as character count, mostly because each character counts as one "word" and there are no spaces between characters (at least in Chinese).

As for keyword analysis, if the user imputs the combination of two or more characters, that must be seen as one word.

@a4jp-com

This comment has been minimized.

Copy link

a4jp-com commented Jan 25, 2017

Individual kanji characters sometimes have a meaning but in some situations they are combined with a few other hiragana characters to make different words with different readings.

Kanji only:
下 (shita) down

Kanji with hiragana:
下さい (kudasai) please

Each character in either hiragana, katakana or kanji takes up the space of 2 english characters.

@terw-dan

This comment has been minimized.

Copy link
Member

terw-dan commented Jan 25, 2017

Thanks both for the explanation. This will come in helpful when we start implementing and testing this.

@a4jp-com

This comment has been minimized.

Copy link

a4jp-com commented Feb 1, 2017

Is there anything I can do to help?

@IreneStr

This comment has been minimized.

Copy link
Contributor

IreneStr commented Feb 2, 2017

@a4jp-com Thank you for your eagerness to help! At the moment, our main problem is the absence of spaces in Asian languages. Could you confirm that in Japanese there are no spaces between individual words as well.

@a4jp-com

This comment has been minimized.

Copy link

a4jp-com commented Feb 2, 2017

Sometimes a regular space is used but in other situations a Japanese space is used.

Here is the encoding for the unicode character 'IDEOGRAPHIC SPACE' (U+3000)

@IreneStr

This comment has been minimized.

Copy link
Contributor

IreneStr commented Feb 3, 2017

@a4jp-com Thank you for your reply. From googling some Japanese websites, I got the impression that spaces are generally speaking only used between sentences, but not between words.

In the following sentences, for example, there is a whitespace after 、 and 。.):

日本の文化を美術や音楽、演劇、映画からファッションやデザインまで幅広く世界に紹介しています。また、言葉を超えた共感の場をつくり出し、ともに創造する喜びをわかちあって、人と人との交流を深めていきます。

However, these white spaces are part of the 、(U+3001) and 。(U+3002) characters (so the space is not a separate character).

In what situations do people use the regular space or Japanese space?

@a4jp-com

This comment has been minimized.

Copy link

a4jp-com commented Feb 3, 2017

Regular spaces are used between the surname and given name.
山田 たろ
but sometimes no space is used or a Japanese space is used.
山田たろ
山田 たろ

It's kind of a design choice.

Regular spaces are also used when romaji or English phrases are used in advertising.

@idpokute

This comment has been minimized.

Copy link

idpokute commented Feb 12, 2017

Most of Japanese don't use space between words. Maybe Yoast can give the option; that turns off some deduction rules for Japanese.

@a4jp-com

This comment has been minimized.

Copy link

a4jp-com commented Feb 12, 2017

I've been living in Japan for 16 years and have worked as a system engineer in 3 Japanese companies. Spaces are used in sites here.

Especially in pages mixed with English words.

For example:
https://www.toshiba-newenergy.com/
OECD加盟国34ヵ国中33位
IEA Energy Balance of OECD Countries 2013

@IreneStr

This comment has been minimized.

Copy link
Contributor

IreneStr commented Feb 13, 2017

@idpokute @a4jp-com Thank you both for the information. It'll be very helpful when we want to implement this feature in the future.

@iamazik

This comment has been minimized.

Copy link
Member

iamazik commented Feb 16, 2017

Please inform the customer of conversation # 179998 when this conversation has been closed.

@a4jp-com

This comment has been minimized.

Copy link

a4jp-com commented Feb 17, 2017

Thank you very much @IreneStr.

@monbauza

This comment has been minimized.

Copy link

monbauza commented Mar 13, 2017

Please inform the customer of conversation # 184521 when this conversation has been closed.

@a4jp-com

This comment has been minimized.

Copy link

a4jp-com commented Mar 16, 2017

I have been following the rules set out in the plugin but I have lost 75% of my views on one Japanese site. I've gone from about 200 views a day down to about 50 views a day.

The only other change I made was changing the site to HTTPS. I thought that was meant to increase the ranking. Any ideas what could be causing the problem? https://agreatdream.com/word-lists/

Is this somehow linked to the count being off?

@ullivr

This comment has been minimized.

Copy link

ullivr commented Mar 17, 2017

i wrote my blog in chinese, really really really need this function.

@terw-dan

This comment has been minimized.

Copy link
Member

terw-dan commented Mar 20, 2017

@a4jp-com The wordcount is only shown as an indication. It is not something we (can) save to your post that has influence on your rankings. So it has to be something else that caused a decrease.

@a4jp-com

This comment has been minimized.

Copy link

a4jp-com commented Mar 20, 2017

I was just thinking as the numbers are wrong that when we make pages we might be adding titles that are too long etc

@saitai0802

This comment has been minimized.

Copy link

saitai0802 commented Apr 7, 2017

I write my blog in Chinese and Japanese. There is some example of my post title for you to test.
本願寺,錦市場,八坂神社 <==Chinese only
京阪神八日之旅 <==Chinese only
伏見稻荷大社,けんどん屋,奈良公園 <==Chinese mix Japanese

Please help to fix it , we do need this amazing function! it keeps telling us our posts are poor makes us sad.
Thank you so so much

@iamazik

This comment has been minimized.

Copy link
Member

iamazik commented Apr 21, 2017

Please inform the customer of conversation # 192085 when this conversation has been closed.

@suascat

This comment has been minimized.

Copy link

suascat commented Jul 5, 2018

Please inform the customer of conversation # 399452 when this conversation has been closed.

@michaelbriantina

This comment has been minimized.

Copy link

michaelbriantina commented Aug 2, 2018

Please inform the customer of conversation # 411495 when this conversation has been closed.

@a4jp-com

This comment has been minimized.

Copy link

a4jp-com commented Aug 2, 2018

A simple character count option sometime would be nice. The code for the word count is there so it shouldn't be too hard to count characters.

@pyericz

This comment has been minimized.

@pyericz

This comment has been minimized.

Copy link

pyericz commented Aug 3, 2018

Jieba is good for Chinese segmentation. It has many languages support. Check out this.
https://github.com/search?q=jieba

@michaelbriantina

This comment has been minimized.

Copy link

michaelbriantina commented Aug 23, 2018

Please inform the customer of conversation # 418099 when this conversation has been closed.

@Pcosta88

This comment has been minimized.

Copy link

Pcosta88 commented Sep 13, 2018

Please inform the customer of conversation # 425222 when this conversation has been closed.

@a4jp-com

This comment has been minimized.

Copy link

a4jp-com commented Sep 14, 2018

Can someone make a patch with just a character count?

@tatitanizaki

This comment has been minimized.

Copy link

tatitanizaki commented Sep 25, 2018

The character count would help a lot already.
This issue has been open for more than 2 years. Is this going to be fixed or not?

@namster2k

This comment has been minimized.

Copy link

namster2k commented Nov 6, 2018

Microsoft Word counts every Japanese character as a word. This includes hiragana, katakana, and kanji characters. For example, the kanji for "cat" is "猫" and this is counted as one word. The hiragana for "cat" is "ねこ" and this is counted as 2 words. It's not the greatest system, but in my opinion it's better than what Yoast is doing at the moment of counting entire paragraphs as single words.

@7creo

This comment has been minimized.

Copy link

7creo commented Nov 6, 2018

@moorscode moorscode added needs-spec and removed idea labels Nov 9, 2018

@moorscode

This comment has been minimized.

Copy link
Member

moorscode commented Nov 9, 2018

Need-spec needed from a linguisticsteam perspective, can be put in the queue if the scope is clear.

@gersangwiki

This comment has been minimized.

Copy link

gersangwiki commented Nov 25, 2018

In Korean, we place a suffix behind a lot of words. But Yoast is recognizing the words and suffix together as a one keyword. example 나는 기분이 좋다. 나=I 는=am 기분=feeling 이=is 좋다=good Let's say my keyword is 기분 (which means feeling) . But YoastSEO thinks the keyword is '기분이' because they are together. In English, this means that Yoast is recognizing 'feeling is' as a keyword instead of 'feeling'
Recognizing only 6 suffixes separately with the word will improve this a lot.
Most used Korean suffixes : 을or를, 이or가, 은or는.
If it's possible, try to make it like this > When one of 을,를,이,가,은,는 is placed at the end, it's used as a suffix, so make yoast recognize the keyword except the last character. However, when it's place at the front or in the middle, it's part of the keyword.
Fixing this will be crucial for Korean users.

@a4jp-com

This comment has been minimized.

Copy link

a4jp-com commented Nov 25, 2018

Is just having a character count good for the Korean language?

@topout

This comment has been minimized.

Copy link

topout commented Dec 9, 2018

@moorscode said

Need-spec needed from a linguisticsteam perspective

There're many Asian languages. While they are similar, there are some differences so I suggest let's start with Chinese because it is the most spoken language (and I use it :D ) and seems easier than some of the other languages.

The below suggested specs can be applied to both Simplified Chinese (learned by people living in Mainland China) and Traditional Chinese (learned by people who learned Chinese outside Mainland China).

My suggested Spec for Chinese (Simplified and Traditional Chinese):

  • We need to count each character, whether there's a space or not, as an individual word.
  • If there is a space between a character, still treat it as if there's no space.
  • New lines can be treated as a new sentence (just like we would with English)
  • If there are special characters, such as periods, parenthesis, etc., treat it like you would if those were English words. Special characters break apart the words or flow of the words just like English. Chinese uses parenthesis, commas, etc. like English grammar and it is exactly the same for most cases.
  • In case Yoast handles periods to end a sentence differently, Chinese uses "。" in place of a ". " to end a sentence.

If we can do that, at least targeting keywords, keyphrase, synonyms, will work which will make Yoast much more useful as right now where it just does not work at all. Things like Flesch Reading, Passive voice, Transition words won't work but I think many people, including myself, will still be grateful at least we can focus on keywords.

@7creo

This comment has been minimized.

Copy link

7creo commented Dec 9, 2018

@laurasacco

This comment has been minimized.

Copy link

laurasacco commented Dec 11, 2018

Please inform the customer of conversation # 453293 when this conversation has been closed.

@iamazik

This comment has been minimized.

Copy link
Member

iamazik commented Dec 11, 2018

Please inform the customer of conversation # 453634 when this conversation has been closed.

@iamazik

This comment has been minimized.

Copy link
Member

iamazik commented Dec 25, 2018

Please inform the customer of conversation # 457569 when this conversation has been closed.

@IreneStr IreneStr added next lingo and removed next labels Jan 30, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment