New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finding CJK text is failed #544

Closed
YukiChiba opened this Issue Dec 18, 2014 · 11 comments

Comments

Projects
None yet
5 participants
@YukiChiba

YukiChiba commented Dec 18, 2014

I set an environment variable XAPIAN_CJK_NGRAM to 1 in order to handle Japanese text for finding messages.
http://xapian.org/docs/sourcedoc/html/namespaceCJK.html
http://trac.xapian.org/ticket/180

However, it is failed to find Japanese text when no filed is specified. The followings are examples of output for a message containing "手続き" in subject and body:

% mu find s:手続き date:today..now
2014年12月18日 14時40分12秒 Yuki Chiba <chiba@******> 手続き
2014年12月18日 14時40分12秒 Yuki Chiba <chiba@******> 手続き
% mu find 手続き date:today..now
mu: no matches for search expression (4)

I think this is related to this closed issue:
#123

The issue above is closed because of no reply for a long time. This bug is still remained.

I highly appreciate if mu supports Japanese text for finding messages.
Thank you.

@xuchunyang

This comment has been minimized.

Show comment
Hide comment
@xuchunyang

xuchunyang Dec 23, 2014

Contributor

Finding Chinese doesn't work correctly too, below is my case.

Firstly match all messages

➜  ~  mu find ""
Tue Dec 23 05:51:49 2014 Emmanuele Bassi <ebassi@gmail.com> Calculator is now pinned in GNOME Continuous
Tue Dec 23 05:51:49 2014 Emmanuele Bassi <ebassi@gmail.com> Calculator is now pinned in GNOME Continuous
Tue Dec 23 13:41:19 2014 "FT中文网 - FTChinese.com" <partner@newsletter.ftchinese.com> 诚邀您参与FT中文网《中国消费品牌偏好调查》
Tue Dec 23 13:41:19 2014 "FT中文网 - FTChinese.com" <partner@newsletter.ftchinese.com> 诚邀您参与FT中文网《中国消费品牌偏好调查》
Wed Dec 24 01:34:24 2014 "jenia.ivlev" <jenia.ivlev@gmail.com> How to set tags for org nodes
Wed Dec 24 01:34:24 2014 "jenia.ivlev" <jenia.ivlev@gmail.com> How to set tags for org nodes
Wed Dec 24 01:44:39 2014 "jenia.ivlev" <jenia.ivlev@gmail.com> Re: How to set tags for org nodes
Wed Dec 24 01:44:39 2014 "jenia.ivlev" <jenia.ivlev@gmail.com> Re: How to set tags for org nodes

and then search some Chinese

➜  ~  mu find "诚邀您参与FT中文网"
Tue Dec 23 13:41:19 2014 "FT中文网 - FTChinese.com" <partner@newsletter.ftchinese.com> 诚邀您参与FT中文网《中国消费品牌偏好调查》
Tue Dec 23 13:41:19 2014 "FT中文网 - FTChinese.com" <partner@newsletter.ftchinese.com> 诚邀您参与FT中文网《中国消费品牌偏好调查》

It works fine, but try to search Chinese substring

➜  ~  mu find "诚邀您参与FT中文"
mu: no matches for search expression (4)

Any tips?

Contributor

xuchunyang commented Dec 23, 2014

Finding Chinese doesn't work correctly too, below is my case.

Firstly match all messages

➜  ~  mu find ""
Tue Dec 23 05:51:49 2014 Emmanuele Bassi <ebassi@gmail.com> Calculator is now pinned in GNOME Continuous
Tue Dec 23 05:51:49 2014 Emmanuele Bassi <ebassi@gmail.com> Calculator is now pinned in GNOME Continuous
Tue Dec 23 13:41:19 2014 "FT中文网 - FTChinese.com" <partner@newsletter.ftchinese.com> 诚邀您参与FT中文网《中国消费品牌偏好调查》
Tue Dec 23 13:41:19 2014 "FT中文网 - FTChinese.com" <partner@newsletter.ftchinese.com> 诚邀您参与FT中文网《中国消费品牌偏好调查》
Wed Dec 24 01:34:24 2014 "jenia.ivlev" <jenia.ivlev@gmail.com> How to set tags for org nodes
Wed Dec 24 01:34:24 2014 "jenia.ivlev" <jenia.ivlev@gmail.com> How to set tags for org nodes
Wed Dec 24 01:44:39 2014 "jenia.ivlev" <jenia.ivlev@gmail.com> Re: How to set tags for org nodes
Wed Dec 24 01:44:39 2014 "jenia.ivlev" <jenia.ivlev@gmail.com> Re: How to set tags for org nodes

and then search some Chinese

➜  ~  mu find "诚邀您参与FT中文网"
Tue Dec 23 13:41:19 2014 "FT中文网 - FTChinese.com" <partner@newsletter.ftchinese.com> 诚邀您参与FT中文网《中国消费品牌偏好调查》
Tue Dec 23 13:41:19 2014 "FT中文网 - FTChinese.com" <partner@newsletter.ftchinese.com> 诚邀您参与FT中文网《中国消费品牌偏好调查》

It works fine, but try to search Chinese substring

➜  ~  mu find "诚邀您参与FT中文"
mu: no matches for search expression (4)

Any tips?

@panjie

This comment has been minimized.

Show comment
Hide comment
@panjie

panjie Dec 24, 2014

Contributor

This is because CJK word sementation is not supported in Xapian.
Xunsearch(http://xunsearch.com/) has made a patched Xapian which should support chinese word segmentation, but I found it doesn't work with mu. ;(

Contributor

panjie commented Dec 24, 2014

This is because CJK word sementation is not supported in Xapian.
Xunsearch(http://xunsearch.com/) has made a patched Xapian which should support chinese word segmentation, but I found it doesn't work with mu. ;(

@YukiChiba YukiChiba changed the title from Finding Japanese text is failed to Finding CJK text is failed Mar 5, 2015

@YukiChiba

This comment has been minimized.

Show comment
Hide comment
@YukiChiba

YukiChiba Mar 5, 2015

I changed the title because this is not only related to Japanese but also Chinese.
I guess that Korean is influenced, too.

YukiChiba commented Mar 5, 2015

I changed the title because this is not only related to Japanese but also Chinese.
I guess that Korean is influenced, too.

@declanqian

This comment has been minimized.

Show comment
Hide comment
@declanqian

declanqian Nov 4, 2015

Contributor

+1. I am having problems finding mail with Chinese text.

Contributor

declanqian commented Nov 4, 2015

+1. I am having problems finding mail with Chinese text.

@djcb

This comment has been minimized.

Show comment
Hide comment
@djcb

djcb Dec 2, 2015

Owner

Sadly, no much mu/mu4e can do about this.... perhaps ask the Xapian people?

Owner

djcb commented Dec 2, 2015

Sadly, no much mu/mu4e can do about this.... perhaps ask the Xapian people?

@djcb djcb closed this Dec 2, 2015

@YukiChiba

This comment has been minimized.

Show comment
Hide comment
@YukiChiba

YukiChiba Dec 11, 2015

I think that this is not Xapian related issue,
because notmuch, which is another mail client using Xapian for indexing, can search Japanese words in messages when
the environment variable XAPIAN_CJK_NGRAM to 1.

@djcb Could you please re-open this issue?

YukiChiba commented Dec 11, 2015

I think that this is not Xapian related issue,
because notmuch, which is another mail client using Xapian for indexing, can search Japanese words in messages when
the environment variable XAPIAN_CJK_NGRAM to 1.

@djcb Could you please re-open this issue?

@djcb

This comment has been minimized.

Show comment
Hide comment
@djcb

djcb Dec 11, 2015

Owner

@YukiChiba: oh, perhaps you can create a new one? And can you attach a raw email file with some body-text and a subject in Japanese,so we can test it? Thanks.

Owner

djcb commented Dec 11, 2015

@YukiChiba: oh, perhaps you can create a new one? And can you attach a raw email file with some body-text and a subject in Japanese,so we can test it? Thanks.

@djcb djcb reopened this Dec 11, 2015

@djcb

This comment has been minimized.

Show comment
Hide comment
@djcb

djcb Dec 11, 2015

Owner

Oh actually, just reopened this one -- anyway, having an example message would be great.

Owner

djcb commented Dec 11, 2015

Oh actually, just reopened this one -- anyway, having an example message would be great.

@YukiChiba

This comment has been minimized.

Show comment
Hide comment
@YukiChiba

YukiChiba Dec 11, 2015

@djcb Thank you for reopening this issue.

The following is an example of message in Japanese:

Return-Path: <******>
Received: from ****** ([******])
        by smtp.gmail.com with ESMTPSA id r26sm22467237pfa.45.2015.12.10.22.30.53
        for <******>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Thu, 10 Dec 2015 22:30:54 -0800 (PST)
User-agent: mu4e 0.9.15; emacs 24.5.1
From: Yuki Chiba <******>
To: Yuki Chiba <******>
Subject: =?utf-8?B?5ryi5a2X44KS5ZCr44KAbXU0ZeOBruODhuOCueODiOODoeODvA==?=
 =?utf-8?B?44Or?=
Date: Fri, 11 Dec 2015 15:30:43 +0900
Message-ID: <muzh9jp4hv0.fsf@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

これは漢字を含むテストメールです.

I also uploaded this raw message to the following:

https://www.dropbox.com/s/pqpxd88lvo7ra3h/message?dl=0

YukiChiba commented Dec 11, 2015

@djcb Thank you for reopening this issue.

The following is an example of message in Japanese:

Return-Path: <******>
Received: from ****** ([******])
        by smtp.gmail.com with ESMTPSA id r26sm22467237pfa.45.2015.12.10.22.30.53
        for <******>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Thu, 10 Dec 2015 22:30:54 -0800 (PST)
User-agent: mu4e 0.9.15; emacs 24.5.1
From: Yuki Chiba <******>
To: Yuki Chiba <******>
Subject: =?utf-8?B?5ryi5a2X44KS5ZCr44KAbXU0ZeOBruODhuOCueODiOODoeODvA==?=
 =?utf-8?B?44Or?=
Date: Fri, 11 Dec 2015 15:30:43 +0900
Message-ID: <muzh9jp4hv0.fsf@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

これは漢字を含むテストメールです.

I also uploaded this raw message to the following:

https://www.dropbox.com/s/pqpxd88lvo7ra3h/message?dl=0

@YukiChiba

This comment has been minimized.

Show comment
Hide comment
@YukiChiba

YukiChiba Dec 11, 2015

The following query is succeeded to find:

% mu find s:漢字
2015年12月11日 15時30分43秒 Yuki Chiba <yuki.from.akita@gmail.com> 漢字を含むmu4eのテストメール

But, the following is failed even though "漢字" appears in the body:

% mu find 漢字 date:today  
mu: no matches for search expression (4)

YukiChiba commented Dec 11, 2015

The following query is succeeded to find:

% mu find s:漢字
2015年12月11日 15時30分43秒 Yuki Chiba <yuki.from.akita@gmail.com> 漢字を含むmu4eのテストメール

But, the following is failed even though "漢字" appears in the body:

% mu find 漢字 date:today  
mu: no matches for search expression (4)

djcb added a commit that referenced this issue Jan 17, 2016

mu4e: add note about searching CJK chars to doc
As discussed in issue #544, it's possible to search for CJK text, as
long as you see the environment variable XAPIAN_CJK_NGRAM to non-empty
with Xapian >= 1.2.8.
@djcb

This comment has been minimized.

Show comment
Hide comment
@djcb

djcb Jan 17, 2016

Owner

Thanks, I tried this, and mu find 漢字 seems to work just fine with that env variable.

mu find 漢字 date:today does not find anything because the message is not from today, e.g mu find 漢字 date:20151211 works.

So closing, this once more.

Owner

djcb commented Jan 17, 2016

Thanks, I tried this, and mu find 漢字 seems to work just fine with that env variable.

mu find 漢字 date:today does not find anything because the message is not from today, e.g mu find 漢字 date:20151211 works.

So closing, this once more.

@djcb djcb closed this Jan 17, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment