Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finding CJK text is failed #544

Closed
YukiChiba opened this issue Dec 18, 2014 · 11 comments
Closed

Finding CJK text is failed #544

YukiChiba opened this issue Dec 18, 2014 · 11 comments

Comments

@YukiChiba
Copy link

I set an environment variable XAPIAN_CJK_NGRAM to 1 in order to handle Japanese text for finding messages.
http://xapian.org/docs/sourcedoc/html/namespaceCJK.html
http://trac.xapian.org/ticket/180

However, it is failed to find Japanese text when no filed is specified. The followings are examples of output for a message containing "手続き" in subject and body:

% mu find s:手続き date:today..now
2014年12月18日 14時40分12秒 Yuki Chiba <chiba@******> 手続き
2014年12月18日 14時40分12秒 Yuki Chiba <chiba@******> 手続き
% mu find 手続き date:today..now
mu: no matches for search expression (4)

I think this is related to this closed issue:
#123

The issue above is closed because of no reply for a long time. This bug is still remained.

I highly appreciate if mu supports Japanese text for finding messages.
Thank you.

@xuchunyang
Copy link
Contributor

Finding Chinese doesn't work correctly too, below is my case.

Firstly match all messages

➜  ~  mu find ""
Tue Dec 23 05:51:49 2014 Emmanuele Bassi <ebassi@gmail.com> Calculator is now pinned in GNOME Continuous
Tue Dec 23 05:51:49 2014 Emmanuele Bassi <ebassi@gmail.com> Calculator is now pinned in GNOME Continuous
Tue Dec 23 13:41:19 2014 "FT中文网 - FTChinese.com" <partner@newsletter.ftchinese.com> 诚邀您参与FT中文网《中国消费品牌偏好调查》
Tue Dec 23 13:41:19 2014 "FT中文网 - FTChinese.com" <partner@newsletter.ftchinese.com> 诚邀您参与FT中文网《中国消费品牌偏好调查》
Wed Dec 24 01:34:24 2014 "jenia.ivlev" <jenia.ivlev@gmail.com> How to set tags for org nodes
Wed Dec 24 01:34:24 2014 "jenia.ivlev" <jenia.ivlev@gmail.com> How to set tags for org nodes
Wed Dec 24 01:44:39 2014 "jenia.ivlev" <jenia.ivlev@gmail.com> Re: How to set tags for org nodes
Wed Dec 24 01:44:39 2014 "jenia.ivlev" <jenia.ivlev@gmail.com> Re: How to set tags for org nodes

and then search some Chinese

➜  ~  mu find "诚邀您参与FT中文网"
Tue Dec 23 13:41:19 2014 "FT中文网 - FTChinese.com" <partner@newsletter.ftchinese.com> 诚邀您参与FT中文网《中国消费品牌偏好调查》
Tue Dec 23 13:41:19 2014 "FT中文网 - FTChinese.com" <partner@newsletter.ftchinese.com> 诚邀您参与FT中文网《中国消费品牌偏好调查》

It works fine, but try to search Chinese substring

➜  ~  mu find "诚邀您参与FT中文"
mu: no matches for search expression (4)

Any tips?

@panjie
Copy link
Contributor

panjie commented Dec 24, 2014

This is because CJK word sementation is not supported in Xapian.
Xunsearch(http://xunsearch.com/) has made a patched Xapian which should support chinese word segmentation, but I found it doesn't work with mu. ;(

@YukiChiba YukiChiba changed the title Finding Japanese text is failed Finding CJK text is failed Mar 5, 2015
@YukiChiba
Copy link
Author

I changed the title because this is not only related to Japanese but also Chinese.
I guess that Korean is influenced, too.

@ghost
Copy link

ghost commented Nov 4, 2015

+1. I am having problems finding mail with Chinese text.

@djcb
Copy link
Owner

djcb commented Dec 2, 2015

Sadly, no much mu/mu4e can do about this.... perhaps ask the Xapian people?

@djcb djcb closed this as completed Dec 2, 2015
@YukiChiba
Copy link
Author

I think that this is not Xapian related issue,
because notmuch, which is another mail client using Xapian for indexing, can search Japanese words in messages when
the environment variable XAPIAN_CJK_NGRAM to 1.

@djcb Could you please re-open this issue?

@djcb
Copy link
Owner

djcb commented Dec 11, 2015

@YukiChiba: oh, perhaps you can create a new one? And can you attach a raw email file with some body-text and a subject in Japanese,so we can test it? Thanks.

@djcb djcb reopened this Dec 11, 2015
@djcb
Copy link
Owner

djcb commented Dec 11, 2015

Oh actually, just reopened this one -- anyway, having an example message would be great.

@YukiChiba
Copy link
Author

@djcb Thank you for reopening this issue.

The following is an example of message in Japanese:

Return-Path: <******>
Received: from ****** ([******])
        by smtp.gmail.com with ESMTPSA id r26sm22467237pfa.45.2015.12.10.22.30.53
        for <******>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Thu, 10 Dec 2015 22:30:54 -0800 (PST)
User-agent: mu4e 0.9.15; emacs 24.5.1
From: Yuki Chiba <******>
To: Yuki Chiba <******>
Subject: =?utf-8?B?5ryi5a2X44KS5ZCr44KAbXU0ZeOBruODhuOCueODiOODoeODvA==?=
 =?utf-8?B?44Or?=
Date: Fri, 11 Dec 2015 15:30:43 +0900
Message-ID: <muzh9jp4hv0.fsf@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

これは漢字を含むテストメールです.

I also uploaded this raw message to the following:

https://www.dropbox.com/s/pqpxd88lvo7ra3h/message?dl=0

@YukiChiba
Copy link
Author

The following query is succeeded to find:

% mu find s:漢字
2015年12月11日 15時30分43秒 Yuki Chiba <yuki.from.akita@gmail.com> 漢字を含むmu4eのテストメール

But, the following is failed even though "漢字" appears in the body:

% mu find 漢字 date:today  
mu: no matches for search expression (4)

djcb added a commit that referenced this issue Jan 17, 2016
As discussed in issue #544, it's possible to search for CJK text, as
long as you see the environment variable XAPIAN_CJK_NGRAM to non-empty
with Xapian >= 1.2.8.
@djcb
Copy link
Owner

djcb commented Jan 17, 2016

Thanks, I tried this, and mu find 漢字 seems to work just fine with that env variable.

mu find 漢字 date:today does not find anything because the message is not from today, e.g mu find 漢字 date:20151211 works.

So closing, this once more.

@djcb djcb closed this as completed Jan 17, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants