New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing and searching cannot treat non ASCII identifiers (Origin: bugzilla #705910) #5263

Open
doxygen opened this Issue Jul 2, 2018 · 0 comments

Comments

Projects
None yet
1 participant
@doxygen
Owner

doxygen commented Jul 2, 2018

status REOPENED severity normal in component general for ---
Reported in version 1.8.6-GIT on platform Other
Assigned to: Dimitri van Heesch

Original attachment names and IDs:

On 2013-08-13 13:24:02 +0000, Suzumizaki-Kimitaka wrote:

Created attachment 251489
fix indexing and built-in searching for non ASCII identifiers

I made the patch already, please just apply it.
The details and notes about regression test are below.

Details:
a1) Because the indices make groups against first byte(octet) of UTF-8, the entries are wrong grouped when the names start with non ASCII characters. Like U+0080 - U+00BF go to the group 0xC2 and U+0800 - U+0FFF go to the group 0xE0. Ofcourse they should go same as ASCII characters, like 'A' to 'A', 'B' to 'B'.

a2) The appearance of index group header "- A -", "- B -", "- C -", ... are correctly shown with ASCII ONLY. For non ASCII characters, all headers are shown like "- <?> -". Because 0xC0-0xFF that are NOT followed by 0x80-0xBF are all invalid sequence as UTF-8. 

b1) Builtin javascript search doesn't work with non-ASCII entries. The entries on database are escaped as UTF-8, but the entered words from the search box are escaped as (broken) UTF-16. 

For now, Javascript/ECMAscript can treat unicode directly. We don't have to escape but except to name the files in the "search" folder. Their name has the hexadecimal tail that represents the common first character of the entries. My patch makes them depends on unicode codepoints instead of UTF-8 header byte.

b2) Built-in PHP search deletes non-ASCII characters in searchbox every time.

To fix these problems,
1) Some new functions added to utils.h/cpp.
2) For indexing, index.cpp is fixed.
3) For searching, search_js.h, search.js, search_functions.php, search_functions_php.h and searchindex.cpp are fixed.

Note for regression test on Microsoft Windows:
Sorry to say about my patch posted before (Bug 705219) didn't pass the regression test.

I could not run the tests before, because I couldn't run xmllint easily.
The binary distribution of xmllint doesn't work, it requires old (and looks 'correct') version of iconv.dll. I have to build it from source code of libxml2. Even now I can only build some of libxml2 but I can get xmllint.exe for now.

Today Git SHA-1: SHA: 83fc120e5575446b1161e9ffb8168d55c423f7ac fails test 12. And my patch here doesn't fail another tests I believe.

Regards,
Suzumizaki-Kimitaka

On 2013-08-22 14:29:43 +0000, Dimitri van Heesch wrote:

Thanks for your patch, but I think it requires more thought.

I now see some loops like these in the code:

  for (p=0;p<=MAX_UNICODE_CODEPOINT;p++)

where MAX_UNICODE_CODEPOINT is 0x10FFFF

Performance wise, this is not good, especially since in 99,9% of the iterations nothing will be done other than checking if something needs to be done. If you already use a hash/map then it is better to just iterate over it.

Do you want to make an improved patch? or do you want me to improve it myself?

On 2013-08-22 16:16:26 +0000, Suzumizaki-Kimitaka wrote:

I'm sorry but I would like you to improve it, because I don't know which qtools class I should use. 

As you say, over some of the cases we should simply use iterator, but the others we seem to need to ensure iterating by codepoint-order.

Regards,
Suzumizaki-Kimitaka

On 2013-08-31 12:31:32 +0000, Suzumizaki-Kimitaka wrote:

Hello, have you started to improve the loop problem to iterators?
If not yet, I'll try to.

Tell me I should try or just wait your work.

(I want to make the patch to another issue, but before that, it seems better to resolve this problem first.)

Regards,
Suzumizaki-Kimitaka

On 2013-09-10 10:49:48 +0000, Suzumizaki-Kimitaka wrote:

Created attachment 254582
The updated and fixed patch

Hello.
I found the bug like Bug 707278 with previous patch, and
I have fixed the iterator problem blamed here.

I made the new patch against SHA-1: SHA: 1e373422387e8c1131f887efb47cf3da6459e2ac.
Previous one is expired.

Please apply the new one.

Regards,
Suzumizaki-Kimitaka

On 2013-09-15 18:15:38 +0000, Dimitri van Heesch wrote:

Thanks, I've just pushed a somewhat reworked version of your patch to GitHub.

On 2013-10-22 02:26:52 +0000, Suzumizaki-Kimitaka wrote:

Created attachment 257809
The html documents pair, to show official fix (on the git) cannot solve the problem.

Sorry to say, Dimitri, your workaround (as you said at comment 5) breaks some functionalities.
Please read the html document contained in the attachment with this comment, and tell me how do you plan to do.
The failed_html is made on your work, and the correct_html is on mine.

Regards,
Suzumizaki-Kimitaka

On 2013-10-26 15:07:02 +0000, Suzumizaki-Kimitaka wrote:

Created attachment 258178
The new patch against current origin/HEAD

Okay, I made a new patch. You have another choice now.
The new patch targets SHA-1: SHA: 74815268dd88f2cfb4473462cef3c33eebd5516a

Note that I found one more bug and also fixed with this patch.
The doxygen on current origin/HEAD distinguish upper/lowercase of identifiers.
I'll make new sample project zip like I posted before.

Regards,
Suzumizaki-Kimitaka

On 2013-10-29 01:42:54 +0000, Suzumizaki-Kimitaka wrote:

Created attachment 258383
Update sample project

The update version of html documents pair, to show official fix (on the git) cannot solve the problem(see comment 6 and 7).

On 2013-12-24 18:59:58 +0000, Dimitri van Heesch wrote:

This bug was previously marked ASSIGNED, which means it should be fixed in
doxygen version 1.8.6. Please verify if this is indeed the case. Reopen the
bug if you think it is not fixed and please include any additional information 
that you think can be relevant (preferrably in the form of a self-contained example).

On 2013-12-27 05:40:48 +0000, Suzumizaki-Kimitaka wrote:

Created attachment 264918
Updated patch for 1.8.6 release

As I told before, the work against this issue is not finished.
(Note again this is NOT my fault! The rework told at comment 5 IS failed.)

Here's updated patch, but in fact, only the line of the target files are fixed.

Regards,
Suzumizaki-Kimitaka
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment