Lowercase search does not find non-ASCII uppercase pages and vice versa #8375

apolukhin · 2021-02-09T10:30:48Z

Describe the bug
Searching "привет" while the page title is "Привет" does not work.

Expected behavior
For ASCII everything works fine: searching for "faq" finds page "FAQ". The same behavior expected for non ASCII pages.

Screenshots

To Reproduce

Make some markdown pages starting with # Привет, # Основные сведения, # прочее and # Введение.
Generate docs from those pages using the SEARCHENGINE=YES and SERVER_BASED_SEARCH=NO
Search for при, ПРИВ, основ, ПРОЧЕЕ, вВеден

Version
1.9.1 and trunk.

The text was updated successfully, but these errors were encountered:

albert-github · 2021-02-09T10:55:03Z

Recently I've see a fix for a similar issue #5263 that might also might fix your problem (although you mentioned also trunk with the problem), though it is hard to tell without an example (I don't have a good example at hand with Cyrillic search terms).

Please specify the trunk version you used (complete information of doxygen -v).
Can you please attach a, small, self contained example (source+configuration file in a tar or zip) that allows us to reproduce the problem? Please don't add external links as they might not be persistent.

apolukhin · 2021-02-09T16:12:03Z

doxygen -v
1.9.2 (46ffd77e8d3c15244732128caace858f2aa38d73)

Searching for "ОСН", "ПРИ", "При" and "проч" works (was fixed in #5263, thanks!)
Searching for "при", "вве", "ПРОЧ", "осн", does not work

Attaching a self contained example, all the info is duplicated inside in Doxygen config:
doxygen_sample.zip

albert-github · 2021-02-09T17:07:03Z

It was at first a bit confusing as the output language was still English, but the problem here lies with the names of the sections in the related pages.

albert-github · 2021-02-09T17:47:13Z

The problem looks like a bit more a fundamental problem when writing (a.o. search/all*.js, search/pages*.js, search/searchdata.js) it looks like the translation to lowercase is not done as it is done for the Latin alphabet / ASCII

… pages and vice versa Implementation of a uppercase / lowercase conversion as needed by doxygen. The standard tolower / toupper functions don't really work as they need a "locale" which in general is not necessary for Unicode / UTF8 conversions. - caseconvert.cpp / caseconvert.h generated code based on the table from https://www.unicode.org/Public/13.0.0/ucd/UnicodeData.txt with some small modifications regarding uppercase values that shouldn't have a lowercase representation (Kelvin sign) or combined characters where there is no 100% one to one relation between uppercase and lowercase due to some mix (e.g. DZ, Dz and dz). - util.cpp / searchengine.cpp using the new functions - search.js to old "workaround" is not necessary anymore (see issue doxygen#5263)

albert-github · 2021-03-04T13:50:56Z

I've just pushed a proposed patch, pull request #8409

… and vice versa

doxygen · 2021-03-22T19:36:50Z

@apolukhin Please verify if commit a4ecbee fixes the problem for you.

albert-github · 2021-03-23T10:16:06Z

As far as I can see it does not work.

Example: example.tar.gz

Here we have the source to generate the html pages and the directories:

html_mine the approach as I proposed in issue #8375 Lowercase search does not find non-ASCII uppercase pages and vice versa #8409
html_c a run on Cygwin
html_w a run on Windows

When going to the related pages (for easy cut and paste) an cutting the text and pasting the text into the search bar:

Основные сведения In my approach shows the reference and in the Cygwin and Windows version gives "No Matches", first as stripping down to the space it shows a match on Cygwin but on Windows still not.
See the subtle difference (space versus _20):

diff -r html_c/search/pages_1.js html_mine/search/pages_1.js
3c3
<   ['основные сведения_5',['Основные сведения',['../md_test2.html',1,'']]]
---
>   ['основные_20сведения_5',['Основные сведения',['../md_test2.html',1,'']]]

прочее in my approach and Cygwin this looks OK, for Windows I get in the window even a file not found reference.
see:

diff -b -w -r html_mine/search/searchdata.js html_w/search/searchdata.js
3,4c3,4
<   0: "воп",
<   1: "воп"
---
>   0: "Ð²Ð¾Ð¿п",
>   1: "Ð²Ð¾Ð¿п"

…and vice versa (part 2)

doxygen · 2021-03-24T19:42:16Z

@albert-github Fixed two issues:

Visual Studio interprets u8"..." string literals in the locale encoding (so not UTF-8!) which caused the scrambled output. Can be fixed by using the /utf-8 compiler option, but since other compilers may have similar issues I decided to put the exact byte encoding in the caseconvert.h file instead like you did.
I missed the escaping of non-identifier characters in function searchId()

Let me know if you see more issues.

albert-github · 2021-03-26T10:42:26Z

The good old locale, always giving problems ....
Looks like to be working now. I tested it with a Japanese string as well.
Escaping of non-identifier characters
Looks like to be working now

…s correctly The problem is that that "_" is seen as an Id character and not is escaped for JS search. This is a regression on: ``` Commit: a4ecbee [a4ecbee] Date: Monday, March 22, 2021 8:02:06 PM issue doxygen#8375: Lowercase search does not find non-ASCII uppercase pages and vice versa ``` and ``` Commit: 3a365ab [3a365ab] Date: Wednesday, March 24, 2021 8:34:50 PM issue doxygen#8375 Lowercase search does not find non-ASCII uppercase pages and vice versa (part 2) ```

apolukhin · 2021-05-05T16:18:41Z

@apolukhin Please verify if commit a4ecbee fixes the problem for you.

@doxygen yep, latest master works like a charm.

Many thanks!

doxygen · 2021-08-18T20:29:42Z

This issue was previously marked 'fixed but not released',
which means it should be fixed in doxygen version 1.9.2.
Please verify if this is indeed the case. Reopen the
issue if you think it is not fixed and please include any additional information
that you think can be relevant (preferably in the form of a self-contained example).

This is a regression on doxygen#8375, the `substr` function requires a length and not an end position. Problem was found when looking at doxygen#3244

albert-github added bug needinfo reported bug is incomplete, please add additional info labels Feb 9, 2021

albert-github removed the needinfo reported bug is incomplete, please add additional info label Feb 9, 2021

albert-github added the HTML HTML / XHTML output label Mar 4, 2021

doxygen added a commit that referenced this issue Mar 22, 2021

issue #8375: Lowercase search does not find non-ASCII uppercase pages…

a4ecbee

… and vice versa

doxygen added the fixed but not released Bug is fixed in github, but still needs to make its way to an official release label Mar 22, 2021

albert-github removed the fixed but not released Bug is fixed in github, but still needs to make its way to an official release label Mar 23, 2021

doxygen added a commit that referenced this issue Mar 24, 2021

issue #8375 Lowercase search does not find non-ASCII uppercase pages …

3a365ab

…and vice versa (part 2)

doxygen added the fixed but not released Bug is fixed in github, but still needs to make its way to an official release label Mar 25, 2021

doxygen removed the fixed but not released Bug is fixed in github, but still needs to make its way to an official release label Aug 18, 2021

doxygen closed this as completed Aug 18, 2021

albert-github mentioned this issue Nov 22, 2022

Incorrect return of getUTF8CharAt giving wrong alphabetical index #9688

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lowercase search does not find non-ASCII uppercase pages and vice versa #8375

Lowercase search does not find non-ASCII uppercase pages and vice versa #8375

apolukhin commented Feb 9, 2021

albert-github commented Feb 9, 2021

apolukhin commented Feb 9, 2021

albert-github commented Feb 9, 2021

albert-github commented Feb 9, 2021

albert-github commented Mar 4, 2021

doxygen commented Mar 22, 2021

albert-github commented Mar 23, 2021

doxygen commented Mar 24, 2021

albert-github commented Mar 26, 2021

apolukhin commented May 5, 2021

doxygen commented Aug 18, 2021

Lowercase search does not find non-ASCII uppercase pages and vice versa #8375

Lowercase search does not find non-ASCII uppercase pages and vice versa #8375

Comments

apolukhin commented Feb 9, 2021

albert-github commented Feb 9, 2021

apolukhin commented Feb 9, 2021

albert-github commented Feb 9, 2021

albert-github commented Feb 9, 2021

albert-github commented Mar 4, 2021

doxygen commented Mar 22, 2021

albert-github commented Mar 23, 2021

doxygen commented Mar 24, 2021

albert-github commented Mar 26, 2021

apolukhin commented May 5, 2021

doxygen commented Aug 18, 2021