Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lowercase search does not find non-ASCII uppercase pages and vice versa #8375

Closed
apolukhin opened this issue Feb 9, 2021 · 11 comments
Closed
Labels
bug HTML HTML / XHTML output

Comments

@apolukhin
Copy link
Contributor

Describe the bug
Searching "привет" while the page title is "Привет" does not work.

Expected behavior
For ASCII everything works fine: searching for "faq" finds page "FAQ". The same behavior expected for non ASCII pages.

Screenshots
seach_lowercase_ru

To Reproduce

  1. Make some markdown pages starting with # Привет, # Основные сведения, # прочее and # Введение.
  2. Generate docs from those pages using the SEARCHENGINE=YES and SERVER_BASED_SEARCH=NO
  3. Search for при, ПРИВ, основ, ПРОЧЕЕ, вВеден

Version
1.9.1 and trunk.

@albert-github albert-github added bug needinfo reported bug is incomplete, please add additional info labels Feb 9, 2021
@albert-github
Copy link
Collaborator

Recently I've see a fix for a similar issue #5263 that might also might fix your problem (although you mentioned also trunk with the problem), though it is hard to tell without an example (I don't have a good example at hand with Cyrillic search terms).

  • Please specify the trunk version you used (complete information of doxygen -v).
  • Can you please attach a, small, self contained example (source+configuration file in a tar or zip) that allows us to reproduce the problem? Please don't add external links as they might not be persistent.

@apolukhin
Copy link
Contributor Author

doxygen -v
1.9.2 (46ffd77e8d3c15244732128caace858f2aa38d73)

Searching for "ОСН", "ПРИ", "При" and "проч" works (was fixed in #5263, thanks!)
Searching for "при", "вве", "ПРОЧ", "осн", does not work

Attaching a self contained example, all the info is duplicated inside in Doxygen config:
doxygen_sample.zip

@albert-github albert-github removed the needinfo reported bug is incomplete, please add additional info label Feb 9, 2021
@albert-github
Copy link
Collaborator

It was at first a bit confusing as the output language was still English, but the problem here lies with the names of the sections in the related pages.

@albert-github
Copy link
Collaborator

The problem looks like a bit more a fundamental problem when writing (a.o. search/all*.js, search/pages*.js, search/searchdata.js) it looks like the translation to lowercase is not done as it is done for the Latin alphabet / ASCII

albert-github added a commit to albert-github/doxygen that referenced this issue Mar 4, 2021
… pages and vice versa

Implementation of a uppercase / lowercase conversion as needed by doxygen.
The standard tolower / toupper functions don't really work as they need a "locale" which in general is not necessary for Unicode / UTF8 conversions.
- caseconvert.cpp / caseconvert.h generated code based on the table from https://www.unicode.org/Public/13.0.0/ucd/UnicodeData.txt with some small modifications regarding uppercase values that shouldn't have a lowercase representation (Kelvin sign) or combined characters where there is no 100% one to one relation between uppercase and lowercase due to some mix (e.g.  DZ,  Dz and  dz).
- util.cpp / searchengine.cpp using the new functions
- search.js to old "workaround" is not necessary anymore (see issue doxygen#5263)
@albert-github albert-github added the HTML HTML / XHTML output label Mar 4, 2021
@albert-github
Copy link
Collaborator

I've just pushed a proposed patch, pull request #8409

@doxygen doxygen added the fixed but not released Bug is fixed in github, but still needs to make its way to an official release label Mar 22, 2021
@doxygen
Copy link
Owner

doxygen commented Mar 22, 2021

@apolukhin Please verify if commit a4ecbee fixes the problem for you.

@albert-github
Copy link
Collaborator

As far as I can see it does not work.

Example: example.tar.gz

Here we have the source to generate the html pages and the directories:

When going to the related pages (for easy cut and paste) an cutting the text and pasting the text into the search bar:

  • Основные сведения In my approach shows the reference and in the Cygwin and Windows version gives "No Matches", first as stripping down to the space it shows a match on Cygwin but on Windows still not.
    See the subtle difference (space versus _20):
    diff -r html_c/search/pages_1.js html_mine/search/pages_1.js
    3c3
    <   ['основные сведения_5',['Основные сведения',['../md_test2.html',1,'']]]
    ---
    >   ['основные_20сведения_5',['Основные сведения',['../md_test2.html',1,'']]]
    
  • прочее in my approach and Cygwin this looks OK, for Windows I get in the window even a file not found reference.
    see:
    diff -b -w -r html_mine/search/searchdata.js html_w/search/searchdata.js
    3,4c3,4
    <   0: "воп",
    <   1: "воп"
    ---
    >   0: "вопп",
    >   1: "вопп"
    

@albert-github albert-github removed the fixed but not released Bug is fixed in github, but still needs to make its way to an official release label Mar 23, 2021
doxygen added a commit that referenced this issue Mar 24, 2021
@doxygen
Copy link
Owner

doxygen commented Mar 24, 2021

@albert-github Fixed two issues:

  • Visual Studio interprets u8"..." string literals in the locale encoding (so not UTF-8!) which caused the scrambled output. Can be fixed by using the /utf-8 compiler option, but since other compilers may have similar issues I decided to put the exact byte encoding in the caseconvert.h file instead like you did.
  • I missed the escaping of non-identifier characters in function searchId()

Let me know if you see more issues.

@doxygen doxygen added the fixed but not released Bug is fixed in github, but still needs to make its way to an official release label Mar 25, 2021
@albert-github
Copy link
Collaborator

  • The good old locale, always giving problems ....
    Looks like to be working now. I tested it with a Japanese string as well.
  • Escaping of non-identifier characters
    Looks like to be working now

albert-github added a commit to albert-github/doxygen that referenced this issue Apr 8, 2021
…s correctly

The problem is that that "_" is seen as an Id character and not is escaped for JS search.

This is a regression on:
```
Commit: a4ecbee [a4ecbee]
Date: Monday, March 22, 2021 8:02:06 PM
issue doxygen#8375: Lowercase search does not find non-ASCII uppercase pages and vice versa
```
and
```
Commit: 3a365ab [3a365ab]
Date: Wednesday, March 24, 2021 8:34:50 PM
issue doxygen#8375 Lowercase search does not find non-ASCII uppercase pages and vice versa (part 2)
```
@apolukhin
Copy link
Contributor Author

@apolukhin Please verify if commit a4ecbee fixes the problem for you.

@doxygen yep, latest master works like a charm.

Many thanks!

@doxygen
Copy link
Owner

doxygen commented Aug 18, 2021

This issue was previously marked 'fixed but not released',
which means it should be fixed in doxygen version 1.9.2.
Please verify if this is indeed the case. Reopen the
issue if you think it is not fixed and please include any additional information
that you think can be relevant (preferably in the form of a self-contained example).

@doxygen doxygen removed the fixed but not released Bug is fixed in github, but still needs to make its way to an official release label Aug 18, 2021
@doxygen doxygen closed this as completed Aug 18, 2021
albert-github added a commit to albert-github/doxygen that referenced this issue Nov 22, 2022
This is a regression on doxygen#8375, the `substr` function requires a length and not an end position.
Problem was found when looking at doxygen#3244
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug HTML HTML / XHTML output
Projects
None yet
Development

No branches or pull requests

3 participants