Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V8: Replace icudtl.dat with full ICU data file 64.2 #11046

Closed
wants to merge 6 commits into from

Conversation

Simran-B
Copy link
Contributor

@Simran-B Simran-B commented Feb 3, 2020

Taken from https://github.com/unicode-org/icu/releases/tag/release-64-2
(icu4c-64_2-src.zip/icu/source/data/in/icudt64l.dat)

This is for little endian systems only (l). If needed, it seems like icupkg -tb could be used to convert it to big endianess.

Start server with new data folder and --default-language is, then run the following AQL query:

RETURN CONCAT(
    FOR letter IN SPLIT("YKÁÉVÝLIÓFEÖPNÐRXGÍJÚOSÞBDUHÆTMA", "") /* random order */
    SORT letter
    RETURN letter /* expected is:    AÁBDÐEÉFGHIÍJKLMNOÓPRSTUÚVXYÝÞÆÖ
                     wrong would be: AÁÆBDÐEÉFGHIÍJKLMNOÓÖPRSTUÚVXYÝÞ */
)

With the icudtl.dat replaced in 3rdParty\V8\v7.9.317\third_party\icu\common (10 MB -> 26 MB) I get the correct result. Might need to add specific tests to ensure that it's actually working and to raise flags if we accidentally overwrite the data file with a subset version some time later. Could test Icelandic and German phone book sorting de-DE-u-co-phonebk.

Related:

@Simran-B Simran-B added this to the 3.7 milestone Feb 3, 2020
@Simran-B Simran-B self-assigned this Feb 3, 2020
@Simran-B
Copy link
Contributor Author

Simran-B commented Feb 4, 2020

Something isn't right with ArangoSearch... I created a collection coll and ran:

FOR letter IN SPLIT("YKÁÉVÝLIÓFEÖPNÐRXGÍJÚOSÞBDUHÆTMA", "")
INSERT { letter } INTO coll

I also created two Analyzers:

var analyzers = require("@arangodb/analyzers");
analyzers.save("text_is","text",{ locale: "is.utf-8", stopwords:[] },["frequency","norm","position"]);
analyzers.save("text_is2","text",{ locale:"is.utf-8", accent: true, stopwords:[]},["frequency","norm","position"])

And a View:

  "links": {
    "coll": {
      "fields": {
        "letter": {
          "analyzers": [
            "identity",
            "text_en",
            "text_is",
            "text_is2"
          ]
        }
      },

Sorting the collection or View is okay:

FOR doc IN view // or: IN coll
SORT doc.letter
RETURN doc

Both return the documents in the expected order if the server was started with --default-language is.

However, limiting it to a sub-range of letters doesn't work correctly in the View case:

FOR doc IN coll
FILTER doc.letter >= "X" AND doc.letter <= "Þ"
SORT doc.letter
RETURN doc.letter // [ "X", "Y", "Ý", "Þ" ]
FOR doc IN coll
SEARCH ANALYZER(doc.letter >= "X" AND doc.letter <= "Þ", <analyzer>)
SORT doc.letter
RETURN doc.letter
/* different Analyzers:
identity = [ "Á", "Ð", "É", "Í", "Ó", "Ú", "Y", "Ý", "Þ", "Æ", "Ö"]
text_en = [ "A", "Á", "B", "D", "E", "É", "F", "G", "H", "I", "Í", "J", "K", "L", "M", "N", "O", "Ó", "P", "R", "S", "T", "U", "Ú", "V", "X", "Y", "Ý", "Ö" ]
text_is = [ "A", "Á", "B", "D", "E", "É", "F", "G", "H", "I", "Í", "J", "K", "L", "M", "N", "O", "Ó", "P", "R", "S", "T", "U", "Ú", "V", "X", "Y", "Ý", "Ö" ]
text_is2 = [ "A", "B", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "R", "S", "T", "U", "V", "X", "Y" ]
*/

BTW. it is allowed to do SORT ANALYZER(doc.letter, "text_is"), but I'm not sure if it's supposed to do anything nor if it actually does.

@dothebart
Copy link
Contributor

@Simran-B
Copy link
Contributor Author

On a related note, can we get rid of this warning that shows up on every server startup?

WARNING [581d1] failed to locate 'icudtl.dat' at 'C:\Windows\icudtl.dat'

We ship our own data file and it's uncommon to have one in your system I suppose. As long as either exists, no warning should be raised IMO. An INFO log would be okay though, to tell the user which icudtl.dat it picked, should that be of relevance (or what was the reason to add above warning in the first place?)

@dothebart
Copy link
Contributor

huh? do you have an environment variable configured that points there? We don't search there by default.

@Simran-B
Copy link
Contributor Author

@dothebart Heureka!

image

Looks like I created this env var? Probably because it was mentioned in https://www.arangodb.com/docs/stable/installation-compiling-windows.html#building-arangodb. I do have icudt54l.dat in C:\Windows, but recent ArangoDB versions will be looking for other versions.

Is the information from the docs still relevant, or does a locally compiled ArangoDB find the ICU data file from the repository anyway?

@dothebart
Copy link
Contributor

yes, this used to be a part of our original windows debug installation instructions.

@Simran-B Simran-B removed the 9 WIP label Mar 19, 2020
@Simran-B Simran-B marked this pull request as ready for review March 19, 2020 12:00
@Simran-B Simran-B added 3 Core 3 Search IResearch / Fulltext index / Analyzers labels Mar 19, 2020
@Simran-B
Copy link
Contributor Author

@dothebart According to Andrey it is an (internally) known issue that the SEARCH operation does not take --default-language nor the Analyzer locale into account for the relevant operators: arangodb/docs#374 (review)

For SORT as well as regular FILTER it appears to work just fine, so nothing should stop us from merging the full ICU data file, as the ArangoSearch related issues existed before.

@srl295
Copy link

srl295 commented Mar 24, 2020

Hi! You might want to see the progress on nodejs/full-icu-npm#36 — I have not started publishing the data binaries as described in ICU-20600 yet. How far back would you need them?

Also note that for Node.js in https://github.com/nodejs/node/pull/29522/files#diff-947d0de2be4d9fa41699237a83d77ac5 I used a .bz2 compression of the data file. Only 9.3Mb

Also by the way, ICU has a data build tool, https://github.com/unicode-org/icu/blob/master/docs/userguide/icu_data/buildtool.md that can be used to customize a subset of the data. If you only need collation (or collation plus something) you could make a more compact data file.

@Simran-B
Copy link
Contributor Author

Hi @srl295,
thanks for getting in touch and sharing some links!

TL;DR - we don't need anything older than ICU v64.2 (and we already have that).
An easy way to get the full ICU data file in the future is all we want.

I don't know all the details but here some background information:

  • V8 is embedded into our system, but we rarely upgrade it:
    • v4.1.0.27 in v2.5 (Mar 2015)
    • v4.3.61 in v2.7 (Oct 2015)
    • v5.0.71.39 in v3.0 (Jul 2016)
    • v5.7.492.77 in v3.2 (Jul 2017)
    • v7.1.302.28 in v3.5 (Aug 2019)
    • v7.9.317 in v3.7 (~May 2020)
  • Upgrading is a manual process and mostly about dealing with API breaking changes
  • We don't use the latest available version on upgrades, but whatever Node.js uses, because of their existing build scripts etc. AFAIK
  • We use a single ICU data file for multiple components (V8, IResearch/ArangoSearch, server core)
  • A user asked us about support for Icelandic a while ago. It turned out that our ICU data file was only a subset. No code changes necessary to support the language, just ship the full file.
  • We would like to ship full ICU but don't want to compile it ourselves.
  • The size of the ICU data file does not really matter to us, unless it would be >50 MB or so.
  • Node.js 13.0.0 (released 2019-10-22) was the first version to default to full-icu and used V8 v7.8. Our v7.9 has the smaller 10 MB ICU data file however, not sure why.
  • We want to replace it by the full 25 MB file (this PR). It's fine for us to download, extract and replace the file whenever we upgrade V8 by hand, but it would be convenient if it was the full version out of the box.
  • We target x86 64-bit little-endian systems. A big-endian data file isn't needed at the moment.

@srl295
Copy link

srl295 commented Mar 26, 2020

Then you'll find the full little endian data file within icu-src .tgz or .zip in the ICU download.

@srl295
Copy link

srl295 commented Apr 10, 2020

please note ICU67rc has the new data file zips https://github.com/unicode-org/icu/releases/tag/release-67-rc

@Simran-B
Copy link
Contributor Author

Simran-B commented May 6, 2020

@KVS85 Meanwhile, I added several of these warnings to the docs:

image

While the larger ICU data file adds support for more languages, the lexicographic order is still not taken into account. There is no regression because of this PR, so it should be safe to merge.

@KVS85 KVS85 requested review from gnusi and jsteemann May 13, 2020 15:55
@KVS85
Copy link
Contributor

KVS85 commented May 13, 2020

@KVS85
Copy link
Contributor

KVS85 commented May 13, 2020

@srl295
Copy link

srl295 commented May 13, 2020

@KVS85 FWIW are the Jenkins URLs supposed to be broken? I get host not found…

@Simran-B
Copy link
Contributor Author

Simran-B commented May 13, 2020

@srl295 The links work within our company network only. Our Jenkins is not public.

@Simran-B
Copy link
Contributor Author

@srl295 I added instructions for future updates of the ICU data file. The separate distribution of just the data file is really handy, thanks a lot!

@Simran-B
Copy link
Contributor Author

In the previous PR run, the following error occurred:

* Test "catch"
    [FAILED]  basics

      "test" failed: exit code was 1

    [FAILED]  IResearchQueryLateMaterializationTest

      "test" failed: /work/ArangoDB/tests/IResearch/IResearchQueryLateMaterialization-test.cpp:219
Value of: 0 == arangodb::basics::VelocyPackHelper::compare( arangodb::velocypack::Slice(*expectedDoc), resolved, true)
  Actual: false
Expected: true/work/ArangoDB/tests/IResearch/IResearchQueryLateMaterialization-test.cpp:219
Value of: 0 == arangodb::basics::VelocyPackHelper::compare( arangodb::velocypack::Slice(*expectedDoc), resolved, true)
  Actual: false
Expected: true/work/ArangoDB/tests/IResearch/IResearchQueryLateMaterialization-test.cpp:219
Value of: 0 == arangodb::basics::VelocyPackHelper::compare( arangodb::velocypack::Slice(*expectedDoc), resolved, true)
  Actual: false
Expected: true/work/ArangoDB/tests/IResearch/IResearchQueryLateMaterialization-test.cpp:219
Value of: 0 == arangodb::basics::VelocyPackHelper::compare( arangodb::velocypack::Slice(*expectedDoc), resolved, true)
  Actual: false
Expected: true

A test run using the latest devel state did not show this error:

http://jenkins01.arangodb.biz:8080/job/arangodb-matrix-pr-linux/11687/

Merged devel into this PR once more. Let's see if the problem persists. If it does, then the ICU data file is to blame - even though I can't think of a good reason why that should be the case:

http://jenkins01.arangodb.biz:8080/job/arangodb-matrix-pr-linux/11689/

@Simran-B
Copy link
Contributor Author

It seems to be caused by the ICU data file after:

* Test "catch"
    [FAILED]  basics

      "test" failed: exit code was 1

    [FAILED]  IResearchQueryLateMaterializationTest

      "test" failed: /work/ArangoDB/tests/IResearch/IResearchQueryLateMaterialization-test.cpp:219
Value of: 0 == arangodb::basics::VelocyPackHelper::compare( arangodb::velocypack::Slice(*expectedDoc), resolved, true)
  Actual: false
Expected: true/work/ArangoDB/tests/IResearch/IResearchQueryLateMaterialization-test.cpp:219
Value of: 0 == arangodb::basics::VelocyPackHelper::compare( arangodb::velocypack::Slice(*expectedDoc), resolved, true)
  Actual: false
Expected: true/work/ArangoDB/tests/IResearch/IResearchQueryLateMaterialization-test.cpp:219
Value of: 0 == arangodb::basics::VelocyPackHelper::compare( arangodb::velocypack::Slice(*expectedDoc), resolved, true)
  Actual: false
Expected: true/work/ArangoDB/tests/IResearch/IResearchQueryLateMaterialization-test.cpp:219
Value of: 0 == arangodb::basics::VelocyPackHelper::compare( arangodb::velocypack::Slice(*expectedDoc), resolved, true)
  Actual: false
Expected: true

    [FAILED]  IResearchQueryNoMaterializationTest

      "test" failed: /work/ArangoDB/tests/IResearch/IResearchQueryNoMaterialization-test.cpp:240
Value of: memcmp(expectedValue->getCharPtr(), resStr, length) == 0
  Actual: false
Expected: true

   Suites failed: 3 Tests Failed: 0

The tests IResearchQueryNoMaterialization-test.cpp and IResearchQueryLateMaterialization-test.cpp don't seem to deal with non-ASCII characters, so I'm puzzled about these failures.

@Simran-B
Copy link
Contributor Author

With the ICU file applied on top of #11603 it still produces above errors 😕

@srl295
Copy link

srl295 commented May 14, 2020 via email

@dothebart
Copy link
Contributor

dothebart commented May 15, 2020 via email

@Simran-B Simran-B marked this pull request as draft May 22, 2020 13:30
@Simran-B
Copy link
Contributor Author

It seems like the official ICU data file and what we currently use is somehow incompatible, and it's unclear if we want to put in the effort of building Chromium with tweaks to get a compatible, complete file.

https://arangodb.atlassian.net/browse/AR-115 (internal)

@Simran-B Simran-B closed this Dec 11, 2020
@Simran-B Simran-B deleted the feature/full-icu-64-2 branch December 11, 2020 15:08
@srl295
Copy link

srl295 commented Dec 11, 2020 via email

@Simran-B
Copy link
Contributor Author

@srl295 If I substitute our Chromium/V8 ICU data file (which is a subset) with the matching version of the official full ICU data file, then suddenly bizarre errors occur in some of our unit tests, which don't even deal with non-ASCII characters. Furthermore, some collations seem to be broken. I can't remember which language exactly, but I believe Russian or Ukrainian did not sort correctly anymore (the first letter of the alphabet ended up at the very end, the rest looked okay however). None of these issues occur with the current ICU data file 😩

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 Feature 3 Core 3 Search IResearch / Fulltext index / Analyzers 9 WIP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants