Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does SORT work for string properties #19526

Open
HongmanKim opened this issue Aug 4, 2023 · 1 comment
Open

How does SORT work for string properties #19526

HongmanKim opened this issue Aug 4, 2023 · 1 comment

Comments

@HongmanKim
Copy link

My Environment

  • ArangoDB Version: 3.9.3
  • Deployment Mode: Single Server
  • Deployment Strategy: Manual Start in Docker
  • Configuration:
  • Infrastructure: AWS, RHEL VM
  • Operating System: UBuntu 20.04, RHEL 8
  • Total RAM in your machine: 64GB
  • Disks in use: SSD
  • Used Package:

Component, Query & Data

AQL, arangojs

AQL query (if applicable):

For doc in dummy
	SORT doc.r
	return doc

AQL explain and/or profile (if applicable):

Query results are:

[
  {
	"_key": "441941",
	"_id": "dummy/441941",
	"_rev": "_gZktyIC---",
	"r": "_"
  },
  {
	"_key": "441650",
	"_id": "dummy/441650",
	"_rev": "_gZktkLe---",
	"r": "-"
  },
  {
	"_key": "450584",
	"_id": "dummy/450584",
	"_rev": "_gZvbPKu---",
	"r": "0"
  },
  {
	"_key": "450600",
	"_id": "dummy/450600",
	"_rev": "_gZx0HOm---",
	"r": "1",
	"a": "whatever"
  },
  {
	"_key": "450616",
	"_id": "dummy/450616",
	"_rev": "_gZvbtu2---",
	"r": "5"
  },
  {
	"_key": "441680",
	"_id": "dummy/441680",
	"_rev": "_gZktIaC---",
	"r": "A"
  },
  {
	"_key": "441771",
	"_id": "dummy/441771",
	"_rev": "_gZktTta---",
	"r": "a"
  }
]

The sorted strings are effectively:

'_', '-', '0', '1', '5', 'A', 'a'

The sorted results are different from Array.sort() or localeCompare()

unifr@melon:~$ node
Welcome to Node.js v16.19.0.
Type ".help" for more information.
> ['_', '-', '0', '1', '5', 'A', 'a'].sort()
[
  '-', '0', '1',
  '5', 'A', '_',
  'a'
]

> ['_', '-', '0', '1', '5', 'A', 'a'].sort((a, b)=>a.localeCompare(b))
[
  '_', '-', '0',
  '1', '5', 'a',
  'A'
]

> ['_', '-', '0', '1', '5', 'A', 'a'].sort((a, b)=>a.localeCompare(b), 'en-US')
[
  '_', '-', '0',
  '1', '5', 'a',
  'A'
]

In the Arangodb container, the default LANGUAGE is set to en_US.

/var/lib/arangodb3 # cat LANGUAGE
{"default":"en_US"}

According to the documentation: https://www.arangodb.com/docs/stable/aql/fundamentals-type-value-order.html,

string: string values are ordered using a localized comparison, using the configured
server language
for sorting according to the alphabetical order rules of that language

But I'm not sure what the compare function SORT in AQL uses in my ArangoDB.

Dataset:
Above is from a dummy document collection

Size of your Dataset on disk:
Very small

Replication Factor & Number of Shards (Cluster only):
N/A

Steps to reproduce

  1. Add documents to collection (with the content above)
  2. Run the AQL query

Problem:

The sort order is not what I expected

Expected result:

I would expect the same results as using a.localeCompare(b), 'en-US')

@Simran-B
Copy link
Contributor

String comparisons are handled by ICU, with exceptions for ArangoSearch-related features.

The difference between JavaScript and AQL seems to be that JS sorts lowercase characters before uppercase characters, whereas in AQL it is upper before lower. This is explicitly set for ICU by the server here, but only in the fallback case:

coll->setAttribute(UCOL_CASE_FIRST, UCOL_UPPER_FIRST, status); // A < a

If you explicitly specify --icu-language en_US, this attribute does not get set, and you get the default ICU behavior AFAICT.

I'm not sure if JavaScript explicitly sorts lowercase before uppercase or if it uses the default, which seems to be "off" according to the ICU Collation Demo - but I don't know whether it means "undefined order". It rather looks like the default is lowercase before uppercase (see the diff strengths in the collation demo).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants