Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for collating function (sequence)? #767

Closed
zingchen opened this issue Jul 27, 2021 · 21 comments
Closed

Support for collating function (sequence)? #767

zingchen opened this issue Jul 27, 2021 · 21 comments

Comments

@zingchen
Copy link

zingchen commented Jul 27, 2021

The current version (v1.6) represents collating functions as an enum of 3 values: binary, nocase and rtrim, which corresponds to the 3 pre-built collating functions (also known as collating sequences in SQlite terminology). For many languages, the pre-built collating functions don't work.

Let's consider for example Czech. Two Czech words are compared lexicographically, i.e. letter-by-letter. However, the letter ch (pronounced kh) is represented as a sequence of two characters (Unicode codepoints) c and h, and takes place between h and i in the Latin alphabet. Thus Christina (Czech version of Christine, Khristina) is an eight-letter word, and takes place after Hana and before Klara.

Another example is Vietnamese. A diacriticized letter, e.g. é, ế, is represented as either a single pre-composed character or a sequence of two characters, the un-diacriticized letter (e, ê) followed by the diacritical mark. (Note that ê may be considered diacriticized letter in other contexts.) Diacritical marks represent tone of words. Every word has zero or one diacritical mark. For comparison between two words, the words are first translated into un-diacriticized form, possibly with extra traling spaces so as they have the same length, followed by the diacritical mark. The two words, in translated form, are then compared lexicographically. (In other words, a diacritical mark is always compared against a diacritical mark and such comparison takes place if and only if the un-diacriticized version of the words equal.)

@fnc12
Copy link
Owner

fnc12 commented Jul 27, 2021

Ok you've described the problem and what solution do you offer? How would you solve this issue with raw SQLite? Why don't you use custom collations instead of built-in?

@zingchen
Copy link
Author

zingchen commented Jul 27, 2021

With raw SQLite I'd register custom collating function myCompare() under name "MYCOMPARE" by a C API call to sqlite3_create_collation(myDbConnection, "MYCOMPARE", SQLITE_UTF8, NULL, myCompare), then write COLLATE MYCOMPARE in queries.

So, I think that SQLite-ORM should provide a wrapper for sqlite3_create_collation() which, besides others, maps the collating function (myCompare) to its name ("MYCOMPARE") and a version of collate_xxxx() family, e.g. collate_custom() that uses the mapping to translate the function's address to name for constructing the COLLATE clause.

@fnc12
Copy link
Owner

fnc12 commented Jul 27, 2021

there is custom collation support. Just call storage.create_collation(...)

@fnc12 fnc12 added the question label Jul 27, 2021
@zingchen
Copy link
Author

zingchen commented Jul 27, 2021

OOPS, I must have mis-read the source code. I'm sorry.

Any hint on how to specify the collating_function myCompare to construct a query? I've searched the examples but didn't find one using collating_function or create_collation.

Thanks.

@zingchen zingchen changed the title Incomplete support for collating function (sequence) Support for collating function (sequence)? Jul 27, 2021
@fnc12
Copy link
Owner

fnc12 commented Jul 27, 2021

auto rows = storage.select(&User::name, where(is_equal(&User::name, "Mercury").collate("ototo")));

@zingchen
Copy link
Author

Thank you! Out of curiosity, why "ototo" and not &ototo or simply ototo?

The later seem to be more consistent with select(&User::name,...) and collate(nocase) syntax, which is one of the very principles of SQLite-ORM.

@fnc12
Copy link
Owner

fnc12 commented Jul 27, 2021

@zingchen several reasons:

  1. custom collation is not very often used feature so it is not bad if you can inject custom string using .collate(...) member function
  2. I was young and silly 😅

If you want I can add strong typed API just like user defined functions have

@zingchen
Copy link
Author

Please please do it. As ototo is treated equally to nocase and rtrim in the SQLite language, it'd better be so in C++. After all, that's why we love SQLite-ORM. Right?

@fnc12
Copy link
Owner

fnc12 commented Jul 27, 2021

Probably I did not understand you quite right. Do you mean that custom collation feature must have a strong typed API just like user defined functions feature has to eliminate the opportunity of typing collation name as string at many places? Like that:

struct MyUnicodeCollation {
    int operator()(int leftLength, const void *lhs, int rightLength, const void *rhs) const {
        //  cute comparison code here..
    }

    static std::string_view name() {
        return "MY_UNICODE_COLLATION";
    }
};
storage.create_collation<MyUnicodeCollation>();
auto rows = storage.select(&User::name, where(is_equal(&User::name, "Mercury").collate<MyUnicodeCollation>()));

?

@zingchen
Copy link
Author

Yes. The syntax collate<MyUnicodeCollation>() would be the most intuitive one from user's perspective, as it resembles the pattern collate_binary(), collate_nocase() and collate_rtrim().

@fnc12
Copy link
Owner

fnc12 commented Jul 27, 2021

Ok I'll add it to the TODO list

@zingchen
Copy link
Author

Thank you very much!

@fnc12
Copy link
Owner

fnc12 commented Jul 27, 2021

You are welcome! https://github.com/fnc12/sqlite_orm/blob/dev/TODO.md please review the latest line in the list. Is the issue actual?

@zingchen
Copy link
Author

Not so actual, compared to other items in the list. I can live with collate("MY_UNICODE_COLLATION") for now.

Some years ago, a student, who took an intro Ada course from me, told me that he prefers C++ to Ada because operator + (int x, int y) "is better" than function "+" (x, y : integer). The difference is + vs "+". I disagree with him, but since then I understand why Ada failed.

@fnc12
Copy link
Owner

fnc12 commented Jul 27, 2021

Cool story!

@fnc12
Copy link
Owner

fnc12 commented Aug 7, 2021

@zingchen new template API for custom collations is here #771

@fnc12
Copy link
Owner

fnc12 commented Aug 8, 2021

@zingchen merged. Please check it out

@zingchen
Copy link
Author

zingchen commented Aug 8, 2021

GCC encountered errors trying to compile the example collate.cpp with the new sqlite_orm.h (from the dev branch). Here are some errors.

C:\C-lib\sqlite_orm\include/sqlite_orm/sqlite_orm.h:5529:15: error: '::strncpy' has not been declared
 5529 |             ::strncpy(dataCopy, stringChars, stringDataLength + 1);
      |               ^~~~~~~
C:\C-lib\sqlite_orm\include/sqlite_orm/sqlite_orm.h:5431:1: note: 'strncpy' is defined in header '<cstring>'; did you forget to '#include <cstring>'?
 5430 | #include <locale>  //  std::wstring_convert
  +++ |+#include <cstring>
 5431 | 

C:\C-lib\sqlite_orm\include/sqlite_orm/sqlite_orm.h: In member function 'std::tuple<const char*, int> sqlite_orm::statement_binder<V, typename std::enable_if<(std::is_same<V, std::__cxx11::basic_string<char> >::value || std::is_same<V, const char*>::value), void>::type>::string_data(const char*) const':
C:\C-lib\sqlite_orm\include/sqlite_orm/sqlite_orm.h:5542:31: error: there are no arguments to 'strlen' that depend on a template parameter, so a declaration of 'strlen' must be available [-fpermissive]
 5542 |             auto length = int(strlen(s));
      |                               ^~~~~~
C:\C-lib\sqlite_orm\include/sqlite_orm/sqlite_orm.h:5542:31: note: (if you use '-fpermissive', G++ will accept your code, but allowing the use of an undeclared name is deprecated)

C:\C-lib\sqlite_orm\include/sqlite_orm/sqlite_orm.h:5542:37: error: 'strlen' was not declared in this scope, and no declarations were found by argument-dependent lookup at the point of instantiation [-fpermissive]
 5542 |             auto length = int(strlen(s));
      |                               ~~~~~~^~~
In file included from C:/msys64/mingw64/include/c++/10.3.0/cstring:42,
                 from C:\C-lib\sqlite_orm\include/sqlite_orm/sqlite_orm.h:5741,
                 from D:\duong\testCpp\test1\sqlite_orm_examples\collate.cpp:6:
C:/msys64/mingw64/x86_64-w64-mingw32/include/string.h:64:18: note: 'size_t strlen(const char*)' declared here, later in the translation unit
   64 |   size_t __cdecl strlen(const char *_Str);
      |                  ^~~~~~
Process terminated with status 1 (0 minute(s), 1 second(s))
4 error(s), 12 warning(s) (0 minute(s), 1 second(s))

@fnc12
Copy link
Owner

fnc12 commented Aug 8, 2021

@zingchen oh damn! The fix is here #772

@zingchen
Copy link
Author

zingchen commented Aug 9, 2021

It works flawlessly now.

(A variation of collate.cpp with added example for the new construct collate<CustomCol>().)

#include <ctime>

#include <string>
#include <iostream>

#include <sqlite_orm/sqlite_orm.h>

using std::cout;
using std::endl;

struct User {
    int id;
    std::string name;
    time_t createdAt;
};

struct Foo {
    std::string text;
    int baz;
};


int main(int, char **) {

    using namespace sqlite_orm;

    // custom collation
    struct CustomCol {
        int operator()(int leftLength, const void* lhs, int rightLength, const void* rhs) const {
            std::cout << static_cast<const char*>(lhs) << ":" << static_cast<const char*>(rhs) << std::endl;
            return 0;
        }

        static const char* name() {
            return "CUSTOMCOL";
        }
    };

    auto storage = make_storage(
        "collate_custom.sqlite",
        make_table("users",
                   make_column("id", &User::id, primary_key()),
                   make_column("name", &User::name),
                   make_column("created_at", &User::createdAt)),
        make_table("foo", make_column("text", &Foo::text, collate_nocase()), make_column("baz", &Foo::baz)));
    storage.sync_schema();
    storage.remove_all<User>();
    storage.remove_all<Foo>();

    storage.insert(User{0, "Lil Kim", std::time(nullptr)});
    storage.insert(User{0, "lil kim", std::time(nullptr)});
    storage.insert(User{0, "Nicki Minaj", std::time(nullptr)});

    storage.create_collation<CustomCol>();

    //  SELECT COUNT(*)
    //  FROM users
    //  WHERE name = 'lil kim'
    auto preciseLilKimsCount = storage.count<User>(where(is_equal(&User::name, "lil kim")));
    cout << "preciseLilKimsCount = " << preciseLilKimsCount << endl;

    //  SELECT COUNT(*) FROM users WHERE name = 'lil kim' COLLATE NOCASE
    auto nocaseCount = storage.count<User>(where(is_equal(&User::name, "lil kim").collate_nocase()));
    cout << "nocaseCount = " << nocaseCount << endl;

    //  SELECT COUNT(*) FROM users WHERE name = 'lil kim' COLLATE CUSTOMCOL
    auto customCount = storage.count<User>(where(is_equal(&User::name, "lil kim").collate<CustomCol>()));
    cout << "customCount = " << customCount << endl;


    //  SELECT COUNT(*) FROM users
    cout << "total users count = " << storage.count<User>() << endl;

    storage.insert(Foo{"Touch", 10});
    storage.insert(Foo{"touch", 20});

    cout << "foo count = " << storage.count<Foo>(where(c(&Foo::text) == "touch")) << endl;

    //  SELECT id
    //  FROM users
    //  ORDER BY name COLLATE RTRIM ASC
    auto rows = storage.select(&User::id, order_by(&User::name).collate_rtrim().asc());
    cout << "rows count = " << rows.size() << endl;

    return 0;
}

Thank you very much.

@fnc12
Copy link
Owner

fnc12 commented Aug 9, 2021

@zingchen you are welcome!

@fnc12 fnc12 closed this as completed Aug 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants