Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Using in SQLite #8

Closed
honzajavorek opened this issue Jul 8, 2023 · 3 comments
Closed

Question: Using in SQLite #8

honzajavorek opened this issue Jul 8, 2023 · 3 comments

Comments

@honzajavorek
Copy link
Contributor

SQLite offers a way to easily add custom functions written in Python. I wondered if I could register czech_sort.key as a custom function, because then I could have Czech sort on any field in my database with ease and portability. However, SQLite doesn't seem to be happy with tuples, raising an exception:

sqlite3.ProgrammingError: User-defined functions cannot return 'tuple' values to SQLite

I looked at the tuple if I can somehow convert it to one of the supported types, but I got scared, the tuple is enormous and undocumented.

Any ideas how this could be done, do I overlook something, is this out of scope, or better done different way?

@encukou
Copy link
Owner

encukou commented Jul 9, 2023

Yup, this isn't trivial: the way tuples sort matches the sorting rules nicely, and czech_sort takes advantage of that.

The tuple is documented in the code.

You could use something like this, but maybe key could be adapted to make things easier.

def bytes_key(string):
    def _recurse(key):
        match key:
            case tuple():
                # Turn individual items into bytes keys, and join them.
                # After each item, put a `1` byte if there are more items,
                # and a `0` byte if not.
                return b'\x01'.join(_recurse(e) for e in key) + b'\0'
            case str():
                # Encode to UTF-8, add a zero marker at the end.
                # The marker needs to be "smaller" than anything in the string,
                # including embedded `0` bytes.
                # So, the marker is a doubled `0` byte, and any `0` bytes in
                # the string are "escaped" to `0`-`1`.
                return key.encode().replace(b'\0', b'\0\x01') + b'\0\0'
            case int():
                if key < 0:
                    # Negative numbers are smallest, start with a `0` byte.
                    return bytes([0]) + key.to_bytes(8, signed=True)
                if key >= 254:
                    # Large numbers start with a full byte.
                    return bytes([255]) + key.to_bytes(8)
                # Small non-negative numbers use the remaining byte values.
                return (key + 1).to_bytes(1)
            case _:
                raise TypeError(key)
    return _recurse(czech_sort.key(string))

PR with tests welcome :)

honzajavorek pushed a commit to honzajavorek/czech-sort that referenced this issue Jul 9, 2023
@encukou
Copy link
Owner

encukou commented Jul 11, 2023

Thanks! Fixed in #9, out in 1.1.0.

WARNING: Do not store the results of bytes_key in the database. The format can change in future versions of czech_sort.

@encukou encukou closed this as completed Jul 11, 2023
@honzajavorek
Copy link
Contributor Author

Amazing! Thanks! This will make something convoluted and diffucult very simple and portable.

Btw, I see you dropped Python v2. I think the README will need an update in the compatibility section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants