Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

support emoji encoding for Flux jobids #5174

Merged
merged 11 commits into from May 17, 2023
Merged

Conversation

grondo
Copy link
Contributor

@grondo grondo commented May 16, 2023

This PR implements the emoji encoding for FLUIDs as proposed in the pending PR flux-framework/rfc#381.
It is a WIP pending acceptance of that RFC PR.

Instead of trying to implement a generic binary to emoji encoding, the "basemoji" implementation here simply converts uint64_t to base 576 using the same pattern as the F58 implementation. This allows high-order zero bits to be dropped since the result is a "number" with emoji as the base 576 digits.

A named emoji encoding is then added to the fluid.h with support for detection and parsing in fluid_parse(3), along with an id.emoji property for the JobID class and flux-jobs output formats.

$ flux job id --to=emoji 0 1 512 65535 123456789
馃槂
馃槃
馃敒
馃拋馃摎
馃幇馃挆馃摚

Copy link
Member

@chu11 chu11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea: update the cute format to use id.emoji? However, getting things to align is hard. I haven't been able to figure anything out yet that looks ok.

Edit: I think the emojis are double width in my terminal, so I need to "space over" the JOBID header. We need like a "header offset' or something.

image

Edit: heh, this fixed the alignment :P (12 spaces)

diff --git a/src/bindings/python/flux/job/info.py b/src/bindings/python/flux/job/info.py
index 393d40a..12f0ccb 100644
--- a/src/bindings/python/flux/job/info.py
+++ b/src/bindings/python/flux/job/info.py
@@ -633,7 +633,7 @@ class JobInfoFormat(flux.util.OutputFormat):
         "id.dec": "JOBID",
         "id.hex": "JOBID",
         "id.f58": "JOBID",
-        "id.emoji": "JOBID",
+        "id.emoji": "           JOBID",
         "id.kvs": "JOBID",
         "id.words": "JOBID",
         "id.dothex": "JOBID",


/* Maximum number of emoji "digits" in a basemoji string is
*
* ciel (ln (2^64-1)/ln (576)) = 7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ceil

}

/* Check for overflow of provided buffer:
* Need space for prefix + count bytes for emoji + NUL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no prefix?

/* Check for overflow of provided buffer:
* Need space for prefix + count bytes for emoji + NUL
*/
if (count + 1 > buflen) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parens around (count + 1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but operator precedence puts + before > so what is the need here? Just preference?

Copy link
Member

@garlick garlick May 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I would not add parens there (precedence is clear).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok no prob, I tend to add parens when there's more than "1 thingie" on one side of the gt or lt.

Comment on lines 111 to 122
/* Check for expected length of a basemoji string, and if the
* first two bytes match the expected UTF-8 encoding.
* This doesn't guarantee that `s` is a valid basemoji string,
* but this will catch most obvious cases and other invalid strings
* are left to be detected in decode.
*/
if (len >= BASEMOJI_MINLEN
&& len <= BASEMOJI_MAXLEN
&& (uint8_t)s[0] == 0xf0
&& (uint8_t)s[1] == 0x9f)
return true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should check strlen (s) % 4 == 0? I'm waffling on the need for it, would be good to catch obvious errors, but code below would handle it ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, good point. It is a simple check so might as well add it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note we could go through and ensure ever other two bytes start with F0 9F, but that seemed a bit heavyweight to me. This isn't validating the emoji string, but just detecting if we should try to decode it as one..

@@ -236,7 +245,7 @@ void test_basic (void)
ok (fluid_encode (buf, sizeof (buf), id, FLUID_STRING_DOTHEX) == 0,
"fluid_encode type=DOTHEX works");
ok (fluid_decode (buf, &id2, FLUID_STRING_DOTHEX) == 0 && id == id2,
"fluid_decode type=MNEMONIC works");
"fluid_decode type=DOTHEX works");
Copy link
Member

@chu11 chu11 May 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should be in another commit? another one below too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, those were cut-and-paste fixes I missed adding to a different commit :-(

grondo added 11 commits May 16, 2023 16:16
Problem: There is no implementation for encoding FLUIDs to emoji
as described in RFC 19.

Add basemoji, an encoding which uses a table of 576 standard Unicode
emoji to represent 64 bit unsigned integers in base 576.
Problem: There are no unit tests for the basemoji encoding for
64 bit unsigned integers.

Add a set of simple unit tess for basemoji to libutil/test.
Problem: Encoding FLUIDs to a string of emoji is not supported.

Use the libutil/basemoji implementation to add a FLUID_STRING_EMOJI
string type to fluid_string_type_t. Make sure FLUID_STRING_EMOJI is
supported with fluid_parse(3).
Problem: There are a couple cut-and-paste errors in the fluid unit test
that print type=MNEMONIC when they mean type=DOTHEX.

Fix the typos.
Problem: There are no unit tests for the FLUID_STRING_EMOJI encoding.

Amend the existing fluid unit tests to exercise this encoding.
Problem: libjob does not support encoding flux_jobid_t to emoji.

Add an "emoji" encoding that uses FLUX_STRING_EMOJI to encode
flux_jobid_t as a string of emoji using the basemoji implementation.
Problem: No tests in the libjob unit testsuite ensure the "emoji"
jobid encoding works as intended.

Add some expected emoji output to the libjob unit tests.
Problem: A Python JobID object can't return the emoji encoding of a
jobid since there's no corresponding class property.

Add an `emoji` property to the JobID class which returns the emoji
encoding of the jobid.
Problem: The Python job tests do not test the emoji encoding for JobIDs.

Add a tests for expected 'emoji' encodings of JobIDs.
Problem: The Python job formatting class does not support id.emoji,
even though emoji is valid jobid encoding.

Add `id.emoji` to various dictionaries as required to support this
encoding in output formats.
Problem: `id.emoji` is missing from the list of available field names
in flux-jobs(1).

Add id.emoji to the list of valid field names in the flux-jobs(1)
manual.
@chu11
Copy link
Member

chu11 commented May 16, 2023

I was completely messing around to see if I could get the emoji alignment to work and this ended up working:

@@ -443,6 +444,17 @@ class UtilFormatter(Formatter):
             basecases = ("", "0s", "0.0", "0:00:00", "1970-01-01T00:00:00", localepoch)
             value = "-" if str(value) in basecases else str(value)
             spec = spec[:-1] + "s"
+
+        if isinstance (value, str):
+            widecount = 0
+            for chr in value:
+                if unicodedata.east_asian_width(chr) == 'W':
+                    widecount += 1
+
+            if widecount and spec == ">12":
+                count = 12 - widecount
+                spec = f">{count}"
+
         retval = super().format_field(value, spec)

image

Obviously I hard coded it based on what I was expecting for the spec, but perhaps the basics of something to align wide chars is doable / reasonable ... I think there's a lot of corner cases on emoji chars and some are half-chars, but the output would atleast be sensible in most cases, vs the super shifted one I showed above.

@grondo
Copy link
Contributor Author

grondo commented May 16, 2023

Yeah, that might work! I wonder if we should have a special conversion so this doesn't have to be done for every string that is being formatted, or maybe there's some other way to better detect a possible "wide" string (or maybe it is fast enough that it doesn't matter)

@grondo
Copy link
Contributor Author

grondo commented May 17, 2023

BTW I stumbled across this library https://gitlab.com/fgallaire/cjkwrap/-/blob/master/cjkwrap.py
That has a ckjlen() function (len of a possible wide string) that we could maybe pilfer:

def cjklen(text):
     """cjklen(object) -> integer
         Return the real width of an unicode text, the len of any other type.
     """
     if not isinstance(text, text_type):
         return len(text)
     return sum(2 if is_wide(char) else 1 for char in text) 

@grondo grondo changed the title WIP: support emoji encoding for Flux jobids support emoji encoding for Flux jobids May 17, 2023
@grondo
Copy link
Contributor Author

grondo commented May 17, 2023

Just removed WIP since the RFC 19 changes went in.

@codecov
Copy link

codecov bot commented May 17, 2023

Codecov Report

Merging #5174 (631e983) into master (c0e977c) will increase coverage by 0.02%.
The diff coverage is 90.66%.

@@            Coverage Diff             @@
##           master    #5174      +/-   ##
==========================================
+ Coverage   83.13%   83.15%   +0.02%     
==========================================
  Files         454      455       +1     
  Lines       77961    78035      +74     
==========================================
+ Hits        64814    64892      +78     
+ Misses      13147    13143       -4     
Impacted Files Coverage 螖
src/bindings/python/flux/job/info.py 92.85% <酶> (酶)
src/common/libutil/fluid.c 93.17% <85.71%> (-0.30%) 猬囷笍
src/common/libutil/basemoji.c 90.62% <90.62%> (酶)
src/bindings/python/flux/job/JobID.py 100.00% <100.00%> (酶)
src/common/libjob/id.c 90.69% <100.00%> (+0.22%) 猬嗭笍

... and 5 files with indirect coverage changes

@chu11
Copy link
Member

chu11 commented May 17, 2023

I wonder if we should have a special conversion so this doesn't have to be done for every string that is being formatted

Oh that's a good idea. And honestly, we could just limit it to certain formatting in that case. like >\d+ and <\d+ or something.

Copy link
Member

@chu11 chu11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chu11
Copy link
Member

chu11 commented May 17, 2023

ehhh, this wasn't too awful, maybe could propose in a follow up PR (probably can cleanup some logic here).

diff --git a/src/bindings/python/flux/util.py b/src/bindings/python/flux/util.py
index f753f5e..1c2f82e 100644
--- a/src/bindings/python/flux/util.py
+++ b/src/bindings/python/flux/util.py
@@ -29,6 +29,7 @@ from string import Formatter
 from typing import Mapping
 
 import yaml
+import unicodedata
 
 # tomllib added to standard library in Python 3.11
 # flux-core minimum is Python 3.6.
@@ -443,6 +444,28 @@ class UtilFormatter(Formatter):
             basecases = ("", "0s", "0.0", "0:00:00", "1970-01-01T00:00:00", localepoch)
             value = "-" if str(value) in basecases else str(value)
             spec = spec[:-1] + "s"
+
+        if spec.endswith("W"):
+            if isinstance (value, str):
+                match = re.search(r"^([<>])(\d+)W", spec)
+                if match:
+                    align = match[1]
+                    width = int(match[2])
+                    widecount = 0
+                    for chr in value:
+                        if unicodedata.east_asian_width(chr) == 'W':
+                            widecount += 1
+
+                    if width > widecount:
+                        width -= widecount;
+                        spec = f"{align}{width}s"
+                    else:
+                        spec = ""
+                else:
+                    spec = spec[:-1]
+            else:
+                spec = spec[:-1]
+
         retval = super().format_field(value, spec)
 
         if denote_truncation and len(retval) < len(str(value)):
diff --git a/src/cmd/flux-jobs.py b/src/cmd/flux-jobs.py
index 5211fb6..a4a14bd 100755
--- a/src/cmd/flux-jobs.py
+++ b/src/cmd/flux-jobs.py
@@ -39,7 +39,7 @@ class FluxJobsConfig(UtilConfig):
         "cute": {
             "description": "Cute flux-jobs format string (default with emojis)",
             "format": (
-                "{id.f58:>12} ?:{queue:<8.8} {username:<8.8} {name:<10.10+} "
+                "{id.emoji:>12W} ?:{queue:<8.8} {username:<8.8} {name:<10.10+} "
                 "{status_emoji:>5.5} {ntasks:>6} {nnodes:>6h} "
                 "{contextual_time!F:>8h} {contextual_info}"
             ),

image

I did not add W to the status_emoji output though. I think there is some combo of wide width, total width, and width of the header that is coming into play here for alignment that I haven't quite figured out. I think this is a decent enough hack to get us something more sensible to update the cute format though. Edit: oh duh ... the header JOBID is always going to be shorter than the output of a basemoji, so all of those extra chars mess up alignment. But with STATUS, there is only one emoji. So it'll always be shorter than the header, so using W could mess up alignment.

@grondo
Copy link
Contributor Author

grondo commented May 17, 2023

Ok, I guess I'll set MWP then.. 馃槅

@mergify mergify bot merged commit 11d4362 into flux-framework:master May 17, 2023
32 checks passed
@grondo grondo deleted the f576 branch May 17, 2023 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants