Proposal: New lexicon `richtext` core-type #2830

pfrazee · 2024-09-25T17:48:09Z

pfrazee
Sep 25, 2024
Maintainer

This proposal is not resolved within the Bluesky team and is being shared here to solicit feedback. There's no firm commitment to this direction at this stage.

Motivation

Lexicon's core types can be found here.

We had a vigorous debate over how to represent richtext when Bluesky’s current schemas were being finalized (#621). Among the options we evaluated was this:

It's not entirely off the table that we switch to arrays of objects that represent spans of text and indicate their active marks/facets. This is how most richtext editors represent their data. The upside is that this resolves the indexing question entirely. The downside -- and why I didn't do this originally -- is it makes the maxLength calculation impossible unless we make this richtext representation a Lexicon primitive. I'm open to that too.

We ultimately decided to stick with offsets pointing into strings (utf8 codeunits). This has surfaced some downsides:

Problem 1. The utf8 codeunit indexes are difficult for engineers to understand.
Problem 2. The utf8 codeunit indexes are difficult to accomplish in some languages (eg javascript).
Problem 3. The slices can’t be authored manually, meaning that constructing richtext by hand is a pain. You typically have to invent a syntax which can be parsed.
Problem 4. There are some poorly-defined behaviors around how to interpret overlapping slices.

Proposal

Introduce a new core lexicon type called richtext.

These richtext values are essentially the parsed form of the facets model:

[
  {text: 'Hello'},
  {text: 'world! ', facets: [{$type: 'com.atproto.richtext.bold'}]},
  {text: 'This is an '},
  {text: 'example link', facets: [{$type: 'com.atproto.richtext.link', uri: 'https://example.com/'}]}
  {text: ' and here is a '},
  {text: 'custom facet', facets: [{$type: 'com.example.spicytext', spice: 100}]}
]

This can be defined as follows:

type RichText = RichTextSpan[]

interface RichTextSpan {
  text: string
  facets?: RichTextFacet[]
}

interface RichTextFacet {
  $type: string
  [index]: any
}

richtext will be a concrete Lexicon type. It will support the following attributes in the Lexicon:

maxLength (integer, optional): maximum length of value, in UTF-8 bytes
minLength (integer, optional): minimum length of value, in UTF-8 bytes
maxGraphemes (integer, optional): maximum length of value, counted as Unicode Grapheme Clusters
minGraphemes (integer, optional): minimum length of value, counted as Unicode Grapheme Clusters
default (string, optional): a default value for this field
supportedFacets: (string[], optional): a list of facets that are expected to be supported. More of a hint (akin to knownValues). Gives schema-authors a way to constrain what’s expected; for instance, can signal that bluesky posts do support link and mention but dont support bold and italics.

The length of a richtext will be calculated by concatenating the ‘text’ values of the objects and then measuring the resulting string.

The richtext string may only be interpreted as a single “block” comprised of “inline spans”. This means that newlines may be observed, but it is not possible to introduce facets such as “unordered lists” or “quotes” as those are block-level elements. (Block-level elements may be introduced using other lexicon core-types or in userland.)

Etc

It’s tempting to allow strings to be mixed into the array of objects, but that kind of irregularity is difficult for more strictly-typed languages like Go.
It’s also tempting to define toplevel facet types like “bold” as direct tokens instead of using NSIDs. My thinking is that we don’t have a great framework for that at the moment, but I'd like to explore it.

Concerns

It's a substantial change that will require new record types and various migrations. Is it enough of an improvement to merit the work?
Reading post plaintext is slightly more complicated now (it requires a string concatenation). This isn't hard it's just slightly less convenient than dumping the .text

tom-sherman · 2024-09-25T18:00:46Z

tom-sherman
Sep 25, 2024

Block-level elements may be introduced using other lexicon core-types or in userland.

Any ideas what this would look like?

9 replies

pfrazee Sep 25, 2024
Maintainer Author

First to be clear: the scope of this proposal is specifically span-level richtext.

At this stage I don't know if we'd want/need the block-level richtext to become a part of lexicon. The only concrete reason something needs to become a lexicon core-type is if lexicon needs to be able to express constraints on it, like a length limit.

pfrazee Sep 25, 2024
Maintainer Author

So to answer your question

ah so we'd be back to using an untyped field and not the richtext type? Or are you saying that example is valid richtext just not renderable in this current proposal?

My example presupposes that the block structure is getting defined by userland lexicon definitions.

tom-sherman Sep 25, 2024

Oh yeah sorry, I'm being slow!

I can define the block level stuff with the current lexicon types and use richtext for the content fields.

pfrazee Sep 25, 2024
Maintainer Author

Yep that's the idea

tom-sherman Sep 26, 2024

I guess the only issue with that is that we'd have to set two constraints: max number of blocks and then max length of rich text text within the block.

Slightly more complex but not so bad.

ChicagoDave · 2024-09-25T18:01:08Z

ChicagoDave
Sep 25, 2024

This is a perfectly sound approach, but why not use html/css or something similar to how XAML works?

2 replies

MikeBeas Sep 25, 2024

Nobody wants to parse that, and HTML and CSS aren't the be-all end-all of formatting options. There may be some sort of facet you want to apply that can't be accomplished with HTML or CSS.

(Mastodon/ActivityPub does this IIRC, and I've seen it go very silly when apps rendered the HTML directly.)

pfrazee Sep 25, 2024
Maintainer Author

See https://www.pfrazee.com/blog/why-facets for some general observations about this question

ungoldman · 2024-09-25T18:08:22Z

ungoldman
Sep 25, 2024

my only recommendation would be to keep it terse and easily parsed, annotated richtext can bloat quickly, especially if structured prose is a possible extension down the line

0 replies

brianolson · 2024-09-25T18:13:12Z

brianolson
Sep 25, 2024
Collaborator

I think the weird thing is defining a new rich text format rather than using something existing like HTML+CSS, markdown, or even ye olde microsoft RTF.

1 reply

pfrazee Sep 25, 2024
Maintainer Author

Check out https://www.pfrazee.com/blog/why-facets

MikeBeas · 2024-09-25T18:16:50Z

MikeBeas
Sep 25, 2024

Very happy to see this proposal. I think there probably needs to be an optional plaintext fallback for situations where facets that can impact actual functionality (vs just formatting) are unsupported. Say a richtext object comes into my application with a link-type facet. Maybe it's a Bluesky post with a richtext link to a blog post: "Check out this link" where "this link" is a link to the URL.

If my client doesn't recognize the link type (hasn't been updated to support it yet, or whatever), it would be nice to be able to have a very safe plaintext fallback that would always work.

So in the above example, your Bluesky app might publish:

[
  {text: 'Check out '},
  {text: 'this link', facets: [{$type: 'com.atproto.richtext.link', uri: 'https://github.com/', fallback: 'this link (https://github.com)' }]}
]

And instead of Check out this link (where "this link" is plaintext and not linked because my client doesn't support links yet), it would be rendered as Check out this link (https://github.com)".

This isn't really applicable to all facet types. Bold or italics will (usually, excepting things like implied sarcasm italics) read the same way regardless of whether or not my app supports those facet types.

But by providing a plaintext fallback option inside RichTextFacet, apps writing these facets can provide a fallback value where appropriate, and clients reading the values can use logic like:

if (!knownFacetTypes.includes(facet.$type)) {
  return facet.fallback ?? block.text;
}

So the type would be:

interface RichTextFacet {
  $type: string
  fallback?: string
  [index]: any
}

7 replies

pfrazee Sep 25, 2024
Maintainer Author

This is worth exploring. It might be useful to enumerate some scenarios. Of course the current fallback would just be to ignore the facet, and the question is just whether that leaves some challenges.

stuartpb Sep 25, 2024

Yeah, I was about to say, even without explicit fallback values, this still risks becoming the de-facto semantics of the text property, with clients using properties from the facet(s) to render the text's "actual" value.

(Example of a horrible misuse I could still realistically see happening due to real-world dev-team dysfunction: a client wants to implement something like "shiny mentions", but their frontend team keeps changing what they tell backend the presentation is going to be, so every mention just has a text value of "[mention]", and the username/handle/DID are all in the facet values, so all the posts made from that client just read "Thank you to [mention] for the help!" on Bluesky.)

str4d Sep 25, 2024

There are three options wrt fallbacks:

Approach	Pro	Con
Ignore unknown facets.	Easy, current approach.	Existence of the facet is invisible to users
Apply a provided fallback	Doesn't impact presentation for users whose clients support it.	Existence of the facet (and its use of the fallback) is invisible to users whose clients don't support it.
Add a tooltip / hover / replacement symbol after the facet	Doesn't impact presentation for users whose clients support it. Very clear to users that there is content here their client can't render.	Annoying UI work.

MikeBeas Sep 25, 2024

I think there are a few issues in that table.

The pro for ignoring unknown facets says that this is the current approach, but that isn't technically true. There is no "current" approach because these rich text facets don't exist. No app is rendering these. In order to support this new primitive, apps would need to do some work regardless of whether or not a fallback is going to be included in the facet. Otherwise they're not going to render the message at all.

Because there will be work needed to display messages using this primitive, it would be relatively easy for developers to implement the fallback value while implementing support for this primitive. For that reason, I'm not sure we would actually ever run into a situation where a client doesn't support the fallback (the con on row 2). Developers implementing this primitive would know they should handle the fallback value, too.

I think option 3 (tooltip, etc) would be client-specific and something developers could add if they want to make it clear something is missing/modified. That could work with or without the fallback option. They might show the fallback with an icon indicating this is not the "actual" value here but a fallback that was inserted due to an unsupported facet.

I think the use of such an indicator doesn't preclude the inclusion of a safe fallback value, but would enhance it. It actually solves for the "facet is invisible to users whose clients don't support it" con on the other rows. If I was building a client, I would definitely consider adding a little ! icon to posts where a fallback had been used, so that users could see the full post payload or something that might give them more context, if they wanted.

str4d Sep 25, 2024

The pro for ignoring unknown facets says that this is the current approach, but that isn't technically true.

Behaviorally, it is true. This is how the current post Lexicon works with unknown facet types. While yes a move to a new base Lexicon is an opportunity to change the default, it doesn't make it false that the way clients treat unknown facets in their UI is to ignore them and not tell their user that there is an unsupported facet.

For that reason, I'm not sure we would actually ever run into a situation where a client doesn't support the fallback (the con on row 2).

That's not what I wrote. It's not that the fallback is unsupported, but that the facet is unsupported. In the sentence:

Existence of the facet (and its use of the fallback) is invisible to users whose clients don't support it.

the "it" in both places refers to "the facet", not "the fallback".

ghost · 2024-09-25T18:19:36Z

ghost
Sep 25, 2024

I'm Not sure what it all means, but if it makes it easier to increase character length in the future I'm all for it 🙏🏻

1 reply

pfrazee Sep 25, 2024
Maintainer Author

Heh. It won't have any bearing on that, unfortunately. That's still just a matter of social consensus.

Signez · 2024-09-25T18:20:55Z

Signez
Sep 25, 2024

It's a substantial change that will require new record types and various migrations. Is it enough of an improvement to merit the work?

Frankly, on reflection, I'm not sure. Even though this proposal seems to make it easier to write payloads by hand, it makes messages much less readable at a glance. I would argue that it is far more common for some posts to be "read by hand" (in logs, console, firehose, etc.) than "written by hand".

This kind of array of objects would show up like that in Chrome's console (and it's the same in Firefox):

…way less readable than a simple string property.

I have to concede though that it would solve the ambiguity in overlapping slices, but is it a recurring problem in practice?

So yeah, not convinced it's a net positive 😅

6 replies

Signez Sep 25, 2024

Oh, once you actually work with the payload, those concerns disappear, for sure. But when you are quickly looking for a string in logs, reading stuff in the Network tab of your browser console, or quickly grepping around for a string? It may be shattered in those text span objects.

Again, not the worst thing, and I am sure other APIs have chose to do so without a problem. But I see the appeal of a simple UTF-8 string, with rich facets attached to it "out-of-bound" (well, it's more "out-of-string" haha).

MikeBeas Sep 25, 2024

I'm not super opposed to including the full message as a plaintext value alongside the chunked-up message, but I worry you'll end up in situations where malicous actors put different values in the full message vs the chunks, which can lead to problems with validation or result in completely different messages rendered in different clients which can lead to difficulties around moderation (please forgive the insult below haha):

{
  message: "Signez is a loser!",
  chunks: [
    {type: "text", text: "Signez is cool!"}
  ]
}

Different apps and services might use the message to validate length or other aspects of the message, while others would use the chunks. Some apps that don't recognize chunked facets would render an insult, while others that recognize the chunks would render a compliment. This complicates moderation, too, since moderation tools may need to understand the facets in order to properly display the offending message.

So even though I'm not super opposed, I am a little opposed haha.

Signez Sep 25, 2024

(No offence taken - I am sure that somewhere deep inside me there are two parts that would parse this payload in a way that would please it.)

Yeah, I don't think that providing both the String version and the spans version are a good idea; there is the maliciously inconsistent payload problem you pointed, but there is also the massive duplication that would occur.

That's why the current way was good enough in my opinion: the utf8-slice version is not perfect by any means but is readable at a glance and can provide all the features the span-objects version, with just the fact that it is a little bit harder to write.

It's basically a mirror of your first argument tho: once written or provided through a library, it's not really a problem any more. Should the Bluesky team prefer making payloads more easily readable, or more easily hand-craftable? (I suspect it's the former, but honestly, it's unclear!)

MikeBeas Sep 25, 2024

I think the payloads are mostly read by machines, so making them easy to create would be my personal preference. It’s not even necessarily a matter of crafting them by hand, even an SDK (writing or rendering) in a language with wonky utf8 support could have difficulty with the code points approach. I think solving those problems at the expense of human-readability for API payloads is a worthwhile trade.

ericvolp12 Sep 26, 2024
Collaborator

This is basically where I sit on this discussion. I like having post text as an easily indexed field that you don't need to write code to piece together and is at the top level of the post record. It makes a lot of automations and scripting more convenient, makes it easier to see what's going on in logs/traces, and makes reading the firehose a lot more approachable. I don't mind if we have plaintext as a fallback for rich text but I think that introduces trust issues that are hard to reconcile (i.e. if the richtext version of your post marshalls into different textual content than the plaintext version). I think I'm on the side of "it's not quite enough of an improvement to be worth the hassle" personally.

ungoldman · 2024-09-25T18:26:52Z

ungoldman
Sep 25, 2024

an alternative approach just to throw something at the wall, adding an annotations type that could be layered on top of plaintext, using positional spans to denote where they start and end, e.g.

{
  text: 'xyz abc',
  annotations: [
    { type: 'bold', span: [0,2] },
    { type: 'comment', span: [4,6], text: 'foo' }
  ]
}

5 replies

brianolson Sep 25, 2024
Collaborator

this is differently tedious but worth considering (e.g. in some languages strings default to byte indexed even if they have UTF-8 inside and you have to do extra work to glyph-index them)

MikeBeas Sep 25, 2024

This is basically how the current rich text system works. The problems created by this system are around how to count the span indices (consider an emoji which is made up of several different characters smushed together in a ligature). Those are the problems the new proposal is trying to solve by not using offsets.

ungoldman Sep 25, 2024

OK I see it now 👀

{
  text: "Hello @bob.com",
  facets: [
    {feature: "mention", index: {start: 6, end: 14}}
  ]
}

https://www.pfrazee.com/blog/why-facets

dead-claudia Dec 7, 2024

The problems created by this system are around how to count the span indices (consider an emoji which is made up of several different characters smushed together in a ligature).

HTML (indirectly per Unicode) requires connecting ligatures across simple text formatting boundaries, including text color, when a zero-width joiner is present, and this is well-documented and has been the case for well over a decade. Implementations have long been inconsistent, with Safari violating the Unicode spec, Chrome violating both Unicode and the HTML spec, and Firefox using a pretty blatant kludge to get it almost right. (Here's one of the relevant web platform tests for you to run yourself.) So, in theory, it's doable. Just expect a lot of bugs around edge cases because text is way harder than it sounds like it should be. (This isn't even the worst problem - mixing English and Arabic very famously results in renderer and text selection bugs all the time, despite being flat out unavoidable in anything potentially dealing with multilingual content.)

There are two other concerns, though: surrogate pairs and multi-character grapheme clusters (flag emojis being a very common type of grapheme cluster). These could be addressed by simply clamping offsets to the closest grapheme cluster boundary to that offset that isn't after it. And this is very straightforward with the necessary data tables: in JS, it's just str.slice(start, end).search(/\p{Grapheme_Cluster_Break}|$/u) + start.

dead-claudia Dec 7, 2024

Oh, I forgot to mention: this is precisely how Android represents its text styling natively. iOS does not, but it does allow you to do similar by setting the selected range and then just doing the related operations. And many of the operations can just accept ranges directly.

gildaswise · 2024-09-25T18:36:22Z

gildaswise
Sep 25, 2024

Is a plainText (or just the existing text) field as it is today not an option? That way the migration would be essentially a new richText field that's the array described above, and it wouldn't break previous behaviors. It would be costly as each post record would have double the data, but losing json readability is a huge minus.

1 reply

stuartpb Sep 25, 2024

Yeah, this is something I was about to propose: I don't know if "this plainText value must strictly equal the value of every text value concatenated" would be a reasonable (or even possible) validation constraint, but that'd definitely be part of what I'd want if a redundant field like this were implemented.

CuriouslyCory · 2024-09-25T18:37:54Z

CuriouslyCory
Sep 25, 2024

Would markdown be able to solve this problem without inventing something new? There's already markdown libraries in pretty much every major language and well defined syntax.

@pfrazee has identified some challenges outlined here https://www.pfrazee.com/blog/why-facets

My thoughts
1. Syntax barfing
You're going to have this problem with rich text too. Whatever "default" behavior you're going to have for unknown syntax you can have for markdown too. I'd suggest a standardized way of extending markdown to make it easy for parsers to pick up and set to default. Instead of having wildly different markers like {color=hex}text{/color} vs ||spoiler|| we can suggest that custom types should always follow a {tag}something{/tag}. Could even be something like {type=spicytext}This is spicy{/type} similar to the richtext types.

2. Parsers suck
I don't disagree, but you're going to author and maintain new ones for richtext, or you could contribute to the markdown libs. I would argue that engaging with existing projects means you have a head-start and existing maintainers. You sacrifice a level of control, but you get a ready to go parser in most languages and you can offer defaults or maintain a style guide where you have the suggested styles for each type. A remark plugin could be maintained for react-markdown for example.

3. Character counting
Probably not a big deal

My thoughts as of now are...
Pro Markdown:

Existing library ecosystem
Existing library maintainers
No need to invent new syntax
Smaller payloads, **bold** vs {text: 'world! ', facets: [{$type: 'com.atproto.richtext.bold'}]},

Pro richtext:

Clear easy to read syntax
High level of control over libs and sw ecosystem
Get to invent a new syntax
Don't need to worry about raw html injection

4 replies

tom-sherman Sep 25, 2024

You can't just skim over character counting like that, it's a massive deal breaker.

Counting characters needs to be a cheap operation to do at scale in order to uphold the lexicon constraints. Parsing markdown to then iterate the nodes feels like a big performance regression and a lot of work! A piece of infrastructure that before only had to do a length check (and with this proposal is just a concatenation beforehand) now has to include an entire markdown parser.

CuriouslyCory Sep 25, 2024

I could just be thinking from a user perspective where I'm only character counting on writing posts. Is there character counting outside of write operations? The complexity isn't a big deal if it's only happening client side during onChange events and once more server side to verify.

str4d Sep 25, 2024

It's not just "server-side" as in "your PDS", it's "server-side" as in "every single server that validates data from PDSs against the corresponding Lexicons". That's the relays (which are doing this against every single post live), AppViews, feeds, labelers, etc.

CuriouslyCory Sep 25, 2024

I appreciate the context. Given the decentralized nature I understand that trust is somewhat limited so I'll probably need to catch up on some of the interactions before I could offer anything better but it does leave one question; would richtext solve this better? The obvious answer seems to be that it's more simple, but there's a lot more data flowing through the pipe with the suggested syntax even if it ultimately ends up being ignored.

ungoldman · 2024-09-25T18:40:47Z

ungoldman
Sep 25, 2024

bring back BBCode

0 replies

qazmlp · 2024-09-25T20:39:30Z

qazmlp
Sep 25, 2024

This would certainly help with developer UX. I think it's important to emphasise that the usual exchange format here would be CBOR rather than JSON - JSON is somewhat worse to parse than HTML, but CBOR supports skipping.

One concern I have, however, is that this lacks support for nesting of fragments. Consider the following HTML (with whitespace for readability, and to be clear I am not suggesting to use HTML above CBOR):

<a href=https://example.com/This-is-a-very-long-link-that-is-common-for-example-when-acticle-titles-are-involved>
  Hello! <i>This</i> is a <b>single</b> <u>link</u> with <b>many <u>nested <i>fragments</i></u></b>.
</a>

With this proposal here, the link URL would be stated… 11 times in this relatively simple example.
The scheme would also require relatively expensive matching and joining of spans for rendering or semantic analysis of the content.

You have to deal with the exact same ambiguities whether you encounter overlapping or partially-matching-adjacent spans. You cannot just render them out split even for formatting because that messes up the accessibility tree. (Some or even most screen reader software may be able to mitigate that – I don't have experience in this regard – but I would not expect such a mitigation to be universal.)

I get that Go specifically has pretty bad support for alternatives, but I'm not sure that avoiding that is worth the increased complexity in rendering. But then again, if the representation is always not readily usable, that would encourage good renderers which would mitigate the problem of occasional bad editors creating unnecessarily split fragments or fragments with semantically counterintuitive nesting 🤔💭

As a data point, the ActivityPub network is using a nested representation without much issue, but simultaneously generally doesn't have WYSIWYG editors (which I suspect would be expected by a more general audience).

6 replies

qazmlp Sep 26, 2024

One concern I have, however, is that this lacks support for nesting of fragments.

I don't think it lacks that support? #2830 (reply in thread) suggests otherwise (giving an example equivalent to <ol>).

[see below]

I get that Go specifically has pretty bad support for alternatives, but I'm not sure that avoiding that is worth the increased complexity in rendering.

I believe the Go point was about mixing Data Model types, not Lexicon types:

It’s tempting to allow strings to be mixed into the array of objects, but that kind of irregularity is difficult for more strictly-typed languages like Go.

Hm… fair. It's likely easier to parse objects with alternative optional fields using an off-the-shelf parser than full alternatives (unless Go has something equivalent to Serde).

That does make for a worse parsing experience in something like Rust, though.

~~AFAICT as long as the richtext contains an array of just objects, those objects can have types that imply nesting.~~ I misread; the intention is that richtext would be at the leaf level of a nested hierarchy, akin to a block element containing <span>s. The Go point still AFAICT doesn't prohibit nesting (via another type), it just prohibits the equivalent of:
<div>
    <span>Some spanned text</span>
    Some un-spanned text.
    <span>Some more spanned text</span>
</div>

That's irrelevant regarding the issue of fragment duplication that I mentioned.

If anything it makes it a little worse, since e.g. links are often expected to encompass multiple block elements. If an app introduces links-as-containers to get around this difficulty, there are suddenly two incompatible ways of inserting a link into the post.

tom-sherman Sep 26, 2024

I don't think it lacks that support? #2830 (reply in thread) suggests otherwise (giving an example equivalent to

That comment thread is thinking forward to block rich text which is outside the scope of this specific proposal

tom-sherman Sep 26, 2024

The scheme would also require relatively expensive matching and joining of spans for rendering or semantic analysis of the content.

I wouldn't describe it as expensive, but I would describe it as complex and something that needs specification.

I think this is a critical hit for this proposal tbh as it basically re-introduces the core problem of a similar level of complexity that overlapping facet ranges have.

I'm not sure that span merging is somehow less complex, easier to specify, or easier to implement than simply solving the problems around offset overlaps.

qazmlp Sep 26, 2024

Not expensive in absolute terms, yes, but compared to to resolving overlaps it seems less optimisable to me.
You'd have to do some potentially long value comparisons in many cases.

Though you're right, it may not matter that much even at scale, and that the spans are ordered would help.
The real problem is facet priority (which in my eyes is something that should be controlled by the presentation layer, so outside of ATProto. It would still be great to have guidelines for it and what to do if e.g. two link facets are nested/overlap/span-identical).

pfrazee Oct 1, 2024
Maintainer Author

@qazmlp you're correct w/all observations here. If we continue to use facet slice-indices, we'll need to start speccing expected behavior for overlaps, which is perhaps tedious but not a real issue. The new proposal, as it stands, includes no facilities for wrapping or overlapping spans, and that might represent a critical blocker for some usecases (such as links).

TimBurga · 2024-09-25T20:43:29Z

TimBurga
Sep 25, 2024

Well-written summary. I'll play devil's advocate here: don't fix what ain't broke. The first three established problems seem like they could be addressed with better tooling. The 4th problem is where the focus belongs.

3 replies

qazmlp Sep 25, 2024

Yes, and the fourth appears to not be addressed by this proposal.

pfrazee Oct 1, 2024
Maintainer Author

The fourth issue is addressed by the proposal, but it does so by eliminating overlaps, which @qazmlp you correctly identified as a potential issue in another reply chain here.

qazmlp Oct 3, 2024

@pfrazee Sorry for the unclear wording. That was meant as a "doesn't really help with the fundamental issue".

(I'm not a native English speaker, so this may be a case of me getting the meaning of a word subtly wrong.)

matthieusieben · 2024-09-25T21:39:04Z

matthieusieben
Sep 25, 2024
Collaborator

This seems to be more of a micro-blogging concern than an Atproto one, so I would rather see these facets defined in the app.bsky namespace. WDYT?

10 replies

Hoid Sep 26, 2024

@MikeBeas Including this new definition as an atproto primitive graces it as the "default" ways of doing things, which apps will naturally build on top of. We should make sure all atproto lexicons are fundamentally applicable to most anything that can be built on the protocol imo, so if this proposal ends up only being used by Bluesky I think that would be a problem.

If this lexicon does end up being developed, maybe it should first be created under an app.bsky.proposed namespace first, so that others could play around with it before the Bluesky peeps decide it definitely belongs in the core lexicons. 🤷

MikeBeas Sep 26, 2024

I very much doubt that rich text would end up being used only by one app in the network. There are already apps in the network that could also benefit from this, like WhiteWind.

Hoid Sep 26, 2024

But would they want to implement richtext in the same way? Would WhiteWind look at the richtext lexicon in com.atproto and say "nah, I'm gonna write my own because I don't like the fact that when debugging I can't easily see the full string" or for some other reason? If enough people do then it's not a good primitive.

MikeBeas Sep 26, 2024

The debugging concern is so minor that I'm not sure why it keeps coming up. If it's that big of a concern, write one function to assemble the message and you're done. Problem solved.

If WhiteWind wants to implement their own rich text they can, that's up to them. But there are plenty of developers out there who aren't interested in reinventing the wheel and would be happy to have a rich text primitive they could rely on.

It's a worthwhile addition. If you don't like the primitive you're free not to use it in your app.

pfrazee Oct 1, 2024
Maintainer Author

There is one reason why this proposal elevates richtext into lexicon core types: if the pattern in this proposal were done in userland, it would not be possible to express some varieties of constraints on the text content (eg length).

mary-ext · 2024-09-25T22:41:05Z

mary-ext
Sep 25, 2024

I've only read through the proposal itself and not the comments, but given that this proposal was posted right after a question about block-level markup like paragraphs I'm not exactly sure if this would serve that use case.

Sure, it's possible to introduce block-level markup by wrapping said richtext in a block interface, however that negates the one big point of this proposal: Having proper character limit restrictions.

I'm wary that this would encourage trying to make said block-level markup work in a sort of hacky way when it's really not supposed to, now that it's a "core part" of the protocol and all.

I'd rather have everyone experiment with their own ideas of a rich text before implementing this as a primitive type.

3 replies

mary-ext Sep 26, 2024

Now, it's not that this can't be made to work, there are two ways we can go about this:

Introduce a richtext-block primitive in conjunction to richtext, or,
Alter richtext to always have blocks, and introduce minBlocks/maxBlocks validations so that microblogs like Bluesky can only have 1 block present in their richtext.

However, the question remains, is this worth it? Rich text is strictly a concern of the services implementing it, not AT Protocol, and at this point in time there doesn't seem to be any benefits to making this a primitive, short of the character limit validation, especially when right now everyone is unsure about rich text formatting.

tom-sherman Sep 26, 2024

See #2830 (comment)

pfrazee Oct 1, 2024
Maintainer Author

If there's no clear consensus on this (which seems likely) then we'll continue to let the ecosystem experiment with it.

mackuba · 2024-09-26T00:58:37Z

mackuba
Sep 26, 2024

mfw:

0 replies

stuartpb · 2024-09-26T08:28:35Z

stuartpb
Sep 26, 2024

One important constraint that just came to mind reading this comment is that (unless there is only one element with no facets) the empty string must not be a valid text value, meaning that every element contributes toward maxLength constraints (thus making that a de-facto constraint on the maximum number of elements, potentially one per character but no more).

2 replies

stuartpb Sep 26, 2024

I'm wondering if it should also be possible for lexicons to put restrictions on what text is allowed on a per-element basis: in the context of Bluesky, I thought about proposing the constraint that no element consisting purely of whitespace may have attached facets (purely for UX reasons), but on top of the issues around what codepoints Unicode does and does not consider "whitespace", that would also needlessly preclude some possible uses of facets that could be used in a non-Bluesky context (eg. tracking highlighted spans in a collaborative editor).

stuartpb Sep 26, 2024

Yeah, the more I think about this, the more I agree with Jaz's position: even without considering this from a point of view where every change starts at minus 100 points, there are just as many edge cases (if not more) that can encounter some form of breakage or misuse under this proposed facet model (while it may prohibit invalid UTF-8 sequences at the string level, it does nothing about the potential for split graphemes across spans, which would become even more likely to be misprocessed across different implementations), and it comes at the expense of a very useful property (the ability to process a plaintext message without having to concatenate or process facets at all).

stuartpb · 2024-09-26T09:20:07Z

stuartpb
Sep 26, 2024

To build upon the mostly-recap self-reply I just posted a little further with a point I haven't seen raised yet, the loss of the potential to have overlapping spans in this model strikes me as a net negative as well: while, as noted in the OP, the semantics around processing overlaps does need to be more tightly defined (in terms of per-app presentation specifications, as @qazmlp notes), the ability to have them is significantly preferable from a extensibility point of view - a span of text that one schema may represent with one facet (eg. a "vote button", to pull loosely from poll.blue as an abstract example) may also be representable in another application's schema as multiple facets (eg. an app that has discrete facets for "Icon" and "Label", possibly even nesting within an overall "Button" span that may have been defined by an earlier/later version of the schema supported by a different set of clients).

2 replies

stuartpb Sep 26, 2024

Before taking this into consideration, I was ambivalent about this proposal, but now that I think about it, the loss of this interpretation-focused extensibility in the proposed model is enough to take me from +0 to -1 on it: in other words, if the proposal in question were how facets worked currently, and the way it works currently were the new model being proposed, I would be arguing in favor of switching to the way it works now.

stuartpb Sep 27, 2024

To give an even more patent example of why the current model is preferable to this one: picture an app like Google Docs, where there's both rich formatting of text and a mechanism to highlight text for collaborative editing (including highlighting text spans to add annotations, or propose changes). Under the current model, these facets, which cover entirely separate concerns, would be entirely independent: under the new model, the app's best option would essentially amount to introducing a new field that reimplements the exact same facet model we have now.

devinivy · 2024-09-27T04:23:22Z

devinivy
Sep 27, 2024
Maintainer

Lexicon types that don't fit into the transport format are identified in a self-describing way, including complex types such as blobs. So I might expect us to do the same thing here. An example use case would be to find and index richtext in a generic way, similar to how we use this property to identify and process blobs generically.

Following that, is it possible that a richtext field might end-up looking something like this?

{
  "$type": "richtext",
  "value": [
    { "text": "Hello, "},
    { "text": "world!", facets: [{ "$type": "com.atproto.richtext.bold" }] }
  ]
}

Some examples of other concrete Lexicon types:

// A cid-link in JSON
{ "$link": "bafycid" }

// Bytes in JSON
{ "$bytes": "aGVsbG8gcGF1bA" }

// A blob in JSON
{
  "$type": "blob",
  "cid": { "$link": "bafycid" },
  "mimeType": "image/jpeg",
  "size": 3976
}

0 replies

yamarten · 2024-09-27T10:03:51Z

yamarten
Sep 27, 2024

I think a safer richtext is meaningful because there have been experiences in the past where overlapping facets have made it impossible to display anything on social-apps.

It is difficult for me to decide whether to make destructive changes to the existing post records, but it is desirable to add a richtext type to the lexicon, regardless of whether or not to abolish the current format.

However, applications that want to use character decoration may avoid richtext because they cannot be nested. This is what is being discussed in another thread. If the nest is inherently difficult, I will take a cautious stance on the introduction.

1 reply

yamarten Sep 27, 2024

In my opinion, if the type of the text field can be specified from the facets definition, the nesting problem can be solved. However, this is probably inappropriate as a type definition for lexicon.

The advantage of this format is that each decoration can choose from a variety of options. You can specify richtext to allow nesting, specify string to prevent strange hacks like @ everyone mention, or specify a string with format to limit it so that only URLs can be decorated.

However, the details of this implementation method can probably be considered after introducing richtext, so in this discussion, it is enough for me to agree with the new richtext if we only know the possibility of future nesting.

usolmz · 2024-09-27T16:07:31Z

usolmz
Sep 27, 2024

1 - rich text editor
2 - To be able to write at least 200 thousand words

→ will be enough for now,

0 replies

bnewbold · 2024-10-03T01:12:19Z

bnewbold
Oct 3, 2024
Maintainer

Chiming in for The Record:

High-level, I think this has some advantages and some disadvantages compared to the current facet system. I don't think there is enough of an improvement to make it worth the change and churn at this point in time. Opportunity costs and dev churn/friction are not insurmountable, and "sooner is better than later", but the downsides are real and I don't think this is worth it. If somebody found a solution which was clearly superior to both, i'd be open to switching.

This introduce what feels to me like pretty application-layer validation and structure as part of Lexicon. I'd rather leave this to application-level schemas, even if it means we don't get Lexicon-layer data validation (aka, validation of overall string length). I don't think we should be totally afraid of application-layer constraints, though Lexicon-level is stronger should be preferred most of the time.

I think that the easiest thing to do near term to improve dev experience around this is to have small stand-alone parser/helper libraries for parsing strings in to facets; and to add a parsePostText endpoint to the bsky appview which takes a string (and some params?) and returns a parsed/resolved post record (including social card, resolving handle mentions to DIDs, etc). Having those in place would make it easier to migrate to new schemas/formats in the future, because updating the libs/API will cover a good chunk of deployed code.

0 replies

dead-claudia · 2024-12-07T22:11:30Z

dead-claudia
Dec 7, 2024

We ultimately decided to stick with offsets pointing into strings (utf8 codeunits). This has surfaced some downsides:

Problem 1. The utf8 codeunit indexes are difficult for engineers to understand.

Problem 2. The utf8 codeunit indexes are difficult to accomplish in some languages (eg javascript).

I wouldn't say difficult, just tedious.

Problem 3. The slices can’t be authored manually, meaning that constructing richtext by hand is a pain. You typically have to invent a syntax which can be parsed.

Problem 4. There are some poorly-defined behaviors around how to interpret overlapping slices.

I have a suggestion for this: make indices index grapheme cluster boundary positions. It's more complicated, but carries a few upsides:

It's fully language-independent. Neither JS nor Swift nor Rust/C++ will have compatibility issues. It's just a matter of splitting graphemes. This can be hand-coded, but there's an npm module for this in JS, a crate for it in Rust, among probably others.
It avoids the worst edge cases like italics applying to accents and not the base character. This is especially important for Korean, where each grapheme is composed of multiple subcompoents, and styling them is relatively difficult to do consistently.
An ordinary human can look at a string, look at a span number, and know right away where it starts, even if accents are decomposed.

As a concrete interface, I propose this:

interface RichText {
    text: string
    facet: Facet[]
}

interface Facet {
    $type: string
    start: number
    end: number
    // type-specific properties...
}

This also allows for image facets and such, and ensures every facet can be traced to a position correctly.

To split a string into styled spans, you can (stably) sort facets by start and then return an array of {text, facets} where each facet applies to the whole text. This is useful for web. Native mobile can just process the facets directly and in order, since both Android and iOS expose such an API.

There is a risk in HTML of unwanted glyph breaking when coloring stuff, but it's fixable with zero-width joiners inside words whenever it'd split a ligature. This is one of the places famously riddled with browser bugs, though, with both Chrome and Safari failing web platform tests for this (and at least Safari rendering incorrect Unicode in it), but the browser bugs are isolated to mid-word changes spanning ligatures and erroneously splitting them.

It's harder, but it's extremely doable. And edge cases are actually no less well-defined than they are with simple Latin letter rendering.* People just aren't fully implementing the relevant specs.

* A few font specs are horribly imprecise with everything, and some font renderers were incredibly sloppy with shaping. For a while, this meant Times New Roman was even rendered noticeably differently by Mac vs by Windows, and it's precisely why SVG fonts were pushed so hard by Mozilla.

3 replies

DavidBuchanan314 Dec 8, 2024

I also thought about grapheme cluster boundaries, but the clustering rules can change from one version of the unicode standard to the next (and even by locale #3079 ). I don't think the complexity is worth it.

(also have, you read the grapheme clustering rules? They're well-defined but very difficult to actually understand, in my opinion. Not something I'd ever want to have to implement myself)

dead-claudia Dec 8, 2024

@DavidBuchanan314 I read all of the relecant part , too. In another comment in this discussion, I mentioned it's (very) tedious to implement - it's why I recommended use of pre-made libraries. And of course Graphemer could also be used.

As for locale, it may seem correct to use something like exemplar-based grapheme cluster boundaries, especially since Bluesky commonly have a language tag. Problem is there's a lot of mixed content out there (ICU/etc can't cope with that), and as Unicode's own specification points out, grapheme cluster breaks for some Indic scripts "may need to be script-, language-, font-, or context-specific to be useful." So for this, it really makes more sense to use the default extended grapheme cluster boundary specification.

You do have a point about stability, though. But at the same time, the rules haven't really changed much in over 20 years. They did add some changes years ago for emoji, but those have long since been removed. And the sparse commmit history of this show how little it's changed.

dead-claudia Dec 8, 2024

Chiming in to add on that I'm on board with using code point indices instead. I just wanted to be bold first to make sure grapheme clusters were fully thought out.

Obviously, code points are easy to count:

If the string was decoded to UTF-8 (ex: Rust), boundaries are bytes whose high 2 bits aren't 10 (as in, they aren't continuation bytes, formula: (ubyte >> 6) == 2).
If the string was decoded to UTF-16 (ex: JS, Java, sometimes C/C++), boundaries are words whose high 6 bits aren't 110111 (as in, they aren't high surrogates, formula: (uword >> 10) == 0x37).
If the string was decoded to UTF-32 (I've seen a few do this), boundaries are just array element boundaries.

Proposal: New lexicon richtext core-type #2830

pfrazee Sep 25, 2024 Maintainer

Motivation

Proposal

Etc

Concerns

Replies: 23 comments · 66 replies

pfrazee Sep 25, 2024 Maintainer Author

pfrazee Sep 25, 2024 Maintainer Author

pfrazee Sep 25, 2024 Maintainer Author

pfrazee Sep 25, 2024 Maintainer Author

brianolson Sep 25, 2024 Collaborator

pfrazee Sep 25, 2024 Maintainer Author

pfrazee Sep 25, 2024 Maintainer Author

pfrazee Sep 25, 2024 Maintainer Author

ericvolp12 Sep 26, 2024 Collaborator

brianolson Sep 25, 2024 Collaborator

pfrazee Oct 1, 2024 Maintainer Author

pfrazee Oct 1, 2024 Maintainer Author

matthieusieben Sep 25, 2024 Collaborator

Proposal: New lexicon `richtext` core-type #2830

pfrazee
Sep 25, 2024
Maintainer

Replies: 23 comments 66 replies

pfrazee Sep 25, 2024
Maintainer Author

pfrazee Sep 25, 2024
Maintainer Author

pfrazee Sep 25, 2024
Maintainer Author

pfrazee Sep 25, 2024
Maintainer Author

brianolson
Sep 25, 2024
Collaborator

pfrazee Sep 25, 2024
Maintainer Author

pfrazee Sep 25, 2024
Maintainer Author

pfrazee Sep 25, 2024
Maintainer Author

ericvolp12 Sep 26, 2024
Collaborator

brianolson Sep 25, 2024
Collaborator

pfrazee Oct 1, 2024
Maintainer Author

pfrazee Oct 1, 2024
Maintainer Author

matthieusieben
Sep 25, 2024
Collaborator