Proposal: New lexicon richtext
core-type
#2830
Replies: 23 comments 66 replies
-
Any ideas what this would look like? |
Beta Was this translation helpful? Give feedback.
-
This is a perfectly sound approach, but why not use html/css or something similar to how XAML works? |
Beta Was this translation helpful? Give feedback.
-
my only recommendation would be to keep it terse and easily parsed, annotated richtext can bloat quickly, especially if structured prose is a possible extension down the line |
Beta Was this translation helpful? Give feedback.
-
I think the weird thing is defining a new rich text format rather than using something existing like HTML+CSS, markdown, or even ye olde microsoft RTF. |
Beta Was this translation helpful? Give feedback.
-
Very happy to see this proposal. I think there probably needs to be an optional plaintext fallback for situations where facets that can impact actual functionality (vs just formatting) are unsupported. Say a If my client doesn't recognize the link type (hasn't been updated to support it yet, or whatever), it would be nice to be able to have a very safe plaintext fallback that would always work. So in the above example, your Bluesky app might publish:
And instead of This isn't really applicable to all facet types. Bold or italics will (usually, excepting things like implied sarcasm italics) read the same way regardless of whether or not my app supports those facet types. But by providing a plaintext
So the type would be:
|
Beta Was this translation helpful? Give feedback.
-
I'm Not sure what it all means, but if it makes it easier to increase character length in the future I'm all for it 🙏🏻 |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
an alternative approach just to throw something at the wall, adding an {
text: 'xyz abc',
annotations: [
{ type: 'bold', span: [0,2] },
{ type: 'comment', span: [4,6], text: 'foo' }
]
} |
Beta Was this translation helpful? Give feedback.
-
Is a |
Beta Was this translation helpful? Give feedback.
-
Would markdown be able to solve this problem without inventing something new? There's already markdown libraries in pretty much every major language and well defined syntax. @pfrazee has identified some challenges outlined here https://www.pfrazee.com/blog/why-facets My thoughts 2. Parsers suck 3. Character counting My thoughts as of now are...
Pro richtext:
|
Beta Was this translation helpful? Give feedback.
-
bring back BBCode |
Beta Was this translation helpful? Give feedback.
-
This would certainly help with developer UX. I think it's important to emphasise that the usual exchange format here would be CBOR rather than JSON - JSON is somewhat worse to parse than HTML, but CBOR supports skipping. One concern I have, however, is that this lacks support for nesting of fragments. Consider the following HTML (with whitespace for readability, and to be clear I am not suggesting to use HTML above CBOR): <a href=https://example.com/This-is-a-very-long-link-that-is-common-for-example-when-acticle-titles-are-involved>
Hello! <i>This</i> is a <b>single</b> <u>link</u> with <b>many <u>nested <i>fragments</i></u></b>.
</a> With this proposal here, the link URL would be stated… 11 times in this relatively simple example. You have to deal with the exact same ambiguities whether you encounter overlapping or partially-matching-adjacent spans. You cannot just render them out split even for formatting because that messes up the accessibility tree. (Some or even most screen reader software may be able to mitigate that – I don't have experience in this regard – but I would not expect such a mitigation to be universal.) I get that Go specifically has pretty bad support for alternatives, but I'm not sure that avoiding that is worth the increased complexity in rendering. But then again, if the representation is always not readily usable, that would encourage good renderers which would mitigate the problem of occasional bad editors creating unnecessarily split fragments or fragments with semantically counterintuitive nesting 🤔💭 As a data point, the ActivityPub network is using a nested representation without much issue, but simultaneously generally doesn't have WYSIWYG editors (which I suspect would be expected by a more general audience). |
Beta Was this translation helpful? Give feedback.
-
Well-written summary. I'll play devil's advocate here: don't fix what ain't broke. The first three established problems seem like they could be addressed with better tooling. The 4th problem is where the focus belongs. |
Beta Was this translation helpful? Give feedback.
-
This seems to be more of a micro-blogging concern than an Atproto one, so I would rather see these facets defined in the |
Beta Was this translation helpful? Give feedback.
-
I've only read through the proposal itself and not the comments, but given that this proposal was posted right after a question about block-level markup like paragraphs I'm not exactly sure if this would serve that use case. Sure, it's possible to introduce block-level markup by wrapping said richtext in a block interface, however that negates the one big point of this proposal: Having proper character limit restrictions. I'm wary that this would encourage trying to make said block-level markup work in a sort of hacky way when it's really not supposed to, now that it's a "core part" of the protocol and all. I'd rather have everyone experiment with their own ideas of a rich text before implementing this as a primitive type. |
Beta Was this translation helpful? Give feedback.
-
One important constraint that just came to mind reading this comment is that (unless there is only one element with no |
Beta Was this translation helpful? Give feedback.
-
To build upon the mostly-recap self-reply I just posted a little further with a point I haven't seen raised yet, the loss of the potential to have overlapping spans in this model strikes me as a net negative as well: while, as noted in the OP, the semantics around processing overlaps does need to be more tightly defined (in terms of per-app presentation specifications, as @qazmlp notes), the ability to have them is significantly preferable from a extensibility point of view - a span of text that one schema may represent with one facet (eg. a "vote button", to pull loosely from poll.blue as an abstract example) may also be representable in another application's schema as multiple facets (eg. an app that has discrete facets for "Icon" and "Label", possibly even nesting within an overall "Button" span that may have been defined by an earlier/later version of the schema supported by a different set of clients). |
Beta Was this translation helpful? Give feedback.
-
Lexicon types that don't fit into the transport format are identified in a self-describing way, including complex types such as blobs. So I might expect us to do the same thing here. An example use case would be to find and index richtext in a generic way, similar to how we use this property to identify and process blobs generically. Following that, is it possible that a richtext field might end-up looking something like this? {
"$type": "richtext",
"value": [
{ "text": "Hello, "},
{ "text": "world!", facets: [{ "$type": "com.atproto.richtext.bold" }] }
]
} Some examples of other concrete Lexicon types: // A cid-link in JSON
{ "$link": "bafycid" } // Bytes in JSON
{ "$bytes": "aGVsbG8gcGF1bA" } // A blob in JSON
{
"$type": "blob",
"cid": { "$link": "bafycid" },
"mimeType": "image/jpeg",
"size": 3976
} |
Beta Was this translation helpful? Give feedback.
-
I think a safer richtext is meaningful because there have been experiences in the past where overlapping It is difficult for me to decide whether to make destructive changes to the existing However, applications that want to use character decoration may avoid |
Beta Was this translation helpful? Give feedback.
-
1 - rich text editor → will be enough for now, |
Beta Was this translation helpful? Give feedback.
-
Chiming in for The Record: High-level, I think this has some advantages and some disadvantages compared to the current facet system. I don't think there is enough of an improvement to make it worth the change and churn at this point in time. Opportunity costs and dev churn/friction are not insurmountable, and "sooner is better than later", but the downsides are real and I don't think this is worth it. If somebody found a solution which was clearly superior to both, i'd be open to switching. This introduce what feels to me like pretty application-layer validation and structure as part of Lexicon. I'd rather leave this to application-level schemas, even if it means we don't get Lexicon-layer data validation (aka, validation of overall string length). I don't think we should be totally afraid of application-layer constraints, though Lexicon-level is stronger should be preferred most of the time. I think that the easiest thing to do near term to improve dev experience around this is to have small stand-alone parser/helper libraries for parsing strings in to facets; and to add a |
Beta Was this translation helpful? Give feedback.
-
I wouldn't say difficult, just tedious.
I have a suggestion for this: make indices index grapheme cluster boundary positions. It's more complicated, but carries a few upsides:
As a concrete interface, I propose this: interface RichText {
text: string
facet: Facet[]
}
interface Facet {
$type: string
start: number
end: number
// type-specific properties...
} This also allows for image facets and such, and ensures every facet can be traced to a position correctly. To split a string into styled spans, you can (stably) sort facets by There is a risk in HTML of unwanted glyph breaking when coloring stuff, but it's fixable with zero-width joiners inside words whenever it'd split a ligature. This is one of the places famously riddled with browser bugs, though, with both Chrome and Safari failing web platform tests for this (and at least Safari rendering incorrect Unicode in it), but the browser bugs are isolated to mid-word changes spanning ligatures and erroneously splitting them. It's harder, but it's extremely doable. And edge cases are actually no less well-defined than they are with simple Latin letter rendering.* People just aren't fully implementing the relevant specs. * A few font specs are horribly imprecise with everything, and some font renderers were incredibly sloppy with shaping. For a while, this meant Times New Roman was even rendered noticeably differently by Mac vs by Windows, and it's precisely why SVG fonts were pushed so hard by Mozilla. |
Beta Was this translation helpful? Give feedback.
-
This proposal is not resolved within the Bluesky team and is being shared here to solicit feedback. There's no firm commitment to this direction at this stage.
Motivation
Lexicon's core types can be found here.
We had a vigorous debate over how to represent richtext when Bluesky’s current schemas were being finalized (#621). Among the options we evaluated was this:
We ultimately decided to stick with offsets pointing into strings (utf8 codeunits). This has surfaced some downsides:
Proposal
Introduce a new core lexicon type called
richtext
.These richtext values are essentially the parsed form of the facets model:
This can be defined as follows:
richtext
will be a concrete Lexicon type. It will support the following attributes in the Lexicon:The length of a richtext will be calculated by concatenating the ‘text’ values of the objects and then measuring the resulting string.
The richtext string may only be interpreted as a single “block” comprised of “inline spans”. This means that newlines may be observed, but it is not possible to introduce facets such as “unordered lists” or “quotes” as those are block-level elements. (Block-level elements may be introduced using other lexicon core-types or in userland.)
Etc
Concerns
.text
Beta Was this translation helpful? Give feedback.
All reactions