Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving UTF-8 Support #203

Open
4 of 6 tasks
tankerkiller125 opened this issue Jul 10, 2020 · 25 comments
Open
4 of 6 tasks

Improving UTF-8 Support #203

tankerkiller125 opened this issue Jul 10, 2020 · 25 comments
Assignees
Labels

Comments

@tankerkiller125
Copy link

tankerkiller125 commented Jul 10, 2020

The following kinds of bugs regarding UTF-8 are known and require the following kinds of changes/steps:

Problems:

@askvortsov1
Copy link
Sponsor Member

askvortsov1 commented Jul 12, 2020

UTF-8 mentions: Convert current <@{username}> mention to <@U:{UserID}> similar to discord, also allows group mention in the future

I really like this, but my only concern is that it wouldn't be understandable while in the text editor. Perhaps we could save a mapping of usernames <=> ids somewhere, still show the usernames in the text editor, and swap it out on save? Essentially, just cuz we store it that way shouldn't mean that we display it in the text editor that way. Also, ref #254

Unfortunately this one is a limitation of the database engine itself often. The best solution around this is to support different search drivers that extension developers can build/tie into.

I agree, this should resolve itself naturally with #286

And that brings us to slugs. I've given this a bit of thought, and would like to propose the following structure:

We have an extender, with use:

(new Extend\Model)->slugDriver(Flarum\User\User::class, DRIVER_CLASS); // User can be any model here that implements SluggableModelInterface

The driver interface looks like this:

interface SlugDriver {
    public function to (AbstractModel model): string
    public function from (string slug): AbstractModel
}

The sluggable model interface looks like this:

interface SluggableModelInterface {
    toSlug(string driverId = DEFAULT): string
    fromSlug(string slug, string driverId): AbstractModel
}

Alternatively, we could allow registration of sluggable models via an extender, or bring the SluggableModelInterface functionality out into some form of service class.

We would provide a UI in admin for site admins to choose which slug drivers to use for each supported model (if more than one is registered for that model)

In all sluggable models, we would also include a computed slug for the model in the serialization so that we can keep slug generation logic completely to the backend.

For the API, we can use a parameter sent in get requests to single-item endpoints to determine if the model instance should be retrieved via a slug driver, otherwise defaulting to an ID (to keep a bit of compliance with JSON API? This should also resolve flarum/framework#358 and flarum/framework#1356

@dsevillamartin
Copy link
Member

@askvortsov1 For this we could also do something like Discord, but that'd require the editor to be WYSIWYG then (at least for some functions). I agree that just showing @datitisev and replacing that in the backend with the <@U:ID> would be easier and better for both users & API requests that create posts with mentions.

@tankerkiller125
Copy link
Author

tankerkiller125 commented Jul 12, 2020

I agree that @username in the editor makes the most sense (for now) but I feel like at some point WYSIWG will be required for some functionaility in order to resolve a comment/issue @luceos brought up about not displaying usernames to general users (because we allow username logins) personally I think that issue would be better resolved by simply disabling username logins altogether though (I can't think of any sites that allow username logins anymore).

Also @askvortsov1 I'm pretty much happy with this proposals for slugs, and it's pretty much what I had in mind.

@tankerkiller125
Copy link
Author

tankerkiller125 commented Aug 4, 2020

I've updated the list based on some community feedback. However I believe that what's now listed is pretty finalized at this point. I plan to start at least some work here in the coming days to resolve some of these issues.

Edit: I'll start work on this once Mithril 2 rewrite is near completion/complete. This will be my focus for next beta.

@tankerkiller125
Copy link
Author

Alright, I'm just going to write down my vision of the way that his would work. Looking for feedback on this idea, it mostly addresses the slugging problems.

Overall the basic idea is that each model that needs slug would use a trait or implement something that would let core know that it's a sluggable model. Further there would be a function (e.g. slugging()) that would inform the slugging handler which database column will be used for slugging purposes (so for discussions that would be the title, users the display name). We may also want to allow extension developers to override model default slugging column in some cases. (So the CJK community might want to use the ID rather than the title for example).

Each slug handler would be defined as a driver (like the mail system) and extension developers would be able to write their own driver that the admin would then be able to select on the dashboard. Admins should be able to select a driver on a model by model basis (so they could do something like discussions use php-intl slugging, users use no conversion slugging, etc.).

Once the admins have changed the slugging method they should be able to use a "Update slug" button that would update all stored slugs. However that button should warn them if they have more than say 1K discussions/users and show them the proper command to use in SSH to do it from there. We should also warn admins that there may be SEO impacts if they update the slugs.

Important Note: Slugs would be stored in the database and JS would pull slugs from this. We would need to find a semi-compatible way of pulling data from the JSON-API using the slugs rather than id.

@tankerkiller125 tankerkiller125 pinned this issue Oct 2, 2020
@maicol07
Copy link

maicol07 commented Oct 5, 2020

I would like to leave a suggestion about this feature. At the moment username can only follow strict rules and there are no ways at the moment (other than editing Flarum core) to change them to something more permissive.
Since usernames includes also special characters in other websites, Flarum should be able to manage them in future betas

@tankerkiller125
Copy link
Author

tankerkiller125 commented Oct 5, 2020

@maicol07 We do support special characters in Display Names which in the default version of Flarum isn't used much. If you look on twitter for example the actual username does not support UTF-8 but the display name does. One of the things that we in theory will be fixing as part of this UTF-8 project will be display/username support and other things that need fixing. Also of note your concern is listed in the checklist.

@askvortsov1
Copy link
Sponsor Member

For changing usernames, I believe that's possible as an admin side action, and the fof username change request extension is a great package for enabling user side requests. Tbh I don't think it's needed as a core feature given those 2

@zerosonesfun
Copy link

I can't think of any sites that allow username logins anymore

Twitter? Or am I thinking about that wrong? Maybe that’s not technically a username and/or, I’m sure being Twitter they find other ways to harden security? I get trying to keep usernames more private for security. But, I do think some sites, even some big ones, still have username logins and display usernames. You just need strong passwords. Maybe 2 factor eventually.

Anyway... just saw that comment and thought I’d foolishly for whatever reason go to bat for sites that display usernames and allow username logins. 🙃

I’m actually a fan of going with email only logins or whatever is decided. Or, let the user decide how secure they want to be. A switch in the admin to allow or not allow username logins. I guess that could be an extension. I also guess I’m getting a little off topic.

@askvortsov1
Copy link
Sponsor Member

askvortsov1 commented Oct 25, 2020

@tankerkiller125 I've been thinking about this, and I'm not sure if we want to over engineer this. Essentially, I'm not convinced that in 2020, transliteration is still necessary.

What if we offered the following (independent, boolean) options:

  • Allow UTF-8, periods in usernames
  • Allow UTF-8, periods in discussion slugs (seems to already be allowed in titles) (if disabled, the usual transliteration system would be used)
  • Use numerical ID only for discussion pages
  • Use numerical ID only for user pages (if disabled, username only would be used for user pages)

This should cover most use cases, and wouldn't be too complex. It would also be fully up to site admins to enable/disable allowing direct unicode in slugs (with the knowledge that those settings would only apply to future changes). It seems to me like the only issue with unicode in usernames is for slugs, and over the past years, browsers support this better and better (as they should, IMO). Other forum softwares allow this as well (https://community.nodebb.org/topic/417/unicode-slugs-yes-please, https://meta.discourse.org/t/unicode-usernames-and-group-names/117737). We'd then just need to fix the mentions regex (or substitute the aforementioned solution), which needs to be done anyway.

The only issue that remains with this aproach is what to do for extensions that define their own models that need slugs. For instance, tags. With the above framework, I don't think we need an "extender" per-se, but it would be nice to have a unified place in admin for configuring all these at once. But we could keep the same switches: Whether unicode can be used in tag slugs, and whether the numerical ID should be used instead of unicode in tag slugs.

@tankerkiller125
Copy link
Author

@askvortsov1 although browsers now display UTF-8 when I suggested that in the original discussion on discuss (or possibly a discord conversation) users with UTF-8 languages where against it do to the way that copy and paste gets handled (very poorly) making sharing of the full URL very ugly.

@askvortsov1
Copy link
Sponsor Member

I'm still unsure that we need custom drivers. Practically speaking, one of 3 slugging schemes would be used:

  • Just numerical ID (optimal for CJK)
  • Transliteration (optimal for english and latin-based character schemes)
  • Full Unicode (Best display, poor copying)

With the 2 switches per-model strategy described above, all these are supported simply and out of the box. We could broaden it to 3 switches if we allow ASCII only (plus periods, I think those should be added), ASCII + accented characters (easily transliteratable), or full Unicode.

@tankerkiller125
Copy link
Author

After further discussion during our meeting last night there are plans to merge two ideas namely:

  • Transliteration Drivers
  • User URLs similar to discussions

We will likely need another meeting to hash out the exact plan here and determine the best way forward.

@askvortsov1
Copy link
Sponsor Member

As a follow-up, we need to decide what kind of drivers we're going with. I'm not sure that we should force a dedicated column for slugs, because for the most part, those can be generated dynamically. Realistically, there are going to be 4 drivers:

  • Numerical ID column, which doesn't need to store slugs back OR forth
  • Numerical ID column with transliterated title / username/etc (like for discussions). This could use a DB column, but that's not really necessary since the latter portion isn't used in going back or forth
  • Numerical ID column with raw title/username. This doesn't need a DB column since we're using the exact value, but as with the above, that portion isn't used to tie a slug to a certain model
  • Just raw title/username (for applicable models, this wouldn't work that well for discussions cuz spaces). This also doesn't need a DB column as the value is accessible immediately.
  • Note that just transliterated name/username isn't an option, since it's not bijective (Daniel and Daniël both map to Daniel)

If someone wants to use a slug beyond that, they might need a DB column, but that's extension territory IMO.

Another concern is how we'd want these to be handled in the frontend. We could generate and serialize slugs when sending content across, but that doesn't address the question of deciphering the slug so the frontend knows which endpoint to send API requests to. We could either:

After that, we can rework mentions, and that should allow for utf-8 usernames!

@tankerkiller125
Copy link
Author

We could actually re-work mentions regardless of slugging drivers or changes. The only hard part about mentions is getting the URL for the user mentions (which will be tied to drivers)

@askvortsov1
Copy link
Sponsor Member

True. We could even do both at once... For the mentions bit, I think s9e/TextFormatter#23 (comment) might just be our solution: We're already retrieving the users tagged by a post from the DB to insert their display names; the same could be done with the slug (and this can even be dynamically) in https://github.com/flarum/mentions/blob/8bcb66f7d728fc92c92c69333c40175c759eba9e/src/Listener/FormatUserMentions.php#L21-L29.

So that leaves us with adjusting the frontend regex so it can recognize dots and utf-8 characters I suppose (although actually no, that leaves us with adjusting the frontend display code so it has a WYSIWYG effect in the text editor so we can store IDs. Which frankly I'm not 100% sure how to approach.

How do you feel about not tying sluggability to DB columns?

@tankerkiller125
Copy link
Author

tankerkiller125 commented Oct 29, 2020

I see the reason for not tying it to slugs in the db, I also have some concerns around it but their fairly minimal concerns so I don't think it's worth arguing over it. Not having it tied to the DB does making changing slug drivers easier, but we'll still need a mass update function regardless since discussions already have a slug column.

@luceos
Copy link
Member

luceos commented Oct 29, 2020

I'm worried that removing a slug column in the database will impact performance too much. Needing to loop over all discussions from the index through a Sluggerable class in contrast to saving the slug attribute on save makes a huge difference..

@askvortsov1
Copy link
Sponsor Member

I don't think anything should be removed, but I think that whether or not a column in the database is used to store slugs is more of an implementation detail for a given driver than a part of the API. We loop through the discussions anyway when serializing, so a database column could be used in implementation to (slightly) speed up the toSlug portion of the driver.

For fromSlug, as noted in https://github.com/flarum/core/issues/2230#issuecomment-718199749, the only case where any variation of the title/username/etc would be used to get an instance from the slug is if we're just using the raw title/username, in which case we can do a direct database lookup without needing to run through all discussion instances to get the slug.

@askvortsov1
Copy link
Sponsor Member

askvortsov1 commented Nov 6, 2020

A bit more info from a meeting today for what's needed:

  • Bundled display name extension
    • Add display name column, UI for setting display name, driver for display name, and replace User fulltext gambit to search by display name. Possibly could also include a slugging driver for User that would display numerical URL instead of username.
  • Implement slugging drivers for users and discussions. Use these in API endpoints IFF a "bySlug" GET parameter is present.
  • Make min search length a static constant that can be overriden by extensions
  • Fix hardcoded date in scrubber (we'll likely need a custom dayjs plugin since the packaged localized formats don't include what we need)

@askvortsov1
Copy link
Sponsor Member

Search improvements will be a part of #286, so the remaining concrete task here is the day js plugin for scrubber date translations.

@askvortsov1 askvortsov1 unpinned this issue Jan 13, 2021
@tankerkiller125
Copy link
Author

A note on that Day js plugin is that we need a localized format for Long Month, Year which is not built in. So all we're adding is that format.

@askvortsov1 askvortsov1 changed the title Future UTF-8 Support DayJS plugin for scrubber formatting Mar 16, 2021
@askvortsov1 askvortsov1 changed the title DayJS plugin for scrubber formatting Improving UTF-8 Support May 16, 2021
@sayuri-gi
Copy link

sayuri-gi commented Jun 15, 2021

@tankerkiller125 i can confirm that this doesn't work with sinhala (si) language as well.
https://discuss.flarum.org/d/27469-search-not-functioning-for-sinhala-unicode --> https://discuss.flarum.org/d/27076-sinhala-language-pack/2 (community report)

We also interest to see flarum support to display every scripts slugs. e.g. when user started a discussion with particular script title
(Additionally it'll be better if discussion owner able to edit the title and slug as possible to edit description and comments because sometimes annoying to see discussions with too long titles or with typos)

@sayuri-gi
Copy link

sayuri-gi commented Jul 14, 2021

The following kinds of bugs regarding UTF-8 are known and require the following kinds of changes/steps:
Problems:

  • UTF-8 search is awful (Can't search anything in chinese or japanese #2003)
    • In some languages a single character is a single word or even sentences (min search length is now a constant that extensions can override)
  • Scrubber date info is hard coded format, will need to localize that.

@tankerkiller125 @luceos search facility works perfectly with 0.1.0-beta.8, 0.1.0-beta.9 versions for sinhalese (si), chinese (zh) languages but not for japanese (ja), korean (ko) etc. somedays ago community user @thimiraonline reported it and i checked on test instance; see below

screenshot a:

screenshot b:

@askvortsov1 askvortsov1 transferred this issue from flarum/framework Mar 10, 2022
@rongcuid
Copy link

Search function for Chinese works only when there is a perfect phrase match, but a partial search is very needed as CJK languages don't use space to separate words.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants