New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slug transliteration #194

Open
tobscure opened this Issue Jul 28, 2015 · 65 comments

Comments

Projects
None yet
@tobscure
Member

tobscure commented Jul 28, 2015

Ex: https://chanphom.com/forums/luat-choi-chan-pro.29/ from "Luật chơi Chắn Pro"

@dcsjapan

This comment has been minimized.

Show comment
Hide comment
@dcsjapan

dcsjapan Jul 28, 2015

Member

Transliteration is possible for many languages, but very difficult or impossible for a few languages (like Japanese). It would be best if there were a way to enable/disable this function; or barring that, percent encoding of unicode might be preferable as a more universally applicable solution.

Member

dcsjapan commented Jul 28, 2015

Transliteration is possible for many languages, but very difficult or impossible for a few languages (like Japanese). It would be best if there were a way to enable/disable this function; or barring that, percent encoding of unicode might be preferable as a more universally applicable solution.

@tobscure

This comment has been minimized.

Show comment
Hide comment
@tobscure

tobscure Aug 27, 2015

Member

Currently slugs are generated using only alphanumeric characters, replacing anything else with a hyphen. However we should support some degree of transliteration so non-Latin languages still get slugs. This is an area where I don't have much knowledge, and help would be appreciated.

What needs to be done:

  • Work out a transliteration strategy (i.e. a library, or is there anything in PHP's standard library?) that supports a wide range of alphabets.
  • Discuss the possibility of leaving unicode characters in slugs, for languages where transliteration is impossible. What are the problems with this, if any?
  • Depending on the strategy we decide upon, consider implementing a mechanism that allows language packs to turn transliteration on/off.
  • While we're here, we should also truncate long slugs to a maximum of 50 or so characters.
Member

tobscure commented Aug 27, 2015

Currently slugs are generated using only alphanumeric characters, replacing anything else with a hyphen. However we should support some degree of transliteration so non-Latin languages still get slugs. This is an area where I don't have much knowledge, and help would be appreciated.

What needs to be done:

  • Work out a transliteration strategy (i.e. a library, or is there anything in PHP's standard library?) that supports a wide range of alphabets.
  • Discuss the possibility of leaving unicode characters in slugs, for languages where transliteration is impossible. What are the problems with this, if any?
  • Depending on the strategy we decide upon, consider implementing a mechanism that allows language packs to turn transliteration on/off.
  • While we're here, we should also truncate long slugs to a maximum of 50 or so characters.

@tobscure tobscure referenced this issue Aug 28, 2015

Closed

v0.1.0 roadmap (old) #74

19 of 53 tasks complete

@justjavac justjavac referenced this issue Sep 7, 2015

Open

Flarum v0.1.0 开发路线图 #3

18 of 53 tasks complete

@tobscure tobscure referenced this issue Sep 7, 2015

Closed

SEO URLs Faulty #433

@wielski

This comment has been minimized.

Show comment
Hide comment
@wielski

wielski Sep 10, 2015

Maybe you can use library like this one?
https://github.com/ashtokalo/php-translit

wielski commented Sep 10, 2015

Maybe you can use library like this one?
https://github.com/ashtokalo/php-translit

@Buhito72

This comment has been minimized.

Show comment
Hide comment
@Buhito72

Buhito72 Sep 29, 2015

In Spanish, the mod_rewrite replaces all Latin characters like ñ, accents, etc. with a hyphen. In order to improve the SEO would be better to rewrite the equivalent characters, for example: español ---> espanol (instead of espa-ol), corazón ---> corazon (instead of coraz-n). It can be done with a simple replacement of characters.

]/', '/[-]+/', '/<[^>]*>/'); $repl = array('', '-', ''); $url = preg_replace ($find, $repl, $url); return $url; } ?>

Buhito72 commented Sep 29, 2015

In Spanish, the mod_rewrite replaces all Latin characters like ñ, accents, etc. with a hyphen. In order to improve the SEO would be better to rewrite the equivalent characters, for example: español ---> espanol (instead of espa-ol), corazón ---> corazon (instead of coraz-n). It can be done with a simple replacement of characters.

]/', '/[-]+/', '/<[^>]*>/'); $repl = array('', '-', ''); $url = preg_replace ($find, $repl, $url); return $url; } ?>
@ISilvaPT

This comment has been minimized.

Show comment
Hide comment
@ISilvaPT

ISilvaPT Oct 4, 2015

Same could be said for Portuguese:
ã | â | á | à > a
ê | é | è | > e
í | ì | > i
õ | ô | ó | ò > o
ú | ù > u
ç > c

ISilvaPT commented Oct 4, 2015

Same could be said for Portuguese:
ã | â | á | à > a
ê | é | è | > e
í | ì | > i
õ | ô | ó | ò > o
ú | ù > u
ç > c

@dcsjapan

This comment has been minimized.

Show comment
Hide comment
@dcsjapan

dcsjapan Oct 8, 2015

Member

As I mentioned above and in #557, transliteration isn't a complete solution. There are some languages that can't be transliterated very easily, or at all.

In the case of Japanese, as I mentioned in Stumbling block 6, it would take a lot of rather sophisticated processing to come up reliable transliterations of words spelled using Chinese characters. And even the most sophisticated program will be reduced to guessing when it comes to things like names, which can use Chinese characters in nonstandard ways.

Japanese is clearly an extreme case, but even where the relationship between pronunciation and spelling tends to be more stable, there are still difficulties. To transliterate Chinese reliably, for example, you would need to provide a glossary of at least several thousand characters. So it's not always a matter of applying a few well-defined rules.

In regions where transliteration is impractical, there is a strong trend toward the use of unicode in URLs. Flarum will have to support that, or it will simply be irrelevant in those regions. At the same time, however, Flarum also needs to offer transliteration for regions that have adopted that approach.

My suggestion is:

Admins should be allowed to specify whether URLs should be transliterated or encoded. This could be implemented as an administrator setting, though it might be better still to have the question asked and answered during the installation process.

When an admin chooses the former, a library such as this one suggested by @FirestarterUA could be used to transliterate all slugs, including thread titles, tag names, and usernames. (Flarum may need to check all these items and return an error whenever any non-transliteratable text is entered. Or we could leave it up to admins to tell their users: "Don't use any Chinese characters ... or else!")

When an admin chooses the latter, all URLs are encoded appropriately, with only an absolute minimum of character replacement (e.g. hyphens in place of spaces) being performed.

Member

dcsjapan commented Oct 8, 2015

As I mentioned above and in #557, transliteration isn't a complete solution. There are some languages that can't be transliterated very easily, or at all.

In the case of Japanese, as I mentioned in Stumbling block 6, it would take a lot of rather sophisticated processing to come up reliable transliterations of words spelled using Chinese characters. And even the most sophisticated program will be reduced to guessing when it comes to things like names, which can use Chinese characters in nonstandard ways.

Japanese is clearly an extreme case, but even where the relationship between pronunciation and spelling tends to be more stable, there are still difficulties. To transliterate Chinese reliably, for example, you would need to provide a glossary of at least several thousand characters. So it's not always a matter of applying a few well-defined rules.

In regions where transliteration is impractical, there is a strong trend toward the use of unicode in URLs. Flarum will have to support that, or it will simply be irrelevant in those regions. At the same time, however, Flarum also needs to offer transliteration for regions that have adopted that approach.

My suggestion is:

Admins should be allowed to specify whether URLs should be transliterated or encoded. This could be implemented as an administrator setting, though it might be better still to have the question asked and answered during the installation process.

When an admin chooses the former, a library such as this one suggested by @FirestarterUA could be used to transliterate all slugs, including thread titles, tag names, and usernames. (Flarum may need to check all these items and return an error whenever any non-transliteratable text is entered. Or we could leave it up to admins to tell their users: "Don't use any Chinese characters ... or else!")

When an admin chooses the latter, all URLs are encoded appropriately, with only an absolute minimum of character replacement (e.g. hyphens in place of spaces) being performed.

@johannsa

This comment has been minimized.

Show comment
Hide comment
@johannsa

johannsa Dec 26, 2015

Contributor

Why not using same approach as Wikipedia and allow use of unicode in slugs which is supported by modern browsers and also by part of Flarum's frontend? This way many character sets would be available.

Also, currently slugs for discussions are generated on the client which is not ideal. They should be generated on the server (and stored on the database like tag slugs are).

Contributor

johannsa commented Dec 26, 2015

Why not using same approach as Wikipedia and allow use of unicode in slugs which is supported by modern browsers and also by part of Flarum's frontend? This way many character sets would be available.

Also, currently slugs for discussions are generated on the client which is not ideal. They should be generated on the server (and stored on the database like tag slugs are).

@dcsjapan

This comment has been minimized.

Show comment
Hide comment
@dcsjapan

dcsjapan Jan 25, 2016

Member

Why not using same approach as Wikipedia and allow use of unicode in slugs which is supported by modern browsers and also by part of Flarum's frontend?

I think that would be a great solution ... I'd just like to be sure there aren't any SEO implications for admins in regions where transliteration is the accepted approach.

Member

dcsjapan commented Jan 25, 2016

Why not using same approach as Wikipedia and allow use of unicode in slugs which is supported by modern browsers and also by part of Flarum's frontend?

I think that would be a great solution ... I'd just like to be sure there aren't any SEO implications for admins in regions where transliteration is the accepted approach.

@franzliedke

This comment has been minimized.

Show comment
Hide comment
@franzliedke

franzliedke Feb 6, 2016

Member

As discussed in #646, we can use Stringy which gives us slugging functionality for free.

Member

franzliedke commented Feb 6, 2016

As discussed in #646, we can use Stringy which gives us slugging functionality for free.

@franzliedke

This comment has been minimized.

Show comment
Hide comment
@franzliedke

franzliedke Feb 6, 2016

Member

We might also want to truncate the slug after a certain length.

Member

franzliedke commented Feb 6, 2016

We might also want to truncate the slug after a certain length.

@thecotne

This comment has been minimized.

Show comment
Hide comment
@thecotne

thecotne Feb 29, 2016

i want to mention here that for georgian language slugs are not generated at all (from this "რა კაი ფორუმი წამოვჭიმეთ!" i got "--" this slug)
and also Wikipedia approach is best for slugs

thecotne commented Feb 29, 2016

i want to mention here that for georgian language slugs are not generated at all (from this "რა კაი ფორუმი წამოვჭიმეთ!" i got "--" this slug)
and also Wikipedia approach is best for slugs

@akalongman

This comment has been minimized.

Show comment
Hide comment
@akalongman

akalongman Feb 29, 2016

+1
@tobscure We need unicode slugs

akalongman commented Feb 29, 2016

+1
@tobscure We need unicode slugs

@thecotne thecotne referenced a pull request that will close this issue Feb 29, 2016

Open

add unicode support in slugs #836

@datitisev

This comment has been minimized.

Show comment
Hide comment
@datitisev

datitisev Mar 27, 2016

Member

This looks good for different languages: Cocur/Sluglify. The only problem is that it needs the language to be fully spelled out, instead of en it needs english, although that is probably an easy fix.
The other one I found which doesn't need a language, is Jbroadway/urlfix, although that one is more basic, I think.
Whichever is better ;)

Member

datitisev commented Mar 27, 2016

This looks good for different languages: Cocur/Sluglify. The only problem is that it needs the language to be fully spelled out, instead of en it needs english, although that is probably an easy fix.
The other one I found which doesn't need a language, is Jbroadway/urlfix, although that one is more basic, I think.
Whichever is better ;)

@dcsjapan

This comment has been minimized.

Show comment
Hide comment
@dcsjapan

dcsjapan Apr 4, 2016

Member

Of the transliteration options mentioned, Slugify strikes me as the most worthy of consideration. It covers a wide range of languages out of the box, can easily customized to cover more, and is flexible when it comes to integration.

As @franzliedke said, Stringy may also be an option, especially if it can also be employed for tasks other than transliteration. One cause for concern is that it only does slugification, not true transliteration; that is, it seems to work on a fixed ruleset:

Converts the string into an URL slug. This includes replacing non-ASCII characters with their closest ASCII equivalents, removing remaining non-ASCII and non-alphanumeric characters, and replacing whitespace with $replacement.

This may not provide the best transliterations for all languages; converting ä to a would not work in a language where ae is the more commonly used transliteration. A more language-specific solution would give better results vis-a-vis both SEFiness and human readability.

I'm wondering whether it would be possible to use Stringy, but insert language-specific rulesets (like the ones used by Slugify) when available. We could put the ruleset file right in the language pack, as we've done with Moment.js translations. When the admin sets the forum's slugification style to "transliteration" (as opposed to "UTF-8") Flarum would grab the ruleset for the forum's default language and slugify based on that. If the language pack is lacking a ruleset, it could fall back to standard Stringy slugification.

Would something like this be possible?

EDIT: It would be best to have Stringy treat the language-specific ruleset as overrides, so it can default to its own slugification rules when it encounters a character that's not covered in the ruleset being used. That would allow it to cope with situations involving characters not included in the ruleset for the default language ... such as a topic about Søren Kierkegaard in a French forum.

This solution would be best suited to single-language forums. Handling of thread titles (etc.) in more than one language would tend to be hit-and-miss. And in cases where a forum includes languages requiring different slugification methods ... Russian and Japanese, for example ... the admin will be forced to use UTF-8 slugs. The only way around that would be to make Flarum truly multilingual, i.e. assign a locale value to each thread.

Member

dcsjapan commented Apr 4, 2016

Of the transliteration options mentioned, Slugify strikes me as the most worthy of consideration. It covers a wide range of languages out of the box, can easily customized to cover more, and is flexible when it comes to integration.

As @franzliedke said, Stringy may also be an option, especially if it can also be employed for tasks other than transliteration. One cause for concern is that it only does slugification, not true transliteration; that is, it seems to work on a fixed ruleset:

Converts the string into an URL slug. This includes replacing non-ASCII characters with their closest ASCII equivalents, removing remaining non-ASCII and non-alphanumeric characters, and replacing whitespace with $replacement.

This may not provide the best transliterations for all languages; converting ä to a would not work in a language where ae is the more commonly used transliteration. A more language-specific solution would give better results vis-a-vis both SEFiness and human readability.

I'm wondering whether it would be possible to use Stringy, but insert language-specific rulesets (like the ones used by Slugify) when available. We could put the ruleset file right in the language pack, as we've done with Moment.js translations. When the admin sets the forum's slugification style to "transliteration" (as opposed to "UTF-8") Flarum would grab the ruleset for the forum's default language and slugify based on that. If the language pack is lacking a ruleset, it could fall back to standard Stringy slugification.

Would something like this be possible?

EDIT: It would be best to have Stringy treat the language-specific ruleset as overrides, so it can default to its own slugification rules when it encounters a character that's not covered in the ruleset being used. That would allow it to cope with situations involving characters not included in the ruleset for the default language ... such as a topic about Søren Kierkegaard in a French forum.

This solution would be best suited to single-language forums. Handling of thread titles (etc.) in more than one language would tend to be hit-and-miss. And in cases where a forum includes languages requiring different slugification methods ... Russian and Japanese, for example ... the admin will be forced to use UTF-8 slugs. The only way around that would be to make Flarum truly multilingual, i.e. assign a locale value to each thread.

@franzliedke franzliedke modified the milestone: 0.1.0 Apr 7, 2016

@yihui

This comment has been minimized.

Show comment
Hide comment
@yihui

yihui Apr 29, 2016

As a Chinese speaker, I'd just want a simple option to disable slugs of posts. I don't want either transliteration or Unicode characters in the URLs. Personally I also prefer shorter URLs like example.com/d/12345 instead of example.com/d/12345-hello-world Having Unicode Chinese characters in the URL will make it horribly long and messy like https://zh.wikipedia.org/wiki/Portal:%E6%96%B0%E8%81%9E%E5%8B%95%E6%85%8B when you copy the URL from the address bar of the browser (e.g. Chrome). That is not human readable, so such slugs will be useless. I think disabling transliteration is much easier to implement and more useful to Chinese users.

yihui commented Apr 29, 2016

As a Chinese speaker, I'd just want a simple option to disable slugs of posts. I don't want either transliteration or Unicode characters in the URLs. Personally I also prefer shorter URLs like example.com/d/12345 instead of example.com/d/12345-hello-world Having Unicode Chinese characters in the URL will make it horribly long and messy like https://zh.wikipedia.org/wiki/Portal:%E6%96%B0%E8%81%9E%E5%8B%95%E6%85%8B when you copy the URL from the address bar of the browser (e.g. Chrome). That is not human readable, so such slugs will be useless. I think disabling transliteration is much easier to implement and more useful to Chinese users.

@dcsjapan

This comment has been minimized.

Show comment
Hide comment
@dcsjapan

dcsjapan Apr 29, 2016

Member

Safari and Firefox are able to copy the URL in human-readable format. When I open the URL you linked above and copy it from the Safari address bar, I get this:

https://zh.wikipedia.org/wiki/Portal:新聞動態

So this should probably be considered a deficiency of Chrome ... or of your OS, perhaps. That said, a third option to disable slugs altogether shouldn't be too hard to implement, and may be wanted by enough site admins that it would be worth adding.

Member

dcsjapan commented Apr 29, 2016

Safari and Firefox are able to copy the URL in human-readable format. When I open the URL you linked above and copy it from the Safari address bar, I get this:

https://zh.wikipedia.org/wiki/Portal:新聞動態

So this should probably be considered a deficiency of Chrome ... or of your OS, perhaps. That said, a third option to disable slugs altogether shouldn't be too hard to implement, and may be wanted by enough site admins that it would be worth adding.

@believer-ufa

This comment has been minimized.

Show comment
Hide comment
@believer-ufa

believer-ufa May 15, 2016

Hello guys :) You hear about PHP Intl Transliterator extension?

For example, you can use this snippet of code for transliterate any strings to latin characters (even japanese characters, as I know)

<?php
$rules = 'Any-Latin; Latin-ASCII; [\u0080-\uffff] remove';

echo transliterator_transliterate($rules,'Какая-то строка, которая нуждается в транслитерации');
// Kakaa-to stroka, kotoraa nuzdaetsa v transliteracii

echo transliterator_transliterate($rules,'新聞動態');
// xin wen dong tai

echo transliterator_transliterate($rules,'რა კაი ფორუმი წამოვჭიმეთ');
// ra kai porumi tsamovchimet

You can find more info about this transliterator functions in sources of Yii 2 framework, for example.

believer-ufa commented May 15, 2016

Hello guys :) You hear about PHP Intl Transliterator extension?

For example, you can use this snippet of code for transliterate any strings to latin characters (even japanese characters, as I know)

<?php
$rules = 'Any-Latin; Latin-ASCII; [\u0080-\uffff] remove';

echo transliterator_transliterate($rules,'Какая-то строка, которая нуждается в транслитерации');
// Kakaa-to stroka, kotoraa nuzdaetsa v transliteracii

echo transliterator_transliterate($rules,'新聞動態');
// xin wen dong tai

echo transliterator_transliterate($rules,'რა კაი ფორუმი წამოვჭიმეთ');
// ra kai porumi tsamovchimet

You can find more info about this transliterator functions in sources of Yii 2 framework, for example.

@believer-ufa

This comment has been minimized.

Show comment
Hide comment
@believer-ufa

believer-ufa May 15, 2016

Also in page with description of Intl extension you can find message of one of php developers in which it is written one of possible solutions to transform string into the correct transliterated url:

<?php
function slugify($string) {
    $string = transliterator_transliterate("Any-Latin; NFD; [:Nonspacing Mark:] Remove; NFC; [:Punctuation:] Remove; Lower();", $string);
    $string = preg_replace('/[-\s]+/', '-', $string);
    return trim($string, '-');
}

echo slugify("Я люблю PHP!"); // a-lublu-php
echo slugify('რა კაი ფორუმი წამოვჭიმეთ'); // ra-kʼai-porumi-tsʼamovchʼimet
echo slugify('新聞動態'); // xin-wen-dong-tai
?>

I think, it need to test on some count of strings to choose the more correct method :)

believer-ufa commented May 15, 2016

Also in page with description of Intl extension you can find message of one of php developers in which it is written one of possible solutions to transform string into the correct transliterated url:

<?php
function slugify($string) {
    $string = transliterator_transliterate("Any-Latin; NFD; [:Nonspacing Mark:] Remove; NFC; [:Punctuation:] Remove; Lower();", $string);
    $string = preg_replace('/[-\s]+/', '-', $string);
    return trim($string, '-');
}

echo slugify("Я люблю PHP!"); // a-lublu-php
echo slugify('რა კაი ფორუმი წამოვჭიმეთ'); // ra-kʼai-porumi-tsʼamovchʼimet
echo slugify('新聞動態'); // xin-wen-dong-tai
?>

I think, it need to test on some count of strings to choose the more correct method :)

@franzliedke

This comment has been minimized.

Show comment
Hide comment
@franzliedke

franzliedke May 15, 2016

Member

@believer-ufa Thanks for pointing it out, we'll take a look.

However, since this requires the intl extension, we probably have to use another approach (library).

Member

franzliedke commented May 15, 2016

@believer-ufa Thanks for pointing it out, we'll take a look.

However, since this requires the intl extension, we probably have to use another approach (library).

@believer-ufa

This comment has been minimized.

Show comment
Hide comment
@believer-ufa

believer-ufa May 15, 2016

@franzliedke, you already use a gd and mysql extensions. Why the use of this extension is the problem? On any linux OS its a problem what resolved by one command like sudo apt install php7.0-intl.

You most likely will not be able to do a same good transliteration with some other library, since in the majority of these libraries are intended only for certain languages.

believer-ufa commented May 15, 2016

@franzliedke, you already use a gd and mysql extensions. Why the use of this extension is the problem? On any linux OS its a problem what resolved by one command like sudo apt install php7.0-intl.

You most likely will not be able to do a same good transliteration with some other library, since in the majority of these libraries are intended only for certain languages.

@franzliedke

This comment has been minimized.

Show comment
Hide comment
@franzliedke

franzliedke May 15, 2016

Member

Well, you will probably agree that we can be reasonably certain that MySQL is installed everywhere. (And even if not, Flarum can not function without it.)

But yeah, I'm open to the idea. Does anybody know some place with PHP extension installation stats?

Member

franzliedke commented May 15, 2016

Well, you will probably agree that we can be reasonably certain that MySQL is installed everywhere. (And even if not, Flarum can not function without it.)

But yeah, I'm open to the idea. Does anybody know some place with PHP extension installation stats?

@believer-ufa

This comment has been minimized.

Show comment
Hide comment
@believer-ufa

believer-ufa May 15, 2016

I little dont understand you. Flarum Installation guide tell to user about needs a SSH acces and PHP 5.5+ with the following extensions: mbstring, pdo_mysql, openssl, json, gd, dom, fileinfo. Its a common situation: install some PHP extensions to be able to run some framework. You just need install a one more extension for have correct transliterations in you forum)

believer-ufa commented May 15, 2016

I little dont understand you. Flarum Installation guide tell to user about needs a SSH acces and PHP 5.5+ with the following extensions: mbstring, pdo_mysql, openssl, json, gd, dom, fileinfo. Its a common situation: install some PHP extensions to be able to run some framework. You just need install a one more extension for have correct transliterations in you forum)

@dcsjapan

This comment has been minimized.

Show comment
Hide comment
@dcsjapan

dcsjapan May 15, 2016

Member

@believer-ufa Not every Flarum admin will have the access necessary to install the extension. One of the devs' goals is to keep Flarum easy to install on shared hosting plans. Every extension added can limit the number of providers that will be able to support Flarum. I think that's why @franzliedke is asking about extension installation stats; it's a decision that can't be made too casually.

Member

dcsjapan commented May 15, 2016

@believer-ufa Not every Flarum admin will have the access necessary to install the extension. One of the devs' goals is to keep Flarum easy to install on shared hosting plans. Every extension added can limit the number of providers that will be able to support Flarum. I think that's why @franzliedke is asking about extension installation stats; it's a decision that can't be made too casually.

@believer-ufa

This comment has been minimized.

Show comment
Hide comment
@believer-ufa

believer-ufa May 16, 2016

Okay, but it really nice extension :) Look at discussion on Flarum forums, one of the participants already convinced about this approach.

You can also write the code so that it does not require the presence Intl extension, but if available have used it. I think it will be the right solution that will avoid problems with bad hosting and will give us a solution to this problem.

believer-ufa commented May 16, 2016

Okay, but it really nice extension :) Look at discussion on Flarum forums, one of the participants already convinced about this approach.

You can also write the code so that it does not require the presence Intl extension, but if available have used it. I think it will be the right solution that will avoid problems with bad hosting and will give us a solution to this problem.

@jordanjay29

This comment has been minimized.

Show comment
Hide comment
@jordanjay29

jordanjay29 May 16, 2016

Member

Maybe @believer-ufa's method is a better extension, regardless of who makes it. Then composer can check if the proper extension is available and refuse to install if not. Being so dependent on an additional php module, if it's not widely installed, may hurt Flarum's ability to be widespread more than lacking this feature.

Member

jordanjay29 commented May 16, 2016

Maybe @believer-ufa's method is a better extension, regardless of who makes it. Then composer can check if the proper extension is available and refuse to install if not. Being so dependent on an additional php module, if it's not widely installed, may hurt Flarum's ability to be widespread more than lacking this feature.

@believer-ufa

This comment has been minimized.

Show comment
Hide comment
@believer-ufa

believer-ufa May 16, 2016

jordanjay29, you can write code what uses Intl if exist, but if not exist Flarum can work, but without nice and full language URL transliteration. Read my above comment

believer-ufa commented May 16, 2016

jordanjay29, you can write code what uses Intl if exist, but if not exist Flarum can work, but without nice and full language URL transliteration. Read my above comment

@franzliedke

This comment has been minimized.

Show comment
Hide comment
@franzliedke

franzliedke May 16, 2016

Member

Well, not using the Intl extension does not mean we can't implement transliteration. There are enough libraries out there.

Still, I kinda like the idea of using Intl when it's available, and only falling back to another implementation if not.

Member

franzliedke commented May 16, 2016

Well, not using the Intl extension does not mean we can't implement transliteration. There are enough libraries out there.

Still, I kinda like the idea of using Intl when it's available, and only falling back to another implementation if not.

@dcsjapan

This comment has been minimized.

Show comment
Hide comment
@dcsjapan

dcsjapan May 16, 2016

Member

Still, I kinda like the idea of using Intl when it's available, and only falling back to another implementation if not.

That sounds promising. 😀

Member

dcsjapan commented May 16, 2016

Still, I kinda like the idea of using Intl when it's available, and only falling back to another implementation if not.

That sounds promising. 😀

@firegurafiku

This comment has been minimized.

Show comment
Hide comment
@firegurafiku

firegurafiku Nov 3, 2016

Let me support the idea which was proposed by @yihui: there should be an option to either disable slugs completely, or set them manually. Or, better, both of them.

Forcing everyone to use machine-transliterated slugs is a huge hurt, as many languages just cannot be romanized well enough, or, at least, unambiguously. For them the result is just a confusing meaningless mess of letters.

@believer-ufa

The library you proposed seem to do only the simplest table-based substitutions. Let me comment your example:

Какая-то строка, которая нуждается в транслитерации
Kakaa-to stroka, kotoraa nuzdaetsa v transliteracii

Or maybe: kakaya, kotoraya, nuzhdaetsya. According to your nickname, you should know that Russian has a bunch of different transliteration schemes. Even the government cannot decide which one to use.

新聞動態
xin wen dong tai

But how about reading this in Japanese: shinbun dotai? Or maybe Korean reading? Unicode does not distinguish between Chinese, Japanese and Korean graphemes.

Even Latin-based scripts cannot be reliably transliterated.
Moreover, what if user wants title translation, not a transliteration in their URLs?

firegurafiku commented Nov 3, 2016

Let me support the idea which was proposed by @yihui: there should be an option to either disable slugs completely, or set them manually. Or, better, both of them.

Forcing everyone to use machine-transliterated slugs is a huge hurt, as many languages just cannot be romanized well enough, or, at least, unambiguously. For them the result is just a confusing meaningless mess of letters.

@believer-ufa

The library you proposed seem to do only the simplest table-based substitutions. Let me comment your example:

Какая-то строка, которая нуждается в транслитерации
Kakaa-to stroka, kotoraa nuzdaetsa v transliteracii

Or maybe: kakaya, kotoraya, nuzhdaetsya. According to your nickname, you should know that Russian has a bunch of different transliteration schemes. Even the government cannot decide which one to use.

新聞動態
xin wen dong tai

But how about reading this in Japanese: shinbun dotai? Or maybe Korean reading? Unicode does not distinguish between Chinese, Japanese and Korean graphemes.

Even Latin-based scripts cannot be reliably transliterated.
Moreover, what if user wants title translation, not a transliteration in their URLs?

@dcsjapan

This comment has been minimized.

Show comment
Hide comment
@dcsjapan

dcsjapan Nov 4, 2016

Member

Unicode does not distinguish between Chinese, Japanese and Korean graphemes.

Even Latin-based scripts cannot be reliably transliterated.

Just so!

Moreover, what if user wants title translation, not a transliteration in their URLs?

That might be worth investigating as an idea for a third-party extension. For now, I think it would be sufficient if Flarum could offer a robust system to provide for both transliteration and unicode, with enough configuration options to allow admins in any region to tweak its behavior to their liking.

Member

dcsjapan commented Nov 4, 2016

Unicode does not distinguish between Chinese, Japanese and Korean graphemes.

Even Latin-based scripts cannot be reliably transliterated.

Just so!

Moreover, what if user wants title translation, not a transliteration in their URLs?

That might be worth investigating as an idea for a third-party extension. For now, I think it would be sufficient if Flarum could offer a robust system to provide for both transliteration and unicode, with enough configuration options to allow admins in any region to tweak its behavior to their liking.

@yihui

This comment has been minimized.

Show comment
Hide comment
@yihui

yihui Nov 4, 2016

a robust system to provide for both transliteration and unicode

plus an option to disable slugs completely please... :)

yihui commented Nov 4, 2016

a robust system to provide for both transliteration and unicode

plus an option to disable slugs completely please... :)

@dcsjapan

This comment has been minimized.

Show comment
Hide comment
@dcsjapan

dcsjapan Nov 4, 2016

Member

plus an option to disable slugs completely please... :)

I don't see why that couldn't be added. Compared to everything else, it would be _easy._ 😄

Incidentally,

Having Unicode Chinese characters in the URL will make it horribly long and messy like when you copy the URL from the address bar of the browser (e.g. Chrome).

I don't experience this sort of thing when using Safari (though I have seen it when using Firefox). One would hope that the other browsers could get with the program and make it possible to copy and paste properly encoded URLs so they result would be human readable ... 🙄


EDIT: See my comment below.

Member

dcsjapan commented Nov 4, 2016

plus an option to disable slugs completely please... :)

I don't see why that couldn't be added. Compared to everything else, it would be _easy._ 😄

Incidentally,

Having Unicode Chinese characters in the URL will make it horribly long and messy like when you copy the URL from the address bar of the browser (e.g. Chrome).

I don't experience this sort of thing when using Safari (though I have seen it when using Firefox). One would hope that the other browsers could get with the program and make it possible to copy and paste properly encoded URLs so they result would be human readable ... 🙄


EDIT: See my comment below.

@believer-ufa

This comment has been minimized.

Show comment
Hide comment
@believer-ufa

believer-ufa Nov 4, 2016

Forcing everyone to use machine-transliterated slugs is a huge hurn, as many languages just cannot be romanized well enough, or, at least, unambiguously. For them the result is just a confusing meaningless mess of letters.

Interesting logic, but I believe that you create too much of an issue out of this topic. We just need the URLs, which will be have some info about conversation. After all, nothing terrible will happen if the url will be slightly incorrect. But there is better to have at least something: it allows you to add the search engines additional information about the page for better SEO optimization.

believer-ufa commented Nov 4, 2016

Forcing everyone to use machine-transliterated slugs is a huge hurn, as many languages just cannot be romanized well enough, or, at least, unambiguously. For them the result is just a confusing meaningless mess of letters.

Interesting logic, but I believe that you create too much of an issue out of this topic. We just need the URLs, which will be have some info about conversation. After all, nothing terrible will happen if the url will be slightly incorrect. But there is better to have at least something: it allows you to add the search engines additional information about the page for better SEO optimization.

@franzliedke

This comment has been minimized.

Show comment
Hide comment
@franzliedke

franzliedke Nov 4, 2016

Member

On the other hand I'm not sure what search engines do with nonsensical information (such as from a wrong transliteration) in the URL. Thanks for bringing it up, @yihui and @firegurafiku!

Member

franzliedke commented Nov 4, 2016

On the other hand I'm not sure what search engines do with nonsensical information (such as from a wrong transliteration) in the URL. Thanks for bringing it up, @yihui and @firegurafiku!

@dcsjapan

This comment has been minimized.

Show comment
Hide comment
@dcsjapan

dcsjapan Nov 11, 2016

Member

Scratch that ... I just copied and pasted a Google URL with Safari and ended up with a string of very non-human-readable percent encodings in it. I had been thinking that Safari fixes percent-encoded URLs when copying to the clipboard, but that doesn't seem to be the case after all.

So the issue raised by @yihui is definitely something we need to think about.

Member

dcsjapan commented Nov 11, 2016

Scratch that ... I just copied and pasted a Google URL with Safari and ended up with a string of very non-human-readable percent encodings in it. I had been thinking that Safari fixes percent-encoded URLs when copying to the clipboard, but that doesn't seem to be the case after all.

So the issue raised by @yihui is definitely something we need to think about.

@aethior

This comment has been minimized.

Show comment
Hide comment
@aethior

aethior Feb 28, 2017

I'm not developer, but I want to share my opinion as user and webmaster. Why not copy the Wordpress (the most used cms) slug method?

Wordpress uses latin letters in lowercase, without symbols or marks, and you have the possibility to use characters from other alphabets. I also think interesting the possibility to short URL without post title (option in admin panel).

In any case, I want to show my negative opinion to method similar to Wikipedia. I'm spanish and my language uses a lot symbols and marks, and the Wikipedia URLs are annoying when you want to share Wikipedia links.

I think the url method should be simple, and complex transliteration added by extension (Wordpress has differents plugins for that).

aethior commented Feb 28, 2017

I'm not developer, but I want to share my opinion as user and webmaster. Why not copy the Wordpress (the most used cms) slug method?

Wordpress uses latin letters in lowercase, without symbols or marks, and you have the possibility to use characters from other alphabets. I also think interesting the possibility to short URL without post title (option in admin panel).

In any case, I want to show my negative opinion to method similar to Wikipedia. I'm spanish and my language uses a lot symbols and marks, and the Wikipedia URLs are annoying when you want to share Wikipedia links.

I think the url method should be simple, and complex transliteration added by extension (Wordpress has differents plugins for that).

@tobscure tobscure referenced this issue Mar 4, 2017

Closed

Url slug #1142

@sijad

This comment has been minimized.

Show comment
Hide comment
@sijad

sijad Mar 6, 2017

Contributor

neither transliterator_transliterate nor Slugify is suitable for Persian language.

Contributor

sijad commented Mar 6, 2017

neither transliterator_transliterate nor Slugify is suitable for Persian language.

@believer-ufa

This comment has been minimized.

Show comment
Hide comment
@believer-ufa

believer-ufa Mar 6, 2017

@sijad, if we talking about slugify, you can easily add you own rules for your language.

believer-ufa commented Mar 6, 2017

@sijad, if we talking about slugify, you can easily add you own rules for your language.

@thecotne

This comment has been minimized.

Show comment
Hide comment
@thecotne

thecotne Mar 8, 2017

what if we use github issue like urls? (id only no slug no transliteration)
and then some plugins may change urls ....

thecotne commented Mar 8, 2017

what if we use github issue like urls? (id only no slug no transliteration)
and then some plugins may change urls ....

@aethior

This comment has been minimized.

Show comment
Hide comment
@aethior

aethior Mar 9, 2017

what if we use github issue like urls? (id only no slug no transliteration)
and then some plugins may change urls ....

Those urls are not seo and human friendly. Your suggestion was discussed here: #1140 (comment)

aethior commented Mar 9, 2017

what if we use github issue like urls? (id only no slug no transliteration)
and then some plugins may change urls ....

Those urls are not seo and human friendly. Your suggestion was discussed here: #1140 (comment)

@firegurafiku

This comment has been minimized.

Show comment
Hide comment
@firegurafiku

firegurafiku Mar 9, 2017

@believer-ufa

if we talking about slugify, you can easily add you own rules for your language.

How about easy adding support for Chinese or Japanese?
Languages are hard and nobody should rely on automatic romanization. Instead, there should be options to disable slugs at all, or set them manually.

firegurafiku commented Mar 9, 2017

@believer-ufa

if we talking about slugify, you can easily add you own rules for your language.

How about easy adding support for Chinese or Japanese?
Languages are hard and nobody should rely on automatic romanization. Instead, there should be options to disable slugs at all, or set them manually.

@sijad

This comment has been minimized.

Show comment
Hide comment
@sijad

sijad Mar 9, 2017

Contributor

@believer-ufa in Persian people usually does not use diacritics in texts, so Slugify is not an option, for Persian language (and Arabic?) using unicode plus a few filters (remove diacritics, non-alphanumerics, spaces, etc) is best option.

Contributor

sijad commented Mar 9, 2017

@believer-ufa in Persian people usually does not use diacritics in texts, so Slugify is not an option, for Persian language (and Arabic?) using unicode plus a few filters (remove diacritics, non-alphanumerics, spaces, etc) is best option.

@yagobski

This comment has been minimized.

Show comment
Hide comment
@yagobski

yagobski Mar 20, 2017

I just make an improvement for this issue. I use something like wordpress Slug. Can handle utf8 and more ;)

yagobski commented Mar 20, 2017

I just make an improvement for this issue. I use something like wordpress Slug. Can handle utf8 and more ;)

yagobski added a commit to yagobski/core that referenced this issue Mar 20, 2017

Slug transliteration #194
Currently slugs are generated using only alphanumeric characters, replacing anything else with a hyphen. I improve Slug function to  support some degree of transliteration for non-Latin languages still get slugs.
@luceos

This comment has been minimized.

Show comment
Hide comment
@luceos

luceos Mar 20, 2017

Member

@franzliedke personally I use the intl extension a lot. It allows for easy implementation of monetary values as well. And it's easy to install as well.

Member

luceos commented Mar 20, 2017

@franzliedke personally I use the intl extension a lot. It allows for easy implementation of monetary values as well. And it's easy to install as well.

@johannsa

This comment has been minimized.

Show comment
Hide comment
@johannsa

johannsa Apr 27, 2017

Contributor

Putting implementation aside, any decision on whether allow UTF characters on slugs (à la Wikipedia) or not has been taken?

Contributor

johannsa commented Apr 27, 2017

Putting implementation aside, any decision on whether allow UTF characters on slugs (à la Wikipedia) or not has been taken?

@tobscure

This comment has been minimized.

Show comment
Hide comment
@tobscure

tobscure Apr 27, 2017

Member

@johannsa I think the consensus was to make it an option?

Member

tobscure commented Apr 27, 2017

@johannsa I think the consensus was to make it an option?

@Zeokat

This comment has been minimized.

Show comment
Hide comment
@Zeokat

Zeokat May 5, 2017

Contributor

I'm one of those who thinks that will be better use the same approach as WordPress because from my experience (more than 6 years using WordPress) is near to perfect.

The problem using intl extension can be shared hostings, maybe some don't have that extension enabled by default.

Contributor

Zeokat commented May 5, 2017

I'm one of those who thinks that will be better use the same approach as WordPress because from my experience (more than 6 years using WordPress) is near to perfect.

The problem using intl extension can be shared hostings, maybe some don't have that extension enabled by default.

@yagobski

This comment has been minimized.

Show comment
Hide comment
@yagobski

yagobski May 6, 2017

yagobski commented May 6, 2017

@Zeokat

This comment has been minimized.

Show comment
Hide comment
@Zeokat

Zeokat May 6, 2017

Contributor

@yagobski your commit is mostly incomplete:

  • First you copy-paste WordPress code, also ignoring functions like seems_utf8() , get_locale(), etc.
  • WordPress code is under GPL license: #1148 (comment)
Contributor

Zeokat commented May 6, 2017

@yagobski your commit is mostly incomplete:

  • First you copy-paste WordPress code, also ignoring functions like seems_utf8() , get_locale(), etc.
  • WordPress code is under GPL license: #1148 (comment)
@yagobski

This comment has been minimized.

Show comment
Hide comment
@yagobski

yagobski May 6, 2017

yagobski commented May 6, 2017

@jordanjay29

This comment has been minimized.

Show comment
Hide comment
@jordanjay29

jordanjay29 May 6, 2017

Member

@yagobski I believe the concern is that borrowing code from a GPL project will force Flarum under the GPL, and that's not a desired outcome.

Make sure any code you borrow is licensed freely (public domain) or with something compatible with MIT. GPL and other sharealikes/copyleft (any Creative Commons license with 'SA') are not compatible and will be rejected for inclusion into Flarum.

Member

jordanjay29 commented May 6, 2017

@yagobski I believe the concern is that borrowing code from a GPL project will force Flarum under the GPL, and that's not a desired outcome.

Make sure any code you borrow is licensed freely (public domain) or with something compatible with MIT. GPL and other sharealikes/copyleft (any Creative Commons license with 'SA') are not compatible and will be rejected for inclusion into Flarum.

@Zeokat

This comment has been minimized.

Show comment
Hide comment
@Zeokat

Zeokat May 14, 2017

Contributor

What will be the right choice when the generated slug is empty because it uses non-friendly characters:

  • Option 1: use a random string and slug will be like 16-7s8eds5e68gd6se7d
  • Option 2: Use only the discussion id (as @aethior suggested), the slug will be like 16- (note that - at the end is added by current code on empty titles).

Will be nice know which will be the right option, because both needs different code solutions.

Contributor

Zeokat commented May 14, 2017

What will be the right choice when the generated slug is empty because it uses non-friendly characters:

  • Option 1: use a random string and slug will be like 16-7s8eds5e68gd6se7d
  • Option 2: Use only the discussion id (as @aethior suggested), the slug will be like 16- (note that - at the end is added by current code on empty titles).

Will be nice know which will be the right option, because both needs different code solutions.

@franzliedke

This comment has been minimized.

Show comment
Hide comment
@franzliedke

franzliedke May 14, 2017

Member

@Zeokat I would prefer Option 2.

Member

franzliedke commented May 14, 2017

@Zeokat I would prefer Option 2.

@Zeokat

This comment has been minimized.

Show comment
Hide comment
@Zeokat

Zeokat May 14, 2017

Contributor

Yes @franzliedke that option will be our best choice. The problem is that Flarum always add char - after the discussion id. I'm trying to locate what files also involve adding that last - but not much luck.

At the moment only located this line involved into the ending dash:

Handling the javascript part is the problem for me.

Contributor

Zeokat commented May 14, 2017

Yes @franzliedke that option will be our best choice. The problem is that Flarum always add char - after the discussion id. I'm trying to locate what files also involve adding that last - but not much luck.

At the moment only located this line involved into the ending dash:

Handling the javascript part is the problem for me.

@franzliedke

This comment has been minimized.

Show comment
Hide comment
@franzliedke

franzliedke May 14, 2017

Member

Hmm, the URL without slug is already understood: https://discuss.flarum.org/d/187.

That means that only the URL generation code has to be adapted.

Member

franzliedke commented May 14, 2017

Hmm, the URL without slug is already understood: https://discuss.flarum.org/d/187.

That means that only the URL generation code has to be adapted.

@Zeokat

This comment has been minimized.

Show comment
Hide comment
@Zeokat

Zeokat May 15, 2017

Contributor

@franzliedke Yes, slugs with discussion-id-only are already understood and also gives us some duplicated content because both urls returns "HTTP status 200" without any redirection (301) that search engines can understand. Anyway, that's another history.

I'm speaking about the lines of code that add the dash after discussion-id slug (for example, on empty slugs the autogenerated slug is https://discuss.flarum.org/d/5772-- , which seems a little ugly).

Anyway here we go: #1183

Contributor

Zeokat commented May 15, 2017

@franzliedke Yes, slugs with discussion-id-only are already understood and also gives us some duplicated content because both urls returns "HTTP status 200" without any redirection (301) that search engines can understand. Anyway, that's another history.

I'm speaking about the lines of code that add the dash after discussion-id slug (for example, on empty slugs the autogenerated slug is https://discuss.flarum.org/d/5772-- , which seems a little ugly).

Anyway here we go: #1183

@luceos luceos referenced this issue Jun 20, 2017

Closed

Supporting UTF8 #34

@buiductuan182

This comment has been minimized.

Show comment
Hide comment
@buiductuan182

buiductuan182 commented Jul 11, 2017

Maybe you can use library like this one? https://www.quangminhhanoi.com/dieu-hoa-daikin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment