Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Underscores in slugs get converted to hyphen-minus #768

Open
chris-morgan opened this issue Aug 8, 2019 · 27 comments

Comments

@chris-morgan
Copy link
Contributor

commented Aug 8, 2019

I have a content file named rust-cfg_attr.md on my Zola 0.8.0 site.

Zola ignores what I deliberately typed as the slug, and turns the _ into -, yielding rust-cfg-attr. This makes me unhappy.

I expect Zola to make no modifications at all to the slug that I specify, so that I can do things how I want to, e.g. Do_things_like_Wikipedia if I really want to.

@Keats

This comment has been minimized.

Copy link
Collaborator

commented Aug 8, 2019

It looks like the "slug everything" approach is too strong. It should instead only remove some characters like & or / from the slug. This will give us UTF-8 slugs as well, which are transformed into ASCII right now.

Anyone interested in doing that?

@chris-morgan

This comment has been minimized.

Copy link
Contributor Author

commented Aug 8, 2019

Why change anything? & is perfectly fine in URLs, and you won’t encounter / in filenames.

I’m interested in fixing this.

@Keats

This comment has been minimized.

Copy link
Collaborator

commented Aug 8, 2019

From https://en.wikipedia.org/wiki/Percent-encoding I would prefer to stick with the unreserved characters + UTF8 since otherwise things like & needs to be encoded for the sitemap. I had a quick test with Hugo and the resulting slugs are exactly what I expected so it might be as simple as porting their slug code from Go to Rust.

@chris-morgan

This comment has been minimized.

Copy link
Contributor Author

commented Aug 9, 2019

Yes, ampersand will need to be encoded everywhere, just as would already happen; I fail to see why that would be a problem.

I see no reason why we should alter what the user has deliberately chosen.

@Keats

This comment has been minimized.

Copy link
Collaborator

commented Aug 9, 2019

It will fail to work in some tools and users might not be aware of encoding, no reason not to remove reserved characters imo: generated URLs by Zola should work everywhere out of the box.

The Go code for Hugo is https://github.com/gohugoio/hugo/blob/master/helpers/path.go#L136 which raised the issue of what to do with accents in paths. I think we want to remove them (at least I wouldn't want them in a French page for example) but need to be careful to not remove other languages significant "accents" like in Japanese and replace it with the correct ASCII version, ä in German should become ae, not a.
I don't know whether an option like RemovePathAccents in Hugo is required though, it feels like the default should be true and I'm pretty sure the option is only there for backwards compatibility.

A related issue: #570

@chris-morgan

This comment has been minimized.

Copy link
Contributor Author

commented Aug 9, 2019

What tools do you have in mind where things might fail?

I maintain that these aren’t things that Zola chose. These are things that the user chose, and Zola should emphatically be leaving them alone. Removing things like accents would be wrong. Messing with what the user provided at all is wrong.

@Keats

This comment has been minimized.

Copy link
Collaborator

commented Aug 9, 2019

What tools do you have in mind where things might fail?

Latex for the first thing that comes to mind. Try to put an accent in a URL like https://fr.wikipedia.org/wiki/Wikipédia and it's going to fail, the user will need to manually encode the accents. Many tools in various places do assume ASCII only URLs. I've seen forums where a link to a Wikipedia article with an article would stop before an accent, linking to a 404 for example.

Removing things like accents would be wrong. Messing with what the user provided at all is wrong.

I would disagree on that. I guess it's more common with taxonomies, but I would want to write the proper name for one, like cinéma without wanting to have the é in the URL. I think it's a pretty common need, as seen in https://discourse.gohugo.io/t/special-characters-in-taxonomy/1247/14

Another thing that sucks with non-ASCII urls is the sharing, eg the Japanese homepage of Wikipedia is https://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8 when you copy the link which is unreadable and gets long very very easily :/ This is the main reason I (and others from the thread above I guess) do not want accents.

@chris-morgan

This comment has been minimized.

Copy link
Contributor Author

commented Aug 9, 2019

But I repeat: this is something that the user chose. If I put an accent in the filename, I expect it to stay there. If I wrote my URLs in Japanese, I expect them to be in Japanese.

It is not Zola’s place to interfere in these matters.

For comparison, consider IDNs.

Changing these things at all is wrong.

@Keats

This comment has been minimized.

Copy link
Collaborator

commented Aug 9, 2019

Right but as I said, I might want to have the correct spelling for a word as a taxonomy but not have any of its accents in the URL (only for ASCII accents, Japanese will not get escaped). So there needs to be an option to at least remove accents (the RemovePathAccents in Hugo I guess).

I still need to think for all the URL reserved characters like & as I still think they should be removed.

@chris-morgan

This comment has been minimized.

Copy link
Contributor Author

commented Aug 10, 2019

I had actually completely forgotten about the slugification of taxonomies, because I don’t use it that way. I will allow that taxonomy terms should have optional (enabled by default) slugification, but that nothing that is already a file name should be touched.

Here’s an excerpt of how I do tags:

[[taxonomies]]
name = "tags"
rss = true

[extra.tags]
	[extra.tags.rust]
	name = "Rust"
	title = "The Rust programming language"
	description = "I write about Rust things routinely. Here they all are."

My pages then have the "rust" tag, and templates render it all to <a href="/tags/rust/" title="The Rust programming language">Rust</a> or whatever, by looking up values in the config.

chris-morgan added a commit to chris-morgan/zola that referenced this issue Aug 11, 2019
Stop slugifying the page slug a second time
It’s already a slug because the user wrote it so—or if it doesn’t
conform to what we expect of slugs, the user did so *deliberately* by
naming the file thus. e.g. I have an article on my website with slug
`rust-cfg_attr`, and I definitely don’t want it turned into
`rust-cfg-attr`.

So there’s no need, or desire, to slugify it again.

You may notice I also dropped the `.trim()` here; that’s also
deliberate, because if you were mad enough to put whitespace at the
start or end of the slug, you must have meant it.

This restores consistency with sections, which were already not
slugified. Taxonomy terms are still slugified, for better or for worse
(worse, in my strong opinion, but I’ll address that another time).

Fixes getzola#768.
@Keats

This comment has been minimized.

Copy link
Collaborator

commented Aug 12, 2019

I would like to have some feedback on the file slugification from other users but if you have time to make a patch it would be much appreciated. With or without the taxonomies option.

@Keats

This comment has been minimized.

Copy link
Collaborator

commented Aug 15, 2019

taxonomy terms should have optional (enabled by default) slugification

I asked a French speaker on another issue for a tag named Décentralisation and they wanted to keep the é if possible so it seems that maybe the default should be to not slugify.

@southerntofu

This comment has been minimized.

Copy link

commented Aug 17, 2019

Hi, i had this problem so i proposed a patch in #781

When slugify is set to true in config, it does the slugification like before. Otherwise simple sanitation, removing & and / from the path as mentioned in this discussion. Removing / is not useful when parsing from filename but is when parsing from slug: foo/bar frontmatter. Maybe remove # too?

For the moment the setting defaults to false but in my opinion respecting user-supplied filenames out-of-the-box is less confusing to new users (especially non-english speakers).

In this patch the slugify config affects both page and taxonomy paths. Heading URI fragments (within the Markdown) are left untouched because i have no idea what should be done with those.

@Keats

This comment has been minimized.

Copy link
Collaborator

commented Aug 17, 2019

I think it needs a few more things. To put all the requirements in a list:

  1. Filenames are never changed (except for 3)
  2. Taxonomies names are never changed unless config.slugify (or whatever name) is set to true (except 3). This config value defaults to false.
  3. Always remove all / and any other characters that can cause issues (#?, ??) for filenames/taxonomies/hardcoded slugs

@chris-morgan is that accurate?

One thing to ensure is that we get all the cases correctly since that's a big breaking change: some external links to people Zola site would break by default if they had non ascii characters.
I'm on holiday next week but if @southerntofu and @chris-morgan and anyone else reading this issue can work together on a patch that would be great.

@southerntofu

This comment has been minimized.

Copy link

commented Aug 19, 2019

some external links to people Zola site would break by default if they had non ascii characters

In order to avoid this, i'm proposing:

  1. Filenames are slugified if config.slugified is true, otherwise never changed (except for 3)
  2. slugs passed through a slug frontmatter are never changed (except for 3)
  3. Taxonomies names are slugified if config.slugify is set to true, otherwise never changed (except for 3).
  4. Always remove all / and any other characters that can cause issues (#?) for filenames/taxonomies/hardcoded slugs
  5. config.slugify defaults to false so UTF-8 URIs work by default

So turning on slugify in the config preserves previously-generated URLs, except for those generated from the slug frontmatter which i think should not have happened in the first place. Should those be slugified as well when the config is turned on? (that would enable 100% backwards-compatibility with the flip of a config setting)

The CHANGELOG could then contain a breaking change as such:

  • Slugification of page paths is now optional. By default, non-ASCII characters (international characters i.e. UTF8) will be retained in the generated pages. If you are currently using non-ASCII characters in your content file paths and want to keep the URLs Zola previously generated, set slugify = true in your config.toml. NOTE: even with slugify = true, slugs passed via the slug frontmatter key will not be slugified.

any other characters that can cause issues (#?, ??)

I have no idea about the #, but i'm confident we should leave the ? in place. It is an inoffensive character, and it's a common pattern to simulate GET queries via a ? within a static file's name, for example for cache-busting purposes.

I'm not sure about # because according to HTTP RFC the client never sends this part of the URI to the server so i don't think allowing it as part of a generated URL is a clever idea.

I'm on holiday next week

Enjoy! I'll be happy to prepare the patch for when you return. If @chris-morgan or others wants to contribute to the patch and docs, i'll take PRs here.

@Keats

This comment has been minimized.

Copy link
Collaborator

commented Aug 22, 2019

but i'm confident we should leave the ? in place. It is an inoffensive character, and it's a common pattern to simulate GET queries via a ? within a static file's name, for example for cache-busting purposes.

I'm not sure if it can be in a path though. ? starts the query component so I don't think https://some-site.com/blog?/hello/ is a valid URL for Zola since the query would be ?/hello/ but this isn't what people using Zola would want.
I think we need to remove all of ?, /, # from paths.

As for the config name, I think slugify_paths might be better since slugify is a bit too generic.

@Keats

This comment has been minimized.

Copy link
Collaborator

commented Aug 26, 2019

Another thing that will need to change is the slugification of anchors. Probably just a matter of replacing spaces with _ like Wikipedia does and leaving everything else the same?

@southerntofu

This comment has been minimized.

Copy link

commented Aug 26, 2019

I'm not sure if it can be in a path though. ? starts the query component

These "query components" are passed into the REQUEST_URI by the client. Some servers interpret that as GET parameters (as does PHP) but static web servers just don't care. Even typical dynamic web server setups will react to a index.php?foo=bar REQUEST_URI by looking up whether there's a file named exactly like that, otherwise pass the request to index.php.

So my opinion is if the web clients/servers allow it, and it is an obvious choice by the website maintainer to introduce such characters in their paths, why prevent it? It's in my opinion more confusing to users to just silently drop characters.

I think we need to remove all of ?, /, # from paths.

After thinking it through, i think only newlines and / should be removed from paths. Newlines because they cannot be spelled out in web browsers. / because it won't be extracted from the filesystem (it is forbidden there) and within the slug frontmatter it makes no sense (there is already path frontmatter for that).

As for the config name, I think slugify_paths might be better since slugify is a bit too generic. (...) slugification of anchors

OK i'll use a single slugify_paths for all that.

@Keats

This comment has been minimized.

Copy link
Collaborator

commented Aug 26, 2019

link = document.createElement('a');
link.href = "http://some-site.com/hello?/ho";
link.pathname
> "/hello"

Same thing happens with # and with both in Python urlparse:

>>> parse.urlparse("http://some-site.com/hello?/ho")
ParseResult(scheme='http', netloc='some-site.com', path='/hello', params='', query='/ho', fragment='')
>>> parse.urlparse("http://some-site.com/hello#/ho")
ParseResult(scheme='http', netloc='some-site.com', path='/hello', params='', query='', fragment='/ho')

To me that's evidence that they need to be removed since even browsers and all programming languages will not understand them correctly. They would be valid urls but not in the form expected by the user so it seems dangerous to me to allow them.

@chris-morgan

This comment has been minimized.

Copy link
Contributor Author

commented Aug 28, 2019

For a project like this, I think it’s reasonable to hope that anyone that uses ? and # knows what they’re doing, and are using the proper access techniques, which must ensure that the URLs are encoded as necessary (I thereby exclude the gratuitous encode-anything-but-ASCII technique), so that ? becomes %3f and # becomes %23—something which must already be done for %, encoding it as %25.

Blocking / outright seems technically reasonable, though I could easily argue in the other direction too.

There are two main schools of slugification: the lower-kebab-case approach, most commonly paired with reduction to ASCII; and the Just_turn_spaces_into_underscores_and_leave_the_rest_alone approach as seen in things like Wikipedia. There is a place for each of these, and I would consider it reasonable to allow you to select slugification scheme for taxonomies and anchors.

For my part, I’m content with what I did in chris-morgan@b9fda1f.

@Keats

This comment has been minimized.

Copy link
Collaborator

commented Aug 28, 2019

Just_turn_spaces_into_underscores_and_leave_the_rest_alone approach as seen in things like Wikipedia

I think that's what generated id/anchors are going to get.

I'm pretty happy with the state of @southerntofu PR with the detailed changes in https://github.com/getzola/zola/pull/781/files#diff-eb8d99ad43cf3761c80af5ab717d516dR10

@chris-morgan

This comment has been minimized.

Copy link
Contributor Author

commented Aug 29, 2019

I expect all slugification to match the one scheme: either paths and anchors should all be lower-kebab-case, or paths and anchors should all be Just_space_changed. Changing the slugification scheme is of course a breaking change.

Maybe a config option like slug_scheme = "lower-kebab-case" or "Space_to_underscore", defaulting to "lower-kebab-case".

@Keats

This comment has been minimized.

Copy link
Collaborator

commented Aug 29, 2019

If slugify_path is true, the slugification is the same as currently: remove all non ascii chars from paths and anchors.
If it's false, we only remove characters causing issues in URLs and on NTFS so no slugification there. I wanted to make it look like Wikipedia but I've just noticed they replaced their spaces by _ in the URLs as well, which is where I was getting the idea to do that for anchors.
I don't want to have 2 different slugifications process so maybe just not touch anchors at all when slugify_path is false.

@southerntofu

This comment has been minimized.

Copy link

commented Sep 3, 2019

OK so in regards to #781 i think we still have to:

  • maybe_slugify anchors
  • URL encode permalinks and anchors

Am i correct?

@Keats

This comment has been minimized.

Copy link
Collaborator

commented Sep 3, 2019

I think so, unless I forgot something.

@southerntofu

This comment has been minimized.

Copy link

commented Sep 3, 2019

OK.

As a reminder: docs should make it clear that path frontmatter is not URL-encoded (it is a valid NTFS path name) but permalink is.

@Keats Keats referenced this issue Sep 9, 2019
@Keats

This comment has been minimized.

Copy link
Collaborator

commented Sep 16, 2019

@southerntofu FYI I'll push that to 0.10 rather than 0.9 since it is a huge change that I want to get 100% right and deserves its own release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.