Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docutils: @.* gets converted to & in static filenames #2646

Closed
gergelypolonkai opened this issue Nov 2, 2019 · 13 comments · Fixed by #2812
Closed

Docutils: @.* gets converted to & in static filenames #2646

gergelypolonkai opened this issue Nov 2, 2019 · 13 comments · Fixed by #2812
Assignees

Comments

@gergelypolonkai
Copy link

I’m currently migrating my blog from Jekyll to Pelican and try hard to preserve old filenames. One of these is my PGP key called gergely@polonkai.eu.asc. In my about page i added

You can download it `here <{static}../gergely@polonkai.eu.asc>`_

.. include:: ../gergely@polonkai.eu.asc
   :code: text

The file gets included (the content appears on the page), the link on the generated page references the correct filename, but the file does not exist, and i get this warning:

WARNING: Cannot get modification stamp for /home/polesz/Projects/blog/content/gergely&
  | 	FileNotFoundError: [Errno 2] No such file or directory: '/home/polesz/Projects/blog/content/gergely&'

If i escape the @ character as %40 or using a backslash (\@) the problem remains.

@gergelypolonkai
Copy link
Author

The problem happens in Content._link_replacer, where m.group('value') becomes ../gergely&#64;polonkai.eu.asc which gets URL parsed to

ParseResult(scheme='', netloc='', path='../gergely&', params='', query='', fragment='64;polonkai.eu.asc')

As a workaround, if i add gergely@polonkai.eu.asc to STATIC_PATHS, it gets copied to the output directory (thus, the generated link works), but the warning persists.

@oulenz
Copy link
Contributor

oulenz commented Nov 15, 2019

I traced this back to docutils. I.e., the @ is replaced by &#64; when the about page is parsed.

What I don't know is whether &#64; is generally the best way to render @ in html. If yes, we could probably handle @ in filenames similarly to how we already handle spaces. If no, then there might be a way to force docutils to leave @ intact.

@avaris
Copy link
Member

avaris commented Nov 15, 2019

Well, that's a bit odd behavior of docutils. &#64; would be the HTML encoding of @, but in a link you should URL encode if you encode it at all. That would be %40 instead.

That aside, I don't know why this does naive replacement instead of urllib.parse.unquote, which could handle %40 as well. However, it won't solve this issue since docutils is doing a different encoding. We could also do html.unescape, but that just feels wrong.

@oulenz
Copy link
Contributor

oulenz commented Nov 16, 2019

Do you know whether docutuls does URL encoding generally? I've tried to look for it but couldn't find it.

@stale
Copy link

stale bot commented Jan 15, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your participation and understanding.

@stale stale bot added the stale Marked for closure due to inactivity label Jan 15, 2020
@gergelypolonkai
Copy link
Author

@avaris in that case, should i report this as a docutils bug?

@stale stale bot removed the stale Marked for closure due to inactivity label Jan 15, 2020
@stale
Copy link

stale bot commented Mar 15, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your participation and understanding.

@stale stale bot added the stale Marked for closure due to inactivity label Mar 15, 2020
@gergelypolonkai
Copy link
Author

Can/should this be fixed on pelicans side, or should it be reported upstream in docutils instead?

@stale stale bot removed the stale Marked for closure due to inactivity label Mar 15, 2020
@justinmayer justinmayer changed the title @.* gets converted to & in static filenames Docutils: @.* gets converted to & in static filenames Apr 30, 2020
@justinmayer
Copy link
Member

@gergelypolonkai: I suggest reporting this to Docutils.

For whatever value it may have, I found several references to Docutils encoding the @ symbol to &#64;:

https://chromium.googlesource.com/infra/third_party/docutils/+/refs/heads/master/docutils/docutils/writers/_html_base.py#290

https://docutils.sourceforge.io/docs/user/config.html#id137

From the changelog, back in 2003:

Added @ to &#64; encoding to thwart address harvesters.

@gergelypolonkai
Copy link
Author

It seems that this mangling of email addresses as done only if some cloak_email_addresses setting is enabled (however it can be done) and it’s in a mailto: link:

    def visit_Text(self, node):
        text = node.astext()
        encoded = self.encode(text)
        if self.in_mailto and self.settings.cloak_email_addresses:
            encoded = self.cloak_email(encoded)
        self.body.append(encoded)

Off to the docutils issue board, then.

@avaris avaris self-assigned this May 3, 2020
@avaris
Copy link
Member

avaris commented May 3, 2020

Based on the response, seems like it's a feature, not a bug ™. But, I agree with the comment that we can be a bit more robust. Extending the search alternatives to URL decoded versions was probably required, we can add the HTML unescaped versions as well.

@justinmayer
Copy link
Member

Thanks to @avaris working on the above-linked PR, this issue should now be addressed. Feel free to test latest master and post a comment here if any follow-up changes are deemed to be warranted.

@justinmayer
Copy link
Member

Fix for this issue is included in the just-released Pelican 4.5.1. ✨

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants