Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Header/id deduplication #2763

Merged
merged 11 commits into from May 21, 2017
2 changes: 2 additions & 0 deletions CHANGES.txt
Expand Up @@ -4,6 +4,8 @@ New in master
Features
--------

* New ``deduplicate_ids``, for preventing duplication of HTML id
attributes (Issue #2570)
* New ``add_header_permalinks`` filter, for Sphinx-style header links
(Issue #2636)

Expand Down
7 changes: 6 additions & 1 deletion docs/manual.txt
Expand Up @@ -1919,7 +1919,7 @@ add_header_permalinks
.headerlink { opacity: 0.1; margin-left: 0.2em; }
.headerlink:hover { opacity: 1; text-decoration: none; }

Additionally, you can provide a custom list of XPath expressions which should be used for finding headers (``{hx}}`` is replaced by headers h1 through h6).
Additionally, you can provide a custom list of XPath expressions which should be used for finding headers (``{hx}`` is replaced by headers h1 through h6).
This is required if you use a custom theme that does not use ``"e-content entry-content"`` as a class for post and page contents.

.. code:: python
Expand All @@ -1928,6 +1928,11 @@ add_header_permalinks
# Include *every* header (not recommended):
# HEADER_PERMALINKS_XPATH_LIST = ['*//{hx}']

deduplicate_ids
Prevent duplicated IDs in HTML output. An incrementing counter is added to
offending IDs. If used alongside ``add_header_permalinks``, it will fix
those links (it must run **after** that filter)

You can apply filters to specific posts or pages by using the ``filters`` metadata field:

.. code:: restructuredtext
Expand Down
2 changes: 1 addition & 1 deletion nikola/conf.py.in
Expand Up @@ -587,7 +587,7 @@ GITHUB_COMMIT_SOURCE = True
# HTML_TIDY_EXECUTABLE = 'tidy5'

# List of XPath expressions which should be used for finding headers
# ({hx}} is replaced by headers h1 through h6).
# ({hx} is replaced by headers h1 through h6).
# You must change this if you use a custom theme that does not use
# "e-content entry-content" as a class for post and page contents.

Expand Down
35 changes: 35 additions & 0 deletions nikola/filters.py
Expand Up @@ -437,3 +437,38 @@ def add_header_permalinks(data, xpath_list=None):
new_node = lxml.html.fragment_fromstring('<a href="#{0}" class="headerlink" title="Permalink to this heading">¶</a>'.format(hid))
node.append(new_node)
return lxml.html.tostring(doc, encoding="unicode")


@apply_to_text_file
def deduplicate_ids(data):
"""Post-process HTML via lxml to deduplicate IDs."""
doc = lxml.html.document_fromstring(data)
elements = doc.xpath('//*')
all_ids = [element.attrib.get('id') for element in elements]
seen_ids = set()
duplicated_ids = set()
for i in all_ids:
if i is not None and i in seen_ids:
duplicated_ids.add(i)
else:
seen_ids.add(i)

if duplicated_ids:
# Well, that sucks.
for i in duplicated_ids:
# Results are ordered the same way they are ordered in document
offending_elements = doc.xpath('//*[@id="{}"]'.format(i))
counter = 2
for e in offending_elements[1::-1]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be offending_elements[::-1] without the first 1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice counter = 2? There will be foo, foo-2, foo-3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this only enumerates elements 1 and 0 of the list, not any other element. ['a', 'b', 'c', 'd', 'e'][1::-1] == ['b', 'a'].

You probably want offending_elements[-2::-1].

new_id = '{0}-{1}'.format(i, counter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if this is ID is in use as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But trying until you find a free one will be a problem if you want to make permalinks (as for the Sphinx permalinks filter). Assume the two oldest entries have headers with IDs a and a. If there's nothing else around, one will get a and the other a-2. But now assume a new post is added which uses a-2 (for whatever reason). Then the old a-2 will end up as a-3, because a-2 is already taken, and links pointing to a in the second oldest post are suddenly pointing to the wrong place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of those cases which we can’t reliably fix, at least as a filter. To make it work in every scenario, we’d need to add post names to the IDs, or otherwise do unusual stuff. The thing is, most people are unlikely to use those permalinks on indexes, and this plugin’s aims are (a) to fix HTML validation issues on indexes, (b) to fix IDs for Sphinx permalinks and other uses clashing, on a single page. You can’t protect against changing permalinks if those are maintained by code, not humans, whilst allowing said humans to edit the page contents (an edit to a post/page could trigger a deduplication)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. You'd need state to track the permalinks. We should mention that somewhere in the documenation, though. Otherwise we'll sooner or later get bug reports for that :)

e.attrib['id'] = new_id
counter += 1
# Find headerlinks that we can fix.
headerlinks = e.find_class('headerlink')
for hl in headerlinks:
# We might get headerlinks of child elements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If one of the header links belongs to a child element with the same ID, you change the link to something wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you suggest to fix that? A break?

if hl.attrib['href'] == '#' + i:
hl.attrib['href'] = '#' + new_id
return lxml.html.tostring(doc, encoding='unicode')
else:
return data