New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Header/id deduplication #2763
Header/id deduplication #2763
Changes from 3 commits
0c1dd74
15ffedb
81c337c
79a3914
18c76b1
91204c8
83354b7
7c68969
e3eaca0
0439676
5237119
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -437,3 +437,38 @@ def add_header_permalinks(data, xpath_list=None): | |
new_node = lxml.html.fragment_fromstring('<a href="#{0}" class="headerlink" title="Permalink to this heading">¶</a>'.format(hid)) | ||
node.append(new_node) | ||
return lxml.html.tostring(doc, encoding="unicode") | ||
|
||
|
||
@apply_to_text_file | ||
def deduplicate_ids(data): | ||
"""Post-process HTML via lxml to deduplicate IDs.""" | ||
doc = lxml.html.document_fromstring(data) | ||
elements = doc.xpath('//*') | ||
all_ids = [element.attrib.get('id') for element in elements] | ||
seen_ids = set() | ||
duplicated_ids = set() | ||
for i in all_ids: | ||
if i is not None and i in seen_ids: | ||
duplicated_ids.add(i) | ||
else: | ||
seen_ids.add(i) | ||
|
||
if duplicated_ids: | ||
# Well, that sucks. | ||
for i in duplicated_ids: | ||
# Results are ordered the same way they are ordered in document | ||
offending_elements = doc.xpath('//*[@id="{}"]'.format(i)) | ||
counter = 2 | ||
for e in offending_elements[1::-1]: | ||
new_id = '{0}-{1}'.format(i, counter) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happens if this is ID is in use as well? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will fix. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But trying until you find a free one will be a problem if you want to make permalinks (as for the Sphinx permalinks filter). Assume the two oldest entries have headers with IDs There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is one of those cases which we can’t reliably fix, at least as a filter. To make it work in every scenario, we’d need to add post names to the IDs, or otherwise do unusual stuff. The thing is, most people are unlikely to use those permalinks on indexes, and this plugin’s aims are (a) to fix HTML validation issues on indexes, (b) to fix IDs for Sphinx permalinks and other uses clashing, on a single page. You can’t protect against changing permalinks if those are maintained by code, not humans, whilst allowing said humans to edit the page contents (an edit to a post/page could trigger a deduplication) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. True. You'd need state to track the permalinks. We should mention that somewhere in the documenation, though. Otherwise we'll sooner or later get bug reports for that :) |
||
e.attrib['id'] = new_id | ||
counter += 1 | ||
# Find headerlinks that we can fix. | ||
headerlinks = e.find_class('headerlink') | ||
for hl in headerlinks: | ||
# We might get headerlinks of child elements | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If one of the header links belongs to a child element with the same ID, you change the link to something wrong. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How do you suggest to fix that? A |
||
if hl.attrib['href'] == '#' + i: | ||
hl.attrib['href'] = '#' + new_id | ||
return lxml.html.tostring(doc, encoding='unicode') | ||
else: | ||
return data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be
offending_elements[::-1]
without the first1
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notice
counter = 2
? There will befoo
,foo-2
,foo-3
…There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this only enumerates elements 1 and 0 of the list, not any other element.
['a', 'b', 'c', 'd', 'e'][1::-1] == ['b', 'a']
.You probably want
offending_elements[-2::-1]
.