Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Header/id deduplication #2763

Merged
merged 11 commits into from May 21, 2017

Conversation

Projects
None yet
3 participants
@Kwpolska
Copy link
Member

commented May 14, 2017

Fixes #2570. Also contains a doc fix for #2743, because I didn’t notice which branch I was committing to.

Should I bother with <a name=""> as well?

cc @wichtounet

Kwpolska added some commits May 14, 2017

Fix #2570 -- new deduplicate_ids filter
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Deduplicate headers bottom-up
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Fix braces in docs
Signed-off-by: Chris Warrick <kwpolska@gmail.com>

@Kwpolska Kwpolska requested review from ralsina and felixfontein May 14, 2017

@Kwpolska Kwpolska added this to the v7.8.6 milestone May 14, 2017

# Results are ordered the same way they are ordered in document
offending_elements = doc.xpath('//*[@id="{}"]'.format(i))
counter = 2
for e in offending_elements[1::-1]:

This comment has been minimized.

Copy link
@felixfontein

felixfontein May 14, 2017

Contributor

Shouldn't this be offending_elements[::-1] without the first 1?

This comment has been minimized.

Copy link
@Kwpolska

Kwpolska May 14, 2017

Author Member

Notice counter = 2? There will be foo, foo-2, foo-3

This comment has been minimized.

Copy link
@felixfontein

felixfontein May 14, 2017

Contributor

But this only enumerates elements 1 and 0 of the list, not any other element. ['a', 'b', 'c', 'd', 'e'][1::-1] == ['b', 'a'].

You probably want offending_elements[-2::-1].

offending_elements = doc.xpath('//*[@id="{}"]'.format(i))
counter = 2
for e in offending_elements[1::-1]:
new_id = '{0}-{1}'.format(i, counter)

This comment has been minimized.

Copy link
@felixfontein

felixfontein May 14, 2017

Contributor

What happens if this is ID is in use as well?

This comment has been minimized.

Copy link
@Kwpolska

Kwpolska May 14, 2017

Author Member

Will fix.

This comment has been minimized.

Copy link
@felixfontein

felixfontein May 14, 2017

Contributor

But trying until you find a free one will be a problem if you want to make permalinks (as for the Sphinx permalinks filter). Assume the two oldest entries have headers with IDs a and a. If there's nothing else around, one will get a and the other a-2. But now assume a new post is added which uses a-2 (for whatever reason). Then the old a-2 will end up as a-3, because a-2 is already taken, and links pointing to a in the second oldest post are suddenly pointing to the wrong place.

This comment has been minimized.

Copy link
@Kwpolska

Kwpolska May 14, 2017

Author Member

This is one of those cases which we can’t reliably fix, at least as a filter. To make it work in every scenario, we’d need to add post names to the IDs, or otherwise do unusual stuff. The thing is, most people are unlikely to use those permalinks on indexes, and this plugin’s aims are (a) to fix HTML validation issues on indexes, (b) to fix IDs for Sphinx permalinks and other uses clashing, on a single page. You can’t protect against changing permalinks if those are maintained by code, not humans, whilst allowing said humans to edit the page contents (an edit to a post/page could trigger a deduplication)

This comment has been minimized.

Copy link
@felixfontein

felixfontein May 14, 2017

Contributor

True. You'd need state to track the permalinks. We should mention that somewhere in the documenation, though. Otherwise we'll sooner or later get bug reports for that :)

Try new IDs until a free one is found
Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Kwpolska added some commits May 14, 2017

Testing slices with 3 elements isn’t always right
h/t @felixfontein

Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Smarter ID rewrite ordering
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
@Kwpolska

This comment has been minimized.

Copy link
Member Author

commented May 15, 2017

I updated the documentation and used a different approach: going bottom→top in indexes, and top→bottom in posts/pages. This way, we can get a sensible ordering for both use cases.

@ralsina

This comment has been minimized.

Copy link
Member

commented May 17, 2017

@Kwpolska I resolved the conflicts, hope I did not mess it, feel free to push merge

@felixfontein
Copy link
Contributor

left a comment

LGTM except... :)

off = offending_elements[-2::-1]
for e in off:
new_id = i
while doc.xpath('//*[@id="{}"]'.format(new_id)):

This comment has been minimized.

Copy link
@felixfontein

felixfontein May 17, 2017

Contributor

Why not use (and later update) seen_ids instead of hoping that the query runs as fast as possible?

This comment has been minimized.

Copy link
@Kwpolska

Kwpolska May 17, 2017

Author Member

Makes sense, fixed.

# Find headerlinks that we can fix.
headerlinks = e.find_class('headerlink')
for hl in headerlinks:
# We might get headerlinks of child elements

This comment has been minimized.

Copy link
@felixfontein

felixfontein May 17, 2017

Contributor

If one of the header links belongs to a child element with the same ID, you change the link to something wrong.

This comment has been minimized.

Copy link
@Kwpolska

Kwpolska May 17, 2017

Author Member

How do you suggest to fix that? A break?

Use seen_ids instead of XPath to find existing IDs
h/t @felixfontein

Signed-off-by: Chris Warrick <kwpolska@gmail.com>
@Kwpolska

This comment has been minimized.

Copy link
Member Author

commented May 20, 2017

@felixfontein, please explain, and perhaps suggest a solution to:

If one of the header links belongs to a child element with the same ID, you change the link to something wrong.

@felixfontein

This comment has been minimized.

Copy link
Contributor

commented May 20, 2017

Maybe something like:

<div id="the-content">
  [...]
  <h1 id="the-content">The Content</h1><a class="headerlink" href="#the-content">¶</a>
</div>

(And there's another the-content ID somewhere so that both these end up in off.) If the code processes them from top to bottom, it will change it to the following:

<div id="the-content-2">
  [...]
  <h1 id="the-content-3">The Content</h1><a class="headerlink" href="#the-content-2">¶</a>
</div>

because the <div> is processed first.

To avoid this, you probably have to check if there's another the-content ID between the headerlink element and the element whose ID we're currently processing.

Kwpolska added some commits May 20, 2017

Update at most 1 headerlink
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
@Kwpolska

This comment has been minimized.

Copy link
Member Author

commented May 20, 2017

It sounds as if doing a break when we find the first headerlink will work, because those are supposed to be the first headerlink in a given header. (Using headerlinks[0] would work in 99.9% of cases, I believe). @felixfontein, please review again.

@felixfontein

This comment has been minimized.

Copy link
Contributor

commented May 20, 2017

In the case I sketched above it won't work, because the <div> has no headerlink (because it is no header), so still the wrong headerlink will be "fixed". The problem here is that the header in the post happens end up with the same ID (because of it's title) than something generated by the theme (the <div id="the-content">).

@Kwpolska

This comment has been minimized.

Copy link
Member Author

commented May 20, 2017

Please propose a better solution then? I’m inclined to say “edge case not worth fixing”.

@felixfontein

This comment has been minimized.

Copy link
Contributor

commented May 20, 2017

Is there an efficient way to find the first parent object of hl whose ID is id? If yes, you could check whether it is e.

Undo glitch in add_header_permalinks
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
@Kwpolska

This comment has been minimized.

Copy link
Member Author

commented May 20, 2017

I’m afraid this might not be efficient enough, and that would be needed only for an edge case. This happening in real input is very unlikely. (reST won’t ever generate it, for example)

@felixfontein

This comment has been minimized.

Copy link
Contributor

commented May 20, 2017

Just manually walking up the DOM tree shouldn't be very slow. Usually DOM trees aren't excessively deep that you really get a performance hit. And since this search has only to be done for very few elements, it shouldn't really matter.

And yes, while this is an edge case, it is a bug and really shitty for a user if this happens. And if we ever fix this, it might break some existing links where someone might have worked around this bug.

In my opinion, we should get this filter into a form that should never require any changes, because every change might break existing links, which are supposed to be permalinks. But then, I probably won't be using it (in particular the headerlink feature), so I won't really mind if you decide to ignore this edge case :)

Edit: I just noticed that this only screws up the header links, but not the IDs themselves. So this won't destroy permalinks later on, it will just generate incorrect header links. So I'm really fine with leaving this as-is. On the other hand, walking up the DOM tree isn't that hard and inefficient either.

@felixfontein

This comment has been minimized.

Copy link
Contributor

commented May 20, 2017

Ok, just thought of one reason where this can happen more realistically in real life: if someone decides to name a subsection the same as a section. Some people tend to do that. And if you don't use a compiler which prevents colliding IDs, you have a problem.

Edit: ah, I forgot about the break, that'll of course prevent this.

@Kwpolska Kwpolska merged commit 26af25c into master May 21, 2017

5 checks passed

codacy/pr Good work! A positive pull request.
Details
continuous-integration/appveyor/branch AppVeyor build succeeded
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@Kwpolska Kwpolska deleted the header-deduplication branch May 21, 2017

@Kwpolska

This comment has been minimized.

Copy link
Member Author

commented May 21, 2017

Thanks, merged. Will rebuild getnikola.com to use this in a second.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.