Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing index.html within link generated to designate the canonical URL #483

Closed
amotl opened this issue Apr 29, 2024 · 2 comments
Closed

Comments

@amotl
Copy link
Member

amotl commented Apr 29, 2024

Problem

On a page rendered by index.rst/index.md files, like this one about built-in functions, the index.html page name is omitted on the rendered variant of the <link rel="canonical" representation.

<link rel="canonical" href="https://cratedb.com/docs/crate/reference/en/latest/general/builtins/" />

This flaw causes all sorts of downstream problems.

Details

@msbt is outlining more details about the problem. Thanks!

The massive amount of non-indexed pages are a result of our docs setup. The top 2 non-Google-issues (Alternate page with proper canonical tag and Page with redirect) are mostly because of the redirect chains and versioning that we have in place.

If you take this URL as an example:
https://cratedb.com/docs/crate/reference/en/master/general/builtins/subquery-expressions.html

The page also exists in these (and probably some more) versions:
https://cratedb.com/docs/crate/reference/en/5.6/general/builtins/subquery-expressions.html
https://cratedb.com/docs/crate/reference/en/5.5/general/builtins/subquery-expressions.html

Both links above have this URL set as canonical:
https://cratedb.com/docs/crate/reference/en/latest/general/builtins/subquery-expressions.html

This will obviously result in a lof of unindexed pages, not sure if this can be fixed since it's not really broken.

To show an example of your links, this URL shows as not-indexed because of "Alternate page with proper canonical tag":
https://cratedb.com/docs/guide/install/cloud/aws/index.html

If you inspect that URL, you can see the "User-declared canonical" is:
https://cratedb.com/docs/guide/install/cloud/aws/ (which is indexed)

So the index.html gets omitted by RTD and every docs-page ending with index.html gets a not-indexed issue attached to it. Can we maybe add that to the canonical URL to avoid that @amotl?

References

/cc @matkuliak, @michaelkremmel

@amotl
Copy link
Member Author

amotl commented Apr 29, 2024

Observations

We did a few orientation flights on this topic together with @msbt, and came to the conclusion that RTD might have deprecated the "canonical_url" thing already, as it might only have been required for early versions of Sphinx<1.8 and RTD of that times.

Today, it is advised to use html_baseurl:

For sphinx >=1.8 we can use html_baseurl to set the canonical URL.

-- https://github.com/readthedocs/readthedocs.org/pull/7540/files

... but not define it:

If you are using Sphinx, Read the Docs will automatically add a default value of the html_baseurl setting matching your canonical domain.

If you are using a custom html_baseurl in your conf.py, you have to ensure that the value is correct. This can be complex, supporting pull request builds (which are published on a separate domain), special branches or if you are using subproject s or translations. We recommend not including a html_baseurl in your conf.py, and letting Read the Docs define it.

-- https://docs.readthedocs.io/en/stable/guides/canonical-urls.html

Thoughts

In this case, the section in readthedocs-insert.html.tmpl might actually be a backward-compatibility thing?

References I

We are not sure if each one of them is relevant. However, all are about fixing or improving the situation wrt. canonical links, in one way or another. In this spirit, I am enumerating them here, because there is a chance we missed something on the ugprade path since Sphinx 1.8 (~10 years ago?).

References II

Also discovered those, from 2023.

@amotl
Copy link
Member Author

amotl commented May 6, 2024

Through some cleanups and refactorings, we removed some configuration overhead, and fixed the issue described above, still using a few Crate-specific workarounds.

The improvements have been released with version 0.31.2. Thanks for your excellent support, @msbt! 💯

@amotl amotl closed this as completed May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant