Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make 'latest' remain in URL instead of 9_0 #846

Merged

Conversation

janhoy
Copy link
Contributor

@janhoy janhoy commented May 9, 2022

Spinoff from #77 - do not rewrite latest to 9_0, but opposite, so that people are encouraged to sharing latest links.
Note that it will still be possible to share an explicit link to 9.0 version, that will be sure to route to the 9.0 guide. Just that the 'default' will be latest when working with the latest version.

This may perhaps also help in boosting the PageRank of latest links in search engines?

@janhoy janhoy added the documentation Improvements or additions to documentation label May 9, 2022
@uschindler
Copy link
Contributor

uschindler commented May 9, 2022

I can't really give a review on this PR as I don't know Antorra.

If you want by comment: I don't think this will change Google's ranking, there are pros and cons:

  • haveing a stable "latest" version link is good because longer living pages may appear on top of serach results, so people will use them. But there may also be teh problem of older links disappearing von Google, because they are no longer referenced: When they are live they are redirceted to "latest". Once they are no longer alive they are invisible, unless we link them explicit.
  • always redirecting to "latest" seems bad to me, as it makes it impossible to add permlinks. Or is there the possibility to get some "permlink" button on each page? So somebody citing a specific page can make a persistent link?

So I am mixed feelings. As coming from "science" where persistent URLs for each verison are important, I tend to think that redirecting to "latest" is not best idea. From a business person opinion of course linking to latest is fine.

If both works: No redirect and both pages are visible next to each other with separate URLs, I would be happy. But the page should have a "canonic url" meta header to inform Google about the duplicates and which version is the one to "bookmark" (the versioned one).

@janhoy
Copy link
Contributor Author

janhoy commented May 10, 2022

Ideally I'd also want that both '9_0' and 'latest' will work, but I cannot see that as a choice at https://docs.antora.org/antora/latest/playbook/urls-latest-version-segment-strategy/#key?

Antora supports canonical URL header: https://docs.antora.org/antora/latest/playbook/site-url/#canonical-url which is good news, so if we redirect to "latest", but canonical remains "9_0" then we could be good?

@uschindler
Copy link
Contributor

uschindler commented May 11, 2022

Antora supports canonical URL header: https://docs.antora.org/antora/latest/playbook/site-url/#canonical-url which is good news, so if we redirect to "latest", but canonical remains "9_0" then we could be good?

Maybe do it the other way round: If the canonical URL is always "latest" (as described in the documentation), then Google would forget all old versions and only show links to latest. Thats actually a good thing and would solve our problems.

We should maybe think of patching all old pages with a canonical link, too. Or much better instead of patching, we could add a HTTP "Link:" header (see Google Docs above) to the .htaccess where we maybe link all 8.x pages to latest 8.11 refguide on the HTTP level. Same for 7.x and 6.x. This would at least remove all variants from google except the latets version of each major release.

@HoustonPutman
Copy link
Contributor

Maybe do it the other way round: If the canonical URL is always "latest" (as described in the documentation), then Google would forget all old versions and only show links to latest. Thats actually a good thing and would solve our problems.

+1, we want latest to be in the google results.

I have tested this, and the current logic will set the canonical link to latest, not 9_0. The version selection tool also links to the correct name of the page in each version, if the page has been renamed. So this is exactly the logic that we want.

  • always redirecting to "latest" seems bad to me, as it makes it impossible to add permlinks. Or is there the possibility to get some "permlink" button on each page? So somebody citing a specific page can make a persistent link?

We should certainly add this and it shouldn't be hard to do. I am often annoyed with the AWS docs, trying to link to the specific latest version.

We should maybe think of patching all old pages with a canonical link, too. Or much better instead of patching, we could add a HTTP "Link:" header (see Google Docs above) to the .htaccess where we maybe link all 8.x pages to latest 8.11 refguide on the HTTP level. Same for 7.x and 6.x. This would at least remove all variants from google except the latets version of each major release.

We definitely need to do something about the Solr 6-8 releases. Generally only the pages that don't exist in latest (after redirects) should be indexed in Google. I think that will be hard to do in general, though we can go back and do it. In my opinion, we can probably do a blanket robots file that says don't index the old ref-guide pages. It will make some information not searchable, but there should be very few pages that have been removed before 9.0

Somewhat sane suggestion: We make a robots.txt that disallows scrapping all 6-8 ref guides. Then we create exceptions for the pages that have been removed in a certain version. So if a page (like autoscaling) was removed in 9.0, we create an exception that allows scraping old-guide/8_11/autoscaling.html, if a page was removed in 8.5, we allow old-guide/8_4/page.html. There shouldn't be too many of these pages, so we can go through and add the exceptions manually.

@janhoy
Copy link
Contributor Author

janhoy commented May 11, 2022

I have a list of pages that once existed but no longer does in 9.0:
https://github.com/apache/solr/pull/596/files#diff-ebf3a521b24b4139995e9e70b7aeffc202df3152e84f4dd46d17d6649f343834R97

@uschindler
Copy link
Contributor

uschindler commented May 11, 2022

A robots.txt to hide the old releases looks like a good idea. We can just link all URL prefixes and we're done. Exlicitely allowing some older pages could also be done.

Instead of an old-style robots.txt we may also use a <locationMatch ^/guide/(6|7|8)_> to the htaccess with addHeader "X-Robots-Tag: noindex,nofollow,noarchive" (noindex should be enough, we can still allow Google to follow links or archive).

@HoustonPutman
Copy link
Contributor

Instead of an old-style robots.txt we may also use a <locationMatch ^/guide/(6|7|8)_> to the htaccess with addHeader "X-Robots-Tag: noindex,nofollow,noarchive" (noindex should be enough, we can still allow Google to follow links or archive).

Yes I was actually about to start implementing this. I think it's the way to go.

@HoustonPutman
Copy link
Contributor

Ok I have this: 78ecec9

It'll be an absolute pain to test, and I'm sure it doesn't work out-of-the box. But it's not required for the 9.0 release, so we can tinker with it.

@uschindler
Copy link
Contributor

uschindler commented May 11, 2022

Ich think you should be able to do some tests with curl -I testurl on the staging site and check for (non-)existence of header.

@janhoy
Copy link
Contributor Author

janhoy commented May 11, 2022

Ich think you should be able to do some tests with curl -I testurl on the staging site and check for (non-)existence of header.

I was able to test .htaccess locally during my PR effort with docker like this:

docker run --rm --name httpd -p 8000:80 -v /Users/janhoy/git/solr-site/output:/usr/local/apache2/htdocs/ -v $(pwd)/my-httpd.conf:/usr/local/apache2/conf/httpd.conf httpd

@janhoy janhoy merged commit 4896773 into apache:main May 11, 2022
@janhoy janhoy deleted the latest_version_segment_strategy_redirect_to branch May 11, 2022 23:00
janhoy added a commit that referenced this pull request May 11, 2022
janhoy added a commit that referenced this pull request May 11, 2022
janhoy added a commit to janhoy/solr that referenced this pull request May 12, 2022
@magibney
Copy link
Contributor

I realize this PR is merged (and thanks!), but I have a couple of questions that follow logically on the conversation here, so:

  1. Noticing that we don't have (and haven't historically had, I think) an old-style robots.txt, I wonder: is old-style robots.txt completely obviated? i.e., should we not bother having one?
  2. Sitemap.xml is being generated I think, and is present in nightlies. But it doesn't appear to be accessible on the main site. I think it was previously present. Is there still a purpose served by sitemap.xml? Currently the nightlies version looks like it points to all (antora) versions -- I'm not sure whether we'd want to pare down the referenced pages to make sitemap.xml a proper complement to "canonical, no-index/no-follow/no-archive" approach taken by this PR?
  3. If sitemap.xml is still relevant and we want to make it accessible, I think the sitemap spec calls for sitemap.xml to be referenced from an old-style robots.txt ... I'm not aware of other/newer approaches to referencing sitemap.xml.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants