Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix google rankings of docs by using weighting / google sitemaps #79

Open
ukd1 opened this issue Aug 8, 2019 · 19 comments
Open

Fix google rankings of docs by using weighting / google sitemaps #79

ukd1 opened this issue Aug 8, 2019 · 19 comments

Comments

@ukd1
Copy link

@ukd1 ukd1 commented Aug 8, 2019

For me I get:

image

aka, linking to https://crystal-lang.org/api/0.20.1/HTTP/Client.html

I actually want https://crystal-lang.org/api/0.30.0/HTTP/Client.html, or "latest" (https://crystal-lang.org/api/latest/HTTP/Client.html).

This can be done, with sitemap (https://en.wikipedia.org/wiki/Sitemaps) using weighting, aka priority:

<?xml version="1.0" encoding="utf-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
    <url>
        <loc>https://crystal-lang.org/api/0.20.1/HTTP/Client.html</loc>
        <lastmod>2019-07-10</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.1</priority>
    </url>
    <url>
        <loc>https://crystal-lang.org/api/0.30.0/HTTP/Client.html</loc>
        <lastmod>2019-07-10</lastmod>
        <changefreq>daily</changefreq>
        <priority>1.0</priority>
    </url>
</urlset>

etc...

@ukd1 ukd1 changed the title Fix google rankings of docs by using weighting Fix google rankings of docs by using weighting / google sitemaps Aug 8, 2019
@straight-shoota

This comment has been minimized.

Copy link
Member

@straight-shoota straight-shoota commented Aug 8, 2019

There is essentially discussed crystal-lang/crystal#5952
The idea was to use canonical references to the latest url. But that also has some issues. And most importantly, it currently doesn't cover doc pages < 0.25.0 so these still show up on the search results (but there are no between 0.25.0 and 0.30.0 because they all point to latest).

Using a sitemap seems like a smart alternative to solve this issue. 👍

@bcardiff

This comment has been minimized.

Copy link
Member

@bcardiff bcardiff commented Aug 9, 2019

Does someone know if there is a tool for generating the sitemap does some recursive checks over directories? Unless it can be automated it won't happen.

Doing a pass over old docs to set the canonical seems more likely if that fixes the issue.

@straight-shoota

This comment has been minimized.

Copy link
Member

@straight-shoota straight-shoota commented Aug 9, 2019

This should be fairly simple to automate. Essentially you only need to extract all local links from each version's index.html in the API docs. That's already the URLs for the sitemaps. All but those links from the most recent version get a lower priority.
Each released API docs should be in a distinct sitemap file and all can be combined together in a sitemap index.

I can try to put this together if nobody else is interested.

@ukd1

This comment has been minimized.

Copy link
Author

@ukd1 ukd1 commented Aug 12, 2019

@straight-shoota nice, didn't think to check for the issue on the main repo somehow...lol. I'd be down for helping on this, but tbh, I have no idea how the docs or old versions are built - so it might be easier if you do it. If you'd like a hand / pair / lmk.

@RX14

This comment has been minimized.

Copy link
Member

@RX14 RX14 commented Aug 25, 2019

Doing a pass over old docs to set the canonical seems more likely if that fixes the issue.

this can be done with a simple sed script to adjust the header for old versions. The doc pages don't need to be regenerated.

@RX14

This comment has been minimized.

Copy link
Member

@RX14 RX14 commented Aug 25, 2019

I'd suggest redoing thse old pages before working on a sitemap

@straight-shoota

This comment has been minimized.

Copy link
Member

@straight-shoota straight-shoota commented Sep 5, 2019

@RX14 The problem with canonical links is that docs for older versions vanish from the search results because the search algorithm treats them as duplicate content. But in fact, they're not duplicate and users might have a need to find documentation for older versions as well. For example, when upgrading to a new version, you may need to read the documentation of deprecated features not available in the current version in order to find a suitable replacement.

Using a sitemap is a superior solution because it allows to assign priorities to individual pages and outdated pages don't vanish completely, they just won't be as prominent as more recent once. I think it should eventually replace the canonical link.

@straight-shoota

This comment has been minimized.

Copy link
Member

@straight-shoota straight-shoota commented Sep 5, 2019

I've put together a simple program to automatically generate sitemaps for https://crystal-lang.org/api

It's available at https://github.com/straight-shoota/crystal_docs_sitemap

Generated output: output.tar.gz

The output contents should be published at https://crystal-lang.org/api/ and search engines need to be informed about the sitemap (see https://www.sitemaps.org/protocol.html#informing).

@straight-shoota

This comment has been minimized.

Copy link
Member

@straight-shoota straight-shoota commented Oct 18, 2019

@bcardiff WDYT?

@bcardiff

This comment has been minimized.

Copy link
Member

@bcardiff bcardiff commented Oct 18, 2019

I agree the sitemap is worth having and is a good solution. I don't think having the sitemap checked in the repo is the right thing.

From a workflow point of view what it would make sense is to have a tool to change an existing sitemap with some operations:

  1. Add dir content as it will be reached from specific url prefix
  2. Set the priority for all routes matching a specific prefix

At least that workflow will play well with the release process, where we have a local dir with the new api documentation to upload.

I am unsure how to keep the sitemap up to date with respect to the content from jekyll itself regarding lastmod of existing pages and new posts. Maybe there is a 3rd action

  1. Update last most for a subset of dirs as it will be reached from specific url prefix.

That way we can update those params without iterating the whole content.

So, in essence, is having an approach to update rather than create a sitemap.

@straight-shoota

This comment has been minimized.

Copy link
Member

@straight-shoota straight-shoota commented Oct 18, 2019

I don't think having the sitemap checked in the repo is the right thing.

Agreed. It just needs to be generated and put into an S3 bucket. Ideally, a rebuild should be triggered after the nightly API docs have been updated from master.

I am unsure how to keep the sitemap up to date with respect to the content from jekyll itself regarding lastmod of existing pages and new posts.

Currently, these sitemaps are only for /api, so Jekyll is not even involved. This is perfectly fine, the sitemap doesn't need to incorporate all pages on the domain. Getting the priorities right for the API versions is the main issue here, and I'd like to get that fixed before considering other parts of the website. They can be tackled individually (for example, Jekyll can simply build its own sitemap), we just need to reference all sitemaps from the sitemapindex.

So, in essence, is having an approach to update rather than create a sitemap.

Sure, we can do that. I just figured it would be easier to simply run the generator and push the result to S3 without having to synchronize first.

In practice, there are two events that would require an update to the sitemaps:

  1. Every day the updated nightly API docs are published for master. This only needs a rebuild of the sitemap for /api/master.
  2. When a new Crystal version is released, we need to build the sitemap for the new release and rebuild for the last x releases in order to update the priority. x is currently 3: The last 2 versions get priorities (0.5, 0.3) and the one after that needs to be set to the default (0.1)

The priority adjustments could actually just be implemented with a simple grep. The contents don't change when a release age, thus there is no need to actually rebuild the sitemap.

Considering all this, it might actually be the best solution to integrate the sitemap generation into the doc generator. This problem is not specific to the stdlib and this way all shards API docs could benefit.
This is really trivial to implement, it just spits out another file. And won't require additional configuration, there is already --canonical-base-url and priority could just be 1.0 by default. Maybe a --sitemap-priority option could be useful, but it's not necessary.

To build the sitemaps for legacy releases, we can just use https://github.com/straight-shoota/crystal_docs_sitemap That's a one-time thing.

With this, updates to master sitemap don't need any additional action because the updated sitemap is already provided by the doc generator.
When a new release is added, we need to add it to the sitemapindex and update the sitemaps for the last releases, but this could just be s/priority="1.0"/priority="0.5"/ etc.

@bcardiff

This comment has been minimized.

Copy link
Member

@bcardiff bcardiff commented Oct 18, 2019

Currently, these sitemaps are only for /api, so Jekyll is not even involved

Wouldn't that prevent indexing other pages?

Every day the updated nightly API docs are published for master.

I thought we didn't want sitemap for master. Is mostly used for preview (edit: sorry you mention it at the end)

When a new Crystal version is released,

I'm ok downloading the whole docs for a first time generation (edit: or using the proposed script), but upon a crystal version release I don't have locally all the bucket of docs. And I don't want to require to download it. What I do have is the -doc.tar.gz artifact that is pushed. I was thinking of injecting the new paths there, without actually retrieving them from http or the bucket. Hence the proposed transformations 1 and 2.

@straight-shoota

This comment has been minimized.

Copy link
Member

@straight-shoota straight-shoota commented Oct 18, 2019

Wouldn't that prevent indexing other pages?

No, sitemaps are not used as an exclusive source. Search engines still employ their regular crawling. They just augment the results or help discover pages that would otherwise not be discovered. See https://webmasters.stackexchange.com/questions/114425/if-i-remove-urls-from-an-xml-sitemap-will-google-still-index-them

@straight-shoota

This comment has been minimized.

Copy link
Member

@straight-shoota straight-shoota commented Oct 18, 2019

I thought we didn't want sitemap for master. Is mostly used for preview

I guess it's not strictly necessary, but when the doc generator puts out the sitemap anyway, this requires no extra effort at all.

What I do have is the -doc.tar.gz artifact that is pushed.

My suggestion is that the sitemap is generated directly by the docs generator, thus it would already be included in the doc.tar.gz.
Each API version has its own sitemap (sitemap.xml), which would be located at /api/{{version}}/sitemap.xml.

When publishing a new release, you would just push the contents of doc.tar.gz and the new sitemap is online. It needs to be referenced in the sitemap index, so that's adding one line to that file:

<sitemap loc="https://crystal-lang.org/api/{{version}}/sitemap.xml" lastmod="{{`date --rfc-3339=date`}}" />

And you would need to grab /api/{{version-1}}/sitemap.xml, /api/{{version-2}}/sitemap.xml, /api/{{version-3}}/sitemap.xml, replace the priorities and push them back up.

This could all be placed in a simple shell script which could automatically retrieve the files from S3, apply the changes and push them back up. I haven't tested this but the general idea looks like this:

CURRENT_VERSION=$1

aws s3 cp $S3_BUCKET/sitemapindex.xml sitemapindex.xml

sed '$ i\  <sitemap loc="https://crystal-lang.org/api/$CURRENT_VERSION/sitemap.xml" lastmod="$(date --rfc-3339=date)" />' -i sitemapindex.xml

aws s3 cp sitemapindex.xml $S3_BUCKET/sitemapindex.xml

ARGV=("$@")

for (( i=2; i < $#; i++ )); do
  version=$ARGV[$i]
  case $i in
    2)
      priority=0.5
      ;;
    3)
      priority=0.3
      ;;
    *)
      priority=0.1
  esac

  aws s3 cp $S3_BUCKET/$version/sitemap.xml sitemap-$version.xml

  sed "s/priority=\"\\d\\.\\d/priority=\"$priority\"/" -i sitemap-$version.xml

  aws s3 cp sitemap-$version.xml $S3_BUCKET/$version/sitemap.xml
done
@bcardiff

This comment has been minimized.

Copy link
Member

@bcardiff bcardiff commented Oct 18, 2019

Ok, let's make the doc tool generate the sitemap if instructed so. But it will need to know about the base url. $ crystal docs --sitemap-base-url=https://crystal-lang.org/api/VERSION/ or something alike.

Then the maintenance of the root site map is more scriptable as proposed.

@straight-shoota

This comment has been minimized.

Copy link
Member

@straight-shoota straight-shoota commented Oct 18, 2019

We don't need another CLI option for this. That's exactly the same intent as --canonical-base-url.

@bcardiff

This comment has been minimized.

Copy link
Member

@bcardiff bcardiff commented Oct 18, 2019

The canonical-base-url is /latest always.
If there is no need to have a canonical-base then that setting might go away.
And they are different concerns.

@straight-shoota

This comment has been minimized.

Copy link
Member

@straight-shoota straight-shoota commented Oct 18, 2019

Oh yes, I mixed that up, sorry. It should go, because using canonical completely hides all older versions. So we can simply replace it.

@straight-shoota

This comment has been minimized.

Copy link
Member

@straight-shoota straight-shoota commented Nov 20, 2019

The compiler supports generating a sitemap now. We can proceed to get this integrated into the docs generation process.

  • Add DOCS_OPTIONS to distribution-scripts:
    • For nightly: --sitemap-base-url=https://crystal-lang.org/api/master --sitemap-changefreq=daily --sitemap-priority=0.3
    • For latest release: --sitemap-base-url=https://crystal-lang.org/api/$(version) --sitemap-changefreq=never --sitemap-priority=1.0
  • Pass the build type from .circle/config.yml to the distribution-script's workflow.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.