Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add toggle for blocking search indexing of Grav URL parameters #162

Closed

Conversation

nbusseneau
Copy link

When enabled, a noindex <meta> tag will be added to pages with Grav URL parameters (e.g. /blog/foo:bar).

Rationale: avoid page duplicates being indexed by search engines.

I noticed this was happening when parameterized URLs were scrapable by search engines, which is the case by default on Quark with sidebar using Taxonomy List and Archives plugins.

It seems that at least Google treats them as separate pages, even when properly using canonical link elements and sitemaps.

As specified in Google's documentation, the indexer may choose to ignore content-provided indicators:

Google chooses the canonical page based on a number of factors (or signals), such as whether the page is served via http or https; page quality; presence of the URL in a sitemap; and any "rel=canonical" labeling. You can indicate your preference to Google using these techniques, but Google may choose a different page as canonical than you do, for various reasons.

This behavior might be undesirable. In this case, the only resort is to block indexing altogether using a noindex <meta> tag:

<meta name="robots" content="noindex">

Obviously, we'd only want to block indexing for Grav parameterized URLs. Including the <meta> tag in a base template is not suitable: it must be conditionally inserted.

Hence this toggle.

When enabled, a `noindex` `<meta>` tag will be added to pages with Grav
URL parameters (e.g. `/blog/foo:bar`).

Rationale: avoid page duplicates being indexed by search engines.

I noticed this was happening when parameterized URLs were scrapable by
search engines, which is the case by default on Quark with sidebar using
Taxonomy List and Archives plugins.

It seems that at least Google treats them as separate pages, even when
properly using canonical link elements and sitemaps.

As specified in Google's documentation:
https://support.google.com/webmasters/answer/139066?hl=en
the indexer may choose to ignore content-provided indicators:

> Google chooses the canonical page based on a number of factors
(or signals), such as whether the page is served via http or https;
page quality; presence of the URL in a sitemap; and any "rel=canonical"
labeling. You can indicate your preference to Google using these
techniques, but Google may choose a different page as canonical than you
do, for various reasons.

This behavior might be undesirable. In this case, the only resort is to
block indexing altogether using a `noindex` `<meta>` tag:
https://support.google.com/webmasters/answer/93710?hl=en

Obviously, we'd only want to block indexing for Grav parameterized URLs.
Including the `<meta>` tag in a base template is not suitable: it must
be conditionally inserted.

Hence this toggle.
@nbusseneau
Copy link
Author

@rhukster Revisiting old PRs -- what do you think about this? I have been running a custom Quark theme on my end with this switch for a while, and find it really useful to not have Google index Grav parameterized URLs on my website.

@rhukster
Copy link
Member

Rather than this being a toggle in every theme, it makes more sense to me if this logic was rewritten as a plugin. I think that would be more useful and not theme dependent. What do you think?

@nbusseneau
Copy link
Author

Oooh that is a nice idea, I did not think about this. I agree!

@nbusseneau nbusseneau closed this Oct 19, 2021
@NicoHood
Copy link
Contributor

Isnt the canonial url for such usecases?

@nbusseneau
Copy link
Author

nbusseneau commented Oct 20, 2021

@NicoHood As said in the OP:

It seems that at least Google treats them as separate pages, even when properly using canonical link elements and sitemaps.

As specified in Google's documentation, the indexer may choose to ignore content-provided indicators:

Google chooses the canonical page based on a number of factors (or signals), such as whether the page is served via http or https; page quality; presence of the URL in a sitemap; and any "rel=canonical" labeling. You can indicate your preference to Google using these techniques, but Google may choose a different page as canonical than you do, for various reasons.

I have verified this happening with Google indexing my website's parameterized URLs as separate pages even with the proper canonical URL specified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants