Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Page extensions return duplicated content using 200 statuscode instead of 301 redirect #3130

Closed
NicoHood opened this issue Jan 6, 2021 · 11 comments
Assignees

Comments

@NicoHood
Copy link
Contributor

NicoHood commented Jan 6, 2021

I found som issues with page extensions, eg https://yourdomain.de/contact.html:

  1. Any characters appended to the specified config.system.pages.types will also work. E.g. https://yourdomain.de/contact.htmlxxx
  2. The website return a 200 instead of a 301 to the original url. This is important for SEO to avoid duplicated content

Solution

1.: The error comes from the Uri class function isValidExtension(). The regex should be an exact match using ^exact$ syntax:
NicoHood@4131e9c

2.: Uri class function init() removes the postfix from the url which causes the issue. Note that html and htm will always return a 200, as those are set as default, no matter what config.system.pages.types contains. The reason why I even found this, is the weird behavior for the sitemap plugin, which made the sitemap available via xml and "normal" url. It causes lots of trouble for search engines, if there is duplicated content. In this case it would be best to only serve the xml file.

Page processing order:

  1. InitializeProcessor -> trailing slash logic happens here
  2. Uri.init() -> 200 logic currently happens here
  3. PagesProcessor
  4. onPageInitialized -> sitemap gets added here
  5. onPageNotFound
  6. onPageNotFound -> error plugin serves 404
  7. PagesProcessor ends

It would be best to only do the redirect if no page was found (or served by any plugin) and serve the sitemap only at /sitemap.xml (or whatever is configured). In order to still allow plugins to serve special .xml files we must place the redirect after the onPageInitialized event but before onPageNotFound. A possible solution is to move the code from Uri to the PagesProcessor:

NicoHood@ecbcdbd

The code solves the following:

  • 301 redirect instead of 200 duplicated content (SEO improvement)
  • Allow plugins to server special files like sitemap.xml on the extension uri only
  • Still allow redirects of .html etc to the "real" page
  • But also add an option to turn it off: redirect_file_extensions
  • Backwards compatibility for the sitemap plugin to server the sitemap at sitemap and redirect at sitemap.xml
  • Option to serve the sitemap now at sitemap.xml only, without any .htaccess rule

Fix provided in PR #3131

@mahagr
Copy link
Member

mahagr commented Jan 15, 2021

I agree that Grav should probably be more strict on file extensions. Try using .json, it will also return html page if page doesn't support the format. This will however break backwards compatibility.

@mahagr mahagr self-assigned this Jan 15, 2021
@mahagr
Copy link
Member

mahagr commented Mar 11, 2021

Have you tried to set system.pages.redirect_default_route to true?

@mahagr
Copy link
Member

mahagr commented Mar 11, 2021

OK, that didn't work. I changed the configuration to accept redirect codes instead of just true/false.

The fix also needs updated form plugin as the select field had issues with boolean values.

mahagr added a commit that referenced this issue Mar 11, 2021
@mahagr
Copy link
Member

mahagr commented Mar 11, 2021

Oh, and you also need to use 301 in both system.pages.redirect_default_route and system.pages.redirect_trailing_slash.

@NicoHood
Copy link
Contributor Author

When opening a page with the .html postfix the redirect happens (good) but still with 302 redirect code. I've set the mentioned settings to 301. The default should still be 302, as not all links should be 301 by default. The rest is working fine from my observations.

@mahagr
Copy link
Member

mahagr commented Mar 15, 2021

I'm using 302 redirect code as during testing I ended up destroying my development site with neverending redirect loops by just changing a setting. Only clearing all cache from the browser helped on that as it didn't even reach the site anymore.

The change should obey the new redirect setting, though? Did you try with system.pages.redirect_default_route: 301?

@NicoHood
Copy link
Contributor Author

I tried that and it does not work for the html postfix. Just try it out.

@mahagr
Copy link
Member

mahagr commented Mar 19, 2021

Yeah, because it was missing logic for that. Develop has the latest version, it should redirect .htm, .html and no extension properly and also works with system.pages.append_url_extension setting so redirects also work from no extension to .html for example.

@mahagr
Copy link
Member

mahagr commented Mar 20, 2021

@NicoHood Please test it again, the implementation should be good now. If there are still issues in sitemap, I think they are specific to the plugin, but we can talk about them, too. What I've fixed here are all the duplicate pages in Grav content, such as:

  • trailing slash
  • duplicate extensions such as .html and .htm
  • redirects and page aliases

Let's create a new issue for the remaining issues, please. :)

@mahagr mahagr closed this as completed Mar 20, 2021
@NicoHood
Copy link
Contributor Author

NicoHood commented Mar 20, 2021

Checkout https://bikefittingfinder.de/karte.xml it still does not redirect. (Grav 1.7.9)

@mahagr
Copy link
Member

mahagr commented Mar 20, 2021

It's not supposed to. But it should be showing 404 as xml version of the page doesn't exist. Create a new issue about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants