Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sitemap.xml has a strange URL, Google will not read it #3067

Closed
harrywesterman opened this issue Feb 7, 2020 · 14 comments
Closed

Sitemap.xml has a strange URL, Google will not read it #3067

harrywesterman opened this issue Feb 7, 2020 · 14 comments

Comments

@harrywesterman
Copy link

https://www.stamboomwesterman.net/index.php?route=%2Fsitemap.xml is the URL of my sitemap (I manually add the workaround you made a couple of days ago: [https://github.com//issues/3065])

This %2F in the URL doesn't sound right? Google Search won't read the sitemap, though I can see it in a browser.

@fisharebest
Copy link
Owner

The URL looks OK to me. If I click it, then I get a valid sitemap file.

This %2F in the URL doesn't sound right?

This is a "url-encoded" slash character. It's perfectly normal.
You can see them elsewhere in webtrees. For example, if you search for foo/bar, you'll
end up on a page with a URL like this:
https://dev.webtrees.net/demo-dev/tree/demo/search-general?query=foo%2Fbar

Google Search won't read the sitemap

Can you give more details? Is there an error message?

@harrywesterman
Copy link
Author

I found it, the XML was allright, but CloudFlare was blocking access to the sitemap when you used a GoogleBot useragent. I disabled CloudFlare for the time being, so nothing wrong with Webtrees. Thanks for your help!

@harrywesterman
Copy link
Author

harrywesterman commented Feb 10, 2020

No, there is still something fishy (no offence), Google still does not find the pages in the Sitemap. The sitemap.xml can be read now, but it is not valid xml says https://www.xmlvalidation.com/index.php?id=1&L=0.

1 | <?xml version="1.0" encoding="UTF-8"?>
2 | <?xml-stylesheet type="text/xsl" href="https://www.stamboomwesterman.net/index.php?route=%2Fsitemap.xsl"?>
3 | <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

Error in the XML document:
3: | 67 | cvc-elt.1: Cannot find the declaration of element 'sitemapindex'.

@fisharebest
Copy link
Owner

fisharebest commented Feb 10, 2020

That validator tool requires that you upload the sitemap schema first.
If you upload the file https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd
and then upload your sitemapindex file, it reports no errors.

Here is a validator that is designed to test sitemap files. It says that both your sitemapindex and sitemap files are valid.

https://www.xml-sitemaps.com/validate-xml-sitemap.html?op=validate-xml-sitemap&go=1&sitemapurl=https%3A%2F%2Fwww.stamboomwesterman.net%2Findex.php%3Froute%3D%252Fsitemap.xml&submit=Validate+Sitemap

https://www.xml-sitemaps.com/validate-xml-sitemap.html?op=validate-xml-sitemap&go=1&sitemapurl=https%3A%2F%2Fwww.stamboomwesterman.net%2Findex.php%3Froute%3D%252Fsitemap-tree1-INDI-0.xml&submit=Validate+Sitemap

Google still does not find the pages in the Sitemap.

Can you give more details?

@harrywesterman
Copy link
Author

I found that site too :-) But another site says the first link is not right:
https://www.websiteplanet.com/nl/webtools/sitemap-validator/?page=https://www.stamboomwesterman.net/index.php?route=%2Fsitemap.xml

So I fixed the xml manually, deleting all the extra spaces that are in the file and uploaded it to my site as sitemapfixed.xml. Now it thinks it is better:
https://www.websiteplanet.com/nl/webtools/sitemap-validator/?page=https://www.stamboomwesterman.net/sitemapfixed.xml

Google thinks it is better too now, but no URLS found. Then I fixed the https://www.stamboomwesterman.net/index.php?route=%2Fsitemap-tree1-INDI-0.xml too, just by aligning the XML tags. I called it sitemap-tree1-INDI-0-fixed.xml (I left only one INDI in there).

image

You can see it likes my fixed xml's, and finds the URL! So it is about spaces and aligning!

@fisharebest
Copy link
Owner

So it is about spaces and aligning!

I am not sure that it is. XML allows whitespace.

Google is happy to read the sitemap files on my site, and many others.

Here's the google console for the sitemap on the demo server:

Screenshot 2020-02-10 at 12 17 53

Screenshot 2020-02-10 at 12 18 02

@harrywesterman
Copy link
Author

Yes I understand. So it must be something with PHP on my shared hoster right? I asked the Dreamhost guys, but they cannot find anything.

The logic seems to be:
https://www.stamboomwesterman.net/index.php?route=%2Fsitemap.xml
Gives sometimes HTTP/1.1 500 Internal Server Error, especially with the Googlebot useragent. Normal browsers will show the page anyhow, but Google Search console and testsites just give a "connection error".

https://www.stamboomwesterman.net/sitemapfixed.xml
HTTP/1.1 200 OK always.

@harrywesterman
Copy link
Author

Continuing the journey, I am starting to like this :-)

Using the Google Advanced Rest Client, I could get the 500 error from my machine. But after that, if started working, only 200's. I have seen that before, a new browser will give a 500 error first and work after a refresh.

I enabled php logging, and got a decent error now:

[10-Feb-2020 06:59:19 America/Los_Angeles] PHP Fatal error: Uncaught ErrorException: preg_match(): Allocation of JIT memory failed, PCRE JIT will be disabled. This is likely caused by security restrictions. Either grant PHP permission to allocate executable memory, or set pcre.jit=0 in /home/harrywesterman/stamboomwesterman.net/vendor/nyholm/psr7-server/src/ServerRequestCreator.php:260
Stack trace:
#0 [internal function]: Fisharebest\Webtrees\Webtrees::Fisharebest\Webtrees{closure}(2, 'preg_match(): A...', '/home/harrywest...', 260, Array)
#1 /home/harrywesterman/stamboomwesterman.net/vendor/nyholm/psr7-server/src/ServerRequestCreator.php(260): preg_match('/^(.+)\:(\d+)$/', 'www.stamboomwes...', Array)
#2 /home/harrywesterman/stamboomwesterman.net/vendor/nyholm/psr7-server/src/ServerRequestCreator.php(141): Nyholm\Psr7Server\ServerRequestCreator->createUriFromArray(Array)
#3 /home/harrywesterman/stamboomwesterman.net/vendor/nyholm/psr7-server/src/ServerRequestCreator.php(63): Nyholm\Psr7Server\ServerRequestCreator->getUriFromEnvWithHTTP(Array)
#4 /home/harrywesterm in /home/harrywesterman/stamboomwesterman.net/vendor/nyholm/psr7-server/src/ServerRequestCreator.php on line 260

So it disables it after it goes wrong, that is why the second time it works!

So I put pcre.jit=0 in my php.ini and now I got another error:

[10-Feb-2020 15:07:50 UTC] PHP Fatal error: Uncaught ErrorException: Trying to get property 'name' of non-object in /home/harrywesterman/stamboomwesterman.net/resources/views/layouts/default.phtml:76
Stack trace:
#0 /home/harrywesterman/stamboomwesterman.net/resources/views/layouts/default.phtml(76): Fisharebest\Webtrees\Webtrees::Fisharebest\Webtrees{closure}(8, 'Trying to get p...', '/home/harrywest...', 76, Array)
#1 /home/harrywesterman/stamboomwesterman.net/app/View.php(186): include('/home/harrywest...')
#2 /home/harrywesterman/stamboomwesterman.net/app/View.php(282): Fisharebest\Webtrees\View->render()
#3 /home/harrywesterman/stamboomwesterman.net/app/Helpers/functions.php(203): Fisharebest\Webtrees\View::make('layouts/default', Array)
#4 /home/harrywesterman/stamboomwesterman.net/app/Http/ViewResponseTrait.php(58): view('layouts/default', Array)
#5 /home/harrywesterman/stamboomwesterman.net/app/Http/Middleware/HandleExceptions.php(150): Fisharebest\Webtrees\Http\Middleware\HandleExceptions->viewResponse('components/aler...', Array, 404)
#6 /ho in /home/harrywesterman/stamboomwesterman.net/resources/views/layouts/default.phtml on line 76

Maybe this is something Webtrees related or should I tune PHP some more?

Greetings,
Harry

@fisharebest
Copy link
Owner

So I put pcre.jit=0 in my php.ini

According to the documentation, pcre.jit=1 only works if the PCRE library on your server is compiled with JIT support.

I read that the default is 0 on most linux distributions.

and now I got another error

You get this error when? Every page? The sitemap page?

@harrywesterman
Copy link
Author

So I put pcre.jit=0 in my php.ini

According to the documentation, pcre.jit=1 only works if the PCRE library on your server is compiled with JIT support.

I read that the default is 0 on most linux distributions.

Just a shared webhoster that tries to limit their users... I asked for an upgrade to a VPS already.

and now I got another error

You get this error when? Every page? The sitemap page?

Only when asking for the sitemap pages. The rest of Webtrees works great. The error is only in the php.log, you get HTTP error 500 returncode, but the browser renders the sitemap OK.

@fisharebest
Copy link
Owner

Only when asking for the sitemap pages.

That error is in the default page template - which isn't used for the sitemap files.

So, this sounds odd.

you get HTTP error 500 returncode, but the browser renders the sitemap OK.

Are you 100% certain that the 500 response is for the sitemap?

You should either get a valid sitemap with a 200 response - or an error page with a 500.

@harrywesterman
Copy link
Author

Yes, I can trigger it 100% with https://botsimulator.com with the url https://www.stamboomwesterman.net/index.php?route=%2Fsitemap.xml

But I have a workaround now, I just downloaded all the sitemaps, saved them in local xml files on my webserver, and pointed my robots.txt and Google Search to https://stamboomwesterman.net/sitemaplocal.xml. At least my website can be found with Google and Bing :-)

Lets just say that my webhoster did something funky.

@fisharebest
Copy link
Owner

I can set the User Agent using a browser plugin.
I can pretend to be a googlebot, and I can fetch https://www.stamboomwesterman.net/index.php?route=%2Fsitemap.xml OK

The botsimulator uses the same User Agent.
It shows that it has fetched the page contents correctly.
But the status code is 500 instead of 200.

Lets just say that my webhoster did something funky.

There is something strange happening on your server...

@harrywesterman
Copy link
Author

Let's close it, way too much energy in this now :-D Thanks for all your help!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants