Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance fix for Yoast sitemaps #12161

Open
dhilditch opened this issue Feb 5, 2019 · 16 comments

Comments

@dhilditch
Copy link

@dhilditch dhilditch commented Feb 5, 2019

  • I've read and understood the contribution guidelines.
  • I've searched for any related issues and avoided creating a duplicate issue.

Please give us a description of what happened.

The performance of Yoast sitemaps is hurting servers on larger websites.

With a client of mine, they have 1.4 million products, and Yoast is causing table-scans to happen on every segment of data loaded. It's good that Yoast builds the sitemap in chunks, but it's bad that table scans happen for each of these chunks.

The table scans are causing sitemaps to not work on large websites, and worse, they're hurting other user queries.

Please describe what you expected to happen and why.

I expect index-seeks and good performance regardless of how large websites get.

I expect the sitemaps to work.

How can we reproduce this behavior?

  1. Enable XML sitemaps on Yoast settings page on million+ item store
  2. View sitemap.xml page
  3. View slow query log

Technical info

The problem is two-fold:

  1. There is a post_date != '00-00-00 00:00:00' query added to the where clause. This prevents use of a sorted index column which is required for the pagination.

  2. There is no supporting index to avoid sorting operations.

I've posted a full analysis here:

https://www.wpintense.com/2019/02/04/performance-optimisation-for-various-xml-sitemap-plugins/

But basically:

  1. Remove this post_date != '00-00-00 00:00:00' part of the query from this file: https://github.com/Yoast/wordpress-seo/blob/a1628d1570a8e8cca896312509ac986182836c30/inc/sitemaps/class-post-type-sitemap-provider.php

  2. Add a supporting index to wp_posts so sorts are not required on every windowed-page request:

create index wpi_scalability_pro_sitemaps on wp_posts (post_status, post_password, post_type, post_modified)

  • If relevant, which editor is affected (or editors):
  • Classic Editor
  • Gutenberg
  • Classic Editor plugin
  • Which browser is affected (or browsers):
  • Chrome
  • Firefox
  • Safari
  • Other

Used versions

  • WordPress version: 5.02
  • Yoast SEO version: Premium 7.9 and Free 9.5
  • Gutenberg plugin version:
  • Classic Editor plugin version:
  • Relevant plugins in case of a bug:
  • Tested with theme: Flatmarket (but not relevant)
@Rarst

This comment has been minimized.

Copy link
Contributor

@Rarst Rarst commented Feb 5, 2019

Continuing our discussion from your post, in the past I worked on refactoring of sitemaps module and specifically paid close attention to feedback on performance at the scale you are talking about.

For the record I am not a DB specialist, so my experience with this is combination of my own observation, extensive feedback from plugin's users, and some reaching out to people who are specialists.

To reiterate your claim from the post is that main query for the sitemap post page is unoptimized and doesn't scale for large amounts of posts.

The query in question is (you chose to run a subquery part of it in your post):

SELECT l.ID, post_title, post_content, post_name, post_parent, post_author, post_modified_gmt, post_date, post_date_gmt
FROM (
	SELECT wp_posts.ID
	FROM wp_posts
	WHERE wp_posts.post_status = 'publish' AND wp_posts.post_type = 'test' AND wp_posts.post_password = '' AND wp_posts.post_date != '0000-00-00 00:00:00'
	ORDER BY wp_posts.post_modified ASC
	LIMIT 100 OFFSET 0
) o
JOIN wp_posts l ON l.ID = o.ID

The main challenge of SQL for sitemaps is that MySQL doesn't scale well for OFFSET operations. The more rows you have the more rows it walks through to reach your offset. The queries in the plugin had been very specifically written and audited to account for that.

I didn't have cool 1M posts on hand, but I generated 92K which is more than sufficient to illustrate the rate of performance drop off:

  • at offset 0 the query executes for me in ~1.290s
  • at offset 90,000 the query executes for me in ~1.340s

So with a considerable offset performance dropped of by 50 milliseconds, while your post claims performance drop off of multiple minutes (at larger sizes, but certainly not consistent with the pace I observe).

Further, removing AND wp_posts.post_date != '0000-00-00 00:00:00' part of query, you claim to be problematic and have major impact on performance of the query, makes zero difference to the run time for me.

Do I believe that you observe poor performance at a large site? Sure, that is certainly possible.

Do I believe that this query is inherently problematic? No, not from my experience or any feedback I've gotten before.

I suspect the question to explore here is why that specific server runs this query slowly, not what's wrong with the query.

@dhilditch

This comment has been minimized.

Copy link
Author

@dhilditch dhilditch commented Feb 6, 2019

Hi again - you keep using the word 'claim'. Should I record a video for you so you can see these are not claims, they are facts?

@dhilditch

This comment has been minimized.

Copy link
Author

@dhilditch dhilditch commented Feb 6, 2019

For the removal of that part of the query - you need to remove that AND add the index. If you only add the index, you won't see the boost, if you only remove that part of the query you won't see the boost. You need to do both.

@dhilditch

This comment has been minimized.

Copy link
Author

@dhilditch dhilditch commented Feb 6, 2019

And as for the 'dropoff' - I think you'll find both those queries are just doing full table scans and hence no difference in speed.

@dhilditch

This comment has been minimized.

Copy link
Author

@dhilditch dhilditch commented Feb 6, 2019

Finally -

I suspect the question to explore here is why that specific server runs this query slowly, not what's wrong with the query.

I optimised the server. There's nothing now wrong with the server stack. There were issues with it before, but not now.

@Rarst

This comment has been minimized.

Copy link
Contributor

@Rarst Rarst commented Feb 6, 2019

Hi again - you keep using the word 'claim'. Should I record a video for you so you can see these are not claims, they are facts?

I didn't mean this dismissively, I apologize for it coming across like that. I believe your observed results, though I disagree on conclusions so far (the need to change the query and add custom index).

For the removal of that part of the query - you need to remove that AND add the index. If you only add the index, you won't see the boost, if you only remove that part of the query you won't see the boost. You need to do both.

Gotcha. From point of view of plugin in general that would have two problems:

  1. The query is no longer functionally equivalent (which would need to be addressed otherwise).
  2. When I was working on this the priority was using native WP indexes for queries. Custom index would be challenging to reliably ship for general use case.

I optimised the server. There's nothing now wrong with the server stack.

I can't speak for your server stack, I can only say that there is extensive feedback on record from large sites on sitemaps to the contrary and it performing sufficiently well. As such I would suspect individual issue with environment or otherwise, not a systematic one with implementation.

Since I am no longer involved with development, I think I'll shut up now. :) Thank you for taking the time to research and report and I (always) hope there are opportunities to improve sitemap performance for general case, it was my pet module. :)

@matrixpoland

This comment has been minimized.

Copy link

@matrixpoland matrixpoland commented Feb 27, 2019

I have problem too with high CPU usage. Maybe because of sitemaps. I have aproxx 250 pages

@joelkarunungan

This comment has been minimized.

Copy link

@joelkarunungan joelkarunungan commented Mar 26, 2019

This issue occurs for me and can be observed in large sites with several hundred thousand pages simply by visiting the sitemap_index.xml page. It is loading very slow. Its trivial to insert 1M+ records in the wp_posts table or using the WP API to replicate this. There is a workaround I found: https://markjaquith.wordpress.com/2018/01/22/how-i-fixed-yoast-seo-sitemaps-on-a-large-wordpress-site/

@redvivi

This comment has been minimized.

Copy link

@redvivi redvivi commented May 18, 2019

Hello @joelkarunungan,

I have the same issue and highly penalising regarding loading time (15 minutes per sitemap).

Have you recently tested https://markjaquith.wordpress.com/2018/01/22/how-i-fixed-yoast-seo-sitemaps-on-a-large-wordpress-site/ solution ?

I tried to implement it, however the command line is failing, executing:
cd /srv/www/example.com && /usr/local/bin/wp eval '$sm = new WPSEO_Sitemaps;$sm->build_root_map();$sm->output();' > /srv/www/example.com/wp-content/uploads/sitemap_index.xml
Throws
PHP Notice: Undefined index: HTTP_X_FORWARDED_PROTO in phar:///home/wp/vendor/wp-cli/wp-cli/php/WP_CLI/Runner.php(1169) : eval()'d code on line 72 PHP Warning: Invalid argument supplied for foreach() in /var/www/wordpress/wp-content/plugins/wordpress-seo/inc/sitemaps/class-sitemaps.php on line 388 Warning: Invalid argument supplied for foreach() in /var/www/wordpress/wp-content/plugins/wordpress-seo/inc/sitemaps/class-sitemaps.php on line 388

Am I missing something?

@joelkarunungan

This comment has been minimized.

Copy link

@joelkarunungan joelkarunungan commented May 21, 2019

@redvivi first solution I ended up using is @dhilditch solution:

https://www.wpintense.com/2019/02/04/performance-optimisation-for-various-xml-sitemap-plugins/ > Yoast SEO XML Sitemaps performance (this is same as what he mentioned above in this issue thread)

I eventually dropped Yoast's sitemap and used Google XML Sitemaps as mentioned in the link as well. This was the best solution for me despite the bugs with the 3 year old plugin.

Hoping though that someone would adopt the Google XML Sitemaps plugin through WP official so the plugin gets updated and its bugs fixed since its performance really outperforms Yoast's sitemap generator.

@redvivi

This comment has been minimized.

Copy link

@redvivi redvivi commented May 21, 2019

@joelkarunungan, thanks for your feedback.
Unfortunately, @dhilditch does not change the performance issue I am facing, I guess the data I want to sitemap does not apply in my case.

It is quite disappointing that there is not a solution from Yoast.

@Uranbold

This comment has been minimized.

Copy link

@Uranbold Uranbold commented May 27, 2019

Also got this problem with 500k+ posts. I think it's Google fetching the posts and related to XML sitemap 100%.

SELECT l.ID, post_title, post_content, post_name, post_parent, post_author, post_modified_gmt, post_date, post_date_gmt
FROM (
SELECT wp_posts.ID
FROM wp_posts
WHERE wp_posts.post_status = 'publish'
AND wp_posts.post_type = 'post'
AND wp_posts.post_password = ''
AND wp_posts.post_date != '0000-00-00 00:00:00'
ORDER BY wp_posts.post_modified ASC LIMIT 100 OFFSET 381500
)
o JOIN wp_posts l ON l.ID = o.ID

@SergeyBiryukov

This comment has been minimized.

Copy link
Member

@SergeyBiryukov SergeyBiryukov commented Jun 25, 2019

There is a post_date != '00-00-00 00:00:00' query added to the where clause.

For reference, post_date != '0000-00-00 00:00:00' and post_author != 0 were added in 3cdcfc9.

post_author != 0 was later removed in 0845fc9.

@im1981

This comment has been minimized.

Copy link

@im1981 im1981 commented Sep 10, 2019

Hello, i also had massive performance issues with the yoast sitemaps (site over 350.000 posts), adding the index and removing the post_date != '0000-00-00 00:00:00' improved performance a lot.
Anyway i just build my own static file cache.

@roryroryro

This comment has been minimized.

Copy link

@roryroryro roryroryro commented Oct 7, 2019

Hello
Me too adding index in database and removing post_date improved speed a lot.
However i still have the problem for custom taxonomy. I try to find the post_date in inc/sitemaps/class-taxonomy-sitemap-provider.php but cannot find.
Any idea of how I can fix that in a similar way for custom taxonomy?
Thank you

@amboutwe

This comment has been minimized.

Copy link
Member

@amboutwe amboutwe commented Nov 6, 2019

Please inform the customer of conversation # 555596 when this conversation has been closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.