Crawl sitemaps #632

john-shaffer · 2020-08-13T22:27:16Z

WP 5.5 includes a sitemap by default. It should be pretty easy to parse and crawl the sitemaps, but we need some changes to allow adding new URLs during a crawl.

leonstafford · 2020-08-27T05:54:29Z

@fromcouch - this may be a nice issue to work on for you? If you want it, feel free to assign yourself and let me know if any questions

fromcouch · 2020-08-27T15:21:36Z

Yes, i will review.

fromcouch · 2020-08-27T22:42:53Z

@leonstafford seems that isn't a good idea to implement default wordpress sitemap a lot of problems with multisite and multilanguage. You can see comments here:
https://make.wordpress.org/core/2020/07/22/new-xml-sitemaps-functionality-in-wordpress-5-5/

Instead, I could detect if sitemap.xml exists and parse it (or ask directly to Yoast or another plugin the URL list)

leonstafford · 2020-08-27T22:58:02Z

Yes, detect and parse if exists sounds right.

I used to have some code in to detect Yoast sitemaps specifically.

We can add those to a list of Detect if exist and parse them files, including:

ads.txt
robots.txt
all common sitemaps

thegulshankumar · 2020-08-29T08:54:35Z

/sitemap.xml only this file is getting crawled, below path are missing

RankMaths / Yoast

/main-sitemap.xsl
/sitemap_index.xml
/post-sitemap.xml
post-sitemap2.xml
/page-sitemap.xml

Ideally, I would expect like this ...
Crawl /sitemap.xml
If redirect detected, follow those path
Then whatever files detected crawl all those

EDIT: I reported this issue at different thread. This was finding as per Static HTML Output, not wp2static.

fromcouch · 2020-08-29T19:39:32Z

Maybe will be easier if we read sitemap file from robots.txt and if not detected search for:

sitemap.xml
sitemap_index.xml
wp_sitemap.xml

And then crawl ...

fromcouch · 2020-09-01T20:56:12Z

@leonstafford i have a problem here. When I read sitemaps I get a list of all URLs. This means that I can't respect checkboxes that ask for detect posts, etc.

Maybe we could add a checkbox in configuration called "use sitemaps" that deactivates:

Detect Custom Post Types
Detect Pages
Detect Posts
Detect Uploads

Let me something ...

leonstafford · 2020-09-02T01:35:37Z

@fromcouch - ah, good point!

I'd like to see togglable option for "Use Sitemaps", which is on by default. We can allow users to check box sitemaps and any other detection option. Adding a warning in the Export Log that:

You've selected Sitemaps and other detection options, which may result in more URLs detected than you expected

Detecting too much is usually preferred to not detecting enough, especially when users have the ability to go back and adjust settings to limit detection.

leonstafford · 2020-12-01T04:55:16Z

This functionality should already be merged in

fromcouch mentioned this issue Sep 7, 2020

Sitemap #666

Merged

leonstafford closed this as completed Dec 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawl sitemaps #632

Crawl sitemaps #632

john-shaffer commented Aug 13, 2020

leonstafford commented Aug 27, 2020

fromcouch commented Aug 27, 2020

fromcouch commented Aug 27, 2020

leonstafford commented Aug 27, 2020

thegulshankumar commented Aug 29, 2020 •

edited

fromcouch commented Aug 29, 2020

fromcouch commented Sep 1, 2020

leonstafford commented Sep 2, 2020

leonstafford commented Dec 1, 2020

Crawl sitemaps #632

Crawl sitemaps #632

Comments

john-shaffer commented Aug 13, 2020

leonstafford commented Aug 27, 2020

fromcouch commented Aug 27, 2020

fromcouch commented Aug 27, 2020

leonstafford commented Aug 27, 2020

thegulshankumar commented Aug 29, 2020 • edited

RankMaths / Yoast

fromcouch commented Aug 29, 2020

fromcouch commented Sep 1, 2020

leonstafford commented Sep 2, 2020

leonstafford commented Dec 1, 2020

thegulshankumar commented Aug 29, 2020 •

edited