Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawl sitemaps #632

Closed
john-shaffer opened this issue Aug 13, 2020 · 9 comments
Closed

Crawl sitemaps #632

john-shaffer opened this issue Aug 13, 2020 · 9 comments

Comments

@john-shaffer
Copy link
Contributor

WP 5.5 includes a sitemap by default. It should be pretty easy to parse and crawl the sitemaps, but we need some changes to allow adding new URLs during a crawl.

@leonstafford
Copy link
Contributor

@fromcouch - this may be a nice issue to work on for you? If you want it, feel free to assign yourself and let me know if any questions

@fromcouch
Copy link
Contributor

Yes, i will review.

@fromcouch
Copy link
Contributor

@leonstafford seems that isn't a good idea to implement default wordpress sitemap a lot of problems with multisite and multilanguage. You can see comments here:
https://make.wordpress.org/core/2020/07/22/new-xml-sitemaps-functionality-in-wordpress-5-5/

Instead, I could detect if sitemap.xml exists and parse it (or ask directly to Yoast or another plugin the URL list)

@leonstafford
Copy link
Contributor

Yes, detect and parse if exists sounds right.

I used to have some code in to detect Yoast sitemaps specifically.

We can add those to a list of Detect if exist and parse them files, including:

ads.txt
robots.txt
all common sitemaps

@thegulshankumar
Copy link

thegulshankumar commented Aug 29, 2020

/sitemap.xml only this file is getting crawled, below path are missing

RankMaths / Yoast

/main-sitemap.xsl
/sitemap_index.xml
/post-sitemap.xml
post-sitemap2.xml
/page-sitemap.xml

Ideally, I would expect like this ...
Crawl /sitemap.xml
If redirect detected, follow those path
Then whatever files detected crawl all those

EDIT: I reported this issue at different thread. This was finding as per Static HTML Output, not wp2static.

@fromcouch
Copy link
Contributor

Maybe will be easier if we read sitemap file from robots.txt and if not detected search for:

  • sitemap.xml
  • sitemap_index.xml
  • wp_sitemap.xml

And then crawl ...

@fromcouch
Copy link
Contributor

@leonstafford i have a problem here. When I read sitemaps I get a list of all URLs. This means that I can't respect checkboxes that ask for detect posts, etc.

Maybe we could add a checkbox in configuration called "use sitemaps" that deactivates:

Detect Custom Post Types
Detect Pages
Detect Posts
Detect Uploads

Let me something ...

@leonstafford
Copy link
Contributor

@fromcouch - ah, good point!

I'd like to see togglable option for "Use Sitemaps", which is on by default. We can allow users to check box sitemaps and any other detection option. Adding a warning in the Export Log that:

You've selected Sitemaps and other detection options, which may result in more URLs detected than you expected

Detecting too much is usually preferred to not detecting enough, especially when users have the ability to go back and adjust settings to limit detection.

@fromcouch fromcouch mentioned this issue Sep 7, 2020
@leonstafford
Copy link
Contributor

This functionality should already be merged in

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants