Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle recursive sitemaps #7

Open
GokulNC opened this issue Oct 24, 2019 · 4 comments
Open

Handle recursive sitemaps #7

GokulNC opened this issue Oct 24, 2019 · 4 comments

Comments

@GokulNC
Copy link
Collaborator

GokulNC commented Oct 24, 2019

There are some sitemaps which recursively contains sitemaps. For instance:
https://www.dailythanthi.com/Sitemap/Sitemap.xml

But the recursive sitemaps may or may not comply to the sitemap format.
An example for recursive sitemap that complies to the sitemap format:
https://hindi.news18.com/sitemap.xml

Todo:
We'll better extract the URLs (http) from these links.

@GokulNC
Copy link
Collaborator Author

GokulNC commented Oct 24, 2019

Found this interesting package ultimate_sitemap_parser.

My code snippet to get list of all URLs from all sitemaps in the website:

from usp.tree import sitemap_tree_for_homepage

def get_all_urls_from_all_sitemaps(website):
    tree = sitemap_tree_for_homepage(website)
    return [page.url for page in tree.all_pages()]

get_all_urls_from_all_sitemaps('https://www.dailythanthi.com/')

Gave me around 7000 article links. Pretty cool actually!
We can just use all these URLs to crawl all the articles.

@GokulNC
Copy link
Collaborator Author

GokulNC commented Oct 24, 2019

The above code first looks for the robots.txt of the website, hence this might be a possible solution for issue #6

@divkakwani
Copy link
Owner

This seems like a useful thing. But I was thinking in which cases it will be useful since we have decided to use recursive crawls now. I have not seen many sources that have first level sitemap all fine, but the next levels are broken or something. Do you reckon this can be useful?

@GokulNC
Copy link
Collaborator Author

GokulNC commented Dec 23, 2019

Sure, we'll close it then. But we'll also do the following:

  • After the recursive crawls are complete/exhausted, we'll augment our data with the articles from sitemap (if they're not covered by the recursive crawls). We may use the above library to do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants