apify · metalwarrior665 · Jan 18, 2023
@@ -1,6 +1,6 @@
 ---
 title: Advanced web scraping
-description: Take your scrapers to the next level by learning various advanced concepts and techniques that will help you build highly scalable and reliable crawlers.
+description: Take your scrapers to production-ready level by learning various advanced concepts and techniques that will help you build highly scalable and reliable crawlers.
 menuWeight: 6
 category: web scraping & automation
 paths:
@@ -9,11 +9,21 @@ paths:
 
 # Advanced web scraping
 
-In this course, we'll be tackling some of the most challenging and advanced web-scraping cases, such as mobile-app scraping, scraping sites with limited pagination, and handling large-scale cases where millions of items are scraped. Are **you** ready to take your scrapers to the next level?
+In [**Web scraping for beginners**]({{@link web_scraping_for_beginners.md}}) course, we have learned the nesesary basics required to create a scraper. In the following courses, we enhanced our scraping toolbox by scraping APIs, using browsers, scraping dynamic websites, understanding website anti-scraping protection and making our code more maintainable by moving into Typescript. 
 
-If you've managed to follow along with all of the courses prior to this one, then you're more than ready to take these upcoming lessons on 😎
+In this course, we will take all of that knowledge, add a few more advanced concepts and apply them to learn how to build a production-ready web scraper.
+
+## [](#what-does-production-ready-mean) What does production-ready mean?
+
+Of course, there is no single world-wide definition of what production-ready system is. Different companies and use-cases will place different priorities on the project. But in general, a production-ready system is stable, reliable, scalable, performant, observable and maintainable.
 
-<!-- Just like the [**Web scraping for beginners**]({{@link web_scraping_for_beginners.md}}) course, this course is divided into two main sections: **Data collection** and **Crawling**. -->
+The following sections will cover the core concepts that will ensure that your scraper is production-ready:
+- Advanced crawling section will cover how to ensure we find all pages or products on the website.
+- The advanced data extraction will cover how to efficiently extact data from particular page or API.
+
+Both of these section will include guides for monitoring, performance, anti-scraping protections and debugging.
+
+If you've managed to follow along with all of the courses prior to this one, then you're more than ready to take these upcoming lessons on 😎
 
 ## [](#first-up) First up
 

@@ -0,0 +1,59 @@
+---
+title: Crawling sitemaps
+description: Learn how to extract all of a website's listings even if they limit the number of results pages. See code examples for setting up your scraper.
+menuWeight: 2
+paths:
+- advanced-web-scraping/crawling/crawling-sitemaps
+---
+
+In the previous lesson, we learned what is the utility (and dangers) of crawling sitemaps. In this lesson, we will go in-depth to how to crawl sitemaps.
+
+We will look at the following topics:
+- How to find sitemap URLs
+- How to set up HTTP requests to download sitemaps
+- How to parse URLs from sitemaps
+
+## [](#how-to-find-sitemap-urls) How to find sitemap URLs
+Sitemaps are commonly restricted to contain max 50k URLs so usually there will be a whole list of them. There can be a master sitemap containing URLs of all other sitemaps or the sitemaps might simply be indexed in robots.txt and/or having auto-incremented URLs like `/sitemap1.xml`, `/sitemap2.xml`, etc.
+
+### [](#google) Google
+You can try your luck on Google by searching for `site:example.com sitemap.xml` or `site:example.com sitemap.xml.gz` and see if you get any results. If you do, you can try to download the sitemap and see if it contains any useful URLs. This success of this approach depends if the website tells Google to index the sitemap file itself which is rather uncommon.
+
+### [](#robots-txt) robots.txt
+If the website has a robots.txt file, it often contains sitemap URLs. The sitemap URLs are usually listed under `Sitemap:` directive.
+
+### [](#common-url-paths) Common URL paths
+You can try to iterate over a common URL paths like:
+```
+/sitemap.xml
+/product_index.xml
+/product_template.xml
+/sitemap_index.xml
+/sitemaps/sitemap_index.xml
+/sitemap/product_index.xml
+/media/sitemap.xml
+/media/sitemap/sitemap.xml
+/media/sitemap/index.xml
+```
+
+Make also sure you test the list with `.gz`, `.tar.gz` and `.tgz` extensions and by capitalizing the words (e.g. `/Sitemap_index.xml.tar.gz`).
+
+Some websites also provide an HTML version, to help indexing bots find new content. Those include:
+
+```
+/sitemap
+/category-sitemap
+/sitemap.html
+/sitemap_index
+```
+
+Apify provides the [Sitemap Sniffer actor](https://apify.com/vaclavrut/sitemap-sniffer) (open-source code), that scans the URL variations automatically for you, so that you don't have to check manually.
+
+## [](#how-to-set-up-http-requests-to-download-sitemaps) How to set up HTTP requests to download sitemaps
+For most sitemaps, you can do a simple HTTP request and parse the downloaded XML text with Cheerio (or just use `CheerioCrawler`). Some sitemaps ar compressed and have to be streamed and decompressed. [This article]({{@link node-js/parsing_compressed_sitemaps}} describes step by step guide how to handle that.
+
+## [](#how-to-parse-urls-from-sitemaps) How to parse URLs from sitemaps
+The easiest part is to parse the actual URLs from the sitemap. The URLs are usually listed under `<loc>` tags. You can use Cheerio to parse the XML text and extract the URLs. Just be careful that the sitemap might contain other URLs that you don't want to crawl (e.g. /about, /contact or various special category sections). [This article]({{@link node-js/node-js/scraping-from-sitemaps}} provides code examples for parsing sitemaps.
+
+## [](#next) Next up
+That's all we need to know about sitemaps for now. Let's dive into a much more interesting topic - search, filters and pagination.
@@ -1,12 +1,14 @@
 ---
-title: Scraping paginated sites
+title: Crawling with search I
 description: Learn how to extract all of a website's listings even if they limit the number of results pages. See code examples for setting up your scraper.
-menuWeight: 1
+menuWeight: 3
 paths:
-- advanced-web-scraping/scraping-paginated-sites
+- advanced-web-scraping/crawling/crawling-with-search-i
 ---
 
-# Scraping websites with limited pagination
+# Scraping websites with search I
+
+In this lesson, we will start with a simpler example of scraping HTML based websites with limited pagination. 
 
 Limited pagination is a common practice on e-commerce sites and is becoming more popular over time. It makes sense: a real user will never want to look through more than 200 pages of results – only bots love unlimited pagination. Fortunately, there are ways to overcome this limit while keeping our code clean and generic.
 
@@ -18,7 +20,7 @@ Limited pagination is a common practice on e-commerce sites and is becoming more
 
 Websites usually limit the pagination of a single (sub)category to somewhere between 1,000 to 20,000 listings. The site might have over a million listings in total. Without a proven algorithm, it will be very manual and almost impossible to scrape all listings.
 
-We will first look at a couple ideas that don't work so well and then present the [final robust solution](#using-filter-ranges).
+We will first look at a couple ideas that might cross our mind but don't work so well and then present the [most robust solution](#using-filter-ranges).
 
 ### [](#going-deeper-into-subcategories) Going deeper into subcategories
 
@@ -278,9 +280,10 @@ for (const filter of newFilters) {
 await crawler.addRequests(requestsToEnqueue);
 ```
 
+Check out the [full code example](https://github.com/metalwarrior665/apify-utils/tree/master/examples/crawler-with-filters).
+
 ## [](#summary) Summary
 
-And that's it. We have an elegant and simple solution for a complicated problem. In a real project, you would want to make this a bit more robust and [save analytics data]({{@link expert_scraping_with_apify/saving_useful_stats.md}}). This will let you know what filters you went through and how many products each of them had.
+And that's it. We have an elegant and simple solution for a complicated problem. In the next lesson, we will explore how to refine this algorithm and apply it to bigger use-cases like scraping APIs.
 
-Check out the [full code example](https://github.com/metalwarrior665/apify-utils/tree/master/examples/crawler-with-filters).