Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: docs(academy-advanced-crawling): comit my unfinished first articles #490

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 14 additions & 4 deletions content/academy/advanced_web_scraping.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Advanced web scraping
description: Take your scrapers to the next level by learning various advanced concepts and techniques that will help you build highly scalable and reliable crawlers.
description: Take your scrapers to production-ready level by learning various advanced concepts and techniques that will help you build highly scalable and reliable crawlers.
menuWeight: 6
category: web scraping & automation
paths:
Expand All @@ -9,11 +9,21 @@ paths:

# Advanced web scraping

In this course, we'll be tackling some of the most challenging and advanced web-scraping cases, such as mobile-app scraping, scraping sites with limited pagination, and handling large-scale cases where millions of items are scraped. Are **you** ready to take your scrapers to the next level?
In [**Web scraping for beginners**]({{@link web_scraping_for_beginners.md}}) course, we have learned the nesesary basics required to create a scraper. In the following courses, we enhanced our scraping toolbox by scraping APIs, using browsers, scraping dynamic websites, understanding website anti-scraping protection and making our code more maintainable by moving into Typescript.

If you've managed to follow along with all of the courses prior to this one, then you're more than ready to take these upcoming lessons on 😎
In this course, we will take all of that knowledge, add a few more advanced concepts and apply them to learn how to build a production-ready web scraper.

## [](#what-does-production-ready-mean) What does production-ready mean?

Of course, there is no single world-wide definition of what production-ready system is. Different companies and use-cases will place different priorities on the project. But in general, a production-ready system is stable, reliable, scalable, performant, observable and maintainable.

<!-- Just like the [**Web scraping for beginners**]({{@link web_scraping_for_beginners.md}}) course, this course is divided into two main sections: **Data collection** and **Crawling**. -->
The following sections will cover the core concepts that will ensure that your scraper is production-ready:
- Advanced crawling section will cover how to ensure we find all pages or products on the website.
- The advanced data extraction will cover how to efficiently extact data from particular page or API.

Both of these section will include guides for monitoring, performance, anti-scraping protections and debugging.

If you've managed to follow along with all of the courses prior to this one, then you're more than ready to take these upcoming lessons on 😎

## [](#first-up) First up

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
---
title: Crawling sitemaps
description: Learn how to extract all of a website's listings even if they limit the number of results pages. See code examples for setting up your scraper.
menuWeight: 2
paths:
- advanced-web-scraping/crawling/crawling-sitemaps
---

In the previous lesson, we learned what is the utility (and dangers) of crawling sitemaps. In this lesson, we will go in-depth to how to crawl sitemaps.

We will look at the following topics:
- How to find sitemap URLs
- How to set up HTTP requests to download sitemaps
- How to parse URLs from sitemaps

## [](#how-to-find-sitemap-urls) How to find sitemap URLs
Sitemaps are commonly restricted to contain max 50k URLs so usually there will be a whole list of them. There can be a master sitemap containing URLs of all other sitemaps or the sitemaps might simply be indexed in robots.txt and/or having auto-incremented URLs like `/sitemap1.xml`, `/sitemap2.xml`, etc.

### [](#google) Google
You can try your luck on Google by searching for `site:example.com sitemap.xml` or `site:example.com sitemap.xml.gz` and see if you get any results. If you do, you can try to download the sitemap and see if it contains any useful URLs. This success of this approach depends if the website tells Google to index the sitemap file itself which is rather uncommon.

### [](#robots-txt) robots.txt
If the website has a robots.txt file, it often contains sitemap URLs. The sitemap URLs are usually listed under `Sitemap:` directive.

### [](#common-url-paths) Common URL paths
You can try to iterate over a common URL paths like:
```
/sitemap.xml
/product_index.xml
/product_template.xml
/sitemap_index.xml
/sitemaps/sitemap_index.xml
/sitemap/product_index.xml
/media/sitemap.xml
/media/sitemap/sitemap.xml
/media/sitemap/index.xml
```

Make also sure you test the list with `.gz`, `.tar.gz` and `.tgz` extensions and by capitalizing the words (e.g. `/Sitemap_index.xml.tar.gz`).

Some websites also provide an HTML version, to help indexing bots find new content. Those include:

```
/sitemap
/category-sitemap
/sitemap.html
/sitemap_index
```

Apify provides the [Sitemap Sniffer actor](https://apify.com/vaclavrut/sitemap-sniffer) (open-source code), that scans the URL variations automatically for you, so that you don't have to check manually.

## [](#how-to-set-up-http-requests-to-download-sitemaps) How to set up HTTP requests to download sitemaps
For most sitemaps, you can do a simple HTTP request and parse the downloaded XML text with Cheerio (or just use `CheerioCrawler`). Some sitemaps ar compressed and have to be streamed and decompressed. [This article]({{@link node-js/parsing_compressed_sitemaps}} describes step by step guide how to handle that.

## [](#how-to-parse-urls-from-sitemaps) How to parse URLs from sitemaps
The easiest part is to parse the actual URLs from the sitemap. The URLs are usually listed under `<loc>` tags. You can use Cheerio to parse the XML text and extract the URLs. Just be careful that the sitemap might contain other URLs that you don't want to crawl (e.g. /about, /contact or various special category sections). [This article]({{@link node-js/node-js/scraping-from-sitemaps}} provides code examples for parsing sitemaps.

## [](#next) Next up
That's all we need to know about sitemaps for now. Let's dive into a much more interesting topic - search, filters and pagination.
Original file line number Diff line number Diff line change
@@ -1,12 +1,14 @@
---
title: Scraping paginated sites
title: Crawling with search I
description: Learn how to extract all of a website's listings even if they limit the number of results pages. See code examples for setting up your scraper.
menuWeight: 1
menuWeight: 3
paths:
- advanced-web-scraping/scraping-paginated-sites
- advanced-web-scraping/crawling/crawling-with-search-i
---

# Scraping websites with limited pagination
# Scraping websites with search I

In this lesson, we will start with a simpler example of scraping HTML based websites with limited pagination.

Limited pagination is a common practice on e-commerce sites and is becoming more popular over time. It makes sense: a real user will never want to look through more than 200 pages of results – only bots love unlimited pagination. Fortunately, there are ways to overcome this limit while keeping our code clean and generic.

Expand All @@ -18,7 +20,7 @@ Limited pagination is a common practice on e-commerce sites and is becoming more

Websites usually limit the pagination of a single (sub)category to somewhere between 1,000 to 20,000 listings. The site might have over a million listings in total. Without a proven algorithm, it will be very manual and almost impossible to scrape all listings.

We will first look at a couple ideas that don't work so well and then present the [final robust solution](#using-filter-ranges).
We will first look at a couple ideas that might cross our mind but don't work so well and then present the [most robust solution](#using-filter-ranges).

### [](#going-deeper-into-subcategories) Going deeper into subcategories

Expand Down Expand Up @@ -278,9 +280,10 @@ for (const filter of newFilters) {
await crawler.addRequests(requestsToEnqueue);
```

Check out the [full code example](https://github.com/metalwarrior665/apify-utils/tree/master/examples/crawler-with-filters).

## [](#summary) Summary

And that's it. We have an elegant and simple solution for a complicated problem. In a real project, you would want to make this a bit more robust and [save analytics data]({{@link expert_scraping_with_apify/saving_useful_stats.md}}). This will let you know what filters you went through and how many products each of them had.
And that's it. We have an elegant and simple solution for a complicated problem. In the next lesson, we will explore how to refine this algorithm and apply it to bigger use-cases like scraping APIs.

Check out the [full code example](https://github.com/metalwarrior665/apify-utils/tree/master/examples/crawler-with-filters).

Loading