Skip to content

Commit

Permalink
Merge branch 'master' into headless-new
Browse files Browse the repository at this point in the history
  • Loading branch information
B4nan committed May 22, 2023
2 parents 9723ae4 + b1ba108 commit a45e1dc
Show file tree
Hide file tree
Showing 61 changed files with 1,375 additions and 912 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/test-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@ jobs:
run: |
git config --global user.name 'Apify Release Bot'
git config --global user.email 'noreply@apify.com'
yarn turbo --force copy -- --canary --preid=beta
yarn turbo copy --force -- --canary --preid=beta
git commit -am "chore: bump canary versions [skip ci]"
echo "access=public" > .npmrc
Expand Down
21 changes: 21 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,27 @@
All notable changes to this project will be documented in this file.
See [Conventional Commits](https://conventionalcommits.org) for commit guidelines.

## [3.3.2](https://github.com/apify/crawlee/compare/v3.3.1...v3.3.2) (2023-05-11)


### Bug Fixes

* **MemoryStorage:** cache requests in `RequestQueue` ([#1899](https://github.com/apify/crawlee/issues/1899)) ([063dcd1](https://github.com/apify/crawlee/commit/063dcd1c9e6652cd316cc0e8c4f4e4bbb70c246e))
* respect config object when creating `SessionPool` ([#1881](https://github.com/apify/crawlee/issues/1881)) ([db069df](https://github.com/apify/crawlee/commit/db069df80bc183c6b861c9ac82f1e278e57ea92b))


### Features

* allow running single crawler instance multiple times ([#1844](https://github.com/apify/crawlee/issues/1844)) ([9e6eb1e](https://github.com/apify/crawlee/commit/9e6eb1e32f582a8837311aac12cc1d657432f3fa)), closes [#765](https://github.com/apify/crawlee/issues/765)
* **HttpCrawler:** add `parseWithCheerio` helper to `HttpCrawler` ([#1906](https://github.com/apify/crawlee/issues/1906)) ([ff5f76f](https://github.com/apify/crawlee/commit/ff5f76f9336c47c555c28038cdc72dc650bb5065))
* **router:** allow inline router definition ([#1877](https://github.com/apify/crawlee/issues/1877)) ([2d241c9](https://github.com/apify/crawlee/commit/2d241c9f88964ebd41a181069c378b6b7b5bf262))
* RQv2 memory storage support ([#1874](https://github.com/apify/crawlee/issues/1874)) ([049486b](https://github.com/apify/crawlee/commit/049486b772cc2accd2d2d226d8c8726e5ab933a9))
* support alternate storage clients when opening storages ([#1901](https://github.com/apify/crawlee/issues/1901)) ([661e550](https://github.com/apify/crawlee/commit/661e550dcf3609b75e2d7bc225c2f6914f45c93e))





## [3.3.1](https://github.com/apify/crawlee/compare/v3.3.0...v3.3.1) (2023-04-11)


Expand Down
4 changes: 2 additions & 2 deletions docs/introduction/04-pw-w-cheerio.ts
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,13 @@ import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, parseWithCheerio }) => {
// Wait for the actor cards to render.
await page.waitForSelector('.ActorStoreItem');
await page.waitForSelector('div[data-test="actorCard"]');
// Extract the page's HTML from browser
// and parse it with Cheerio.
const $ = await parseWithCheerio();
// Use familiar Cheerio syntax to
// select all the actor cards.
$('.ActorStoreItem').each((i, el) => {
$('div[data-test="actorCard"]').each((i, el) => {
const text = $(el).text();
console.log(`ACTOR_${i + 1}: ${text}\n`);
});
Expand Down
4 changes: 2 additions & 2 deletions docs/introduction/04-pw.ts
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@ import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page }) => {
// Wait for the actor cards to render.
await page.waitForSelector('.ActorStoreItem');
await page.waitForSelector('div[data-test="actorCard"]');
// Execute a function in the browser which targets
// the actor card elements and allows their manipulation.
const actorTexts = await page.$$eval('.ActorStoreItem', (els) => {
const actorTexts = await page.$$eval('div[data-test="actorCard"]', (els) => {
// Extract text content from the actor cards
return els.map((el) => el.textContent);
});
Expand Down
38 changes: 19 additions & 19 deletions docs/introduction/04-real-world-project.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ We hear you, young padawan! First, learn how to crawl, you must. Only then, walk

## Making a production-grade crawler

Making a production-grade crawler is not difficult, but there are many pitfalls of scraping that can catch you off guard. So for the real world project you'll learn how to scrape [Apify Store](https://apify.com/store) instead of the Crawlee website. It is a library of scrapers and automations called **actors** that anyone can grab and use for free.
Making a production-grade crawler is not difficult, but there are many pitfalls of scraping that can catch you off guard. So for the real world project you'll learn how to scrape [Apify Store](https://apify.com/store) instead of the Crawlee website. It is a library of scrapers and automations called **Actors** that anyone can grab and use for free.

The website requires JavaScript rendering, which allows us to showcase more features of Crawlee. We've also added some helpful tips that prepare you for the real-world issues that you will surely encounter when scraping at scale.

Expand All @@ -42,7 +42,7 @@ For the purposes of this tutorial, let's assume that the website cannot be scrap

## Choosing the data you need

A good first step is to figure out what data you want to scrape and where to find it. For the time being, let's just agree that we want to scrape all actors from [Apify Store](https://apify.com/store) and for each actor we want to get its:
A good first step is to figure out what data you want to scrape and where to find it. For the time being, let's just agree that we want to scrape all Actors from [Apify Store](https://apify.com/store) and for each Actor we want to get its:

- URL
- Owner
Expand All @@ -52,7 +52,7 @@ A good first step is to figure out what data you want to scrape and where to fin
- Last modification date
- Number of runs

You will notice that some information is available directly on the list page, but for details such as "Last modification date" or "Number of runs" we'll also need to open the actor detail pages.
You will notice that some information is available directly on the list page, but for details such as "Last modification date" or "Number of runs" we'll also need to open the Actor detail pages.

![data to scrape](/img/getting-started/scraping-practice.jpg 'Overview of data to be scraped.')

Expand All @@ -66,7 +66,7 @@ Let's take a look at the `apify.com/store` page more carefully. There are some *

### Categories and sorting

When you click the categories, you'll see that they filter the results. If you remove the category, you're back to the original number of results. By going through a few categories and observing the behavior, we can quite safely assume that the default setting - with **no category selected** - shows us **all the actors available** in the store and that's the setting we'll use to scrape. The same applies to sorting. We don't need that now.
When you click the categories, you'll see that they filter the results. If you remove the category, you're back to the original number of results. By going through a few categories and observing the behavior, we can quite safely assume that the default setting - with **no category selected** - shows us **all the Actors available** in the store and that's the setting we'll use to scrape. The same applies to sorting. We don't need that now.

:::caution

Expand All @@ -90,37 +90,37 @@ Similarly to the issue with filters explained above, the existence of pagination

:::

At the time of writing the Store results counter showed 1047 results - actors. Quick count of actors on one page of results makes 24. 8 rows times 3 actors. This means that there should be 44 pages of results. 1047 divided by 24. Try going to page number 44 (or the number your own calculation produced).
At the time of writing the Store results counter showed 1047 results - Actors. Quick count of Actors on one page of results makes 24. 8 rows times 3 Actors. This means that there should be 44 pages of results. 1047 divided by 24. Try going to page number 44 (or the number your own calculation produced).

```
https://apify.com/store?page=44
```

It's empty. 🤯 Wrong calculation? Not really. This is an example of another common issue in web scraping. The result count presented by websites very rarely matches the actual number of available results. In our case, it's simply because certain actors were hidden for some reason, but the count does not reflect it.
It's empty. 🤯 Wrong calculation? Not really. This is an example of another common issue in web scraping. The result count presented by websites very rarely matches the actual number of available results. In our case, it's simply because certain Actors were hidden for some reason, but the count does not reflect it.

At the time of writing, the last results were actually on page 42. But that's ok. What's important is that on this page 42, the pagination links at the bottom are still the same as on page one, two or six. This makes it fairly certain that you can keep following those links until you scrape all the results. Good 👍

If you're not convinced, you can visit a page somewhere in the middle, like `https://apify.com/store?page=20` and see how the pagination looks there.

## The crawling strategy

Now that you know where to start and how to find all the actor details, let's look at the crawling process.
Now that you know where to start and how to find all the Actor details, let's look at the crawling process.

1. Visit the store page containing the list of actors (our start URL).
2. Enqueue all links to actor details.
1. Visit the store page containing the list of Actors (our start URL).
2. Enqueue all links to Actor details.
3. Enqueue links to next pages of results.
4. Open the next page in queue.
- When it's a results list page, go to 2.
- When it's an actor detail page, scrape the data.
5. Repeat until all results pages and all actor details have been processed.
- When it's an Actor detail page, scrape the data.
5. Repeat until all results pages and all Actor details have been processed.

`PlaywrightCrawler` will make sure to visit the pages for you, if you provide the correct requests, and you already know how to enqueue pages, so this should be fairly easy. Nevertheless, there are few more tricks that we'd like to showcase.

## Sanity check

Let's check that everything is set up correctly before writing the scraping logic itself. You might realize that something in your previous analysis doesn't quite add up, or the website might not behave exactly as you expected.

The example below creates a new crawler that visits the start URL and prints the text content of all the actor cards on that page. When you run the code, you will see the _very badly formatted_ content of the individual actor cards.
The example below creates a new crawler that visits the start URL and prints the text content of all the Actor cards on that page. When you run the code, you will see the _very badly formatted_ content of the individual Actor cards.

<Tabs groupId="sanity-check">
<TabItem value="playwright" label="Playwright" default>
Expand All @@ -131,9 +131,9 @@ The example below creates a new crawler that visits the start URL and prints the
</TabItem>
</Tabs>

If you're wondering how to get that `.ActorStoreItem` selector. We'll explain it in the next chapter on DevTools.
If you're wondering how to get that `div[data-test="actorCard"]` selector. We'll explain it in the next chapter on DevTools.

## DevTools - the scrapers toolbox
## DevTools - the scraper's toolbox

:::info

Expand All @@ -145,7 +145,7 @@ Let's open DevTools by going to https://apify.com/store in Chrome and then right

## Selecting elements

In the DevTools, choose the **Select an element** tool and try hovering over one of the actor cards.
In the DevTools, choose the **Select an element** tool and try hovering over one of the Actor cards.

![select an element](/img/getting-started/select-an-element.png 'Finding the select an element tool.')

Expand All @@ -155,15 +155,15 @@ You'll see that you can select different elements inside the card. Instead, sele

Selecting an element will highlight it in the DevTools HTML inspector. When carefully look at the elements, you'll see that there are some **classes** attached to the different HTML elements. Those are called **CSS classes**, and we can make a use of them in scraping.

Conversely, by hovering over elements in the HTML inspector, you will see them highlight on the page. Inspect the page's structure around the actor cards. You'll see that all the card's data is displayed in an `<a>` element with two classes, one of which is **ActorStoreItem**. It should now make sense how we got that `.ActorStoreItem` selector. It's just a way to find all elements that are annotated with the `ActorStoreItem`.
Conversely, by hovering over elements in the HTML inspector, you will see them highlight on the page. Inspect the page's structure around the Actor cards. You'll see that all the card's data is displayed in an `<a>` element with a `data-test` attribute **actorCard**. It should now make sense how we got that `div[data-test="actorCard"]` selector. It's just a way to find all elements that are annotated with the `actorCard`.

It's always a good idea to double-check that you're not getting any unwanted elements with this class. To do that, go into the **Console** tab of DevTools and run:

```ts
document.querySelectorAll('.ActorStoreItem');
document.querySelectorAll('div[data-test="actorCard"]');
```

You will see that only the 24 actor cards will be returned, and nothing else.
You will see that only the 24 Actor cards will be returned, and nothing else.

:::tip

Expand All @@ -173,4 +173,4 @@ CSS selectors and DevTools are quite a big topic. If you want to learn more, vis

## Next lesson

In the next lesson we will crawl the whole store, including all the listing pages and all the actor detail pages.
In the next lesson, we will crawl the whole store, including all the listing pages and all the Actor detail pages.
20 changes: 10 additions & 10 deletions docs/introduction/05-crawling.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ sidebar_label: "Crawling"
description: Your first steps into the world of scraping with Crawlee
---

To crawl the whole [Apify Store](https://apify.com/store) and find all the data, you first need to visit all the pages with actors - listing pages and also all the actor detail pages.
To crawl the whole [Apify Store](https://apify.com/store) and find all the data, you first need to visit all the pages with Actors - listing pages and also all the Actor detail pages.

## Crawling the listing pages

Expand All @@ -23,14 +23,14 @@ import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request, enqueueLinks }) => {
console.log(`Processing: ${request.url}`)
// Wait for the actor cards to render,
// Wait for the Actor cards to render,
// otherwise enqueueLinks wouldn't enqueue anything.
await page.waitForSelector('.ActorStorePagination-pages a');
await page.waitForSelector('.ActorStorePagination-buttons a');

// Add links to the queue, but only from
// elements matching the provided selector.
await enqueueLinks({
selector: '.ActorStorePagination-pages > a',
selector: '.ActorStorePagination-buttons a',
label: 'LIST',
})
},
Expand All @@ -43,15 +43,15 @@ The code should look pretty familiar to you. It's a very simple `requestHandler`

### The `selector` parameter of `enqueueLinks()`

When you previously used `enqueueLinks()`, you were not providing any `selector` parameter, and it was fine, because you wanted to use the default value, which is `a` - finds all `<a>` elements. But now, you need to be more specific. There are multiple `<a>` links on the Store results page, and you're only interested in those that will take your crawler to the next page of results. Using the DevTools, you'll find that you can select the links you need using the `.ActorStorePagination-pages a` selector, which selects all the `<a>` elements that are direct children of an element with `class=ActorStorePagination-pages`.
When you previously used `enqueueLinks()`, you were not providing any `selector` parameter, and it was fine, because you wanted to use the default value, which is `a` - finds all `<a>` elements. But now, you need to be more specific. There are multiple `<a>` links on the Store results page, and you're only interested in those that will take your crawler to the next page of results. Using the DevTools, you'll find that you can select the links you need using the `.ActorStorePagination-buttons a` selector, which selects all the `<a>` elements that are children of an element with `class=ActorStorePagination-buttons`.

### The `label` of `enqueueLinks()`

You will see `label` used often throughout Crawlee, as it's a convenient way of labelling a `Request` instance for quick identification later. You can access it with `request.label` and it's a `string`. You can name your requests any way you want. Here, we used the label `LIST` to note that we're enqueueing pages with lists of results. The `enqueueLinks()` function will add this label to all requests before enqueueing them to the `RequestQueue`. Why this is useful will become obvious in a minute.

## Crawling the detail pages

In a similar fashion, you need to collect all the URLs to the actor detail pages, because only from there you can scrape all the data you need. The following code only repeats the concepts you already know for another set of links.
In a similar fashion, you need to collect all the URLs to the Actor detail pages, because only from there you can scrape all the data you need. The following code only repeats the concepts you already know for another set of links.

```ts
import { PlaywrightCrawler } from 'crawlee';
Expand All @@ -65,17 +65,17 @@ const crawler = new PlaywrightCrawler({
// This means we're either on the start page, with no label,
// or on a list page, with LIST label.

await page.waitForSelector('.ActorStorePagination-pages a');
await page.waitForSelector('.ActorStorePagination-buttons a');
await enqueueLinks({
selector: '.ActorStorePagination-pages > a',
selector: '.ActorStorePagination-buttons a',
label: 'LIST',
})

// In addition to adding the listing URLs, we now also
// add the detail URLs from all the listing pages.
await page.waitForSelector('.ActorStoreItem');
await page.waitForSelector('div[data-test="actorCard"] a');
await enqueueLinks({
selector: '.ActorStoreItem',
selector: 'div[data-test="actorCard"] a',
label: 'DETAIL', // <= note the different label
})
}
Expand Down

0 comments on commit a45e1dc

Please sign in to comment.