Skip to content

Commit efbfc44

Browse files
authored
feat: update the scrapers (#70)
- add globs to input - update input schema - update READMEs - update packages
1 parent 6723165 commit efbfc44

File tree

23 files changed

+1084
-2343
lines changed

23 files changed

+1084
-2343
lines changed

package-lock.json

Lines changed: 691 additions & 2047 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

package.json

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@
5555
"@types/content-type": "^1.1.5",
5656
"@types/fs-extra": "^9.0.13",
5757
"@types/jest": "^28.1.5",
58-
"@types/node": "^18.0.3",
58+
"@types/node": "^18.7.18",
5959
"@types/rimraf": "^3.0.2",
6060
"@types/semver": "^7.3.10",
6161
"@types/tough-cookie": "^4.0.2",
@@ -66,7 +66,7 @@
6666
"crawlee": "^3.0.0",
6767
"playwright": "^1.25.0",
6868
"puppeteer": "~17.0.0",
69-
"eslint": "^8.19.0",
69+
"eslint": "~8.22.0",
7070
"fs-extra": "^10.1.0",
7171
"gen-esm-wrapper": "^1.1.3",
7272
"husky": "^8.0.1",
@@ -77,7 +77,7 @@
7777
"ts-jest": "^28.0.5",
7878
"ts-node": "^10.8.2",
7979
"turbo": "1.4.6",
80-
"typescript": "~4.8.2"
80+
"typescript": "~4.8.3"
8181
},
8282
"packageManager": "npm@8.19.1"
8383
}

packages/actor-scraper/cheerio-scraper/README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,7 @@ This is useful for determining which start URL is currently loaded, in order to
9393

9494
### Link selector
9595

96-
The **Link selector** (`linkSelector`) field contains a CSS selector that is used to find links to other web pages, i.e. `<a>` elements with the `href` attribute. On every page loaded, the scraper looks for all links matching the **Link selector**. It checks that the target URL matches one of the [**Pseudo-URLs**](#pseudo-urls)/[**Glob Patterns**](#glob-patterns), and if so then adds the URL to the request queue, to be loaded by the scraper later.
96+
The **Link selector** (`linkSelector`) field contains a CSS selector that is used to find links to other web pages, i.e. `<a>` elements with the `href` attribute. On every page loaded, the scraper looks for all links matching the **Link selector**. It checks that the target URL matches one of the [**Glob Patterns**](#glob-patterns)/[**Pseudo-URLs**](#pseudo-urls), and if so then adds the URL to the request queue, to be loaded by the scraper later.
9797

9898
By default, new scrapers are created with the following selector that matches all links:
9999

@@ -585,7 +585,8 @@ You might also want to see these other resources:
585585
A similar web scraping actor to Puppeteer Scraper, but using the [Playwright](https://github.com/microsoft/playwright) library instead.
586586
- [Actors documentation](https://docs.apify.com/actors) -
587587
Documentation for the Apify Actors cloud computing platform.
588-
- [Crawlee](https://crawlee.dev) - Learn how to build a new web scraper from scratch using the world's most popular web crawling and scraping library for Node.js.
588+
- [Apify SDK documentation](https://sdk.apify.com) - Learn more about the tools required to run your own Apify actors.
589+
- [Crawlee documentation](https://crawlee.dev) - Learn how to build a new web scraping project from scratch using the world's most popular web crawling and scraping library for Node.js.
589590

590591
## Upgrading
591592

packages/actor-scraper/cheerio-scraper/package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
},
1212
"devDependencies": {
1313
"@apify/tsconfig": "^0.1.0",
14-
"@types/node": "^18.7.16",
14+
"@types/node": "^18.7.18",
1515
"ts-node": "^10.9.1",
1616
"typescript": "^4.8.3"
1717
},

packages/actor-scraper/playwright-scraper/README.md

Lines changed: 85 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ Optionally, each URL can be associated with custom user data - a JSON object tha
5454

5555
The **Link selector** (`linkSelector`) field contains a CSS selector that is used to find links to other web pages (items with `href` attributes, e.g. `<div class="my-class" href="...">`).
5656

57-
On every page loaded, the scraper looks for all links matching **Link selector**, and checks that the target URL matches one of the [**Pseudo-URLs**](#pseudo-urls)/[**Glob Patterns**](#glob-patterns). If it is a match, it then adds the URL to the request queue so that it's loaded by the scraper later on.
57+
On every page loaded, the scraper looks for all links matching **Link selector**, and checks that the target URL matches one of the [**Glob Patterns**](#glob-patterns)/[**Pseudo-URLs**](#pseudo-urls). If it is a match, it then adds the URL to the request queue so that it's loaded by the scraper later on.
5858

5959
By default, new scrapers are created with the following selector that matches all links on any page:
6060

@@ -106,8 +106,7 @@ will match the URL:
106106
http://www.example.com/search?do[load]=1
107107
```
108108

109-
Optionally, each pseudo-URL can be associated with user data that can be referenced from your **[Page function](#page-function)**
110-
using `context.request.label` to determine which kind of page is currently loaded in the browser.
109+
Optionally, each pseudo-URL can be associated with user data that can be referenced from your **[Page function](#page-function)** using `context.request.label` to determine which kind of page is currently loaded in the browser.
111110

112111
Note that you don't need to use the **Pseudo-URLs** setting at all,
113112
because you can completely control which pages the scraper will access by calling `await context.enqueueRequest()`
@@ -134,12 +133,13 @@ const context = {
134133

135134
// EXPOSED OBJECTS
136135
page, // Playwright.Page object.
137-
request, // Request object.
136+
request, // Crawlee.Request object.
138137
response, // Response object holding the status code and headers.
138+
session, // Reference to the currently used session.
139+
proxyInfo, // Object holding the url and other information about currently used Proxy.
139140
crawler, // Reference to the crawler object, with access to `browserPool`, `autoscaledPool`, and more.
140141
globalStore, // Represents an in memory store that can be used to share data across pageFunction invocations.
141-
log, // Reference to log object.
142-
playwrightUtils, // Reference to playwrightUtils namespace, containing various utilities for Playwright.
142+
log, // Reference to Crawlee.utils.log.
143143
Actor, // Reference to the Actor class of Apify SDK.
144144
Apify, // Alias to the Actor class for back compatibility.
145145

@@ -149,6 +149,11 @@ const context = {
149149
saveSnapshot, // Saves a screenshot and full HTML of the current page to the key value store.
150150
skipLinks, // Prevents enqueueing more links via glob patterns/Pseudo URLs on the current page.
151151
enqueueRequest, // Adds a page to the request queue.
152+
153+
// PLAYWRIGHT CONTEXT-AWARE UTILITY FUNCTIONS
154+
injectJQuery, // Injects the jQuery library into a Playwright page.
155+
sendRequest, // Sends request using got-scraping.
156+
parseWithCheerio, // Returns Cheerio handle for page.content(), allowing to work with the data same way as with CheerioCrawler.
152157
};
153158
```
154159

@@ -186,8 +191,8 @@ This is a reference to the Playwright Page object, which enables you to use the
186191

187192
#### **`request`**
188193

189-
| Type | Arguments | Returns |
190-
| ------ | --------- | -------------------------------------------------------------- |
194+
| Type | Arguments | Returns |
195+
| ------ | --------- |--------------------------------------------------------------|
191196
| Object | - | [Request](https://crawlee.dev/api/core/class/Request) object |
192197

193198
An object with metadata about the currently crawled page, such as its URL, headers, and the number of retries.
@@ -219,6 +224,22 @@ See the [Request class](https://crawlee.dev/api/core/class/Request) for a previe
219224

220225
The response object is produced by Playwright. Currently, we only pass the response's HTTP status code and headers to the `response` object.
221226

227+
#### **`session`**
228+
229+
| Type | Arguments | Returns |
230+
| ------ | --------- |---------------------------------------------------------------|
231+
| Object | - | [Session](https://crawlee.dev/api/core/class/Session) object |
232+
233+
Reference to the currently used session. See the [official documentation](https://crawlee.dev/api/core/class/Session) for more information.
234+
235+
#### **`proxyInfo`**
236+
237+
| Type | Arguments | Returns |
238+
| ------ | --------- |----------------------------------------------------------------------|
239+
| Object | - | [ProxyInfo](https://crawlee.dev/api/core/interface/ProxyInfo) object |
240+
241+
Object holding the url and other information about currently used Proxy. See the [official documentation](https://crawlee.dev/api/core/interface/ProxyInfo) for more information.
242+
222243
#### **`crawler`**
223244

224245
| Type | Arguments | Returns |
@@ -251,9 +272,9 @@ Refer to the [official documentation](https://crawlee.dev/api/playwright-crawler
251272

252273
#### **`log`**
253274

254-
| Type | Arguments | Returns |
255-
| ------ | --------- |------------------------------------------------------|
256-
| Object | - | [log](https://crawlee.dev/api/core/class/Log) object |
275+
| Type | Arguments | Returns |
276+
| ------ | --------- |--------------------------------------------------------------------|
277+
| Object | - | [Crawlee.utils.log](https://crawlee.dev/api/core/class/Log) object |
257278

258279
This should be used instead of JavaScript's built in `console.log` when logging in the Node.js context, as it automatically color-tags your logs, as well as allows the toggling of the visibility of log messages using options such as [Debug log](#debug-log) in [Advanced configuration](#advanced-configuration).
259280

@@ -265,14 +286,6 @@ The most common `log` methods include:
265286
- `context.log.error()`
266287
- `context.log.exception()`
267288

268-
#### **`playwrightUtils`**
269-
270-
| Type | Arguments | Returns |
271-
| ------ | --------- |------------------------------------------------------------------------------------------------|
272-
| Object | - | [playwrightUtils](https://crawlee.dev/api/playwright-crawler/namespace/playwrightUtils) object |
273-
274-
This is a namespace containing various utility functions for Playwright. Refer to the [official documentation](https://crawlee.dev/api/playwright-crawler/namespace/playwrightUtils) for more information.
275-
276289
#### **`Actor`**
277290

278291
| Type | Arguments | Returns |
@@ -379,6 +392,55 @@ This method is a nice shorthand for
379392
await context.crawler.requestQueue.addRequest({ url: 'https://foo.bar/baz' })
380393
```
381394

395+
#### **`injectJQuery`**
396+
397+
| Type | Arguments | Returns |
398+
| -------- |-----------|------------------|
399+
| Function | () | _Promise\<void>_ |
400+
401+
> This function is async! Don't forget the `await` keyword!
402+
403+
Injects the [jQuery](https://jquery.com/) library into a Playwright page. The injected jQuery will be set to the `window.$` variable, and will survive page navigations and reloads. Note that `injectJQuery()` does not affect the Playwright [`page.$()`](https://playwright.dev/docs/api/class-page#page-query-selector) function in any way.
404+
405+
Usage:
406+
407+
```JavaScript
408+
await context.injectJQuery();
409+
```
410+
411+
#### **`sendRequest`**
412+
413+
| Type | Arguments | Returns |
414+
| -------- |----------------------------------------------|------------------|
415+
| Function | (overrideOptions?: Partial\<GotOptionsInit>) | _Promise\<void>_ |
416+
417+
> This function is async! Don't forget the `await` keyword!
418+
419+
This is a helper function that allows processing the context bound `Request` object through [`got-scraping`](https://github.com/apify/got-scraping). Some options, such as `url` or `method` could be overridden by providing `overrideOptions`. See the [official documentation](https://crawlee.dev/docs/guides/got-scraping#sendrequest-api) for full list of possible `overrideOptions` and more information.
420+
421+
Usage:
422+
423+
```JavaScript
424+
// Without overrideOptions
425+
await context.sendRequest();
426+
// With overrideOptions.url
427+
await context.sendRequest({ url: 'https://www.example.com' });
428+
```
429+
430+
#### **`parseWithCheerio`**
431+
432+
| Type | Arguments | Returns |
433+
| -------- |-----------|-------------------------|
434+
| Function | () | _Promise\<CheerioRoot>_ |
435+
436+
Returns Cheerio handle for `page.content()`, allowing to work with the data same way as with CheerioCrawler.
437+
438+
Usage:
439+
440+
```JavaScript
441+
const $ = await context.parseWithCheerio();
442+
```
443+
382444
## Proxy Configuration
383445

384446
The **Proxy configuration** (`proxyConfiguration`) option enables you to set proxies
@@ -421,7 +483,7 @@ The proxy configuration can be set programmatically when calling the actor using
421483

422484
## Browser Configuration
423485

424-
### Browser Type
486+
### `launcher`
425487

426488
The actor will use a Chromium browser by default. Alternatively, you can set it to use a Firefox browser instead.
427489

@@ -516,7 +578,7 @@ The full object stored in the dataset would look as follows (in JSON format, inc
516578
}
517579
```
518580

519-
To download the results, call the [Get dataset items](https://apify.com/docs/api/v2#/reference/datasets/item-collection) API endpoint:
581+
To download the results, call the [Get dataset items](https://docs.apify.com/api/v2#/reference/datasets/item-collection) API endpoint:
520582

521583
```
522584
https://api.apify.com/v2/datasets/[DATASET_ID]/items?format=json
@@ -534,7 +596,8 @@ For more information, see [Datasets](https://apify.com/docs/storage#dataset) in
534596
That's it! You might also want to check out these other resources:
535597

536598
- [Actors documentation](https://apify.com/docs/actor) - Documentation for the Apify Actors cloud computing platform.
537-
- [Crawlee](https://crawlee.dev) - Learn how to build a new web scraping project from scratch using the world's most popular web crawling and scraping library for Node.js.
599+
- [Apify SDK documentation](https://sdk.apify.com) - Learn more about the tools required to run your own Apify actors.
600+
- [Crawlee documentation](https://crawlee.dev) - Learn how to build a new web scraping project from scratch using the world's most popular web crawling and scraping library for Node.js.
538601
- [Cheerio Scraper](https://apify.com/apify/cheerio-scraper) - Another web scraping actor that downloads and processes pages in raw HTML for much higher performance.
539602
- [Puppeteer Scraper](https://apify.com/apify/puppeteer-scraper) - A similar web scraping actor to Playwright Scraper, but using the [Puppeteer](https://github.com/puppeteer/puppeteer) library instead.
540603
- [Web Scraper](https://apify.com/apify/web-scraper) - A similar web scraping actor to Playwright Scraper, but is simpler to use and only runs in the context of the browser. Uses the [Puppeteer](https://github.com/puppeteer/puppeteer) library.

packages/actor-scraper/playwright-scraper/package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
},
1515
"devDependencies": {
1616
"@apify/tsconfig": "^0.1.0",
17-
"@types/node": "^18.7.16",
17+
"@types/node": "^18.7.18",
1818
"ts-node": "^10.9.1",
1919
"typescript": "^4.8.3"
2020
},

packages/actor-scraper/playwright-scraper/src/internals/crawler_setup.ts

Lines changed: 2 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@ import {
1212
PlaywrightCrawlerOptions,
1313
PlaywrightLaunchContext,
1414
BrowserCrawlerEnqueueLinksOptions,
15-
playwrightUtils,
1615
log,
1716
} from '@crawlee/playwright';
1817
import { Awaitable, Dictionary } from '@crawlee/utils';
@@ -42,7 +41,6 @@ export class CrawlerSetup implements CrawlerSetupOptions {
4241
requestQueue: RequestQueue;
4342
keyValueStore: KeyValueStore;
4443
customData: unknown;
45-
playwrightUtils = playwrightUtils;
4644
input: Input;
4745
maxSessionUsageCount: number;
4846
evaledPageFunction: (...args: unknown[]) => unknown;
@@ -216,15 +214,13 @@ export class CrawlerSetup implements CrawlerSetupOptions {
216214
}
217215

218216
private _createNavigationHooks(options: PlaywrightCrawlerOptions) {
219-
options.preNavigationHooks!.push(async ({ request, page, session }, gotoOptions) => {
217+
options.preNavigationHooks!.push(async ({ request, page, session, blockRequests }, gotoOptions) => {
220218
// Attach a console listener to get all logs from Browser context.
221219
if (this.input.browserLog) browserTools.dumpConsole(page);
222220

223221
// Prevent download of stylesheets and media, unless selected otherwise
224222
if (this.blockedUrlPatterns.length) {
225-
await playwrightUtils.blockRequests(page, {
226-
urlPatterns: this.blockedUrlPatterns,
227-
});
223+
await blockRequests({ urlPatterns: this.blockedUrlPatterns });
228224
}
229225

230226
// Add initial cookies, if any.
@@ -307,7 +303,6 @@ export class CrawlerSetup implements CrawlerSetupOptions {
307303
requestQueue: this.requestQueue,
308304
keyValueStore: this.keyValueStore,
309305
customData: this.input.customData,
310-
playwrightUtils: this.playwrightUtils,
311306
},
312307
pageFunctionArguments,
313308
};

packages/actor-scraper/puppeteer-scraper/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
FROM apify/actor-node-playwright:16 AS builder
1+
FROM apify/actor-node-puppeteer-chrome:16 AS builder
22

33
COPY --chown=myuser package*.json ./
44

packages/actor-scraper/puppeteer-scraper/INPUT_SCHEMA.json

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -9,24 +9,24 @@
99
"title": "Start URLs",
1010
"type": "array",
1111
"description": "URLs to start with",
12-
"prefill": [
13-
{
14-
"url": "https://apify.com"
15-
}
16-
],
12+
"prefill": [{ "url": "https://crawlee.dev" }],
1713
"editor": "requestListSources"
1814
},
15+
"globs": {
16+
"title": "Glob Patterns",
17+
"type": "array",
18+
"description": "Glob patterns to match links in the page that you want to enqueue. Combine with Link selector to tell the scraper where to find links. Omitting the Glob patterns will cause the scraper to enqueue all links matched by the Link selector.",
19+
"editor": "stringList",
20+
"default": [],
21+
"prefill": ["https://crawlee.dev/*/*"]
22+
},
1923
"pseudoUrls": {
20-
"title": "Pseudo-URLs",
24+
"title": "Pseudo-URLs (deprecated)",
2125
"type": "array",
2226
"description": "Pseudo-URLs to match links in the page that you want to enqueue. Combine with Link selector to tell the scraper where to find links. Omitting the Pseudo-URLs will cause the scraper to enqueue all links matched by the Link selector.",
2327
"editor": "pseudoUrls",
2428
"default": [],
25-
"prefill": [
26-
{
27-
"purl": "https://apify.com[(/[\\w-]+)?]"
28-
}
29-
]
29+
"prefill": []
3030
},
3131
"linkSelector": {
3232
"title": "Link selector",

0 commit comments

Comments
 (0)