You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: packages/actor-scraper/cheerio-scraper/README.md
+3-2Lines changed: 3 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -93,7 +93,7 @@ This is useful for determining which start URL is currently loaded, in order to
93
93
94
94
### Link selector
95
95
96
-
The **Link selector** (`linkSelector`) field contains a CSS selector that is used to find links to other web pages, i.e. `<a>` elements with the `href` attribute. On every page loaded, the scraper looks for all links matching the **Link selector**. It checks that the target URL matches one of the [**Pseudo-URLs**](#pseudo-urls)/[**Glob Patterns**](#glob-patterns), and if so then adds the URL to the request queue, to be loaded by the scraper later.
96
+
The **Link selector** (`linkSelector`) field contains a CSS selector that is used to find links to other web pages, i.e. `<a>` elements with the `href` attribute. On every page loaded, the scraper looks for all links matching the **Link selector**. It checks that the target URL matches one of the [**Glob Patterns**](#glob-patterns)/[**Pseudo-URLs**](#pseudo-urls), and if so then adds the URL to the request queue, to be loaded by the scraper later.
97
97
98
98
By default, new scrapers are created with the following selector that matches all links:
99
99
@@ -585,7 +585,8 @@ You might also want to see these other resources:
585
585
A similar web scraping actor to Puppeteer Scraper, but using the [Playwright](https://github.com/microsoft/playwright) library instead.
Documentation for the Apify Actors cloud computing platform.
588
-
-[Crawlee](https://crawlee.dev) - Learn how to build a new web scraper from scratch using the world's most popular web crawling and scraping library for Node.js.
588
+
-[Apify SDK documentation](https://sdk.apify.com) - Learn more about the tools required to run your own Apify actors.
589
+
-[Crawlee documentation](https://crawlee.dev) - Learn how to build a new web scraping project from scratch using the world's most popular web crawling and scraping library for Node.js.
Copy file name to clipboardExpand all lines: packages/actor-scraper/playwright-scraper/README.md
+85-22Lines changed: 85 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -54,7 +54,7 @@ Optionally, each URL can be associated with custom user data - a JSON object tha
54
54
55
55
The **Link selector** (`linkSelector`) field contains a CSS selector that is used to find links to other web pages (items with `href` attributes, e.g. `<div class="my-class" href="...">`).
56
56
57
-
On every page loaded, the scraper looks for all links matching **Link selector**, and checks that the target URL matches one of the [**Pseudo-URLs**](#pseudo-urls)/[**Glob Patterns**](#glob-patterns). If it is a match, it then adds the URL to the request queue so that it's loaded by the scraper later on.
57
+
On every page loaded, the scraper looks for all links matching **Link selector**, and checks that the target URL matches one of the [**Glob Patterns**](#glob-patterns)/[**Pseudo-URLs**](#pseudo-urls). If it is a match, it then adds the URL to the request queue so that it's loaded by the scraper later on.
58
58
59
59
By default, new scrapers are created with the following selector that matches all links on any page:
60
60
@@ -106,8 +106,7 @@ will match the URL:
106
106
http://www.example.com/search?do[load]=1
107
107
```
108
108
109
-
Optionally, each pseudo-URL can be associated with user data that can be referenced from your **[Page function](#page-function)**
110
-
using `context.request.label` to determine which kind of page is currently loaded in the browser.
109
+
Optionally, each pseudo-URL can be associated with user data that can be referenced from your **[Page function](#page-function)** using `context.request.label` to determine which kind of page is currently loaded in the browser.
111
110
112
111
Note that you don't need to use the **Pseudo-URLs** setting at all,
113
112
because you can completely control which pages the scraper will access by calling `await context.enqueueRequest()`
@@ -134,12 +133,13 @@ const context = {
134
133
135
134
// EXPOSED OBJECTS
136
135
page, // Playwright.Page object.
137
-
request, // Request object.
136
+
request, //Crawlee.Request object.
138
137
response, // Response object holding the status code and headers.
138
+
session, // Reference to the currently used session.
139
+
proxyInfo, // Object holding the url and other information about currently used Proxy.
139
140
crawler, // Reference to the crawler object, with access to `browserPool`, `autoscaledPool`, and more.
140
141
globalStore, // Represents an in memory store that can be used to share data across pageFunction invocations.
141
-
log, // Reference to log object.
142
-
playwrightUtils, // Reference to playwrightUtils namespace, containing various utilities for Playwright.
142
+
log, // Reference to Crawlee.utils.log.
143
143
Actor, // Reference to the Actor class of Apify SDK.
144
144
Apify, // Alias to the Actor class for back compatibility.
145
145
@@ -149,6 +149,11 @@ const context = {
149
149
saveSnapshot, // Saves a screenshot and full HTML of the current page to the key value store.
150
150
skipLinks, // Prevents enqueueing more links via glob patterns/Pseudo URLs on the current page.
151
151
enqueueRequest, // Adds a page to the request queue.
152
+
153
+
// PLAYWRIGHT CONTEXT-AWARE UTILITY FUNCTIONS
154
+
injectJQuery, // Injects the jQuery library into a Playwright page.
155
+
sendRequest, // Sends request using got-scraping.
156
+
parseWithCheerio, // Returns Cheerio handle for page.content(), allowing to work with the data same way as with CheerioCrawler.
152
157
};
153
158
```
154
159
@@ -186,8 +191,8 @@ This is a reference to the Playwright Page object, which enables you to use the
Object holding the url and other information about currently used Proxy. See the [official documentation](https://crawlee.dev/api/core/interface/ProxyInfo) for more information.
242
+
222
243
#### **`crawler`**
223
244
224
245
| Type | Arguments | Returns |
@@ -251,9 +272,9 @@ Refer to the [official documentation](https://crawlee.dev/api/playwright-crawler
This should be used instead of JavaScript's built in `console.log` when logging in the Node.js context, as it automatically color-tags your logs, as well as allows the toggling of the visibility of log messages using options such as [Debug log](#debug-log) in [Advanced configuration](#advanced-configuration).
259
280
@@ -265,14 +286,6 @@ The most common `log` methods include:
This is a namespace containing various utility functions for Playwright. Refer to the [official documentation](https://crawlee.dev/api/playwright-crawler/namespace/playwrightUtils) for more information.
275
-
276
289
#### **`Actor`**
277
290
278
291
| Type | Arguments | Returns |
@@ -379,6 +392,55 @@ This method is a nice shorthand for
> This function is async! Don't forget the `await` keyword!
402
+
403
+
Injects the [jQuery](https://jquery.com/) library into a Playwright page. The injected jQuery will be set to the `window.$` variable, and will survive page navigations and reloads. Note that `injectJQuery()` does not affect the Playwright [`page.$()`](https://playwright.dev/docs/api/class-page#page-query-selector) function in any way.
| Function | (overrideOptions?: Partial\<GotOptionsInit>) |_Promise\<void>_|
416
+
417
+
> This function is async! Don't forget the `await` keyword!
418
+
419
+
This is a helper function that allows processing the context bound `Request` object through [`got-scraping`](https://github.com/apify/got-scraping). Some options, such as `url` or `method` could be overridden by providing `overrideOptions`. See the [official documentation](https://crawlee.dev/docs/guides/got-scraping#sendrequest-api) for full list of possible `overrideOptions` and more information.
@@ -534,7 +596,8 @@ For more information, see [Datasets](https://apify.com/docs/storage#dataset) in
534
596
That's it! You might also want to check out these other resources:
535
597
536
598
-[Actors documentation](https://apify.com/docs/actor) - Documentation for the Apify Actors cloud computing platform.
537
-
-[Crawlee](https://crawlee.dev) - Learn how to build a new web scraping project from scratch using the world's most popular web crawling and scraping library for Node.js.
599
+
-[Apify SDK documentation](https://sdk.apify.com) - Learn more about the tools required to run your own Apify actors.
600
+
-[Crawlee documentation](https://crawlee.dev) - Learn how to build a new web scraping project from scratch using the world's most popular web crawling and scraping library for Node.js.
538
601
-[Cheerio Scraper](https://apify.com/apify/cheerio-scraper) - Another web scraping actor that downloads and processes pages in raw HTML for much higher performance.
539
602
-[Puppeteer Scraper](https://apify.com/apify/puppeteer-scraper) - A similar web scraping actor to Playwright Scraper, but using the [Puppeteer](https://github.com/puppeteer/puppeteer) library instead.
540
603
-[Web Scraper](https://apify.com/apify/web-scraper) - A similar web scraping actor to Playwright Scraper, but is simpler to use and only runs in the context of the browser. Uses the [Puppeteer](https://github.com/puppeteer/puppeteer) library.
Copy file name to clipboardExpand all lines: packages/actor-scraper/puppeteer-scraper/INPUT_SCHEMA.json
+11-11Lines changed: 11 additions & 11 deletions
Original file line number
Diff line number
Diff line change
@@ -9,24 +9,24 @@
9
9
"title": "Start URLs",
10
10
"type": "array",
11
11
"description": "URLs to start with",
12
-
"prefill": [
13
-
{
14
-
"url": "https://apify.com"
15
-
}
16
-
],
12
+
"prefill": [{ "url": "https://crawlee.dev" }],
17
13
"editor": "requestListSources"
18
14
},
15
+
"globs": {
16
+
"title": "Glob Patterns",
17
+
"type": "array",
18
+
"description": "Glob patterns to match links in the page that you want to enqueue. Combine with Link selector to tell the scraper where to find links. Omitting the Glob patterns will cause the scraper to enqueue all links matched by the Link selector.",
19
+
"editor": "stringList",
20
+
"default": [],
21
+
"prefill": ["https://crawlee.dev/*/*"]
22
+
},
19
23
"pseudoUrls": {
20
-
"title": "Pseudo-URLs",
24
+
"title": "Pseudo-URLs (deprecated)",
21
25
"type": "array",
22
26
"description": "Pseudo-URLs to match links in the page that you want to enqueue. Combine with Link selector to tell the scraper where to find links. Omitting the Pseudo-URLs will cause the scraper to enqueue all links matched by the Link selector.",
0 commit comments