feat: implement ErrorSnapshotter for error context capture #2332

HamzaAlwan · 2024-02-12T17:36:12Z

This commit introduces the ErrorSnapshotter class to the crawlee package, providing functionality to capture screenshots and HTML snapshots when an error occurs during web crawling.

This functionality is opt-in, and can be enabled via the crawler options:

const crawler = new BasicCrawler({
  // ...
  statisticsOptions: {
    saveErrorSnapshots: true,
  },
});

Closes #2280

…280) This commit introduces the ErrorSnapshotter class to the crawlee package, providing functionality to capture screenshots and HTML snapshots when an error occurs during web crawling.

HamzaAlwan · 2024-02-12T17:48:12Z

Hey @B4nan,

I have completed the pull request that resolves the issue with saving Save screenshot/HTML on first occurrence of error in error statistics.

However, when I was trying things out, I ran into a few hiccups. I used the debugger and set up a quick test file with Cheerio and the browser to check if everything was working fine. The good news is that the object passed the test, but I couldn't get around to testing the file saved in the key-value store properly.

Normal tests ran fine, but seems there are errors but I don't believe they are related to my changes.

Unfortunately, I was unable to run the e2e tests or build it due to some errors in the terminal, and I wasn't able to figure out what caused them.

I'm using node v18.17.1 and yarn v4.0.2

yarn test Error:

 FAIL  test/core/crawlers/browser_crawler.test.ts > BrowserCrawler > proxy > browser should launch with rotated custom proxy
 FAIL  test/core/crawlers/browser_crawler.test.ts > BrowserCrawler > proxy > browser should launch with rotated custom proxy
 FAIL  test/core/crawlers/browser_crawler.test.ts > BrowserCrawler > proxy > browser should launch with rotated custom proxy
 FAIL  test/core/crawlers/browser_crawler.test.ts > BrowserCrawler > proxy > browser should launch with rotated custom proxy
 FAIL  test/core/crawlers/browser_crawler.test.ts > BrowserCrawler > proxy > proxy rotation on error respects maxSessionRotations, calls failedRequestHandler
 FAIL  test/core/crawlers/browser_crawler.test.ts > BrowserCrawler > proxy > proxy rotation on error respects maxSessionRotations, calls failedRequestHandler
 FAIL  test/core/crawlers/browser_crawler.test.ts > BrowserCrawler > proxy > proxy rotation on error respects maxSessionRotations, calls failedRequestHandler
 FAIL  test/core/crawlers/browser_crawler.test.ts > BrowserCrawler > proxy > proxy rotation on error respects maxSessionRotations, calls failedRequestHandler
 FAIL  test/core/crawlers/browser_crawler.test.ts > BrowserCrawler > proxy > proxy rotation logs the original proxy error
 FAIL  test/core/crawlers/browser_crawler.test.ts > BrowserCrawler > proxy > proxy rotation logs the original proxy error
 FAIL  test/core/crawlers/browser_crawler.test.ts > BrowserCrawler > proxy > proxy rotation logs the original proxy error
 FAIL  test/core/crawlers/browser_crawler.test.ts > BrowserCrawler > proxy > proxy rotation logs the original proxy error
TargetCloseError: Protocol error (Runtime.callFunctionOn): Session closed. Most likely the page has been closed.
 ❯ CdpCDPSession.send node_modules/puppeteer-core/src/cdp/CDPSession.ts:106:6
 ❯ ExecutionContext.#evaluate node_modules/puppeteer-core/src/cdp/ExecutionContext.ts:292:6
 ❯ ExecutionContext.evaluate node_modules/puppeteer-core/lib/esm/puppeteer/cdp/ExecutionContext.js:124:36
    125|      * @example
    126|      *
    127|      * ```ts
       |      ^
    128|      * const context = await page.mainFrame().executionContext();
    129|      * const handle: JSHandle<typeof globalThis> = await context.evaluateHandle(
 ❯ IsolatedWorld.evaluate node_modules/puppeteer-core/src/cdp/IsolatedWorld.ts:153:14
 ❯ CdpFrame.evaluate node_modules/puppeteer-core/src/api/Frame.ts:513:30
 ❯ CdpFrame.<anonymous> node_modules/puppeteer-core/src/util/decorators.ts:64:2
 ❯ CdpFrame.content node_modules/puppeteer-core/src/api/Frame.ts:788:10
 ❯ CdpFrame.<anonymous> node_modules/puppeteer-core/src/util/decorators.ts:64:2
 ❯ CdpPage.content node_modules/puppeteer-core/lib/esm/puppeteer/api/Page.js:516:43
 ❯ ErrorSnapshotter.captureSnapshot packages/utils/src/internals/error_snapshotter.ts:38:38

yarn build Error:

@crawlee/utils#build: command (/Users/hamza/Desktop/apify/crawlee/packages/utils) /private/var/folders/4z/qgr81fxx5695006fwrzq6k7r0000gn/T/xfs-504fbf48/yarn run build exited (2)

 Tasks:    3 successful, 5 total
Cached:    3 cached, 5 total
  Time:    3.881s 
Failed:    @crawlee/utils#build

 ERROR  run failed: command  exited (2)

yarn test:e2e Error:

node:internal/errors:496
    ErrorCaptureStackTrace(err);
    ^

Error [ERR_MODULE_NOT_FOUND]: Cannot find module '/Users/hamza/Desktop/apify/crawlee/packages/utils/dist/index.mjs' imported from /Users/hamza/Desktop/apify/crawlee/test/e2e/tools.mjs
    at new NodeError (node:internal/errors:405:5)
    at finalizeResolution (node:internal/modules/esm/resolve:324:11)
    at moduleResolve (node:internal/modules/esm/resolve:943:10)
    at defaultResolve (node:internal/modules/esm/resolve:1129:11)
    at nextResolve (node:internal/modules/esm/loader:163:28)
    at ESMLoader.resolve (node:internal/modules/esm/loader:835:30)
    at ESMLoader.getModuleJob (node:internal/modules/esm/loader:424:18)
    at ModuleWrap.<anonymous> (node:internal/modules/esm/module_job:77:40)
    at link (node:internal/modules/esm/module_job:76:36) {
  code: 'ERR_MODULE_NOT_FOUND'
}

Node.js v18.17.1

vladfrangu · 2024-02-12T18:01:44Z

Can you send the full build log you have from the build command? There should be more information 👀

packages/utils/src/internals/error_snapshotter.ts

packages/utils/src/internals/error_tracker.ts

HamzaAlwan · 2024-02-12T18:08:11Z

@vladfrangu

Can you send the full build log you have from the build command? There should be more information 👀

Full build log:

yarn build 
╭───────────────────────────────────────────────────────────────────────╮
│                                                                       │
│                  Update available v1.11.3 ≫ v1.12.3                   │
│    Changelog: https://github.com/vercel/turbo/releases/tag/v1.12.3    │
│           Run "npx @turbo/codemod@latest update" to update            │
│                                                                       │
│        Follow @turborepo for updates: https://x.com/turborepo         │
╰───────────────────────────────────────────────────────────────────────╯
• Packages in scope: @crawlee/basic, @crawlee/browser, @crawlee/browser-pool, @crawlee/cheerio, @crawlee/cli, @crawlee/core, @crawlee/http, @crawlee/jsdom, @crawlee/linkedom, @crawlee/memory-storage, @crawlee/playwright, @crawlee/puppeteer, @crawlee/templates, @crawlee/types, @crawlee/utils, crawlee
• Running build in 16 packages
• Remote caching disabled
@crawlee/types:build: cache hit (outputs already on disk), replaying logs d5696cf6c1be5593
@crawlee/templates:build: cache hit (outputs already on disk), replaying logs 5de49aba4a4af3e3
@crawlee/templates:build: Debugger attached.
@crawlee/templates:build: Debugger attached.
@crawlee/templates:build: Debugger attached.
@crawlee/templates:build: Waiting for the debugger to disconnect...
@crawlee/templates:build: Waiting for the debugger to disconnect...
@crawlee/templates:build: Debugger attached.
@crawlee/templates:build: Debugger attached.
@crawlee/templates:build: Validating 8 templates
@crawlee/templates:build: Validating template getting-started-ts
@crawlee/templates:build: Finished validating getting-started-ts
@crawlee/templates:build: Validating template getting-started-js
@crawlee/templates:build: Finished validating getting-started-js
@crawlee/templates:build: Validating template cheerio-ts
@crawlee/types:build: Debugger attached.
@crawlee/types:build: Debugger attached.
@crawlee/types:build: Debugger attached.
@crawlee/types:build: Waiting for the debugger to disconnect...
@crawlee/types:build: Waiting for the debugger to disconnect...
@crawlee/types:build: Debugger attached.
@crawlee/types:build: Debugger attached.
@crawlee/types:build: Waiting for the debugger to disconnect...
@crawlee/types:build: Debugger attached.
@crawlee/types:build: Waiting for the debugger to disconnect...
@crawlee/types:build: Waiting for the debugger to disconnect...
@crawlee/types:build: Debugger attached.
@crawlee/types:build: Debugger attached.
@crawlee/types:build: Debugger attached.
@crawlee/types:build: Waiting for the debugger to disconnect...
@crawlee/types:build: Waiting for the debugger to disconnect...
@crawlee/types:build: Waiting for the debugger to disconnect...
@crawlee/types:build: Waiting for the debugger to disconnect...
@crawlee/templates:build: Finished validating cheerio-ts
@crawlee/templates:build: Validating template playwright-ts
@crawlee/templates:build: Finished validating playwright-ts
@crawlee/templates:build: Validating template puppeteer-ts
@crawlee/templates:build: Finished validating puppeteer-ts
@crawlee/templates:build: Validating template cheerio-js
@crawlee/templates:build: Finished validating cheerio-js
@crawlee/templates:build: Validating template playwright-js
@crawlee/templates:build: Finished validating playwright-js
@crawlee/templates:build: Validating template puppeteer-js
@crawlee/templates:build: Finished validating puppeteer-js
@crawlee/templates:build: Waiting for the debugger to disconnect...
@crawlee/templates:build: Waiting for the debugger to disconnect...
@crawlee/templates:build: Debugger attached.
@crawlee/templates:build: Debugger attached.
@crawlee/templates:build: Waiting for the debugger to disconnect...
@crawlee/templates:build: Debugger attached.
@crawlee/templates:build: Waiting for the debugger to disconnect...
@crawlee/templates:build: Waiting for the debugger to disconnect...
@crawlee/templates:build: Debugger attached.
@crawlee/templates:build: Debugger attached.
@crawlee/templates:build: Debugger attached.
@crawlee/templates:build: Waiting for the debugger to disconnect...
@crawlee/templates:build: Waiting for the debugger to disconnect...
@crawlee/templates:build: Waiting for the debugger to disconnect...
@crawlee/templates:build: Waiting for the debugger to disconnect...
@crawlee/utils:build: cache miss, executing 4ba9529396006646
@crawlee/cli:build: cache hit (outputs already on disk), replaying logs 05dc6b290eb9c6bb
@crawlee/memory-storage:build: cache miss, executing a3934740eaa87094
@crawlee/utils:build: error TS2209: The project root is ambiguous, but is required to resolve export map entry '.' in file '/Users/hamza/Desktop/apify/crawlee/packages/utils/package.json'. Supply the `rootDir` compiler option to disambiguate.
@crawlee/utils:build: 
@crawlee/utils:build: 
@crawlee/utils:build: Found 1 error.
@crawlee/utils:build: 
@crawlee/utils:build: ERROR: command finished with error: command (/Users/hamza/Desktop/apify/crawlee/packages/utils) /private/var/folders/4z/qgr81fxx5695006fwrzq6k7r0000gn/T/xfs-fada239d/yarn run build exited (2)
@crawlee/utils#build: command (/Users/hamza/Desktop/apify/crawlee/packages/utils) /private/var/folders/4z/qgr81fxx5695006fwrzq6k7r0000gn/T/xfs-fada239d/yarn run build exited (2)

 Tasks:    4 successful, 5 total
Cached:    3 cached, 5 total
  Time:    5.055s 
Failed:    @crawlee/utils#build

 ERROR  run failed: command  exited (2)

vladfrangu · 2024-02-12T18:09:53Z

Just checked CI logs too, and yeah.. probably updating imports with what I sent might fix it

HamzaAlwan · 2024-02-12T18:13:19Z

Just checked CI logs too, and yeah.. probably updating imports with what I sent might fix it

Yes, it's fixed now.

HamzaAlwan · 2024-02-12T23:39:41Z

A couple of things:

1- I need to handle saving the local file path to the screenshot and HTML, currently, I only save the Apify path which will not work locally.
2- Please verify the naming or the path of the screenshot and HTML properties, and the method that generates the file names.
3- Not sure if the way I handled saving the error is the correct way, as we have errorTracker and errorTrackerRetry.
4- Should we save the screenshot and HTML inside the default key-value store?
5- Regarding this:

Some errors happen before any page is created or opened, before navigation happens, or after the page is already closed (maybe then a response object is still available to store HTML?)

It's not possible to save the HTML or take a screenshot if an error happens before navigation, and I have not yet looked at the case for post-navigation

B4nan · 2024-02-13T07:27:05Z

Please add some tests for the new functionality.

Also, note that what you did (making the add method async) is in fact breaking change, but I guess we can afford to do that here as the class is rather internal (but it's not marked as such and is publically documented). Maybe it would be better to add another method, e.g. addAsync.

metalwarrior665 · 2024-02-13T08:30:13Z

Thanks. I will check it a bit later this week

packages/basic-crawler/src/internals/basic-crawler.ts

packages/utils/src/internals/error_snapshotter.ts

metalwarrior665

Do you have some example stats output file that this now generates?

For local dev, we could use a full local filesystem path (like /Users/lukas/apify/public-actors/facebook/storage/key_value_stores/${kvName}/${filename} but not sure what is the best practice in dev world
Seems good enough but I want to see example files
Not sure too, you should test it manually with a scrape but also do unit tests for this
For now to the configured default store, the same place where the stats are stored. We can add some configuration later, up to @B4nan how much of it we should have
We should be able to capture snapshots of e.g. pages that navigated but then were rejected as blocked by status code. Those should already have a body and perhaps the page object. But I'm not sure if we can get hold of it in time.

packages/utils/src/internals/error_snapshotter.ts

packages/utils/src/internals/error_tracker.ts

HamzaAlwan · 2024-02-20T09:15:02Z

@metalwarrior665 ,

Here is an example, clicking on the (path + cmd) opens the file:

"retryErrors": {
    "file:///Users/hamza/Desktop/playwright-test/main.js:18:19": {
      "missing error code": {
        "Error": {
          "This is an error": {
            "count": 3,
            "firstErrorScreenshot": "file:///Users/hamza/Desktop/playwright-test/storage/key_value_stores/default/SNAPSHOT_de2cd33186d663537f2b441be5c372_This-is-an-error.jpeg",
            "firstErrorHtml": "file:///Users/hamza/Desktop/playwright-test/storage/key_value_stores/default/SNAPSHOT_de2cd33186d663537f2b441be5c372_This-is-an-error.html"
          }
        }
      }
    }
  }

… local snapshots paths

packages/utils/src/internals/error_tracker.ts

packages/utils/src/internals/error_snapshotter.ts

metalwarrior665 · 2024-02-21T08:39:51Z

@HamzaAlwan When adding tests, let's add a few real-world errors so we can properly trim very long error messages. I wonder if the 30 chars error length could be longer but let's see how the real world errors will look like. Thanks.

…iles names

metalwarrior665

@B4nan @vladfrangu Could you guys review the latest version? Thanks

packages/core/src/crawlers/error_tracker.ts

test/core/error_tracker.test.ts

metalwarrior665 · 2024-04-11T11:49:39Z

@B4nan @vladfrangu @barjin Hey guys, let's get this merged pls. It is really useful for us and we believe for the community as well. We are ready to refactor it if needed. Thanks

B4nan

left few comments, nothing huge, overall it looks pretty good

i have two points outside of that:

the use of APIFY_IS_AT_HOME: I pointed out the recent PR (fix: Fixed double extension for screenshots #2419) which might resolve this so we don't need it. overwise this would be literally the only place in the whole crawlee codebase we would rely on apify env vars (for anything else than improved logging). lets try to find a way around that and not use this env var.
the whole functionality seems to be hardcoded, this needs to be configurable - and i am not entirely sure it should be enabled by default

B4nan · 2024-02-21T09:26:16Z

packages/utils/src/internals/error_tracker.ts

        this.result = Object.create(null);
        this.total = 0;
    }

-    add(error: ErrnoException) {
+    async add(error: ErrnoException, context?: CrawlingContext) {


i am still not very keen of this breaking change either (making sync methods async)

and looking at the implementation, i'd say it can stay sync, you dont really need to wait for the screenshots to resolve, right? and we can still return the promise to allow optional awaiting

Yeah, not awaiting the snapshot actually sounds like a good idea, just make sure it is try/caught inside so we don't get a promise rejection crash, then the add can return promise or error?

and we can still return the promise to allow optional awaiting

I'm not sure how to implement this

packages/utils/src/internals/error_tracker.ts

packages/utils/src/internals/error_snapshotter.ts

packages/playwright-crawler/src/internals/utils/playwright-utils.ts

packages/utils/src/index.ts

packages/core/src/crawlers/error_snapshotter.ts

metalwarrior665 · 2024-05-09T10:52:28Z

@HamzaAlwan Please look at @B4nan 's remarks. Let's prioritize this so we can finish it finally. cc @AndreyBykov

…changes

…d revert changes to add function

B4nan · 2024-05-15T06:29:05Z

Great that you addressed the comments just now @HamzaAlwan, we want to ship v3.10 in the afternoon most probably, so let's try to squeeze this in. Don't forget to rebase and resolve the conflicts.

B4nan

another round of comments. its looking good, but i really want to get rid of the apify related hardcoded stuff with KVS and paths. i can try to deal with this myself if you don't know how

B4nan · 2024-05-15T07:01:20Z

packages/core/src/crawlers/error_snapshotter.ts

+        // v3.0.0 used `crawlee_storage` as the default, we changed this in v3.0.1 to just `storage`,
+        // this function handles it without making BC breaks - it respects existing `crawlee_storage`
+        // directories, and uses the `storage` only if it's not there.
+        const defaultStorageDir = () => {
+            if (pathExistsSync(resolve('./crawlee_storage'))) {
+                return 'crawlee_storage';
+            }
+
+            return 'storage';
+        };
+
+        this.KEY_VALUE_PLATFORM_PATH = 'https://api.apify.com/v2/key-value-stores';
+        this.KEY_VALUE_STORE_LOCAL_PATH = `file://${process.env.PWD}/${defaultStorageDir()}/key_value_stores`;


lets remove this, there is no point in supporting crawlee 3.0.0 when this feature will land in 3.10.0

Suggested change

// v3.0.0 used `crawlee_storage` as the default, we changed this in v3.0.1 to just `storage`,

// this function handles it without making BC breaks - it respects existing `crawlee_storage`

// directories, and uses the `storage` only if it's not there.

const defaultStorageDir = () => {

if (pathExistsSync(resolve('./crawlee_storage'))) {

return 'crawlee_storage';

}

return 'storage';

};

this.KEY_VALUE_PLATFORM_PATH = 'https://api.apify.com/v2/key-value-stores';

this.KEY_VALUE_STORE_LOCAL_PATH = `file://${process.env.PWD}/${defaultStorageDir()}/key_value_stores`;

this.KEY_VALUE_PLATFORM_PATH = 'https://api.apify.com/v2/key-value-stores';

this.KEY_VALUE_STORE_LOCAL_PATH = `file://${process.env.PWD}/storage/key_value_stores`;

i also very much dislike the fact that we hardcode some apify URL here (i dislike its here, its not relevant whether you get the value from env vars or hardcode it). why do you need to handle this manually? we already have abstractions in place, you should just use KVS and it will do something locally and something else on the platform, why is that not enough and you need to deal with different paths explicitly?

I guess Hamza would do that if he would be more aware of how the storage classes work behind the scenes :) Can you just suggest what to use?

i would remove this manual URL handling completely and just call KVS.setValue (and deal with the URLs elsewhere, e.g. via that getPublicUrl method in SDK (which we can add to crawlee too if we want it to provide URLs to the locally stored files)

i can try to improve this myself if needed, would very much like to have this ready by the end of the day or tomorrow morning. also its probably not a blocker for merging, @HamzaAlwan please focus on the other parts if this one is not clear (but let's remove the crawlee_storage conditional handling as suggested).

edit: maybe it will be better of left for myself, as it might require more internal adjustments... lets focus on the other comments mainly

packages/core/src/crawlers/error_snapshotter.ts

B4nan · 2024-05-15T07:04:09Z

packages/core/src/crawlers/error_snapshotter.ts

+            if (process.env.APIFY_IS_AT_HOME) {
+                const platformPath = `${this.KEY_VALUE_PLATFORM_PATH}/${keyValueStore.id}/records`;
+                return {
+                    screenshotFileUrl: screenshotFilename ? `${platformPath}/${screenshotFilename}` : undefined,
+                    htmlFileUrl: htmlFileName ? `${platformPath}/${htmlFileName}` : undefined,
+                };
+            }
+
+            const localPath = `${this.KEY_VALUE_STORE_LOCAL_PATH}/${keyValueStore.name || 'default'}`;
+            return {
+                screenshotFileUrl: screenshotFilename ? `${localPath}/${screenshotFilename}` : undefined,
+                htmlFileUrl: htmlFileName ? `${localPath}/${htmlFileName}` : undefined,
+            };


as mentioned above, this is bad, it hardcodes handling for platform vs local, which is completely abstracted away from user (based on the Actor.init call, and different configuration options), you should just use a KVS and let it do its job.

FYI we have this KVS.getPublicUrl() method defined in the SDK variant of the class, i would rather see an approach in this direction instead of hardcoding things on wrong level.

https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/key_value_store.ts#L12

i don't understand why would you need to care about local paths, but we could define the same method on the crawlee variant of the KVS class to resolve to that, sounds like a much cleaner solution

packages/core/src/crawlers/error_snapshotter.ts

B4nan · 2024-05-15T07:15:10Z

packages/core/src/crawlers/error_tracker.ts

+    errno?: number | undefined;
+    code?: string | number | undefined;
+    path?: string | undefined;
+    syscall?: string | undefined;
+    cause?: any;


remove the | undefined, its not doing anything since you already define the props as optional via ?

also i have doubts about the cause being required, that should be already defined on the base Error class

looks like this was there already, but please lets fix it

It was added there because of this:

this.add(error.cause);

this.add expects ErrnoException type and it shows an error, as cause type is unknown in Error class

packages/core/src/typedefs.ts

packages/utils/src/index.ts

B4nan · 2024-05-15T07:25:36Z

test/core/error_tracker.test.ts

@@ -157,7 +155,7 @@ test('no code is null code', () => {
    expect(tracker.getUniqueErrorCount()).toBe(1);
 });

-test('can hide error code', () => {
+test('can hide error code', async () => {


i mentioned one place, and you only fixed only one place. all of those changes in this file should be removed (except the import)

Yeah sorry I forgot the rest

test/e2e/puppeteer-error-snapshot/actor/Dockerfile

B4nan · 2024-05-15T07:27:40Z

@HamzaAlwan please try to work on this asap if you want to see this in v3.10.0

janbuchar · 2024-05-15T09:50:46Z

I might be late to the party, but why ErrorSnapshotter and not ErrorScreenshotter or something? "Snapshot" is pretty ambiguous.

B4nan · 2024-05-15T09:52:28Z

it does not store just screenshots, but also html snapshots, right?

edit: in fact it does not store any screenshots when used with non-browser crawlers

metalwarrior665 · 2024-05-15T09:55:52Z

We use the term snapshot as "store representation of the page" like screenshot, HTML (potentially HAR , JS etc. if we would want that). I agree the term is confusing but we didn't find better one. We already have https://crawlee.dev/api/3.7/playwright-crawler/namespace/playwrightUtils#saveSnapshot

…rty names for snapshot files

@B4nan

* chore(deps): lock file maintenance * chore(deps): lock file maintenance * chore(deps): lock file maintenance * chore(deps): lock file maintenance * ci: test on node 22 (apify#2438) * chore: use node 20 in templates * chore(deps): update yarn to v4.2.1 * chore(deps): lock file maintenance * fix: return true when robots.isAllowed returns undefined (apify#2439) `undefined` means that there is no explicit rule for the requested route. No rules means no disallow, therefore it's allowed. Fixes apify#2437 --------- Co-authored-by: Jan Buchar <Teyras@gmail.com> * chore(deps): update patch/minor dependencies to v3.3.0 * chore(deps): update patch/minor dependencies to v3.3.2 * chore(deps): lock file maintenance * chore(deps): lock file maintenance * docs: Should be "Same Domain" not "Same Subdomain" (apify#2445) The docs appear to be a bit misleading. If people want "Same Subdomain" they should actually use "Same Hostname". ![image](https://github.com/apify/crawlee/assets/10026538/2b5452c5-e313-404b-812d-811e0764bd2d) * chore(docker): update docker state [skip ci] * docs: fix two typos (array or requests -> array of requests, no much -> not much) (apify#2451) * fix: sitemap `content-type` check breaks on `content-type` parameters (apify#2442) According to the [RFC1341](https://www.w3.org/Protocols/rfc1341/4_Content-Type.html), the Content-type header can contain additional string parameters. * chore(docker): update docker state [skip ci] * chore(deps): lock file maintenance * fix(core): fire local `SystemInfo` events every second (apify#2454) During local development, we are firing events for the AutoscaledPool about current system resources like memory or CPU. We were firing them once a minute by default, but we remove those snapshots older than 30s, so we never had anything to compare and always used only the very last piece of information. This PR changes the interval to 1s, aligning this with how the Apify platform fires events. * chore(deps): lock file maintenance * chore(deps): lock file maintenance * chore(deps): lock file maintenance * chore(deps): lock file maintenance * chore(deps): lock file maintenance * chore(deps): lock file maintenance * chore(deps): update dependency linkedom to ^0.18.0 (apify#2457) * chore(docker): update docker state [skip ci] * perf: optimize adding large amount of requests via `crawler.addRequests()` (apify#2456) This PR resolves three main issues with adding large amount of requests into the queue: - Every requests added to the queue was automatically added to the LRU requests cache, which has a size of 1 million items. this makes sense for enqueuing a few items, but if we try to add more than the limit, we end up with overloading the LRU cache for no reason. Now we only add the first 1000 requests to the cache (plus any requests added via separate calls, e.g. when doing `enqueueLinks` from inside a request handler, again with a limit of the first 1000 links). - We used to validate the whole requests array via `ow`, and since the shape can vary, it was very slow (e.g. 20s just for the `ow` validation). Now we use a tailored validation for the array that does the same but resolves within 100ms or so. - We always created the `Request` objects out of everything, which had a significant impact on memory usage. Now we skip this completely and let the objects be created later when needed (when calling `RQ.addRequests()` which only receives the actual batch and not the whole array) Related: https://apify.slack.com/archives/C0L33UM7Z/p1715109984834079 * perf: improve scaling based on memory (apify#2459) We only allowed to use 70% of the available memory, this PR changes the limit to 90%. Tested with a low memory options and it did not have any effect, while it allows to use more memory on the large memory setups - where the 30% could mean 2gb or so, we dont need such a huge buffer. Also increases the scaling steps to 10% instead of 5% so speed up the scaling. Related: [apify.slack.com/archives/C0L33UM7Z/p1715109984834079](https://apify.slack.com/archives/C0L33UM7Z/p1715109984834079) * feat: make `RequestQueue` v2 the default queue, see more on [Apify blog](https://blog.apify.com/new-apify-request-queue/) (apify#2390) Closes apify#2388 --------- Co-authored-by: drobnikj <drobnik.j@gmail.com> Co-authored-by: Martin Adámek <banan23@gmail.com> * fix: do not drop statistics on migration/resurrection/resume (apify#2462) This fixes a bug that was introduced with apify#1844 and apify#2083 - we reset the persisted state for statistics and session pool each time a crawler is started, which prevents their restoration. --------- Co-authored-by: Martin Adámek <banan23@gmail.com> * chore(deps): update patch/minor dependencies (apify#2450) * chore(docker): update docker state [skip ci] * fix: double tier decrement in tiered proxy (apify#2468) * docs: scrapy-vs-crawlee blog (apify#2431) Co-authored-by: Saurav Jain <sauain@SauravApify.local> Co-authored-by: davidjohnbarton <41335923+davidjohnbarton@users.noreply.github.com> * perf: optimize `RequestList` memory footprint (apify#2466) The request list now delays the conversion of the source items into the `Request` objects, resulting in a significantly less memory footprint. Related: https://apify.slack.com/archives/C0L33UM7Z/p1715109984834079 * fix: `EnqueueStrategy.All` erroring with links using unsupported protocols (apify#2389) This changes `EnqueueStrategy.All` to filter out non-http and non-https URLs (`mailto:` links were causing the crawler to error). Let me know if there's a better fix or if you want me to change something. Thanks! ``` Request failed and reached maximum retries. Error: Received one or more errors at _ArrayValidator.handle (/path/to/project/node_modules/@sapphire/shapeshift/src/validators/ArrayValidator.ts:102:17) at _ArrayValidator.parse (/path/to/project/node_modules/@sapphire/shapeshift/src/validators/BaseValidator.ts:103:2) at RequestQueueClient.batchAddRequests (/path/to/project/node_modules/@crawlee/src/resource-clients/request-queue.ts:340:36) at RequestQueue.addRequests (/path/to/project/node_modules/@crawlee/src/storages/request_provider.ts:238:46) at RequestQueue.addRequests (/path/to/project/node_modules/@crawlee/src/storages/request_queue.ts:304:22) at attemptToAddToQueueAndAddAnyUnprocessed (/path/to/project/node_modules/@crawlee/src/storages/request_provider.ts:302:42) at RequestQueue.addRequestsBatched (/path/to/project/node_modules/@crawlee/src/storages/request_provider.ts:319:37) at RequestQueue.addRequestsBatched (/path/to/project/node_modules/@crawlee/src/storages/request_queue.ts:309:22) at enqueueLinks (/path/to/project/node_modules/@crawlee/src/enqueue_links/enqueue_links.ts:384:2) at browserCrawlerEnqueueLinks (/path/to/project/node_modules/@crawlee/src/internals/browser-crawler.ts:777:21) ``` * fix(core): use createSessionFunction when loading Session from persisted state (apify#2444) Changes SessionPool's new Session loading behavior in the core module to utilize the configured createSessionFunction if specified. This ensures that new Sessions are instantiated using the custom session creation logic provided by the user, improving flexibility and adherence to user configurations. * fix(core): conversion between tough cookies and browser pool cookies (apify#2443) Fixes the conversion from tough cookies to browser pool cookies and vice versa, by correctly handling cookies where the domain has a leading dot versus when it doesn't. * test: fix e2e tests for zero concurrency * chore(deps): update dependency puppeteer to v22.8.2 * chore(docker): update docker state [skip ci] * docs: fixes (apify#2469) @B4nan minor fixes * chore(deps): update dependency puppeteer to v22.9.0 * feat: implement ErrorSnapshotter for error context capture (apify#2332) This commit introduces the ErrorSnapshotter class to the crawlee package, providing functionality to capture screenshots and HTML snapshots when an error occurs during web crawling. This functionality is opt-in, and can be enabled via the crawler options: ```ts const crawler = new BasicCrawler({ // ... statisticsOptions: { saveErrorSnapshots: true, }, }); ``` Closes apify#2280 --------- Co-authored-by: Martin Adámek <banan23@gmail.com> * test: fix e2e tests for error snapshotter * feat: add `FileDownload` "crawler" (apify#2435) Adds a new package `@crawlee/file-download`, which overrides the `HttpCrawler`'s MIME type limitations and allows the users to download arbitrary files. Aside from the regular `requestHandler`, this crawler introduces `streamHandler`, which passes a `ReadableStream` with the downloaded data to the user handler. --------- Co-authored-by: Martin Adámek <banan23@gmail.com> Co-authored-by: Jan Buchar <jan.buchar@apify.com> * chore(release): v3.10.0 * chore(release): update internal dependencies [skip ci] * chore(docker): update docker state [skip ci] * docs: add v3.10 snapshot * docs: fix broken link for a moved content * chore(deps): lock file maintenance * docs: improve crawlee seo ranking (apify#2472) * chore(deps): lock file maintenance * refactor: Remove redundant fields from `StatisticsPersistedState` (apify#2475) Those fields are duplicated in the base class anyway. * chore(deps): lock file maintenance * fix: provide URLs to the error snapshot (apify#2482) This will respect the Actor SDK override automatically since importing the SDK will fire this side effect: https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/key_value_store.ts#L25 * docs: update keywords (apify#2481) Co-authored-by: Saurav Jain <sauain@SauravApify.local> * docs: add feedback from community. (apify#2478) Co-authored-by: Saurav Jain <sauain@SauravApify.local> Co-authored-by: Martin Adámek <banan23@gmail.com> Co-authored-by: davidjohnbarton <41335923+davidjohnbarton@users.noreply.github.com> * chore: use biome for code formatting (apify#2301) This takes ~50ms on my machine 🤯 - closes apify#2366 - Replacing spaces with tabs won't be done right here, right now. - eslint and biome are reconciled - ~biome check fails because of typescript errors - we can either fix those or find a way to ignore it~ * chore(docker): update docker state [skip ci] * test: Check if the proxy tier drops after an amount of successful requests (apify#2490) * chore: ignore docker state when checking formatting (apify#2491) * chore: remove unused eslint ignore directives * chore: fix formatting * chore: run biome as a pre-commit hook (apify#2493) * fix: adjust `URL_NO_COMMAS_REGEX` regexp to allow single character hostnames (apify#2492) Closes apify#2487 * fix: investigate and temp fix for possible 0-concurrency bug in RQv2 (apify#2494) * test: add e2e test for zero concurrency with RQ v2 * chore: update biome * chore(docker): update docker state [skip ci] * chore(deps): lock file maintenance (apify#2495) * chore(release): v3.10.1 * chore(release): update internal dependencies [skip ci] * chore(docker): update docker state [skip ci] * chore: add undeclared dependency * chore(deps): update patch/minor dependencies to v1.44.1 * chore(deps): lock file maintenance * chore(docker): update docker state [skip ci] * feat: Loading sitemaps from string (apify#2496) - closes apify#2460 * docs: fix homepage gradients (apify#2500) * fix: Autodetect sitemap filetype from content (apify#2497) - closes apify#2461 * chore(deps): update dependency puppeteer to v22.10.0 * chore(deps): lock file maintenance --------- Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Martin Adámek <banan23@gmail.com> Co-authored-by: Gigino Chianese <Sajito@users.noreply.github.com> Co-authored-by: Jan Buchar <Teyras@gmail.com> Co-authored-by: Connor Adams <connorads@users.noreply.github.com> Co-authored-by: Apify Release Bot <noreply@apify.com> Co-authored-by: Jiří Spilka <jiri.spilka@apify.com> Co-authored-by: Jindřich Bär <jindrichbar@gmail.com> Co-authored-by: Vlad Frangu <kingdgrizzle@gmail.com> Co-authored-by: drobnikj <drobnik.j@gmail.com> Co-authored-by: Jan Buchar <jan.buchar@apify.com> Co-authored-by: Saurav Jain <souravjain540@gmail.com> Co-authored-by: Saurav Jain <sauain@SauravApify.local> Co-authored-by: davidjohnbarton <41335923+davidjohnbarton@users.noreply.github.com> Co-authored-by: Stefan Sundin <git@stefansundin.com> Co-authored-by: Gustavo Silva <silva95gustavo@gmail.com> Co-authored-by: Hamza Alwan <ihamzaalwan@gmail.com>

feat: Implement ErrorSnapshotter for error context capture (crawlee#2…

1c6b730

…280) This commit introduces the ErrorSnapshotter class to the crawlee package, providing functionality to capture screenshots and HTML snapshots when an error occurs during web crawling.

HamzaAlwan requested review from B4nan and metalwarrior665 February 12, 2024 17:36

HamzaAlwan self-assigned this Feb 12, 2024

B4nan changed the title ~~feat: Implement ErrorSnapshotter for error context capture (crawlee#2280)~~ feat: implement ErrorSnapshotter for error context capture Feb 12, 2024

vladfrangu requested changes Feb 12, 2024

View reviewed changes

HamzaAlwan added 2 commits February 12, 2024 21:45

fix: update imports in error_snapshotter.ts and error_tracker.ts

994965d

refactor: error handling and snapshot capture

5f50e30

metalwarrior665 reviewed Feb 13, 2024

View reviewed changes

packages/basic-crawler/src/internals/basic-crawler.ts Show resolved Hide resolved

metalwarrior665 reviewed Feb 13, 2024

View reviewed changes

packages/utils/src/internals/error_snapshotter.ts Outdated Show resolved Hide resolved

metalwarrior665 reviewed Feb 13, 2024

View reviewed changes

packages/utils/src/internals/error_snapshotter.ts Outdated Show resolved Hide resolved

metalwarrior665 requested changes Feb 13, 2024

View reviewed changes

packages/utils/src/internals/error_snapshotter.ts Outdated Show resolved Hide resolved

B4nan reviewed Feb 13, 2024

View reviewed changes

packages/utils/src/internals/error_snapshotter.ts Outdated Show resolved Hide resolved

B4nan reviewed Feb 13, 2024

View reviewed changes

packages/utils/src/internals/error_tracker.ts Outdated Show resolved Hide resolved

HamzaAlwan added 2 commits February 21, 2024 10:33

fix: update import statement in error_tracker.ts

4061fed

refactor: updated ErrorSnapshotter class snapshot methods and handled…

512b3ff

… local snapshots paths

metalwarrior665 reviewed Feb 21, 2024

View reviewed changes

packages/utils/src/internals/error_tracker.ts Outdated Show resolved Hide resolved

metalwarrior665 reviewed Feb 21, 2024

View reviewed changes

packages/utils/src/internals/error_snapshotter.ts Outdated Show resolved Hide resolved

HamzaAlwan added 2 commits February 21, 2024 13:02

refactor: update file paths for error snapshots

3301944

refactor: shorten the code in error_snapshotter.ts and update error f…

00c7d1a

…iles names

HamzaAlwan requested review from vladfrangu and metalwarrior665 February 23, 2024 15:22

metalwarrior665 reviewed Mar 6, 2024

View reviewed changes

packages/core/src/crawlers/error_tracker.ts Outdated Show resolved Hide resolved

test/core/error_tracker.test.ts Show resolved Hide resolved

refactor: error_tracker add methods handling and error_snapshotter tests

e56e1cb

B4nan requested changes May 3, 2024

View reviewed changes

janbuchar mentioned this pull request May 13, 2024

Implement ErrorTracker apify/crawlee-python#151

Open

HamzaAlwan added 5 commits May 14, 2024 23:49

fix: readd error-tracker.ts file to packages/utils to avoid breaking …

d2b2c31

…changes

refactor: remove the extension logic when saving screenshot

a509de6

refactor: Update error_tracker to remove the new addAsync function an…

967def2

…d revert changes to add function

chore: update node version in Dockerfile for error_tracker tests

224d512

refactor: Update ErrorSnapshotter class

deed57b

B4nan requested changes May 15, 2024

View reviewed changes

HamzaAlwan and others added 8 commits May 15, 2024 22:41

refactor: Update ErrorSnapshotter class

084cb0c

revert: Dockerfile node version

f239415

refactor: Update ErrorSnapshotter class and removed hardcoded paths

912c2a3

refactor: Update ErrorSnapshotter class to use more descriptive prope…

70c104d

…rty names for snapshot files

Merge branch 'master' into feat/save-screenshot-and-html-on-first-error

7fd25a7

fix tests

bdcd0dc

revert to addAsync method

3cf793c

make it configurable and opt-in

4a59c07

B4nan force-pushed the feat/save-screenshot-and-html-on-first-error branch from 317a5a5 to 4a59c07 Compare May 16, 2024 10:58

B4nan merged commit e861dfd into apify:master May 16, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement ErrorSnapshotter for error context capture #2332

feat: implement ErrorSnapshotter for error context capture #2332

HamzaAlwan commented Feb 12, 2024 •

edited by B4nan

Loading

HamzaAlwan commented Feb 12, 2024 •

edited

Loading

vladfrangu commented Feb 12, 2024

HamzaAlwan commented Feb 12, 2024

vladfrangu commented Feb 12, 2024

HamzaAlwan commented Feb 12, 2024

HamzaAlwan commented Feb 12, 2024 •

edited

Loading

B4nan commented Feb 13, 2024

metalwarrior665 commented Feb 13, 2024

metalwarrior665 left a comment •

edited

Loading

HamzaAlwan commented Feb 20, 2024

metalwarrior665 commented Feb 21, 2024

metalwarrior665 left a comment

metalwarrior665 commented Apr 11, 2024

B4nan left a comment •

edited

Loading

B4nan Feb 21, 2024

metalwarrior665 May 9, 2024

HamzaAlwan May 14, 2024

metalwarrior665 commented May 9, 2024

B4nan commented May 15, 2024

B4nan left a comment

B4nan May 15, 2024

metalwarrior665 May 15, 2024

B4nan May 15, 2024

B4nan May 15, 2024 •

edited

Loading

B4nan May 15, 2024

B4nan May 15, 2024

B4nan May 15, 2024

HamzaAlwan May 15, 2024 •

edited

Loading

B4nan May 15, 2024

HamzaAlwan May 15, 2024

B4nan commented May 15, 2024

janbuchar commented May 15, 2024

B4nan commented May 15, 2024 •

edited

Loading

metalwarrior665 commented May 15, 2024

feat: implement ErrorSnapshotter for error context capture #2332

feat: implement ErrorSnapshotter for error context capture #2332

Conversation

HamzaAlwan commented Feb 12, 2024 • edited by B4nan Loading

HamzaAlwan commented Feb 12, 2024 • edited Loading

vladfrangu commented Feb 12, 2024

HamzaAlwan commented Feb 12, 2024

vladfrangu commented Feb 12, 2024

HamzaAlwan commented Feb 12, 2024

HamzaAlwan commented Feb 12, 2024 • edited Loading

B4nan commented Feb 13, 2024

metalwarrior665 commented Feb 13, 2024

metalwarrior665 left a comment • edited Loading

Choose a reason for hiding this comment

HamzaAlwan commented Feb 20, 2024

metalwarrior665 commented Feb 21, 2024

metalwarrior665 left a comment

Choose a reason for hiding this comment

metalwarrior665 commented Apr 11, 2024

B4nan left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

metalwarrior665 commented May 9, 2024

B4nan commented May 15, 2024

B4nan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

B4nan May 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HamzaAlwan May 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

B4nan commented May 15, 2024

janbuchar commented May 15, 2024

B4nan commented May 15, 2024 • edited Loading

metalwarrior665 commented May 15, 2024

HamzaAlwan commented Feb 12, 2024 •

edited by B4nan

Loading

HamzaAlwan commented Feb 12, 2024 •

edited

Loading

HamzaAlwan commented Feb 12, 2024 •

edited

Loading

metalwarrior665 left a comment •

edited

Loading

B4nan left a comment •

edited

Loading

B4nan May 15, 2024 •

edited

Loading

HamzaAlwan May 15, 2024 •

edited

Loading

B4nan commented May 15, 2024 •

edited

Loading