Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement ErrorSnapshotter for error context capture #2332

Merged
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
1c6b730
feat: Implement ErrorSnapshotter for error context capture (crawlee#2…
HamzaAlwan Feb 12, 2024
994965d
fix: update imports in error_snapshotter.ts and error_tracker.ts
HamzaAlwan Feb 12, 2024
5f50e30
refactor: error handling and snapshot capture
HamzaAlwan Feb 12, 2024
4061fed
fix: update import statement in error_tracker.ts
HamzaAlwan Feb 21, 2024
512b3ff
refactor: updated ErrorSnapshotter class snapshot methods and handled…
HamzaAlwan Feb 21, 2024
3301944
refactor: update file paths for error snapshots
HamzaAlwan Feb 21, 2024
00c7d1a
refactor: shorten the code in error_snapshotter.ts and update error f…
HamzaAlwan Feb 22, 2024
3205465
chore: add tests for ErrorSnapshotter
HamzaAlwan Feb 22, 2024
778a2bc
fix: conditionally generate screenshot and html file names based on A…
HamzaAlwan Feb 22, 2024
efa5437
refactor: move error_snapshotter.ts and error_tracker.ts to core package
HamzaAlwan Feb 22, 2024
865454e
refactor: update actor and package names for error snapshot tests
HamzaAlwan Feb 22, 2024
6982e03
refactor: add a new function `addAsync`in error_tracker and revert th…
HamzaAlwan Feb 22, 2024
c79e3b1
fix: add partial types in typedefs.ts
HamzaAlwan Feb 22, 2024
9eec3ee
fix: update puppeteer and playwright utils test
HamzaAlwan Feb 22, 2024
e08a3cf
fix: lint issue
HamzaAlwan Feb 23, 2024
e56e1cb
refactor: error_tracker add methods handling and error_snapshotter tests
HamzaAlwan Mar 11, 2024
d2b2c31
fix: readd error-tracker.ts file to packages/utils to avoid breaking …
HamzaAlwan May 14, 2024
a509de6
refactor: remove the extension logic when saving screenshot
HamzaAlwan May 14, 2024
967def2
refactor: Update error_tracker to remove the new addAsync function an…
HamzaAlwan May 14, 2024
224d512
chore: update node version in Dockerfile for error_tracker tests
HamzaAlwan May 14, 2024
deed57b
refactor: Update ErrorSnapshotter class
HamzaAlwan May 14, 2024
084cb0c
refactor: Update ErrorSnapshotter class
HamzaAlwan May 15, 2024
f239415
revert: Dockerfile node version
HamzaAlwan May 15, 2024
912c2a3
refactor: Update ErrorSnapshotter class and removed hardcoded paths
HamzaAlwan May 16, 2024
70c104d
refactor: Update ErrorSnapshotter class to use more descriptive prope…
HamzaAlwan May 16, 2024
7fd25a7
Merge branch 'master' into feat/save-screenshot-and-html-on-first-error
B4nan May 16, 2024
bdcd0dc
fix tests
B4nan May 16, 2024
3cf793c
revert to addAsync method
B4nan May 16, 2024
4a59c07
make it configurable and opt-in
B4nan May 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
12 changes: 10 additions & 2 deletions packages/basic-crawler/src/internals/basic-crawler.ts
Original file line number Diff line number Diff line change
Expand Up @@ -1363,7 +1363,7 @@ export class BasicCrawler<Context extends CrawlingContext = BasicCrawlingContext
const shouldRetryRequest = this._canRequestBeRetried(request, error);

if (shouldRetryRequest) {
this.stats.errorTrackerRetry.add(error);
await this.stats.errorTrackerRetry.addAsync(error, crawlingContext);

if (error instanceof SessionError) {
await this._rotateSession(crawlingContext);
Expand All @@ -1389,7 +1389,15 @@ export class BasicCrawler<Context extends CrawlingContext = BasicCrawlingContext
}
}

this.stats.errorTracker.add(error);
// If the request is non-retryable, the error and snapshot aren't saved in the errorTrackerRetry object.
// Therefore, we pass the crawlingContext to the errorTracker.add method, enabling snapshot capture.
// This is to make sure the error snapshot is not duplicated in the errorTrackerRetry and errorTracker objects.
HamzaAlwan marked this conversation as resolved.
Show resolved Hide resolved
const { noRetry, maxRetries } = request;
if (noRetry || !maxRetries) {
await this.stats.errorTracker.addAsync(error, crawlingContext);
} else {
this.stats.errorTracker.add(error);
}

// If we get here, the request is either not retryable
// or failed more than retryCount times and will not be retried anymore.
Expand Down
135 changes: 135 additions & 0 deletions packages/core/src/crawlers/error_snapshotter.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
import crypto from 'node:crypto';

import type { ErrnoException } from './error_tracker';
import type { CrawlingContext } from '../crawlers/crawler_commons';
import type { KeyValueStore } from '../storages';
import type { PlaywrightCrawlingContext, PuppeteerCrawlingContext, PlaywrightPage, PuppeteerPage } from '../typedefs';

const { PWD, CRAWLEE_STORAGE_DIR, APIFY_IS_AT_HOME } = process.env;
HamzaAlwan marked this conversation as resolved.
Show resolved Hide resolved

/**
* ErrorSnapshotter class is used to capture a screenshot of the page and a snapshot of the HTML when an error occur during web crawling.
*/
export class ErrorSnapshotter {
static readonly MAX_ERROR_CHARACTERS = 30;
static readonly MAX_HASH_LENGTH = 30;
static readonly MAX_FILENAME_LENGTH = 250;
static readonly BASE_MESSAGE = 'An error occurred';
static readonly SNAPSHOT_PREFIX = 'ERROR_SNAPSHOT';
static readonly KEY_VALUE_PLATFORM_PATH = 'https://api.apify.com/v2/key-value-stores';
static readonly KEY_VALUE_STORE_LOCAL_PATH = `file://${PWD}/${CRAWLEE_STORAGE_DIR || 'storage'}/key_value_stores`;
HamzaAlwan marked this conversation as resolved.
Show resolved Hide resolved

/**
* Capture a snapshot of the error context.
*/
async captureSnapshot(error: ErrnoException, context: CrawlingContext): Promise<{ screenshotFileUrl?: string; htmlFileUrl?: string }> {
const { KEY_VALUE_PLATFORM_PATH, KEY_VALUE_STORE_LOCAL_PATH } = ErrorSnapshotter;
const page = context?.page as PuppeteerPage | PlaywrightPage | undefined;
const body = context?.body;

const keyValueStore = await context?.getKeyValueStore();
// If the key-value store is not available, or the body and page are not available, return empty filenames
if (!keyValueStore || (!body && !page)) {
return {};
}

const filename = this.generateFilename(error);

let screenshotFilename: string | undefined;
let htmlFileName: string | undefined;

if (page) {
const capturedFiles = await this.captureSnapShot(
context as unknown as
| PlaywrightCrawlingContext
| PuppeteerCrawlingContext,
filename,
);

if (capturedFiles) {
screenshotFilename = capturedFiles.screenshotFilename;
htmlFileName = capturedFiles.htmlFileName;
}

// If the snapshot for browsers failed to capture the HTML, try to capture it from the page content
if (!htmlFileName) {
const html = await page?.content() || undefined;
htmlFileName = html ? await this.saveHTMLSnapshot(html, keyValueStore, filename) : undefined;
}
} else if (typeof body === 'string') { // for non-browser contexts
htmlFileName = await this.saveHTMLSnapshot(body, keyValueStore, filename);
}

if (APIFY_IS_AT_HOME) {
const platformPath = `${KEY_VALUE_PLATFORM_PATH}/${keyValueStore.id}/records`;
return {
screenshotFileUrl: screenshotFilename ? `${platformPath}/${screenshotFilename}` : undefined,
htmlFileUrl: htmlFileName ? `${platformPath}/${htmlFileName}` : undefined,
};
}

const localPath = `${KEY_VALUE_STORE_LOCAL_PATH}/${keyValueStore.name || 'default'}`;
return {
screenshotFileUrl: screenshotFilename ? `${localPath}/${screenshotFilename}` : undefined,
htmlFileUrl: htmlFileName ? `${localPath}/${htmlFileName}` : undefined,
};
}

/**
* Capture a screenshot and HTML of the page (For Browser only), and return the filename with the extension.
*/
async captureSnapShot(
context: PlaywrightCrawlingContext | PuppeteerCrawlingContext,
filename: string): Promise<{
screenshotFilename?: string;
htmlFileName?: string;
} | undefined> {
HamzaAlwan marked this conversation as resolved.
Show resolved Hide resolved
try {
await context.saveSnapshot({ key: filename });
return {
// The screenshot file extension is different for Apify and local environments
screenshotFilename: `${filename}${APIFY_IS_AT_HOME ? '.jpg' : '.jpeg'}`,
htmlFileName: `${filename}.html`,
HamzaAlwan marked this conversation as resolved.
Show resolved Hide resolved
};
} catch (e) {
HamzaAlwan marked this conversation as resolved.
Show resolved Hide resolved
return undefined;
}
}

/**
* Save the HTML snapshot of the page, and return the filename with the extension.
*/
async saveHTMLSnapshot(html: string, keyValueStore: KeyValueStore, filename: string): Promise<string | undefined> {
try {
await keyValueStore.setValue(filename, html, { contentType: 'text/html' });
return `${filename}.html`;
} catch (e) {
return undefined;
}
}

/**
* Remove non-word characters from the start and end of a string.
*/
sanitizeString(str: string): string {
return str.replace(/^\W+|\W+$/g, '');
HamzaAlwan marked this conversation as resolved.
Show resolved Hide resolved
}

/**
* Generate a unique filename for each error snapshot.
*/
generateFilename(error: ErrnoException): string {
const { SNAPSHOT_PREFIX, BASE_MESSAGE, MAX_HASH_LENGTH, MAX_ERROR_CHARACTERS, MAX_FILENAME_LENGTH } = ErrorSnapshotter;
const { sanitizeString } = this;
HamzaAlwan marked this conversation as resolved.
Show resolved Hide resolved
// Create a hash of the error stack trace
const errorStackHash = crypto.createHash('sha1').update(error.stack || error.message || '').digest('hex').slice(0, MAX_HASH_LENGTH);
const errorMessagePrefix = (error.message || BASE_MESSAGE).slice(0, MAX_ERROR_CHARACTERS).trim();

// Generate filename and remove disallowed characters
const filename = `${SNAPSHOT_PREFIX}_${sanitizeString(errorStackHash)}_${sanitizeString(errorMessagePrefix)}`
.replace(/\W+/g, '-') // Replace non-word characters with a dash
.slice(0, MAX_FILENAME_LENGTH);

return filename;
}
}
Original file line number Diff line number Diff line change
@@ -1,9 +1,12 @@
import { inspect } from 'node:util';

import { ErrorSnapshotter } from './error_snapshotter';
import type { CrawlingContext } from '../crawlers/crawler_commons';

/**
* Node.js Error interface
*/
interface ErrnoException extends Error {
export interface ErrnoException extends Error {
errno?: number | undefined;
code?: string | number | undefined;
path?: string | undefined;
Expand Down Expand Up @@ -283,6 +286,8 @@ export class ErrorTracker {

total: number;

errorSnapshotter: ErrorSnapshotter;

constructor(options: Partial<ErrorTrackerOptions> = {}) {
this.#options = {
showErrorCode: true,
Expand All @@ -294,6 +299,8 @@ export class ErrorTracker {
...options,
};

this.errorSnapshotter = new ErrorSnapshotter();

this.result = Object.create(null);
this.total = 0;
}
Expand Down Expand Up @@ -326,6 +333,43 @@ export class ErrorTracker {
}
}

/**
* This method is async, because it captures a snapshot of the error context.
* We added this new method to avoid breaking changes.
*/
async addAsync(error: ErrnoException, context?: CrawlingContext) {
this.total++;

let group = this.result;

if (this.#options.showStackTrace) {
group = getStackTraceGroup(error, group, this.#options.showFullStack);
}

if (this.#options.showErrorCode) {
group = getErrorCodeGroup(error, group);
}

if (this.#options.showErrorName) {
group = getErrorNameGroup(error, group);
}

if (this.#options.showErrorMessage) {
group = getErrorMessageGroup(error, group, this.#options.showFullMessage);
}
HamzaAlwan marked this conversation as resolved.
Show resolved Hide resolved

increaseCount(group as { count: number });

// Capture a snapshot (screenshot and HTML) on the first occurrence of an error
if (group.count === 1 && context) {
await this.captureSnapshot(group, error, context).catch(() => { });
}

if (typeof error.cause === 'object' && error.cause !== null) {
await this.addAsync(error.cause);
}
}

getUniqueErrorCount() {
let count = 0;

Expand Down Expand Up @@ -366,6 +410,13 @@ export class ErrorTracker {
return result.sort((a, b) => b[0] - a[0]).slice(0, count);
}

async captureSnapshot(storage: Record<string, unknown>, error: ErrnoException, context: CrawlingContext) {
const { screenshotFileUrl, htmlFileUrl } = await this.errorSnapshotter.captureSnapshot(error, context);

storage.firstErrorScreenshotUrl = screenshotFileUrl;
storage.firstErrorHtmlUrl = htmlFileUrl;
}

reset() {
// This actually safe, since we Object.create(null) so no prototype pollution can happen.
// eslint-disable-next-line no-restricted-syntax, guard-for-in
Expand Down
2 changes: 2 additions & 0 deletions packages/core/src/crawlers/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,5 @@ export * from './crawler_commons';
export * from './crawler_extension';
export * from './crawler_utils';
export * from './statistics';
export * from './error_tracker';
export * from './error_snapshotter';
2 changes: 1 addition & 1 deletion packages/core/src/crawlers/statistics.ts
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import { ErrorTracker } from '@crawlee/utils';
import ow from 'ow';

import { ErrorTracker } from './error_tracker';
import { Configuration } from '../configuration';
import type { EventManager } from '../events/event_manager';
import { EventType } from '../events/event_manager';
Expand Down
14 changes: 14 additions & 0 deletions packages/core/src/typedefs.ts
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,17 @@ export function keys<T extends {}>(obj: T) {
}

export declare type AllowedHttpMethods = 'GET' | 'HEAD' | 'POST' | 'PUT' | 'DELETE' | 'TRACE' | 'OPTIONS' | 'CONNECT' | 'PATCH';

// Define the following types as we cannot import the complete types from the respective packages
export interface PlaywrightCrawlingContext {
saveSnapshot: (options: { key: string }) => Promise<void>;
}
export interface PuppeteerCrawlingContext {
saveSnapshot: (options: { key: string }) => Promise<void>;
}
export interface PlaywrightPage {
content: () => Promise<string>;
}
export interface PuppeteerPage {
content: () => Promise<string>;
}
HamzaAlwan marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -534,14 +534,16 @@ export async function saveSnapshot(page: Page, options: SaveSnapshotOptions = {}
try {
const store = await KeyValueStore.open(keyValueStoreName, { config: config ?? Configuration.getGlobalConfig() });

const { APIFY_IS_AT_HOME } = process.env;

if (saveScreenshot) {
const screenshotName = `${key}.jpg`;
const screenshotName = APIFY_IS_AT_HOME ? `${key}.jpg` : key;
HamzaAlwan marked this conversation as resolved.
Show resolved Hide resolved
const screenshotBuffer = await page.screenshot({ fullPage: true, quality: screenshotQuality, type: 'jpeg', animations: 'disabled' });
await store.setValue(screenshotName, screenshotBuffer, { contentType: 'image/jpeg' });
}

if (saveHtml) {
const htmlName = `${key}.html`;
const htmlName = APIFY_IS_AT_HOME ? `${key}.html` : key;
const html = await page.content();
await store.setValue(htmlName, html, { contentType: 'text/html' });
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -666,14 +666,16 @@ export async function saveSnapshot(page: Page, options: SaveSnapshotOptions = {}
try {
const store = await KeyValueStore.open(keyValueStoreName, { config: config ?? Configuration.getGlobalConfig() });

const { APIFY_IS_AT_HOME } = process.env;

if (saveScreenshot) {
const screenshotName = `${key}.jpg`;
const screenshotName = APIFY_IS_AT_HOME ? `${key}.jpg` : key;
const screenshotBuffer = await page.screenshot({ fullPage: true, quality: screenshotQuality, type: 'jpeg' });
await store.setValue(screenshotName, screenshotBuffer, { contentType: 'image/jpeg' });
}

if (saveHtml) {
const htmlName = `${key}.html`;
const htmlName = APIFY_IS_AT_HOME ? `${key}.html` : key;
const html = await page.content();
await store.setValue(htmlName, html, { contentType: 'text/html' });
}
Expand Down
1 change: 0 additions & 1 deletion packages/utils/src/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ export * from './internals/memory-info';
export * from './internals/debug';
export * as social from './internals/social';
export * from './internals/typedefs';
export * from './internals/error_tracker';
HamzaAlwan marked this conversation as resolved.
Show resolved Hide resolved
export * from './internals/open_graph_parser';
export * from './internals/gotScraping';
export * from './internals/robots';
Expand Down
2 changes: 1 addition & 1 deletion packages/utils/test/non-error-objects-working.test.ts
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
import { ErrorTracker } from '../src/internals/error_tracker';
import { ErrorTracker } from '../../core/src/crawlers/error_tracker';

describe('ErrorTracker', () => {
test('processing a non-error error should not crash', () => {
Expand Down