Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for sameDomainDelay #2003

Merged
merged 21 commits into from Jul 31, 2023
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
242b17b
Added support for sameDomainDelay
Dineshhardasani Jul 20, 2023
c260cd3
Fixed minor bug
Dineshhardasani Jul 20, 2023
50c5e27
Extracted domain from url and Check if it can be crawled or not
Dineshhardasani Jul 20, 2023
7c0c216
Added same domain delay support for both list and queue
Dineshhardasani Jul 20, 2023
5b7c96e
refactor code
Dineshhardasani Jul 21, 2023
37e52d4
Refactor code
Dineshhardasani Jul 21, 2023
dbfb905
refactor code
Dineshhardasani Jul 21, 2023
0e64f2f
Minor fixes
Dineshhardasani Jul 21, 2023
c3f3da0
Merge remote-tracking branch 'origin/master' into sameDomainDelay
Dineshhardasani Jul 21, 2023
9f3ea5f
Fixed minor bug and Removed lastAccessTime from options
Dineshhardasani Jul 21, 2023
0359c80
Added tldts library in package.json
Dineshhardasani Jul 25, 2023
b1eadf3
Minor update
Dineshhardasani Jul 25, 2023
2da71f6
update yarn lock file
Dineshhardasani Jul 25, 2023
86703fc
Merge remote-tracking branch 'origin/master' into sameDomainDelay
Dineshhardasani Jul 25, 2023
ee75c0b
Merge remote-tracking branch 'origin/master' into sameDomainDelay
Dineshhardasani Jul 27, 2023
a9b2334
Generated yarn lock file
Dineshhardasani Jul 27, 2023
c5380db
Apply suggestions from code review
B4nan Jul 27, 2023
5ef2295
Apply suggestions from code review
B4nan Jul 27, 2023
cca52da
Update packages/basic-crawler/src/internals/basic-crawler.ts
B4nan Jul 27, 2023
34cd848
Update packages/basic-crawler/src/internals/basic-crawler.ts
B4nan Jul 28, 2023
34098b4
rename option to `sameDomainDelaySecs`
B4nan Jul 28, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions packages/basic-crawler/package.json
Expand Up @@ -53,6 +53,7 @@
"@crawlee/utils": "^3.4.2",
"got-scraping": "^3.2.9",
"ow": "^0.28.1",
"tldts": "^6.0.0",
"tslib": "^2.4.0",
"type-fest": "^4.0.0"
}
Expand Down
42 changes: 41 additions & 1 deletion packages/basic-crawler/src/internals/basic-crawler.ts
Expand Up @@ -46,6 +46,7 @@
import type { Method, OptionsInit } from 'got-scraping';
import { gotScraping } from 'got-scraping';
import ow, { ArgumentError } from 'ow';
import { getDomain } from 'tldts';
Dineshhardasani marked this conversation as resolved.
Show resolved Hide resolved
import type { SetRequired } from 'type-fest';

export interface BasicCrawlingContext<
Expand Down Expand Up @@ -220,6 +221,12 @@
maxRequestRetries?: number;

/**
* Indicates how much time wait before crawling same domain request
B4nan marked this conversation as resolved.
Show resolved Hide resolved
* @default 0
*/
sameDomainDelay?: number;
B4nan marked this conversation as resolved.
Show resolved Hide resolved

/**

Check failure on line 229 in packages/basic-crawler/src/internals/basic-crawler.ts

View workflow job for this annotation

GitHub Actions / Lint

Trailing spaces not allowed
B4nan marked this conversation as resolved.
Show resolved Hide resolved
* Maximum number of session rotations per request.
* The crawler will automatically rotate the session in case of a proxy error or if it gets blocked by the website.
*
Expand Down Expand Up @@ -434,6 +441,8 @@
protected requestHandlerTimeoutMillis!: number;
protected internalTimeoutMillis: number;
protected maxRequestRetries: number;
protected sameDomainDelay: number;
protected domainAccessedTime: Map<string, number>;
protected maxSessionRotations: number;
protected handledRequestsCount: number;
protected statusMessageLoggingInterval: number;
Expand Down Expand Up @@ -463,6 +472,7 @@
// TODO: remove in a future release
handleFailedRequestFunction: ow.optional.function,
maxRequestRetries: ow.optional.number,
sameDomainDelay: ow.optional.number,
maxSessionRotations: ow.optional.number,
maxRequestsPerCrawl: ow.optional.number,
autoscaledPoolOptions: ow.optional.object,
Expand Down Expand Up @@ -494,6 +504,7 @@
requestList,
requestQueue,
maxRequestRetries = 3,
sameDomainDelay = 0,
maxSessionRotations = 10,
maxRequestsPerCrawl,
autoscaledPoolOptions = {},
Expand Down Expand Up @@ -533,6 +544,7 @@
this.statusMessageLoggingInterval = statusMessageLoggingInterval;
this.statusMessageCallback = statusMessageCallback as StatusMessageCallback;
this.events = config.getEventManager();
this.domainAccessedTime = new Map();

this._handlePropertyNameChange({
newName: 'requestHandler',
Expand Down Expand Up @@ -589,6 +601,7 @@
}

this.maxRequestRetries = maxRequestRetries;
this.sameDomainDelay = sameDomainDelay;
this.maxSessionRotations = maxSessionRotations;
this.handledRequestsCount = 0;
this.stats = new Statistics({ logMessage: `${log.getOptions().prefix} request statistics:`, config });
Expand Down Expand Up @@ -964,6 +977,31 @@
_crawlingContext: Context,
) {}

protected _handleRequestWithDelay(request:Request, source: RequestQueue | RequestList) {
B4nan marked this conversation as resolved.
Show resolved Hide resolved
const domain = getDomain(request.url);

if (!domain || !request) {
return false;
}

const now = Date.now();
const lastAccessTime = this.domainAccessedTime.get(domain);

if (!lastAccessTime || (now - lastAccessTime) >= this.sameDomainDelay) {
this.domainAccessedTime.set(domain, now);
return false;
}

const delay = lastAccessTime + this.sameDomainDelay - now;
this.log.debug(`Request ${request.url} (${request.id}) will be reclaimed after ${delay} milliseconds due to same domain delay`);
setTimeout(async () => {
this.log.debug(`Adding request ${request.url} (${request.id}) back to the queue`);
await source?.reclaimRequest(request);
}, delay);

return true;
B4nan marked this conversation as resolved.
Show resolved Hide resolved
}

/**
* Wrapper around requestHandler that fetches requests from RequestList/RequestQueue
* then retries them in a case of an error, etc.
Expand Down Expand Up @@ -996,7 +1034,9 @@

tryCancel();

if (!request) return;
if (!request || this._handleRequestWithDelay(request, source)) {
B4nan marked this conversation as resolved.
Show resolved Hide resolved
return;
}

// Reset loadedUrl so an old one is not carried over to retries.
request.loadedUrl = undefined;
Expand Down
1 change: 1 addition & 0 deletions yarn.lock
Expand Up @@ -750,6 +750,7 @@ __metadata:
"@crawlee/utils": ^3.4.2
got-scraping: ^3.2.9
ow: ^0.28.1
tldts: ^6.0.0
tslib: ^2.4.0
type-fest: ^4.0.0
languageName: unknown
Expand Down