New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add puppeteer js engine #850
Conversation
browser_settings["args"] = ['--no-sandbox', "--proxy-server=" + fetch.proxy]; | ||
} | ||
|
||
const browser = await puppeteer.launch(browser_settings); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will have performance issue if new browser launched for each request. Browser and tabs should be pooled and reused.
You should add nodejs process in run.py and run puppeteer.Other,you open and close a browser with each request ?This is too resource intensive. I think you should open only one browser with all of requests,open page with each request not browser. |
|
||
await page.close(); | ||
await browser.close(); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spider as an async function, it will not exists after make_result
, it will exit immediately after page.goto
.
Your code may still work as it will run over the event loop, but it's not a good structure to combine async function and callback style code.
As async function is used here, you don't need to keep it like this. Just wait for page state change after page.goto
then execute js, and so on then make result. (Promise.race
may help)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much !I need it !
A sample code using async/await would be: const browser = new Browser();
const pool = new Pool(n, browser.newPage);
async function fetch(options) {
const page = pool.get();
try {
await _fetch(page, options);
return await _make_result(page);
} catch (error) {
console.error(error);
return await _make_result(page, error);
} finally {
// clean up
await page.removeAllListeners();
}
}
async function _fetch(page, options) {
// setups
page.userAgent(xxx);
// install listeners
page.on('console');
let error = null;
page.on('error', e => error = e);
page.on('close', e => error = e);
page.on('pageerror', e => error = e);
// to monitoring network
let last_request = null;
const pending_requests = {};
page.on('request', r => last_request = pending_requests[r.id] = r);
page.on('requestfinished', r => delete pending_requests[r.id]);
// real goto goes here
await page.goto(url);
// wait till everything loaded or timeout
await Promise.race([
// it's better to use domcontentloaded than load here, because load only triggered when all images loaded, but page can actually be accessed after domcontentloaded
new Promise(r => page.once('domcontentloaded', r)),
// I don't know the behavior of page.goto, if it wouldn't throw errors when not-exists-domain.org / connection reset / 404 / 304 / unknown protocol / crashes stuff, you should capture that.
new Promise((r, j) => page.once('error', j)),
new Promise((r, j) => page.once('pageerror', j)),
new Promise((r, j) => page.once('close', j)),
new Promise((r, j) => setTimeout(j, timeout)),
]);
if (error) throw error;
while(!timeout && pending_requests.length && last_request + timeout > now()) {
await Promise.delay(10);
if (error) throw error;
}
// execute js or other actions
await page.evaluate(your_js);
// you could do another wait after execute js
while(!timeout && pending_requests.length && last_request + timeout > now()) {
await Promise.delay(10);
if (error) throw error;
}
// other processing ...
}
async function _make_result(page, error) {
// build result object here
} |
Thank you very much for your suggestion. I'll refer to it and revise it |
add puppeteer js engineg.