Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add puppeteer js engine #850

Closed
wants to merge 5 commits into from
Closed

add puppeteer js engine #850

wants to merge 5 commits into from

Conversation

qshine
Copy link

@qshine qshine commented Jan 3, 2019

add puppeteer js engineg.

browser_settings["args"] = ['--no-sandbox', "--proxy-server=" + fetch.proxy];
}

const browser = await puppeteer.launch(browser_settings);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will have performance issue if new browser launched for each request. Browser and tabs should be pooled and reused.

@djytwy
Copy link

djytwy commented Jan 4, 2019

You should add nodejs process in run.py and run puppeteer.Other,you open and close a browser with each request ?This is too resource intensive. I think you should open only one browser with all of requests,open page with each request not browser.


await page.close();
await browser.close();
return;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spider as an async function, it will not exists after make_result, it will exit immediately after page.goto.
Your code may still work as it will run over the event loop, but it's not a good structure to combine async function and callback style code.

As async function is used here, you don't need to keep it like this. Just wait for page state change after page.goto then execute js, and so on then make result. (Promise.race may help)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much !I need it !

@binux
Copy link
Owner

binux commented Jan 4, 2019

A sample code using async/await would be:

const browser = new Browser();
const pool = new Pool(n, browser.newPage);

async function fetch(options) {
  const page = pool.get();
  try {
    await _fetch(page, options);
    return await _make_result(page);
  } catch (error) {
    console.error(error);
    return await _make_result(page, error);
  } finally {
    // clean up
    await page.removeAllListeners();
  }
}

async function _fetch(page, options) {
  // setups
  page.userAgent(xxx);

  // install listeners
  page.on('console');
  let error = null;
  page.on('error', e => error = e);
  page.on('close', e => error = e);
  page.on('pageerror', e => error = e);

  // to monitoring network
  let last_request = null;
  const pending_requests = {};
  page.on('request', r => last_request = pending_requests[r.id] = r);
  page.on('requestfinished', r => delete pending_requests[r.id]);

  // real goto goes here
  await page.goto(url);

  // wait till everything loaded or timeout
  await Promise.race([
    // it's better to use domcontentloaded than load here, because load only triggered when all images loaded, but page can actually be accessed after domcontentloaded 
    new Promise(r => page.once('domcontentloaded', r)),
    // I don't know the behavior of page.goto, if it wouldn't throw errors when not-exists-domain.org / connection reset / 404 / 304 / unknown protocol / crashes stuff, you should capture that.
    new Promise((r, j) => page.once('error', j)),
    new Promise((r, j) => page.once('pageerror', j)),
    new Promise((r, j) => page.once('close', j)),
    new Promise((r, j) => setTimeout(j, timeout)),
  ]);
  if (error) throw error;
  while(!timeout && pending_requests.length && last_request + timeout > now()) {
    await Promise.delay(10);
    if (error) throw error;
  }

  // execute js or other actions
  await page.evaluate(your_js);
  // you could do another wait after execute js
  while(!timeout && pending_requests.length && last_request + timeout > now()) {
    await Promise.delay(10);
    if (error) throw error;
  }

  // other processing ...
}

async function _make_result(page, error) {
  // build result object here
}

@qshine qshine closed this Jan 4, 2019
@qshine qshine deleted the puppeteer branch January 4, 2019 07:37
@qshine
Copy link
Author

qshine commented Jan 4, 2019

Thank you very much for your suggestion. I'll refer to it and revise it

@qshine qshine restored the puppeteer branch January 4, 2019 07:47
@qshine qshine reopened this Jan 4, 2019
@qshine qshine closed this Jan 5, 2019
@ccl0326 ccl0326 mentioned this pull request Feb 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants