Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Using evasion code in vanilla Puppeteer #386

Closed
liweixi100 opened this issue Dec 5, 2020 · 4 comments
Closed

[Question] Using evasion code in vanilla Puppeteer #386

liweixi100 opened this issue Dec 5, 2020 · 4 comments

Comments

@liweixi100
Copy link

liweixi100 commented Dec 5, 2020

Kudos for creating such a versatile and high quality framework.

We have an existing Puppeteer based scraper that targets thousands of sites. Due to the requirement on stability and compatibility, we're a bit hesitant to introduce puppeteer-extra into our code base.

One issue we are trying to resolve is a single site that employs BOT detection: https://www.agl.com.au/about-agl/media-centre

puppeteer-extra-plugin-stealth worked beautifully for this site, and we were able to pinpoint the evasions needed: 'navigator.webdriver' and 'user-agent-override'. Working code:

const puppeteer = require('puppeteer-extra');

const StealthPlugin = require('puppeteer-extra-plugin-stealth');
const stealth = StealthPlugin();

stealth.enabledEvasions.delete('chrome.app');
stealth.enabledEvasions.delete('chrome.csi');
stealth.enabledEvasions.delete('chrome.loadTimes');
stealth.enabledEvasions.delete('chrome.runtime');
stealth.enabledEvasions.delete('iframe.contentWindow');
stealth.enabledEvasions.delete('media.codecs');
stealth.enabledEvasions.delete('navigator.hardwareConcurrency');
stealth.enabledEvasions.delete('navigator.languages');
stealth.enabledEvasions.delete('navigator.permissions');
stealth.enabledEvasions.delete('navigator.plugins');
stealth.enabledEvasions.delete('sourceurl');
stealth.enabledEvasions.delete('webgl.vendor');
stealth.enabledEvasions.delete('window.outerdimensions');

// needed by AGL
// stealth.enabledEvasions.delete('navigator.webdriver');  
// stealth.enabledEvasions.delete('user-agent-override');

console.info('Stealth Plugins Enabled:', stealth.enabledEvasions) 
puppeteer.use(stealth);

puppeteer.launch({ headless: true, 
    args: ['--no-sandbox',
          '--proxy-server=myproxy:80' ] }).then(async browser => {
  console.log('Running tests..');
  const page = await browser.newPage();
  await page.goto('https://www.agl.com.au/about-agl/media-centre');

  console.info('Waiting for selector...');
  await page.waitForSelector('a.article-header>span.title', {visible: true});

  // Somehow the site needed this extra wait for screenshot to work
  await page.waitFor(1000);
  await page.screenshot({ path: 'testresult.png', fullPage: true });
  await browser.close();
  console.log(`All done, check the screenshot. #`)
})

But when we implemented these techniques in vanilla Puppeteer, the code failed to work:

const puppeteer = require('puppeteer');

puppeteer.launch({ headless: true, 
    args: ['--no-sandbox',
           '--proxy-server=myproxy:80',
           ] }).then(async browser => {
  console.log('Running tests..');
  const page = await browser.newPage();

  await page.evaluateOnNewDocument(() => {
      delete Object.getPrototypeOf(navigator).webdriver;

      const override = {
         userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
         acceptLanguage: 'en-US,en',
         locale: 'en-US,en',
         platform: 'Win32'
      }

      page._client.send('Network.setUserAgentOverride', override);
  });

  await page.goto('https://www.agl.com.au/about-agl/media-centre');

  console.info('Waiting for selector...');

  // Will fail (timeout) here...
  await page.waitForSelector('a.article-header>span.title', {visible: true});

  await page.waitFor(1000);
  await page.screenshot({ path: 'testresult.png', fullPage: true });
  await browser.close();
  console.log(`All done, check the screenshot. #`)
})

My questions are:
Did I make any mistakes in the non-working code?
What was the difference between the two pieces of code above?
What kind of internal mechanism that the puppeteer-extra framework employs that could result in the above difference?

Any insight will be greatly appreciated!

@berstend
Copy link
Owner

berstend commented Dec 5, 2020

@liweixi100 Thanks! I do understand the concern to introduce a new project into an existing system, especially if it's used in production. puppeteer-extra is a pretty slim wrapper around puppeteer though and we haven't had any broken builds in all these years. :-)

One of the reasons the extra project became quite popular is that we've been continuously updating it over the past 3 years (next to removing the voodoo aspect of this stuff by adding before & after tests for every single evasion). Hard-coding a specific solution might solve your immediate problem but it's only a matter of time until a site improves on their antibot and you'd be back on square one. :-)

In regards to your code related issue: Stuff in page.evaluateOnNewDocument is run in the browser page context, whereas page._client.send is meant to be run in the Node.js context. Moving that out of page.evaluateOnNewDocument should do the trick.

@berstend berstend closed this as completed Dec 5, 2020
@liweixi100
Copy link
Author

Really appreciate the quick reply! I am very interested in switching over to puppeteer-extra completely. Will run a comparison test on all the websites that we support and see if any regression occurs.

Best regards!

@berstend
Copy link
Owner

berstend commented Dec 6, 2020

@liweixi100 there's one evasion known to potentially cause problems on ad-heavy sites, if you want to play it safe you can proactively disable it: #137 (comment)

@liweixi100
Copy link
Author

Noted. Thanks a lot for the heads-up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants