Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impossible to scrape this website #147

Closed
simjoeweb opened this issue Jan 29, 2020 · 4 comments
Closed

Impossible to scrape this website #147

simjoeweb opened this issue Jan 29, 2020 · 4 comments

Comments

@simjoeweb
Copy link

I get blocked on the first or second page. If I complete the initial verification then it works fine after that. I tried using proxies but I still get blocked after 1 page. Any ideas what it could be?

puppeteerExtra.use(pluginStealth());
const browser = await puppeteerExtra.launch({
    headless: false,
    defaultViewport: {
      width: 1200,
      height: 800,
    }
});
const page = await browser.newPage();
await page.waitFor(5000);
await page.goto("https://www.fnac.com/Telephone-d-occasion/Telephonie-et-Objets-Connectes-d-occasion/Fnac-Darty-Occasion/nsh411995/w-4?PageIndex=1", { waitUntil: "networkidle2" });
await page.waitFor(60000);
await page.goto("https://www.fnac.com/Telephone-d-occasion/Telephonie-et-Objets-Connectes-d-occasion/Fnac-Darty-Occasion/nsh411995/w-4?PageIndex=2", { waitUntil: "networkidle2" });
await page.waitFor(60000);
await page.goto("https://www.fnac.com/Telephone-d-occasion/Telephonie-et-Objets-Connectes-d-occasion/Fnac-Darty-Occasion/nsh411995/w-4?PageIndex=3", { waitUntil: "networkidle2" });
await page.waitFor(60000);
await page.goto("https://www.fnac.com/Telephone-d-occasion/Telephonie-et-Objets-Connectes-d-occasion/Fnac-Darty-Occasion/nsh411995/w-4?PageIndex=4", { waitUntil: "networkidle2" });

image

@JohnDotOwl
Copy link

@simjoeweb , try this https://bot.sannysoft.com/
await page.screenshot({ path: 'screenshot.jpg', type: 'jpeg' , fullPage: true});

See if it's setup properly, there are stuff such as viewport that cna affect the result. Otherwise, check if you're on VPS , most site block famous VPS like DO , or AWS.

To check this, just use your home IP address or mobile phone , it's a good way to see if those "VPS" are blocked.

@brunogaspar
Copy link
Collaborator

On top of what @Rainbowhat mentioned, i would recommend to also send a more valid user agent, you can use the user-agents library for that.

Also try to send some headers like accept, accept encoding, accept language, etc..

For example:

await page.setExtraHTTPHeaders({
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'en-US,q=0.9',
    'Cache-Control': 'none',
})

You can inspect what the browser is sending and choose

@yfe404
Copy link

yfe404 commented May 20, 2020

https://datadome.co/fr/bot-management-protection/une-detection-efficace-cote-client-est-essentielle/ They are talking about you

@brunogaspar
Copy link
Collaborator

Considering that your bot is being detected, i would apply some proxies and clear cookies on each run (if you don't close the browser, between the runs).

There's work being done to hopefully help on avoid being detected, but, it's impossible to guarantee that.

Closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants