Skip to content

Beginner: I'm new to scraping and being blocked

berstend̡̲̫̹̠̖͚͓̔̄̓̐̄͛̀͘ edited this page Apr 26, 2021 · 7 revisions

Problem: Your scraper is being blocked

This wiki aims to be a beginner friendly entry point in understanding why this could happen and how to mitigate it.

Note: This document is only relevant if there are issues, if your custom shell script loop using curl runs fine that's great.

Most common issues

You're using a non-browser based scraper (curl, requests, scrapy, etc)

  • The days where this was sufficient are long gone now 😄
  • It's easy for a site to use JS to gather or calculate some data and require that in their backend (sent in the form of cookies/headers/post data)
  • In addition most sites are built with dynamic JS nowadays, so static html scraping won't get you far
  • Solution: Switch to a scraping framework which uses a real browser (puppeteer, playwright)

You're using Selenium

  • Selenium is the grandfather of browser based scraping frameworks and leaks it's presence in too many ways
  • This applies to anything that is not a real browser as well: Scrapy's Splash, PhantomJS, Electron, CasperJS, etc
  • Solution: Don't use Selenium, use puppeteer or playwright

You're using puppeteer without stealth

You're using non-sensical data

  • Don't try to emulate another browser engine or device type (e.g. mobile) when using a desktop browser
  • Don't use data that doesn't make sense (e.g. macOS platform with a Nvidia RTX 3080 GPU)
  • Don't pretend to be the latest Chrome version (e.g. User-Agent) if you're not

Your IP address is bad

  • Don't use free proxies from the internet, they are being detected as such easily
  • Don't use Tor, all exit nodes are public and the network is meant for people in need
  • Don't use your home internet too often or you might experience rate-limiting or bans
  • Don't use datacenter IPs or proxies, they can be detected as not being "residential"

How bot detection works

(TODO: Add more content here)