# CAPTCHA-Aware Web Scraping Pipeline

This notebook demonstrates how CAPTCHA handling fits into an automated web scraping workflow.

The goal is to:
- Detect CAPTCHA challenges.
- Understand decision flow in scraping pipelines.
- Integrate third-party CAPTCHA solving services (2Captcha).
- Retry requests after CAPTCHA resolution.
- Continue scraping in a controlled and ethical manner.

## Scraping Flow Implemented

Request page  
↓  
Check response  
↓  
Is CAPTCHA present?  
↓  
Yes → Send CAPTCHA to solver API  
↓  
Receive solution token  
↓  
Retry request with token  
↓  
Continue scraping

## Import required libraries

In [13]:
import requests                    # For sending HTTP requests
from bs4 import BeautifulSoup      # For parsing HTML content
import time                        # For adding delays (used while polling CAPTCHA solver)

## Step 1: Request the Target Web Page

The scraper starts by sending a standard HTTP request to the target URL.
At this stage, no assumptions are made about whether the page is accessible or protected.

In [14]:
def fetch_page(url, headers=None):
    """
    Sends an HTTP GET request to the given URL.
    Returns the response object.
    """
    return requests.get(url, headers=headers, timeout=10)   # Timeout is added to avoid hanging indefinitely

## Step 2: Detect CAPTCHA Presence

After receiving the response, the HTML is checked for common CAPTCHA indicators such as:
- Keywords like 'captcha' or 'verify you are human'
- reCAPTCHA HTML elements

In [15]:
def is_captcha_present(response):
    """
    Checks whether the response contains CAPTCHA indicators.
    Returns True if CAPTCHA is detected, else False.
    """
    # Convert HTML text to lowercase for keyword-based checks
    html_text = response.text.lower()

    # Simple keyword-based CAPTCHA detection
    if "captcha" in html_text or "verify you are human" in html_text:
        return True

    # HTML-based detection for reCAPTCHA elements
    soup = BeautifulSoup(response.text, "html.parser")
    if soup.find("div", class_="g-recaptcha"):
        return True

    return False

## Step 3: Extract reCAPTCHA Site-Key

For reCAPTCHA challenges, a site-key is embedded in the HTML.
This key is required when submitting the CAPTCHA to a solver service.

In [16]:
def extract_site_key(response):
    """
    Extracts the reCAPTCHA site-key from the page HTML.
    Returns the site-key string or None.
    """
    soup = BeautifulSoup(response.text, "html.parser")

    # reCAPTCHA site-key is usually stored in a div attribute
    captcha_div = soup.find("div", class_="g-recaptcha")

    if captcha_div:
        return captcha_div.get("data-sitekey")

    return None

## Step 4: Submit CAPTCHA to 2Captcha

The site-key and page URL are sent to the 2Captcha API.
The API responds with a CAPTCHA ID, which is later used to retrieve the solution.

In [17]:
def submit_captcha_to_2captcha(api_key, site_key, page_url):
    """
    Submits CAPTCHA details to 2Captcha.
    Returns a captcha_id if successful.
    """
    # Payload required by 2Captcha API
    payload = {
        "key": api_key,
        "method": "userrecaptcha",
        "googlekey": site_key,
        "pageurl": page_url,
        "json": 1
    }

    # Send CAPTCHA details to 2Captcha
    response = requests.post("http://2captcha.com/in.php", data=payload)
    result = response.json()

    # If submission is successful, return CAPTCHA ID
    if result.get("status") == 1:
        return result.get("request")

    # Raise error if submission fails
    raise RuntimeError("Failed to submit CAPTCHA to 2Captcha")

## Step 5: Poll for CAPTCHA Solution

CAPTCHA solving is asynchronous.
The scraper waits and polls the API until a solution token is returned.

In [18]:
def get_captcha_solution(api_key, captcha_id, wait_time=20):
    """
    Polls the 2Captcha API until the CAPTCHA solution is ready.
    Returns the solution token.
    """
    # Initial wait before polling (CAPTCHA solving is asynchronous)
    time.sleep(wait_time)

    # Parameters required to fetch CAPTCHA solution
    params = {
        "key": api_key,
        "action": "get",
        "id": captcha_id,
        "json": 1
    }

    # Poll the 2Captcha result endpoint
    response = requests.get("http://2captcha.com/res.php", params=params)
    result = response.json()

    # If solution is ready, return the token
    if result.get("status") == 1:
        return result.get("request")

    # Raise error if solution is not available
    raise RuntimeError("CAPTCHA solution not available yet")

## Step 6: Retry Request with CAPTCHA Token

Once a valid token is received, the original request is retried with the CAPTCHA solution attached.

In [19]:
def retry_request_with_token(url, captcha_token):
    """
    Retries the HTTP request using the solved CAPTCHA token.
    """
    # CAPTCHA solution token is attached via request headers
    headers = {
        "User-Agent": "Mozilla/5.0",
        "g-recaptcha-response": captcha_token
    }

    return requests.get(url, headers=headers, timeout=10)

## Step 7: Continue Scraping

After CAPTCHA resolution, the page is treated like a normal response
and scraping logic can proceed.

In [20]:
def scrape_content(response):
    """
    Example content extraction after CAPTCHA is solved.
    """
    soup = BeautifulSoup(response.text, "html.parser")

    # Extract page title as a simple demonstration
    return soup.title.text if soup.title else "No title found"

## Step 8: Full CAPTCHA-Aware Scraping Pipeline

This function orchestrates the entire workflow:
- Fetch page
- Detect CAPTCHA
- Solve CAPTCHA (if present)
- Retry request
- Continue scraping

In [21]:
def captcha_aware_scraper(url, api_key=None):
    """
    Full scraping pipeline with CAPTCHA detection and handling.
    """
    # Initial page request
    response = fetch_page(url)

    # Check if CAPTCHA is present
    if is_captcha_present(response):
        print("CAPTCHA detected.")

        # Stop execution if API key is not provided
        if not api_key:
            print("No API key provided. Stopping execution.")
            return None

        # Extract site-key required for CAPTCHA solving
        site_key = extract_site_key(response)
        if not site_key:
            print("Site-key not found.")
            return None

        # Submit CAPTCHA and retrieve solution token
        captcha_id = submit_captcha_to_2captcha(api_key, site_key, url)
        captcha_token = get_captcha_solution(api_key, captcha_id)

        # Retry the request after CAPTCHA resolution
        response = retry_request_with_token(url, captcha_token)
        print("Request retried after CAPTCHA resolution.")

    # Proceed with scraping after CAPTCHA (or if none was present)
    return scrape_content(response)

## Example Usage

The following example demonstrates how the pipeline is invoked.

In [22]:
TARGET_URL = "https://quotes.toscrape.com"
API_KEY = None  # Provide only if testing with valid 2Captcha credits

result = captcha_aware_scraper(TARGET_URL, API_KEY)
print("Scraping result:", result)

Scraping result: Quotes to Scrape


## Key Takeaways

- CAPTCHA handling is a decision-making step in scraping pipelines.
- Third-party services like 2Captcha are integrated via APIs.
- The scraper reacts dynamically instead of failing.