<a href="https://colab.research.google.com/github/gaddam007-git/book-price-intelligence-system/blob/main/Notebooks/milestone_2_Webscrapping_and_data_aggregation/Milestone_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Milestone 2: Webscrapping and data aggregation.**

In [1]:
from google.colab import output
output.enable_custom_widget_manager()

In [2]:
!pip install transformers torch --quiet

In [3]:
# Import the pipeline utility from the Hugging Face transformers library
# pipeline provides a simple high-level API for common NLP tasks
from transformers import pipeline

# Load a pre-trained Question Answering (QA) model
# "question-answering" specifies the task type
# "distilbert-base-cased-distilled-squad" is a lightweight DistilBERT model
# fine-tuned on the SQuAD dataset for extracting answers from text
qa_model = pipeline(
    "question-answering",
    model="distilbert-base-cased-distilled-squad"
)

# Context paragraph on which the model will search for the answer
# This acts as the knowledge base for the QA system
context = """
Machine learning is a field of artificial intelligence that uses statistical techniques
to give computer systems the ability to learn from data. It focuses on developing
algorithms that improve automatically through experience. Popular ML methods include
supervised learning, unsupervised learning, and reinforcement learning.
"""

# Define the question that will be answered using the context above
# The model will look for the most relevant text span as the answer
question = "What are popular machine learning methods?"

# Run the QA model by passing the question and context
# The model returns:
# - 'answer': extracted text span
# - 'score': confidence score for the predicted answer
# - 'start' and 'end': character positions of the answer in the context
result = qa_model(question=question, context=context)

# Print the question to the console
print("Question:", question)

# Print the extracted answer from the context
print("Answer:", result['answer'])

# Print the confidence score indicating how sure the model is
print("Score:", result['score'])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use cpu


Question: What are popular machine learning methods?
Answer: supervised learning, unsupervised learning, and reinforcement learning
Score: 0.7955703735351562


In [4]:
# Import the pipeline function from Hugging Face's transformers library
# pipeline provides a simple interface to use pre-trained NLP models
from transformers import pipeline

# Load a pre-trained Question Answering (QA) transformer model
# "question-answering" specifies the task type
# "distilbert-base-cased-distilled-squad" is a DistilBERT model
# fine-tuned on the SQuAD dataset for extracting answers from text
qa_model = pipeline(
    "question-answering",
    model="distilbert-base-cased-distilled-squad"
)

# Context paragraph that acts as the knowledge base
# The model will only search for answers within this text
context = """
The Internet has become an essential part of modern life, connecting millions of devices
around the world. It allows people to communicate instantly, access information, and
perform online transactions. Over the years, internet technology has evolved from simple
web pages to advanced cloud computing services. With the rise of smartphones, people
now use the internet daily for navigation, social media, entertainment, and remote work.
Despite its benefits, internet users must be cautious about cybersecurity threats such
as phishing, malware, and data breaches.
"""

# ---- QUESTIONS ----

# Question whose answer is explicitly present in the context
q1 = "What are some common cybersecurity threats mentioned in the context?"

# Question whose answer is NOT present in the context
# The model will still try to guess an answer based on closest text
q2 = "What is the capital of India?"

# Another out-of-context question
# Demonstrates how QA models behave when the answer is missing
q3 = "What is the Unit of Internet?"

# Store all questions in a list for batch processing
questions = [q1, q2, q3]

# ---- RUN ALL QUESTIONS ----

# Loop through each question with an index starting from 1
for i, q in enumerate(questions, 1):

    # Pass the question and context to the QA model
    # The model returns a dictionary containing:
    # - 'answer': extracted text span
    # - 'score': confidence score of the prediction
    # - 'start' and 'end': character positions in the context
    result = qa_model(question=q, context=context)

    # Print the question number and text
    print(f"\nQuestion {i}: {q}")

    # Print the model's extracted answer
    print("Answer:", result["answer"])

    # Print the confidence score for the predicted answer
    print("Score:", result["score"])

Device set to use cpu



Question 1: What are some common cybersecurity threats mentioned in the context?
Answer: phishing, malware, and data breaches
Score: 0.922915115748765

Question 2: What is the capital of India?
Answer: connecting millions of devices
around the world
Score: 0.0007901930948719382

Question 3: What is the Unit of Internet?
Answer: allows people to communicate instantly, access information, and
perform online transactions
Score: 0.2077416628599167


OBSERVATION FOR THE ABOVE CODE

1. Explicit questions work correctly.
  * The first question (“What are some common cybersecurity threats…”) is directly answerable from the given context.
  * The model gives the correct answer with a high confidence score (0.92).

2. Implicit or unrelated questions fail.
  * The second question (“What is the capital of India?”) is not present in the context, so the model tries to guess something anyway.
  * It returns a meaningless answer with a very low score (0.00079), showing that it is unreliable for out-of-context questions.

3. Vague or unclear questions also give incorrect results.
  * The third question (“What is the Unit of Internet?”) has no clear meaning and is not in the context.
  * The model extracts random text with a low-to-medium score (0.20), indicating uncertainty.

CONCLUSION

The QA model performs well only when the question is explicitly answerable from the given context . If the question is unrelated, implicit, or unclear, the model still tries to answer but produces incorrect and low-confidence outputs.


In [5]:
# Import the pipeline utility from the Hugging Face transformers library
# pipeline provides an easy-to-use interface for running pre-trained NLP models
from transformers import pipeline

# Load a pre-trained Question Answering (QA) model
# "question-answering" specifies the NLP task
# "distilbert-base-cased-distilled-squad" is a lightweight DistilBERT model
# trained on the SQuAD dataset for extractive question answering
qa_model = pipeline(
    "question-answering",
    model="distilbert-base-cased-distilled-squad"
)

# Context paragraph that acts as the knowledge source
# The model will extract answers only from this text
context = """
The Internet has become an essential part of modern life, connecting millions of devices
around the world. It allows people to communicate instantly, access information, and
perform online transactions. Over the years, internet technology has evolved from simple
web pages to advanced cloud computing services. With the rise of smartphones, people
now use the internet daily for navigation, social media, entertainment, and remote work.
Despite its benefits, internet users must be cautious about cybersecurity threats such
as phishing, malware, and data breaches.
"""

# ---- QUESTIONS ----

# Explicit question: the answer clearly exists in the context
q1 = "What are some common cybersecurity threats mentioned in the context?"

# Implicit / out-of-context question
# The answer is not present in the context
q2 = "What is the capital of India?"

# Another out-of-context question
q3 = "What is the Unit of Internet?"

# Store all questions in a list for iteration
questions = [q1, q2, q3]

# ---- RUN ALL QUESTIONS ----

# Loop through each question with an index starting from 1
for i, q in enumerate(questions, 1):

    # Pass the question and context to the QA model
    # The model returns a dictionary containing:
    # - 'answer': extracted text span
    # - 'score': confidence score of the prediction
    result = qa_model(question=q, context=context)

    # Extract the confidence score from the result
    score = result["score"]

    # Extract the predicted answer from the result
    answer = result["answer"]

    # Print the question number and text
    print(f"\nQuestion {i}: {q}")

    # Print the confidence score
    print("Score:", score)

    # ---- IF-ELSE CHECK ----

    # If confidence score is high (>= 0.8),
    # treat the answer as reliable and display it
    if score >= 0.8:
        print("Answer:", answer)

    # If confidence score is low,
    # indicate that the answer is not relevant
    else:
        print("Answer is not relevant to the question asked.")

Device set to use cpu



Question 1: What are some common cybersecurity threats mentioned in the context?
Score: 0.922915115748765
Answer: phishing, malware, and data breaches

Question 2: What is the capital of India?
Score: 0.0007901930948719382
Answer is not relevant to the question asked.

Question 3: What is the Unit of Internet?
Score: 0.2077416628599167
Answer is not relevant to the question asked.


OBSERVATION FOR THE ABOVE CODE

The program evaluates how confidently the QA model answers different types of questions based on a given context.

1. For the explicit question (Q1), the context directly contains the answer.

  * The model gives a high score (0.92).

  * The code prints the actual answer because it passes the 0.8 confidence threshold.

2. For the implicit or unrelated questions (Q2 and Q3), the context does not contain the answers.

  * The model produces very low scores (0.00 and 0.20).

  * Since the scores are below 0.8, the code correctly outputs:
  “Answer is not relevant to the question asked.”

CONCLUSION

The experiment shows that:

  * The QA model works accurately only when the question matches the context.

  * For questions that are not related to the context, the model gives low confidence scores, and the program successfully identifies them as irrelevant.


**WEB SCRAPING**

In [6]:
!pip install playwright nest_asyncio
!playwright install chromium
!apt-get install libatk1.0-0 libatk-bridge2.0-0 libatspi2.0-0 libxcomposite1

Collecting playwright
  Downloading playwright-1.57.0-py3-none-manylinux1_x86_64.whl.metadata (3.5 kB)
Collecting pyee<14,>=13 (from playwright)
  Downloading pyee-13.0.0-py3-none-any.whl.metadata (2.9 kB)
Downloading playwright-1.57.0-py3-none-manylinux1_x86_64.whl (46.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 MB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyee-13.0.0-py3-none-any.whl (15 kB)
Installing collected packages: pyee, playwright
Successfully installed playwright-1.57.0 pyee-13.0.0
Downloading Chromium 143.0.7499.4 (playwright build v1200)[2m from https://cdn.playwright.dev/dbazure/download/playwright/builds/chromium/1200/chromium-linux.zip[22m
[1G164.7 MiB [] 0% 0.0s[0K[1G164.7 MiB [] 0% 21.8s[0K[1G164.7 MiB [] 0% 10.2s[0K[1G164.7 MiB [] 0% 6.0s[0K[1G164.7 MiB [] 1% 3.4s[0K[1G164.7 MiB [] 3% 2.5s[0K[1G164.7 MiB [] 4% 2.4s[0K[1G164.7 MiB [] 5% 2.2s[0K[1G164.7 MiB [] 6% 1.9s[0K[1G164.7 MiB [] 7% 1.8s[

In [7]:
# ===============================
# COMMON IMPORTS & EVENT LOOP FIX
# ===============================

import asyncio, json, csv, time
# asyncio → handles asynchronous execution
# json → save data in JSON format
# csv → save data in CSV format
# time → (optional) used for delays if needed

from pathlib import Path
# Path → cleaner and OS-independent file path handling

import nest_asyncio
# nest_asyncio → allows running asyncio inside environments like Google Colab

nest_asyncio.apply()
# Fixes "event loop already running" error in Colab/Jupyter

from playwright.async_api import async_playwright
# Playwright async API → used for browser automation & scraping JS/AJAX websites


# ===============================
# TARGET URL
# ===============================

# AJAX_URL = "https://webscraper.io/test-sites/e-commerce/ajax"
# AJAX_URL = "https://webscraper.io/test-sites"

AJAX_URL = "https://webscraper.io/test-sites/e-commerce/static/computers/laptops?page=1"
# Target page containing laptop products
# This page loads content dynamically, so Playwright is required


# ===============================
# ASYNC SCRAPING FUNCTION
# ===============================

async def scrape_ajax_site():
    # Create Playwright context
    async with async_playwright() as p:

        # Launch Chromium browser in headless mode
        browser = await p.chromium.launch(headless=True)
        # headless=True → browser runs in background (no UI)

        # Create a fresh browser context
        ctx = await browser.new_context()

        # Open a new browser page
        page = await ctx.new_page()

        # Navigate to target URL
        await page.goto(AJAX_URL, timeout=60000)
        # timeout=60000 → wait up to 60 seconds if page loads slowly

        # Wait until product cards are visible
        await page.wait_for_selector(".thumbnail", timeout=30000)
        # Ensures products are fully loaded before scraping


        # ===============================
        # HANDLE "LOAD MORE" BUTTON
        # ===============================

        while True:
            try:
                # Try to locate the "Load more" button
                load_more = await page.query_selector("button:has-text('Load more')")

                # If button not found → stop loop
                if not load_more:
                    break

                # Check if button is disabled
                is_disabled = await load_more.get_attribute("disabled")
                if is_disabled:
                    break

                # Count products before clicking
                before_count = len(await page.query_selector_all(".thumbnail"))

                # Click "Load more"
                await load_more.click()

                # Wait until network requests finish
                await page.wait_for_load_state("networkidle")

                # Soft wait loop to detect newly loaded products
                for _ in range(30):
                    after_count = len(await page.query_selector_all(".thumbnail"))
                    if after_count > before_count:
                        break
                    await asyncio.sleep(0.2)

                # If no new products appeared → exit loop
                if after_count <= before_count:
                    break

            except Exception:
                # Any error (button missing, timeout, etc.) → exit loop
                break


        # ===============================
        # EXTRACT PRODUCT DATA
        # ===============================

        # Get all product cards
        cards = await page.query_selector_all(".thumbnail")

        rows = []  # list to store scraped product data

        for card in cards:

            # Extract product title and link
            title_el = await card.query_selector(".title")
            title = (await title_el.text_content()).strip() if title_el else None
            url = await title_el.get_attribute("href") if title_el else None

            # Extract price
            price_el = await card.query_selector(".price")
            price = (await price_el.text_content()).strip() if price_el else None

            # Extract rating (count of star icons)
            stars = await card.query_selector_all(".ratings .glyphicon-star")
            rating = len(stars) if stars else 0

            # Extract product image URL
            img_el = await card.query_selector("img")
            img_src = await img_el.get_attribute("src") if img_el else None

            # Store extracted data in dictionary
            rows.append({
                "title": title,
                "price": price,
                "rating_stars": rating,
                "product_url": url,
                "image_url": img_src
            })

        # Close browser after scraping
        await browser.close()

        # Return scraped data
        return rows


# ===============================
# RUN ASYNC FUNCTION
# ===============================

# Execute async scraping function
data = asyncio.get_event_loop().run_until_complete(scrape_ajax_site())

# Print number of products scraped
print(f"Collected {len(data)} products")


# ===============================
# SAVE DATA TO FILES
# ===============================

# Create output directory if it doesn't exist
Path("output").mkdir(exist_ok=True)

csv_path = Path("output/products_ajax.csv")
json_path = Path("output/products_ajax.json")

# Save data to CSV
with open(csv_path, "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=data[0].keys())
    writer.writeheader()
    writer.writerows(data)

# Save data to JSON
with open(json_path, "w", encoding="utf-8") as f:
    json.dump(data, f, ensure_ascii=False, indent=2)

# Print confirmation messages
print(f"Saved CSV → {csv_path}")
print(f"Saved JSON → {json_path}")

Collected 6 products
Saved CSV → output/products_ajax.csv
Saved JSON → output/products_ajax.json


**OBSERVATION FOR THE ABOVE CODE**
* The program successfully uses **asynchronous web scraping** with **Playwright** to extract product data from a **dynamic (AJAX-based) e-commerce webpage**.
* The **event loop issue** commonly seen in Jupyter/Google Colab environments is handled correctly using `nest_asyncio`, allowing the async Playwright code to execute without errors.
* The script navigates to the target laptop listing page and **waits explicitly for product cards (`.thumbnail`) to load**, ensuring data is scraped only after the page content becomes visible.
* The **“Load more” button logic** is implemented safely using a loop that:

  * Checks if the button exists,
  * Verifies whether it is disabled,
  * Confirms that new products are actually loaded after each click.
* On the given page, **no additional products were dynamically loaded**, so the loop exited gracefully without errors.

---

**Output Analysis**

* **Total products collected:** `6`

* Each product record includes:

  * `title` – name of the laptop
  * `price` – listed price
  * `rating_stars` – number of star icons found
  * `product_url` – link to the product page
  * `image_url` – URL of the product image

* The scraped data was **successfully persisted in two formats**:

  * **CSV file:** `output/products_ajax.csv`
  * **JSON file:** `output/products_ajax.json`

* Saving the data in **both structured (CSV)** and **semi-structured (JSON)** formats makes it suitable for:

  * Data analysis (Excel, Pandas)
  * APIs or dashboards
  * Machine learning or pricing analytics pipelines



In [8]:
# ===============================
# COMMON IMPORTS & COLAB EVENT LOOP FIX
# ===============================

import asyncio, json, csv, time
# asyncio → to run asynchronous Playwright code
# json → to save scraped data as JSON
# csv → to save scraped data as CSV
# time → utility module (not heavily used here)

from pathlib import Path
# Path → platform-independent file and folder handling

import nest_asyncio
# nest_asyncio → required to run asyncio inside Jupyter/Colab

nest_asyncio.apply()
# Fixes "event loop is already running" error in Colab

from playwright.async_api import async_playwright, TimeoutError as PlaywrightTimeoutError
# async_playwright → Playwright async browser controller
# PlaywrightTimeoutError → used to safely stop pagination when no page loads


# ===============================
# BASE URL TEMPLATE
# ===============================

# Base URL with page number placeholder
BASE_URL = "https://webscraper.io/test-sites/e-commerce/static/computers/laptops?page={page}"
# We will dynamically replace {page} with page numbers (1, 2, 3, ...)


# ===============================
# ASYNC SCRAPING FUNCTION
# ===============================

async def scrape_all_pages_auto():
    # Start Playwright
    async with async_playwright() as p:

        # Launch Chromium browser in headless mode
        browser = await p.chromium.launch(headless=True)
        # headless=True → browser runs without GUI (faster, interview-friendly)

        # Create a fresh browser context
        ctx = await browser.new_context()

        # Open a new page/tab
        page = await ctx.new_page()

        all_rows = []
        # all_rows → stores products collected from all pages

        page_num = 1
        # Start scraping from page 1


        # ===============================
        # PAGINATION LOOP
        # ===============================

        while True:
            # Create page-specific URL
            url = BASE_URL.format(page=page_num)

            print(f"Visiting page {page_num}: {url}")

            # Navigate to the page
            await page.goto(url, timeout=60000)
            # timeout=60000 → wait up to 60 seconds for page load


            # ===============================
            # WAIT FOR PRODUCTS
            # ===============================

            try:
                # Wait until product cards appear
                await page.wait_for_selector(".thumbnail", timeout=15000)
            except PlaywrightTimeoutError:
                # If no products load within time → stop pagination
                print(f"No products found (timeout) on page {page_num}, stopping.")
                break


            # ===============================
            # EXTRACT PRODUCT CARDS
            # ===============================

            cards = await page.query_selector_all(".thumbnail")
            # Select all product blocks on the page

            if not cards:
                # Safety check: if no products found → stop
                print(f"No .thumbnail elements on page {page_num}, stopping.")
                break

            page_rows = []
            # Stores products from the current page only


            # ===============================
            # LOOP THROUGH PRODUCTS
            # ===============================

            for card in cards:

                # Extract product title and link
                title_el = await card.query_selector(".title")
                title = (await title_el.text_content()).strip() if title_el else None
                url = await title_el.get_attribute("href") if title_el else None

                # Extract price
                price_el = await card.query_selector(".price")
                price = (await price_el.text_content()).strip() if price_el else None

                # Extract rating by counting star icons
                stars = await card.query_selector_all(".ratings .glyphicon-star")
                rating = len(stars) if stars else 0

                # Extract product image URL
                img_el = await card.query_selector("img")
                img_src = await img_el.get_attribute("src") if img_el else None

                # Store extracted data in dictionary
                page_rows.append({
                    "title": title,
                    "price": price,
                    "rating_stars": rating,
                    "product_url": url,
                    "image_url": img_src,
                    "page": page_num  # helps identify source page
                })


            # Log how many products were scraped from this page
            print(f"Page {page_num}: collected {len(page_rows)} products")

            # Add current page data to global list
            all_rows.extend(page_rows)

            # Move to next page
            page_num += 1


        # Close browser after scraping all pages
        await browser.close()

        # Return all collected products
        return all_rows


# ===============================
# RUN ASYNC FUNCTION
# ===============================

# Execute the async scraping function
data = asyncio.get_event_loop().run_until_complete(scrape_all_pages_auto())

# Print total number of products scraped
print(f"Total collected products from all pages: {len(data)}")


# ===============================
# SAVE OUTPUT FILES
# ===============================

# Create output directory if it does not exist
Path("output").mkdir(exist_ok=True)

csv_path = Path("output/products_all_pages.csv")
json_path = Path("output/products_all_pages.json")

if data:
    # Save data to CSV
    with open(csv_path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=data[0].keys())
        writer.writeheader()
        writer.writerows(data)

    # Save data to JSON
    with open(json_path, "w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

    print(f"Saved CSV → {csv_path}")
    print(f"Saved JSON → {json_path}")
else:
    # If nothing was scraped
    print("No data scraped, nothing saved.")

Visiting page 1: https://webscraper.io/test-sites/e-commerce/static/computers/laptops?page=1
Page 1: collected 6 products
Visiting page 2: https://webscraper.io/test-sites/e-commerce/static/computers/laptops?page=2
Page 2: collected 6 products
Visiting page 3: https://webscraper.io/test-sites/e-commerce/static/computers/laptops?page=3
Page 3: collected 6 products
Visiting page 4: https://webscraper.io/test-sites/e-commerce/static/computers/laptops?page=4
Page 4: collected 6 products
Visiting page 5: https://webscraper.io/test-sites/e-commerce/static/computers/laptops?page=5
Page 5: collected 6 products
Visiting page 6: https://webscraper.io/test-sites/e-commerce/static/computers/laptops?page=6
Page 6: collected 6 products
Visiting page 7: https://webscraper.io/test-sites/e-commerce/static/computers/laptops?page=7
Page 7: collected 6 products
Visiting page 8: https://webscraper.io/test-sites/e-commerce/static/computers/laptops?page=8
Page 8: collected 6 products
Visiting page 9: https:/

**OBSERVATION FOR THE ABOVE CODE**

* The program successfully implements **automated multi-page web scraping** using **asynchronous Playwright** on a **paginated e-commerce website**.
* The **event-loop conflict in Jupyter/Google Colab** is correctly handled using `nest_asyncio`, allowing async browser automation to run smoothly without runtime errors.
* The scraper dynamically constructs URLs using a **page number template**, enabling it to visit pages sequentially (`page=1` to `page=20`) without hard-coding page links.
* For each page:

  * The script waits explicitly for `.thumbnail` product cards to load, ensuring reliable extraction even on slow networks.
  * It extracts **structured product information**, including:

    * Product title
    * Price
    * Rating (computed by counting star icons)
    * Product URL
    * Image URL
    * Page number (source traceability)
* The **pagination stopping condition** is robust:

  * When page 21 fails to load products within the timeout period, a `PlaywrightTimeoutError` is triggered.
  * This is safely caught and used as a **natural termination signal**, preventing infinite looping.
* The scraper maintains **data integrity** by:

  * Collecting products page-wise
  * Merging all records into a single dataset after pagination ends

---

**Output Analysis**

* **Pages successfully scraped:** `1 – 20`

* **Products per page:**

  * Pages 1–19 → 6 products each
  * Page 20 → 3 products

* **Total products collected:**
  **117 laptop products**

* The final dataset was saved in two commonly used formats:

  * **CSV:** `output/products_all_pages.csv` (ideal for Excel, Pandas analysis)
  * **JSON:** `output/products_all_pages.json` (ideal for APIs, web apps, ML pipelines)

* The consistent logging confirms:

  * Correct pagination handling
  * No data duplication
  * Graceful termination after the last valid page


In [9]:
# ===============================
# COMMON IMPORTS & COLAB EVENT LOOP FIX
# ===============================

import asyncio, json, csv, time
# asyncio → run asynchronous Playwright code
# json → save scraped data in JSON format
# csv → save scraped data in CSV format
# time → utility module (optional, for delays if needed)

from pathlib import Path
# Path → OS-independent way to handle files and folders

import nest_asyncio
# nest_asyncio → required when running asyncio in Jupyter / Google Colab

nest_asyncio.apply()
# Fixes "event loop is already running" error in Colab

from playwright.async_api import async_playwright, TimeoutError as PlaywrightTimeoutError
# async_playwright → async browser automation library
# PlaywrightTimeoutError → used to stop scraping when pages no longer load


# ===============================
# BASE URL FOR TABLETS (PAGINATED)
# ===============================

# Base URL for tablet products
# {page} will be replaced with page numbers (1, 2, 3, ...)
BASE_URL = "https://webscraper.io/test-sites/e-commerce/static/computers/tablets?page={page}"


# ===============================
# ASYNC SCRAPING FUNCTION
# ===============================

async def scrape_tablets_all_pages():
    # Start Playwright context
    async with async_playwright() as p:

        # Launch Chromium browser in headless mode
        browser = await p.chromium.launch(headless=True)
        # headless=True → browser runs without UI (faster & automation-friendly)

        # Create a new browser context (isolated session)
        ctx = await browser.new_context()

        # Open a new browser page
        page = await ctx.new_page()

        all_rows = []
        # all_rows → stores products scraped from ALL pages

        page_num = 1
        # Start scraping from page 1


        # ===============================
        # PAGINATION LOOP
        # ===============================

        while True:
            # Generate URL for current page number
            url = BASE_URL.format(page=page_num)

            print(f"Visiting page {page_num}: {url}")

            # Navigate to the page
            await page.goto(url, timeout=60000)
            # timeout=60000 → wait up to 60 seconds for page load


            # ===============================
            # WAIT FOR PRODUCTS TO LOAD
            # ===============================

            try:
                # Wait until product cards appear on the page
                await page.wait_for_selector(".thumbnail", timeout=15000)
            except PlaywrightTimeoutError:
                # If products do not load → assume no more pages
                print(f"No products found on page {page_num} (timeout), stopping.")
                break


            # ===============================
            # EXTRACT PRODUCT CARDS
            # ===============================

            # Select all product blocks
            cards = await page.query_selector_all(".thumbnail")

            if not cards:
                # Safety check: stop if no products are found
                print(f"No .thumbnail elements on page {page_num}, stopping.")
                break

            page_rows = []
            # page_rows → stores products from the CURRENT page only


            # ===============================
            # LOOP THROUGH EACH PRODUCT
            # ===============================

            for card in cards:

                # Extract product title and product link
                title_el = await card.query_selector(".title")
                title = (await title_el.text_content()).strip() if title_el else None
                url = await title_el.get_attribute("href") if title_el else None

                # Extract product price
                price_el = await card.query_selector(".price")
                price = (await price_el.text_content()).strip() if price_el else None

                # Extract rating by counting star icons
                stars = await card.query_selector_all(".ratings .glyphicon-star")
                rating = len(stars) if stars else 0

                # Extract product image URL
                img_el = await card.query_selector("img")
                img_src = await img_el.get_attribute("src") if img_el else None

                # Store extracted data in dictionary format
                page_rows.append({
                    "title": title,
                    "price": price,
                    "rating_stars": rating,
                    "product_url": url,
                    "image_url": img_src,
                    "page": page_num  # helps track source page
                })


            # Log number of products scraped from this page
            print(f"Page {page_num}: collected {len(page_rows)} products")

            # Add current page data to the global list
            all_rows.extend(page_rows)

            # Move to the next page
            page_num += 1


        # Close the browser after scraping all pages
        await browser.close()

        # Return all scraped tablet products
        return all_rows


# ===============================
# RUN ASYNC FUNCTION
# ===============================

# Execute the async scraping function
data = asyncio.get_event_loop().run_until_complete(scrape_tablets_all_pages())

# Print total number of products scraped
print(f"Total collected products from all tablet pages: {len(data)}")


# ===============================
# SAVE OUTPUT FILES
# ===============================

# Create output directory if it doesn't exist
Path("output").mkdir(exist_ok=True)

csv_path = Path("output/tablets_all_pages.csv")
json_path = Path("output/tablets_all_pages.json")

if data:
    # Save scraped data to CSV file
    with open(csv_path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=data[0].keys())
        writer.writeheader()
        writer.writerows(data)

    # Save scraped data to JSON file
    with open(json_path, "w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

    print(f"Saved CSV → {csv_path}")
    print(f"Saved JSON → {json_path}")
else:
    # If no data was scraped
    print("No data scraped, nothing saved.")

Visiting page 1: https://webscraper.io/test-sites/e-commerce/static/computers/tablets?page=1
Page 1: collected 6 products
Visiting page 2: https://webscraper.io/test-sites/e-commerce/static/computers/tablets?page=2
Page 2: collected 6 products
Visiting page 3: https://webscraper.io/test-sites/e-commerce/static/computers/tablets?page=3
Page 3: collected 6 products
Visiting page 4: https://webscraper.io/test-sites/e-commerce/static/computers/tablets?page=4
Page 4: collected 3 products
Visiting page 5: https://webscraper.io/test-sites/e-commerce/static/computers/tablets?page=5
No products found on page 5 (timeout), stopping.
Total collected products from all tablet pages: 21
Saved CSV → output/tablets_all_pages.csv
Saved JSON → output/tablets_all_pages.json


**OBSERVATION FOR THE ABOVE CODE**
* The program successfully performs **automated multi-page web scraping** of **tablet products** using **asynchronous Playwright**.
* The use of `nest_asyncio` effectively resolves the **event loop conflict** encountered in Jupyter Notebook or Google Colab environments, allowing async browser automation to execute smoothly.
* A **paginated URL template** is used to dynamically generate page URLs, enabling the scraper to automatically navigate through multiple pages without manual intervention.
* For each page:

  * The script explicitly waits for the `.thumbnail` selector, ensuring that product cards are fully loaded before extraction.
  * It extracts structured product information including:

    * Product title
    * Price
    * Rating (calculated by counting star icons)
    * Product URL
    * Image URL
    * Page number (for source traceability)
* The scraper uses a **timeout-based stopping condition**:

  * When page 5 fails to load product cards within the specified timeout, a `PlaywrightTimeoutError` is raised.
  * This exception is safely handled and used to **terminate pagination gracefully**, preventing unnecessary requests or infinite looping.
* Data is collected page-wise and merged into a single dataset, ensuring **no duplication and complete coverage** of available tablet listings.

---

 **Output Analysis**

* **Pages successfully scraped:** `1 – 4`

* **Products per page:**

  * Pages 1–3 → 6 products each
  * Page 4 → 3 products

* **Total tablet products collected:**
  **21 products**

* The extracted data was successfully saved in two widely used formats:

  * **CSV:** `output/tablets_all_pages.csv` – suitable for spreadsheets and data analysis
  * **JSON:** `output/tablets_all_pages.json` – suitable for APIs, dashboards, and machine learning workflows

* Console logs confirm:

  * Correct pagination traversal
  * Accurate product counts per page
  * Proper termination after the last available page

In [10]:
# -----------------------------
# SIMPLE PHONES SCRAPER (FINAL)
# -----------------------------
# This script scrapes phone products from a single-page e-commerce site
# using Playwright (async browser automation)

import asyncio, json, csv
# asyncio → to run asynchronous Playwright code
# json → to save scraped data in JSON format
# csv → to save scraped data in CSV format

from pathlib import Path
# Path → OS-independent file and directory handling

import nest_asyncio
# nest_asyncio → allows asyncio to run inside Jupyter / Google Colab

nest_asyncio.apply()
# Fixes "event loop already running" error in Colab

from urllib.parse import urljoin
# urljoin → safely combine base URL with relative URLs

from playwright.async_api import async_playwright, TimeoutError as PlaywrightTimeoutError
# async_playwright → async Playwright API
# PlaywrightTimeoutError → handles timeout when elements do not load


# -----------------------------
# WEBSITE CONFIGURATION
# -----------------------------

# Phones category has ONLY ONE PAGE → no pagination required
BASE_URL = "https://webscraper.io/test-sites/e-commerce/static/phones"

# Root URL used to convert relative URLs into absolute URLs
ROOT = "https://webscraper.io"


# -----------------------------
# ASYNC SCRAPING FUNCTION
# -----------------------------

async def scrape_phones():
    # Start Playwright context
    async with async_playwright() as p:

        # Launch Chromium browser in headless mode
        browser = await p.chromium.launch(headless=True)
        # headless=True → browser runs without UI (faster, automation-friendly)

        # Create a new isolated browser session
        ctx = await browser.new_context()

        # Open a new browser page
        page = await ctx.new_page()

        print("Visiting:", BASE_URL)

        # Navigate to the phones page
        await page.goto(BASE_URL, timeout=60000)
        # timeout=60000 → wait up to 60 seconds for page load


        # -----------------------------
        # WAIT FOR PRODUCTS TO LOAD
        # -----------------------------

        try:
            # Wait until product cards appear
            await page.wait_for_selector(".thumbnail", timeout=6000)
        except PlaywrightTimeoutError:
            # If products do not load, return empty list
            print("No products found.")
            return []


        # Select all product cards
        cards = await page.query_selector_all(".thumbnail")

        all_rows = []
        # all_rows → list to store scraped product data


        # -----------------------------
        # EXTRACT DATA FROM EACH PRODUCT
        # -----------------------------

        for card in cards:

            # Extract product title and relative URL
            title_el = await card.query_selector(".title")
            title = (await title_el.text_content()).strip()
            href = await title_el.get_attribute("href")

            # Convert relative product URL to absolute URL
            full_product_url = urljoin(ROOT, href)

            # Extract product price
            price_el = await card.query_selector(".price")
            price = (await price_el.text_content()).strip()

            # Extract rating by counting star icons
            stars = await card.query_selector_all(".glyphicon-star")
            rating = len(stars)

            # Extract image URL and convert to absolute URL
            img_el = await card.query_selector("img")
            img_url = urljoin(ROOT, await img_el.get_attribute("src"))

            # Store extracted data in dictionary
            all_rows.append({
                "title": title,
                "price": price,
                "rating_stars": rating,
                "product_url": full_product_url,
                "image_url": img_url
            })


        # Close the browser after scraping
        await browser.close()

        # Return all scraped phone products
        return all_rows


# -------------------------------------------------
# RUN THE SCRAPER
# -------------------------------------------------

# Execute the async scraping function
data = asyncio.get_event_loop().run_until_complete(scrape_phones())

# Print total number of products scraped
print(f"Total products scraped: {len(data)}")


# -------------------------------------------------
# SAVE DATA TO CSV + JSON
# -------------------------------------------------

# Create output directory if it does not exist
Path("output").mkdir(exist_ok=True)


# Save data as CSV
csv_path = Path("output/phones_simple.csv")
with open(csv_path, "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=data[0].keys())
    writer.writeheader()
    writer.writerows(data)

print("Saved CSV:", csv_path)


# Save data as JSON
json_path = Path("output/phones_simple.json")
with open(json_path, "w", encoding="utf-8") as f:
    json.dump(data, f, ensure_ascii=False, indent=2)

print("Saved JSON:", json_path)

Visiting: https://webscraper.io/test-sites/e-commerce/static/phones
Total products scraped: 3
Saved CSV: output/phones_simple.csv
Saved JSON: output/phones_simple.json


**OBSERVATION FOR THE ABOVE CODE**
* The script successfully implements a **simple asynchronous web scraper** using **Playwright** to extract phone product data from a **single-page e-commerce website**.
* Since the phones category contains **only one page**, no pagination logic is required, making the scraper lightweight and efficient.
* The use of `nest_asyncio` correctly resolves the **event loop conflict** in Jupyter Notebook / Google Colab environments, allowing asynchronous browser automation to execute without errors.
* The program waits explicitly for the `.thumbnail` selector to ensure that all product cards are fully loaded before data extraction.
* For each phone product, the scraper accurately extracts:

  * Product title
  * Price
  * Rating (calculated by counting star icons)
  * Product URL (converted from relative to absolute using `urljoin`)
  * Image URL (converted to absolute URL)
* The browser is properly closed after scraping, ensuring **efficient resource management**.

---

**Output Analysis**

* **Total products scraped:** `3 phone products`
* All available phone listings on the page were successfully collected without duplication or data loss.
* The extracted data was saved in two formats:

  * **CSV:** `output/phones_simple.csv` – suitable for spreadsheets and data analysis
  * **JSON:** `output/phones_simple.json` – suitable for APIs, dashboards, and further processing
* Console output confirms:

  * Successful page access
  * Correct product count
  * Successful file generation

In [11]:
# ===============================
# COMMON IMPORTS & COLAB EVENT LOOP FIX
# ===============================

import asyncio, json, csv, time
# asyncio → run asynchronous Playwright code
# json → save scraped data in JSON format
# csv → save scraped data in CSV format
# time → utility module (not directly used here, but commonly included)

from pathlib import Path
# Path → OS-independent way to handle files and folders

import nest_asyncio
# nest_asyncio → required to run asyncio inside Jupyter / Google Colab

nest_asyncio.apply()
# Fixes "event loop already running" error in Colab

from playwright.async_api import async_playwright, TimeoutError as PlaywrightTimeoutError
# async_playwright → async browser automation using Playwright
# PlaywrightTimeoutError → used to safely stop scraping when pages don’t load


# ===============================
# BASE URL FOR TOUCH PHONES
# ===============================

# Base URL for touch phones category
# {page} will be dynamically replaced with page numbers (1, 2, 3, ...)
BASE_URL = "https://webscraper.io/test-sites/e-commerce/static/phones/touch?page={page}"


# ===============================
# ASYNC SCRAPING FUNCTION
# ===============================

async def scrape_touch_phones_all_pages():
    # Start Playwright engine
    async with async_playwright() as p:

        # Launch Chromium browser in headless mode
        browser = await p.chromium.launch(headless=True)
        # headless=True → runs browser in background (no UI)

        # Create a fresh browser context (isolated session)
        ctx = await browser.new_context()

        # Open a new browser page/tab
        page = await ctx.new_page()

        all_rows = []
        # all_rows → stores products from ALL pages

        page_num = 1
        # Start scraping from page 1


        # ===============================
        # PAGINATION LOOP
        # ===============================

        while True:
            # Generate URL for the current page
            url = BASE_URL.format(page=page_num)

            print(f"Visiting page {page_num}: {url}")

            # Navigate to the page
            await page.goto(url, timeout=60000)
            # timeout=60000 → wait up to 60 seconds for page to load


            # ===============================
            # WAIT FOR PRODUCTS TO LOAD
            # ===============================

            try:
                # Wait until product cards appear on the page
                await page.wait_for_selector(".thumbnail", timeout=15000)
            except PlaywrightTimeoutError:
                # If products don’t load, assume no more pages
                print(f"No products found on page {page_num} (timeout), stopping.")
                break


            # ===============================
            # EXTRACT PRODUCT CARDS
            # ===============================

            # Select all product containers on the page
            cards = await page.query_selector_all(".thumbnail")

            if not cards:
                # Safety check: stop if no product elements found
                print(f"No .thumbnail elements on page {page_num}, stopping.")
                break

            page_rows = []
            # page_rows → stores products from the CURRENT page only


            # ===============================
            # LOOP THROUGH EACH PRODUCT
            # ===============================

            for card in cards:

                # Extract product title and relative URL
                title_el = await card.query_selector(".title")
                title = (await title_el.text_content()).strip() if title_el else None
                url = await title_el.get_attribute("href") if title_el else None

                # Extract product price
                price_el = await card.query_selector(".price")
                price = (await price_el.text_content()).strip() if price_el else None

                # Extract rating by counting star icons
                stars = await card.query_selector_all(".ratings .glyphicon-star")
                rating = len(stars) if stars else 0

                # Extract image URL
                img_el = await card.query_selector("img")
                img_src = await img_el.get_attribute("src") if img_el else None

                # Store extracted data in dictionary
                page_rows.append({
                    "title": title,
                    "price": price,
                    "rating_stars": rating,
                    "product_url": url,
                    "image_url": img_src,
                    "page": page_num  # helps track source page
                })


            # Log how many products were scraped from this page
            print(f"Page {page_num}: collected {len(page_rows)} products")

            # Add current page products to final list
            all_rows.extend(page_rows)

            # Move to the next page
            page_num += 1


        # Close browser after scraping all pages
        await browser.close()

        # Return all scraped touch phone products
        return all_rows


# ===============================
# RUN ASYNC FUNCTION
# ===============================

# Execute the async scraping function
data = asyncio.get_event_loop().run_until_complete(scrape_touch_phones_all_pages())

# Print total number of products scraped
print(f"Total collected products from all touch phone pages: {len(data)}")


# ===============================
# SAVE OUTPUT FILES
# ===============================

# Create output directory if it doesn't exist
Path("output").mkdir(exist_ok=True)

csv_path = Path("output/touch_phones_all_pages.csv")
json_path = Path("output/touch_phones_all_pages.json")

if data:
    # Save data to CSV
    with open(csv_path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=data[0].keys())
        writer.writeheader()
        writer.writerows(data)

    # Save data to JSON
    with open(json_path, "w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

    print(f"Saved CSV → {csv_path}")
    print(f"Saved JSON → {json_path}")
else:
    # If no data was scraped
    print("No data scraped, nothing saved.")

Visiting page 1: https://webscraper.io/test-sites/e-commerce/static/phones/touch?page=1
Page 1: collected 6 products
Visiting page 2: https://webscraper.io/test-sites/e-commerce/static/phones/touch?page=2
Page 2: collected 3 products
Visiting page 3: https://webscraper.io/test-sites/e-commerce/static/phones/touch?page=3
No products found on page 3 (timeout), stopping.
Total collected products from all touch phone pages: 9
Saved CSV → output/touch_phones_all_pages.csv
Saved JSON → output/touch_phones_all_pages.json


**OBSERVATION FOR THE ABOVE CODE**
* The program successfully performs **automated multi-page web scraping** of **touch phone products** using **asynchronous Playwright browser automation**.
* The use of `nest_asyncio` effectively resolves the **event-loop conflict** in Jupyter Notebook / Google Colab environments, allowing asynchronous code to execute without runtime errors.
* A **paginated URL template** is used to dynamically generate page URLs, enabling the scraper to automatically navigate through multiple pages of touch phone listings.
* For each page:

  * The script explicitly waits for the `.thumbnail` selector, ensuring that product cards are fully loaded before data extraction.
  * It extracts structured product information including:

    * Product title
    * Price
    * Rating (computed by counting star icons)
    * Product URL
    * Image URL
    * Page number (to track the source page)
* The scraper employs a **timeout-based stopping condition**:

  * When page 3 fails to load product cards within the specified timeout, a `PlaywrightTimeoutError` is raised.
  * This exception is safely handled to **terminate pagination gracefully**, preventing unnecessary requests or infinite loops.
* All scraped data is accumulated into a single dataset, ensuring **complete coverage and no duplication** of available touch phone listings.

---

**Output Analysis**

* **Pages successfully scraped:** `1 – 2`

* **Products per page:**

  * Page 1 → 6 products
  * Page 2 → 3 products

* **Total touch phone products collected:**
  **9 products**

* The extracted data was successfully stored in two formats:

  * **CSV:** `output/touch_phones_all_pages.csv` – suitable for spreadsheets and data analysis
  * **JSON:** `output/touch_phones_all_pages.json` – suitable for APIs, dashboards, and further processing

* Console logs confirm:

  * Correct pagination handling
  * Accurate product counts per page
  * Proper termination after the final available page