# Chapter 10 Lab

### You are given a list of GORUCK products. Your goal is to build a small scraping workflow that produces a DataFrame containing the product URL, page title, and price for each item.

### You will complete this lab in four steps. Each step corresponds to one code block you will write.

### Question 1: Build the input DataFrame

Create a DataFrame named `df` that contains one row per product with these columns:

- `brand_name`
- `product_name`
- `product_url`

Use the 15 products listed in the chapter:

- GR1 26L  
- GR2 34L  
- GR2 40L  
- Rucker 4.0 20L  
- Rucker 4.0 25L  
- Bullet Ruck 15L  
- GR3 45L  
- Radio Ruck  
- Ruck Plate 10lb  
- Ruck Plate 20lb  
- Ruck Plate 30lb  
- Ruck Plate 45lb  
- Expert Ruck Plate 20lb  
- Expert Ruck Plate 30lb  
- Expert Ruck Plate 45lb  

**Instructions**

1. Create a Python list of dictionaries named `products`.
2. Each dictionary must include all three keys: `brand_name`, `product_name`, `product_url`.
3. Fill in working GORUCK URLs so the DataFrame is runnable.
4. Convert the list to a DataFrame named `df`.
5. Display `df` to confirm the structure.

**Checkpoint**

- `df` should have 15 rows.
- `df.columns` should be exactly: `brand_name`, `product_name`, `product_url`.

In [8]:
import re
import time

import pandas as pd
import requests
from bs4 import BeautifulSoup

# -----------------------------
# Input data (URLs included)
# -----------------------------
df= pd.read_csv("ch10lab.csv")


df

Unnamed: 0,brand_name,product_name,product_url
0,GORUCK,GR1 26L,https://www.goruck.com/products/gr1-usa
1,GORUCK,GR2 34L,https://www.goruck.com/products/gr2
2,GORUCK,GR2 40L,https://www.goruck.com/products/gr2
3,GORUCK,Rucker 4.0 20L,https://www.goruck.com/products/rucker
4,GORUCK,Rucker 4.0 25L,https://www.goruck.com/products/rucker
5,GORUCK,Bullet Ruck 15L,https://www.goruck.com/products/bullet-ruck-15l
6,GORUCK,GR3 45L,https://www.goruck.com/products/gr3
7,GORUCK,Radio Ruck,https://www.goruck.com/products/radio-ruck-thr...
8,GORUCK,Ruck Plate 10lb,https://www.goruck.com/products/ruck-plates
9,GORUCK,Ruck Plate 20lb,https://www.goruck.com/products/ruck-plates


### Question 2: Create scraping helper functions

Write the scraping harness and extraction helpers. Your goal is to keep these functions small, but reliable enough to work across the URLs in this lab.

You must create:

1. A reusable `requests.Session`
2. `fetch_html(url)`
3. `extract_title(soup)`
4. `extract_price_generic(soup)`
5. `extract_ruck_plate_variant_price(soup, product_name)`

#### 2.1 Create a requests Session

- Create a `requests.Session()` object named `session`.
- Add a browser-like User-Agent header to `session.headers`.

#### 2.2 Write `fetch_html(url)`

Write a function that:

- Takes a URL string.
- Uses `session.get()` to fetch the page.
- Uses a timeout so the notebook does not hang.
- Calls `raise_for_status()` so HTTP errors are surfaced.
- Returns `response.text`.

#### 2.3 Write `extract_title(soup)`

Write a function that:

- Takes a BeautifulSoup object.
- Returns the visible page title using the `h1` element.
- Returns `None` if no title exists.

#### 2.4 Write `extract_price_generic(soup)`

Write a function that attempts to extract price from standard GORUCK “product pages” (rucksacks and similar products).

Your function must:

1. Look for a Shopify-style price block using BeautifulSoup `select_one`.
2. Check price selectors in this priority order:
   - Sale price (if present)
   - Regular price
3. If a price element is found, return the cleaned text.
4. If not found, fall back to a regex search over the full page text and return the first dollar amount.

#### 2.5 Write `extract_ruck_plate_variant_price(soup, product_name)`

Ruck Plates are the key gotcha in this lab. They appear as multiple weights on a single product page, which means the page contains multiple prices.

Write a function that:

1. Takes the parsed page (`soup`) and the row’s `product_name`.
2. Determines which weight variant the row represents based on the product name:
   - 10lb maps to `10LB (Standard)`
   - 20lb maps to `20LB (Standard)`
   - 30lb maps to `30LB (Standard)`
   - 45lb maps to `45LB (Long)`
3. Extracts the page text with line breaks (for example, using `soup.get_text("\n", strip=True)`).
4. Uses a regex pattern to find a line that looks like:

   `20LB (Standard) - $XX.XX`

5. Returns the matching price string (`$XX.XX`) if found, otherwise `None`.

**Checkpoint**

By the end of Question 2, you should be able to:

- Fetch HTML from a URL using `fetch_html`.
- Parse it with BeautifulSoup.
- Extract a title using `extract_title`.
- Extract a price using either:
  - `extract_price_generic`, or
  - `extract_ruck_plate_variant_price`

In [9]:
import json
import re
import time

import pandas as pd
import requests
from bs4 import BeautifulSoup

# -----------------------------
# Scraping helpers
# -----------------------------
session = requests.Session()
session.headers.update({
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/121.0.0.0 Safari/537.36"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Referer": "https://www.google.com/",
})

def fetch_html(url: str, timeout: int = 20) -> str:
    resp = session.get(url, timeout=timeout, allow_redirects=True)
    resp.raise_for_status()
    return resp.text

def fetch_shopify_product_json(url: str, timeout: int = 20) -> dict:
    """
    Shopify product JSON endpoint.
    Example: https://site.com/products/foo -> https://site.com/products/foo.js
    """
    js_url = url.split("?", 1)[0].rstrip("/") + ".js"
    resp = session.get(js_url, timeout=timeout, allow_redirects=True, headers={"Accept": "application/json"})
    resp.raise_for_status()
    return resp.json()

def extract_title(soup: BeautifulSoup) -> str | None:
    h1 = soup.find("h1")
    return h1.get_text(" ", strip=True) if h1 else None

def extract_price_generic(soup: BeautifulSoup) -> str | None:
    """
    Generic Shopify-style price extraction.
    Works for many GORUCK product pages.
    """
    price_el = (
        soup.select_one("div.product-block__price span.price-item--sale.price-item--last")
        or soup.select_one("div.product-block__price span.price-item--sale")
        or soup.select_one("div.product-block__price span.price-item--regular")
    )
    if price_el:
        return price_el.get_text(" ", strip=True)

    # fallback, grab first visible dollar amount
    text = soup.get_text(" ", strip=True)
    m = re.search(r"\$([\d,]+(?:\.\d{2})?)", text)
    return f"${m.group(1)}" if m else None

def extract_ruck_plate_variant_price(product_url: str, product_name: str) -> str | None:
    """
    Ruck Plates page is variant-driven and may not render variant prices in static HTML.
    Pull variant prices from the Shopify .js endpoint and select the best match.
    """
    data = fetch_shopify_product_json(product_url)
    variants = data.get("variants", [])

    name_lower = product_name.lower()
    want_expert = "expert" in name_lower

    # Extract desired weight from the product name (10/20/30/45)
    m = re.search(r"\b(10|20|30|45)\b", name_lower)
    if not m:
        m = re.search(r"\b(10|20|30|45)\s*lb\b", name_lower)
    if not m:
        return None
    weight = m.group(1)

    def score_variant(v: dict) -> int:
        title = str(v.get("title", "")).lower()
        score = 0
        if weight in title:
            score += 50
        if "lb" in title:
            score += 5
        if want_expert and "expert" in title:
            score += 20
        if (not want_expert) and ("standard" in title):
            score += 10
        if "long" in title:
            score += 3
        return score

    best = None
    best_score = -1
    for v in variants:
        sc = score_variant(v)
        if sc > best_score:
            best_score = sc
            best = v

    if not best or best_score < 50:
        return None

    # Shopify returns cents
    cents = best.get("price")
    if cents is None:
        return None
    return f"${cents/100:.2f}"

### Question 3: Scrape all products and store results

Write code that loops through every row in `df` and produces a new DataFrame named `out`.

**Instructions**

For each row in `df`:

1. Create a `record` dictionary with these keys:
   - `brand_name`
   - `product_name`
   - `product_url`
   - `title` (default `None`)
   - `price` (default `None`)
   - `status` (default `"error"`)
2. Fetch HTML from the row’s URL using `fetch_html`.
3. Create `soup = BeautifulSoup(html, "html.parser")`.
4. Extract title with `extract_title(soup)`.
5. Extract price using:
   - `extract_ruck_plate_variant_price` if the product name contains `"ruck plate"` (case-insensitive)
   - otherwise use `extract_price_generic`
6. Set:
   - `status = "success"` if both title and price exist
   - `status = "missing_field"` if the page was fetched but one of the fields is missing
7. Wrap scraping logic in `try/except`.
   - On exceptions, store the exception type in `status` (for example `error: HTTPError`).
8. Append each record to a list named `results`.
9. Add a short `time.sleep(1)` delay inside the loop.

Finally, convert `results` into a DataFrame named `out` and display it.

In [10]:
# -----------------------------
# Run the scrape
# -----------------------------
results = []

for _, row in df.iterrows():
    record = row.to_dict()
    record.update({"title": None, "price": None, "status": "error"})

    try:
        html = fetch_html(record["product_url"])
        soup = BeautifulSoup(html, "html.parser")

        record["title"] = extract_title(soup)

        if "ruck plate" in record["product_name"].lower():
            # plates live on one product page, pull the requested variant price via .js endpoint
            record["price"] = extract_ruck_plate_variant_price(record["product_url"], record["product_name"])
        else:
            record["price"] = extract_price_generic(soup)

        if record["title"] and record["price"]:
            record["status"] = "success"
        else:
            record["status"] = "missing_field"

    except Exception as e:
        record["status"] = f"error: {type(e).__name__}"

    results.append(record)
    time.sleep(1)

out = pd.DataFrame(results)
out

Unnamed: 0,brand_name,product_name,product_url,title,price,status
0,GORUCK,GR1 26L,https://www.goruck.com/products/gr1-usa,GR1 USA - Cordura,$335.00,success
1,GORUCK,GR2 34L,https://www.goruck.com/products/gr2,GR2 - Cordura,$385.00,success
2,GORUCK,GR2 40L,https://www.goruck.com/products/gr2,GR2 - Cordura,$385.00,success
3,GORUCK,Rucker 4.0 20L,https://www.goruck.com/products/rucker,Rucker 4.0,$265.00,success
4,GORUCK,Rucker 4.0 25L,https://www.goruck.com/products/rucker,Rucker 4.0,$265.00,success
5,GORUCK,Bullet Ruck 15L,https://www.goruck.com/products/bullet-ruck-15l,Bullet Ruck Classic - Ballistic Nylon Cordura,$160.00,success
6,GORUCK,GR3 45L,https://www.goruck.com/products/gr3,GR3 - Cordura,$395.00,success
7,GORUCK,Radio Ruck,https://www.goruck.com/products/radio-ruck-thr...,Radio Ruck USA Throwback,$295.00,success
8,GORUCK,Ruck Plate 10lb,https://www.goruck.com/products/ruck-plates,Ruck Plates,$59.00,success
9,GORUCK,Ruck Plate 20lb,https://www.goruck.com/products/ruck-plates,Ruck Plates,$79.00,success


### Question 4: Summarize and validate results

Write a short report that answers:

- How many rows did you attempt?
- How many rows succeeded?
- How many rows failed?
- What is the success rate?

**Instructions**

1. Count total rows in `out`.
2. Count rows where `status == "success"`.
3. Count rows where `status != "success"`.
4. Print a success rate as a percentage.
5. Display a final view of:

- `product_name`
- `product_url`
- `title`
- `price`
- `status`

In [11]:
# -----------------------------
# Summary report
# -----------------------------
total = len(out)
success = (out["status"] == "success").sum()
missing = (out["status"] == "missing_field").sum()
failed = total - success - missing

print(f"Total rows: {total}")
print(f"Success: {success}")
print(f"Missing field: {missing}")
print(f"Failed: {failed}")
print(f"Success rate: {success / total:.1%}")

out[["product_name", "product_url", "title", "price", "status"]]


Total rows: 15
Success: 15
Missing field: 0
Failed: 0
Success rate: 100.0%


Unnamed: 0,product_name,product_url,title,price,status
0,GR1 26L,https://www.goruck.com/products/gr1-usa,GR1 USA - Cordura,$335.00,success
1,GR2 34L,https://www.goruck.com/products/gr2,GR2 - Cordura,$385.00,success
2,GR2 40L,https://www.goruck.com/products/gr2,GR2 - Cordura,$385.00,success
3,Rucker 4.0 20L,https://www.goruck.com/products/rucker,Rucker 4.0,$265.00,success
4,Rucker 4.0 25L,https://www.goruck.com/products/rucker,Rucker 4.0,$265.00,success
5,Bullet Ruck 15L,https://www.goruck.com/products/bullet-ruck-15l,Bullet Ruck Classic - Ballistic Nylon Cordura,$160.00,success
6,GR3 45L,https://www.goruck.com/products/gr3,GR3 - Cordura,$395.00,success
7,Radio Ruck,https://www.goruck.com/products/radio-ruck-thr...,Radio Ruck USA Throwback,$295.00,success
8,Ruck Plate 10lb,https://www.goruck.com/products/ruck-plates,Ruck Plates,$59.00,success
9,Ruck Plate 20lb,https://www.goruck.com/products/ruck-plates,Ruck Plates,$79.00,success
