# Topic 7 — Web Automation with Selenium

## 1. Goals of This Notebook
In this notebook you will learn:
- **Why Selenium exists** and when it is needed for scraping or automation.  
  Selenium controls a real browser and can render dynamic JavaScript-based content that normal `requests` cannot retrieve.
- **How to install and configure Selenium on Windows**, including notes on drivers, PATH issues, and running inside Jupyter.
- **How to create browser sessions**, navigate, extract data, wait for elements, click buttons, scroll pages, and automate tasks.
- **A full scraping example** using a safe demo website.

---

## 2. Comparison With Requests + BeautifulSoup (Video 6 Recap)

### When Requests + BS4 *is enough*:
- Static HTML pages  
- Simple extraction  
- No interactions  
- No JavaScript rendering  
- Very fast and lightweight  

### When Requests + BS4 *fails*:
- Page loads content **only after JavaScript runs**
- Websites with pagination that loads dynamically
- Login-required sites with CSRF tokens
- Infinite scrolling / lazy loading

### Why Selenium:
- Executes JavaScript  
- Can click, type, scroll, and interact like a real user  
- Good for:
  - dynamic sites  
  - testing workflows  
  - scraping where no clean API exists  

---

## 3. Installing Selenium on Windows

### Step 1 — Install Selenium Python package
Run in terminal or inside Jupyter:

```
pip install selenium
```

### Step 2 — Browser Driver (Modern Approach: Selenium Manager)
As of Selenium 4.10+, Selenium **automatically** downloads the correct browser driver for:
- Chrome  
- Edge  
- Firefox  

**No need to separately download chromedriver.exe.**

### Step 3 — Common Windows Pitfalls
- **PATH issues**: older Selenium tutorials require manual driver installation — ignore those.
- **Antivirus blocking**: some AV tools block automated browser control.  
- **Browser version mismatch**: Selenium Manager eliminates this problem.
- **Running inside Jupyter**:
  - Jupyter blocks some interactive window features  
  - Browser will launch in a new system window  

### Step 4 — Verify Installation
We’ll run a minimal script below.

---

## 4. Choosing a Practice Website

We will use:

### **Quotes to Scrape (JavaScript version)**  
https://quotes.toscrape.com/js

It requires JavaScript, so standard requests fail — perfect for Selenium demos.

---

## 5. Selenium Basics

### 5.1 Creating a Browser Instance
We use Chrome by default. Selenium will handle driver installation.

### 5.2 Opening a Page
We navigate using `.get(url)`.

### 5.3 Locating Elements
Use:
- `By.CSS_SELECTOR`  
- `By.XPATH`  
- `By.CLASS_NAME`  
- `By.TAG_NAME`

### 5.4 Extracting Content
Every Selenium element supports:
- `.text`  
- `.get_attribute("href")`  

### 5.5 Waiting for Dynamic Content
Use:  
`WebDriverWait(driver, timeout).until(condition)`

### 5.6 Interactions
Selenium supports:
- `.click()`  
- `.send_keys()`  
- `.execute_script()` for scrolling


---


In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Minimal test script — open python.org
driver = webdriver.Chrome()  # Selenium Manager auto-installs driver

driver.get("https://www.python.org")

print("Title:", driver.title)

driver.quit()


---

## 6. Practical Selenium Demonstration: Scraping Quotes

We target **https://quotes.toscrape.com/js**  
This version loads quotes via JavaScript and requires Selenium.

Process:
1. Open page  
2. Wait for quotes to load  
3. Extract quote text + author  
4. Click “Next”  
5. Repeat for 3 pages  
6. Save results to CSV  

---


In [None]:
import csv
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()

driver.get("https://quotes.toscrape.com/js")

all_quotes = []

for page in range(3):  # scrape first 3 pages
    print(f"Scraping page {page+1}")

    # Wait for quote elements
    quotes = WebDriverWait(driver, 10).until(
        EC.presence_of_all_elements_located((By.CLASS_NAME, "quote"))
    )

    for q in quotes:
        text = q.find_element(By.CLASS_NAME, "text").text
        author = q.find_element(By.CLASS_NAME, "author").text
        all_quotes.append([text, author])

    # Try clicking the next button if exists
    try:
        next_btn = driver.find_element(By.CSS_SELECTOR, "li.next > a")
        next_btn.click()
    except:
        print("No next page.")
        break

driver.quit()

# Save results
with open("quotes_output.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    writer.writerow(["quote", "author"])
    writer.writerows(all_quotes)

all_quotes[:5]  # show sample


---

## 7. Scrolling & Screenshot Example

Useful for:
- Infinite scroll pages
- Triggering lazy-loaded content
- Capturing evidence of state

---


In [None]:
from selenium import webdriver
import time

driver = webdriver.Chrome()
driver.get("https://quotes.toscrape.com/js")

# Scroll to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2)

driver.save_screenshot("selenium_screenshot.png")

driver.quit()


---

## 8. Cleanup, Tips & Future Work

- Selenium is slower than Requests/BS4 — use only when necessary.  
- Combine both approaches:
  - Selenium to log in / load dynamic HTML  
  - Requests to fetch data behind authenticated session  
- For production, run Selenium in:
  - **Headless mode**
  - **Docker containers**
  - **Selenium Grid** for parallel browsing  

End of Notebook.
