# Web Scraping Tutorial - Python Basics
**Dasha Ageikina, Ph.D.**

Glynmoran

[LinkedIn](https://www.linkedin.com/in/dariaageikina/)

## We’ll cover:
- The basics of web scraping  
- 3 web scraping examples in Python  
  - You can do web scraping in R too, but I recommend learning Python because it is:  
    - a universal language for most industry data jobs  
    - easy to debug and troubleshoot and has many good resources online  
    - more efficient than R  
- Cookies and other considerations in web scraping  
- Challenges  

> **Disclaimer:** This is not a comprehensive tutorial on web scraping since it’s a very big topic. I focus on what was useful for me. There may be better/more efficient solutions for some projects – please share if you find those!


## Definitions

- **Web scraping** – automated process of extracting data from websites.  
- **HTTP (HyperText Transfer Protocol)** – protocol used by the web to transfer data between clients (your browser or Python program) and servers (website hosts).  
- **URL (Uniform Resource Locator)** – web address (link to the website).  
- **HTML (HyperText Markup Language)** – standard language used to create websites.  
- **CSS (Cascading Style Sheets)** – language for formatting HTML elements (colors, fonts, etc.)

> To see a website in HTML using Google Chrome:  
> - Mac: `View → Developer → Developer Tools`  
> - Windows: `Three dots (⋮) → More tools → Developer Tools`


## Some HTML Tags

```html
<html>   <!-- The root element wrapping the entire HTML document -->
<head>   <!-- Meta-information like title, links to scripts/styles -->
<body>   <!-- All visible content of the web page -->
<nav>    <!-- Navigation menus -->
<div>    <!-- Generic container -->
<table>  <!-- Tabular data -->
<tr>     <!-- Table row -->
<td>     <!-- Table cell -->
<a>      <!-- Hyperlink -->
```


## The Main Web Scraping Steps in Python

1. Send an HTTP request (GET) to the URL and save the response.  
2. This object contains the HTML content of the site – it has a complex structure.  
3. Parse the HTML using libraries like:
   - `lxml`
   - `BeautifulSoup`
   - `html5lib`
4. Sometimes, you need to interact with the website using **Selenium**.


## ✅ Problem 1 – Download Data from Files with Links

Sometimes we can download data directly using the file links, like [here](https://files.cdpr.ca.gov/pub/outgoing/pur_archives/)

Let's write a program that downloads the files **pur1974.zip** and **pur1975.zip** for us.


## 💡 Problem 1 – Python Solution

1. If you haven't installed requests library yet, do that first by running "pip install requests". Do this for other libraries that we will use below: lxml, re, bs4, selenium, time.
2. Assign a string to PATH_TO_YOUR_FOLDER with the path to YOUR folder where you want the files to be downloaded.
3. Execute the code and check that the files are in your folder.

In [1]:
import requests

PATH_TO_YOUR_FOLDER = "/Users/dariaageikina/Downloads"

#Loop over years
for year in range(1974, 1976):
    url = f"https://files.cdpr.ca.gov/pub/outgoing/pur_archives/pur{year}.zip"
    filename = f"{PATH_TO_YOUR_FOLDER}/pur{year}.zip"
    
    print(f"Downloading {url}...")
    
    response = requests.get(url) #connect to the link
    #write content of the link into a new file
    with open(filename, "wb") as f: #wb means write the file into a binary mode - an option for non-txt files
        f.write(response.content)

Downloading https://files.cdpr.ca.gov/pub/outgoing/pur_archives/pur1974.zip...
Downloading https://files.cdpr.ca.gov/pub/outgoing/pur_archives/pur1975.zip...


## ✅ Problem 2 – Parse No-Link Elements from a Website

We want to collect data on pesticide type of [this pesticide](https://apps.cdpr.ca.gov/cgi-bin/label/labrep.pl?fmt=1&63069=on) 

We need our program to say it's a miticide and an insecticide.

Steps in Chrome:
1. Open the link in Chrome  
2. Right-click the element → Inspect  
3. Right-click the code → Copy → ...


## 💡 Problem 2 – Python Solution 1 (XPath + lxml)
- Copy the **full XPath** (the address of the element).
- Best when extracting a single element (one table in our case) with a clear XPath.

In [2]:
from lxml import html
import re

url = 'https://apps.cdpr.ca.gov/cgi-bin/label/labrep.pl?fmt=1&63069=on'
response = requests.get(url)

#convert raw HTML content into tree-like structure that we can parse
tree = html.fromstring(response.content)
#include your xpath found manually through Google Chrome Developer
xpath = '/html/body/div/main/div/div[2]/div[1]/table/tbody'
#get the element that we need
element = tree.xpath(xpath)

#convert extracted element into text
pesticide_types = element[0].text_content().strip()
re.findall(r'\b[A-Z]{3,}\b', pesticide_types)

['MITICIDE', 'INSECTICIDE']

## 💡 Problem 2 – Python Solution 2 (Selectors + BeautifulSoup)

- Use tags/selectors when extracting many elements from different parts of HTML.
- Requires more HTML structure investigation.

In our case, it happens so that we are interested in the first table on the page
- `soup.find("table")` → a command for getting the first table on the page. Similarly, `soup.find("a")` would get us the first hyperlink on the page.
- Extract all elements and pick relevant ones (in our case, elements #1 and #3, since Python starts counting from 0, and elements #0 and #2 would be "O0" and "N0").

If we needed to extract all tables from the website, we would run `soup.find_all("table")`.


In [3]:
from bs4 import BeautifulSoup

url = 'https://apps.cdpr.ca.gov/cgi-bin/label/labrep.pl?fmt=1&63069=on'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

table = soup.find('table') #we are lucky because we need the first table
entries = table.find_all('td')
print(entries[1].text.strip())
print(entries[3].text.strip())

MITICIDE
INSECTICIDE


## ✅ Problem 3 – Interact with Elements on the Website

We want the program to:
- Enter `63069` into a search bar [here](https://apps.cdpr.ca.gov/docs/label/epanum.cfm)
- Press the “Submit” button and get us to the next webpage.


## 💡 Problem 3 – Python Solution with Selenium

In [6]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time

#open the browser and the website
driver = webdriver.Chrome()
driver.get("https://apps.cdpr.ca.gov/docs/label/epanum.cfm")

# Wait for the page to fully load
time.sleep(5)

# Locate the input field (through Chrome inspection)
zip_input = driver.find_element(By.NAME, "p_epas")
zip_input.clear()
zip_input.send_keys("63069")

time.sleep(2)

# Press the submit button
submit_button = driver.find_element(By.XPATH, "/html/body/div/main/div/div[2]/form/input[3]")
submit_button.click()

time.sleep(10)

driver.quit()

## Selenium

- Mainly used for web testing  
- Can find elements by: ID, name, XPath, URL, URL name, tag name, class name, CSS selector, etc.  
- You can:
  - Press buttons
  - Fill in forms with text
  - Scroll, drag and drop, navigate forward and back
  - Wait for elements to load  

Good [resource](https://selenium-python.readthedocs.io/index.html) to learn more: 


## Additional Considerations

Examples above are very basic. In reality, we would also need to:
- Add clauses to handle errors
- Add a browser user agent to mimic regular user behavior in a browser  
  - Check out the [list of user-agent strings](https://deviceatlas.com/blog/list-of-user-agent-strings)
- Pauses between requests – to avoid being blocked for inundating the website

## Cookies

- Store session info and user preferences  
- Websites use them to track sessions and recognize bots  
- Without proper handling, you may be logged out or see different content than expected
- Setting proper cookies can help mimic a real browser

📖 [Learn more about cookies](https://scrape.do/blog/web-scraping-cookies/)


## 💡 Problem 2 – Python Solution 1 (XPath + lxml) - UPDATED
If I were to account for some of the considerations above, my script would change to:

In [7]:
import requests
from lxml import html
import re

url = 'https://apps.cdpr.ca.gov/cgi-bin/label/labrep.pl?fmt=1&63069=on'

# Custom headers with User-Agent, change yours to the one compatible with your device
# Mine is MacBook M3, but the user agent for it is the same as for Intel Mac, so I use this one
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15'
    '(KHTML, like Gecko) Version/18.3.1 Safari/605.1.15'
}

# Use session to handle cookies automatically
session = requests.Session()
session.headers.update(headers)

response = session.get(url)

# Check if HTTP response is OK
if response.status_code == 200:
    tree = html.fromstring(response.content)
    xpath = '/html/body/div/main/div/div[2]/div[1]/table/tbody'
    element = tree.xpath(xpath)

    if element:
        pesticide_types = element[0].text_content().strip()
        pesticide_types = re.findall(r'\b[A-Z]{3,}\b', pesticide_types)
        print(pesticide_types)
    else:
        print("Could not find the target element using XPath.")
else:
    print(f"Request failed with status code: {response.status_code}")

['MITICIDE', 'INSECTICIDE']


## Challenges

- **Anti-bot systems** may block your IP address, usually temporarily
  - Don'd scrape too much too quickly
  - Adjust all parameters to mimic browser behavior 
  - Limit the number of requests during the day
  - Consider using this [library](https://github.com/ultrafunkamsterdam/undetected-chromedriver) (make sure you do not violate any laws first)

- **Web scraping can be painfully slow**
  - But it still can run in background

- **Websites change frequently**
  - You may need to update scripts

- **Different websites will have diffferent structures**
  - Every site is different and your script for one website may look very different from a script for another one

- **Requires thorough HTML inspection**
  - Try to find patterns to locate elements
