# Web Scraping

## Retrieving and Parsing HTML

In cases where a web API is not available, we may have to manually retrieve HTML pages and extract the relevant data. This process is called **web scraping**.

**Important considerations:**
- Always check the website's `robots.txt` file (e.g., `website.com/robots.txt`) to see if scraping is allowed
- Respect rate limits - don't make requests too frequently (add delays between requests)
- Consider the website's terms of service
- Be respectful of server resources
- For production applications, consider using the `requests` library instead of `urllib` for better rate limiting and session management

For this example, we will use the *Beautiful Soup* package version 4, which is designed to make it easier to extract information from HTML pages using Python.

https://beautiful-soup-4.readthedocs.io/

If the package is not already installed on your machine, you will need to install it via *pip* on the terminal:

```bash
pip install beautifulsoup4
```

Once the package is installed, we can import it using:

In [1]:
import bs4

Firstly, we will download the HTML source code for our target web page at the link below. This web page contains personal details for a number of customers:

In [2]:
url = "http://mlg.ucd.ie/modules/python/customers.html"

In [None]:
import urllib.request
import urllib.error

try:
    # attempt to open the specified URL
    response = urllib.request.urlopen(url)
    # read the response data (bytes) and decode it into a string
    html = response.read().decode("utf-8")
    # display the HTML string
    print(html)
except urllib.error.HTTPError as e:
    print(f"HTTP Error {e.code}: {e.reason}")
    html = None
except urllib.error.URLError as e:
    print(f"Network Error: {e.reason}")
    html = None
except Exception as e:
    print(f"Unexpected error: {e}")
    html = None

Firstly, we will try to identify all of the \<div\> tags with the attribute "class" equals to "person":

In [None]:
# create a BeautifulSoup object which contains the structure of the HTML page
soup = bs4.BeautifulSoup(html, 'html.parser')
# now find all the <div> tags we want
person_matches = soup.find_all("div", {"class": "person"})
# were there any matches?    
if person_matches:
    print(f"Found {len(person_matches)} person records:")
    for match in person_matches:
        print(match)
else:
    print("No person records found with class 'person'")

While we have found all of the relevant \<div\> tags above, we want to extract more information from each match. 

We can do this by finding the \<span\> tags inside the matched \<div\> tags.

In [None]:
# create a BeautifulSoup object which contains the structure of the HTML page
soup = bs4.BeautifulSoup(html, "html.parser")

# list to store our results
customers = []

# find all the div tags we want
person_matches = soup.find_all("div", {"class": "person"})

# any results?
if not person_matches:
    print("No person records found")
else:
    print(f"Processing {len(person_matches)} person records...")
    for i, match in enumerate(person_matches, 1):
        # within each person match, find span tags and extract the text from each one
        firstname_elem = match.find("span", {"class": "firstname"})
        lastname_elem = match.find("span", {"class": "lastname"})
        dob_elem = match.find("span", {"class": "dob"})
        
        # check if all required elements were found
        if not all([firstname_elem, lastname_elem, dob_elem]):
            print(f"Warning: Person record {i} is missing required fields")
            continue
            
        # extract text from elements
        firstname = firstname_elem.text.strip()
        lastname = lastname_elem.text.strip()
        dob = dob_elem.text.strip()
        
        # clean the date of birth string
        dob = dob.replace("DOB: ", "")
        
        # validate that we have non-empty data
        if not all([firstname, lastname, dob]):
            print(f"Warning: Person record {i} contains empty fields")
            continue
        
        # add all of the information to a dictionary
        customer = {"firstname": firstname, "lastname": lastname, "dob": dob}
        customers.append(customer)
            
    print(f"Successfully extracted {len(customers)} complete customer records")

Print the data that we extracted from the HTML:

In [None]:
for i, customer in enumerate(customers, 1):
    print(f"{i}. {customer}")

We could use f-string formatting to display this data more clearly:

In [None]:
# print table header
print(f"{'No.':<4} {'First Name':<12} {'Last Name':<12} {'Date of Birth'}")
print("-" * 42)

# print each customer's details in a formatted row
for i, customer in enumerate(customers, 1):
    firstname = customer["firstname"]
    lastname = customer["lastname"]
    dob = customer["dob"]
    print(f"{i:<4} {firstname:<12} {lastname:<12} {dob}")