# Web Scrapping for Data

## Part 1: Scraping Auction Catalog PDFs
_____________________

In this stage, I extract auction and painting information from multiple auction catalog PDFs. I have collected hundreds of samples across categories such as old masters and modern/contemporary art. PDF extraction is the preferred method, as auction platforms often employ highly dynamic , making direct web scraping impractical.


### 1. Import Libraries
Import the libraries needed for extracting data from pdf and formatting the data.

In [None]:
import os
import re # Regular Expressions for locating the string to be extract
import pandas as pd
import fitz  # PyMuPDF for extracting text from PDFs
import unidecode # Format text to english readable

In [None]:
# Define folder path where PDFs are stored
pdf_folder = "../art_paintings/"
data_list = []  # Store extracted data

### 2. Review Extracted Data from PDF
Run PyMuPDF to see what data have been extracted from the multiple PDFs in local folder.

In [None]:
# Loop through each PDF in the folder
# os. listdir() returns a list of the names of the entries in the directory
for filename in os.listdir(pdf_folder):
    if filename.endswith(".pdf"): # Only process PDFs
        file_path = os.path.join(pdf_folder, filename) # Create path name by combining folder location and filename
        
        # Open PDF
        doc = fitz.open(file_path)
        print(f"Extracting from: {filename}") # Show which file is being processed
        
        # page_num gives the page number (starting from 0)
        # enumerate(doc): Loops through each page in the PDF
        for page_num, page in enumerate(doc):
            text = page.get_text("text") # Extracts text from the page
            print(f"Page {page_num + 1}:\n{text}\n") # Display the page number and its text

        print("-" * 60) # Print a separator line after each PDF

### 3. Extracting and Structuring Auction Data from PDFs

This step involves extracting auction information from catalog PDFs using PyMuPDF and regular expressions (regex). Specific regex patterns are crafted to locate key data points such as the artist's name, birth year, title, and other relevant details. For each piece of information, the script searches the text using targeted patterns to accommodate the diverse formatting found across different PDFs. Once the data is extracted, it is compiled into a dictionary and then saved as a CSV file for further analysis.

### 3a. Define the Function for Extracting Information

In [None]:
# Extract structured information from the text using regex
def extract_info(text):
    # setting up regex to location information within the pdf document
    artist_match = re.search(r"(?m)^\d+\n([^\n]+)", text, re.MULTILINE)
    birth_year_match = re.search(r"(?m)\n^b\.\s*(\d{4})\b", text, re.MULTILINE)
    birth_death_year_match = re.search(r"^(\d{4} - \d{4})$", text, re.MULTILINE)
    title_match = re.search(r"^\d+\n[^\n]+\n([^\n]+)", text, re.MULTILINE)
    year_match = re.search(r"Executed (?:in|circa) (\d{4})", text)
    medium_match = re.search(r"(\b(?:oil|ink|acrylic|tempera|watercolor|charcoal|graphite|mixed media|print|photograph)[^.\n]+)", text, re.IGNORECASE)
    dimensions_in_match = re.search(r"([\d\.]+ ?(?:by|x) ?[\d\.]+ in)", text)
    dimensions_cm_match = re.search(r"([\d\.]+ ?(?:by|x) ?[\d\.]+ cm)", text)
    lot_number_match = re.search(r"(?m)^\s*(\d+)\n[^\n]+\n", text)
    estimate_match = re.search(r"Estimate:\s*(.*?)\n", text)
    sold_price_match = re.search(r"Lot Sold:\s*(.*?)\n", text)
    condition_matches = re.findall(r"\b(?:Condition Report|Condition:)\s*\n([^.\n]+)", text, re.IGNORECASE)

    # Convert the artist's name and title to plain ASCII characters (removing accents)
    artist_name = unidecode.unidecode(artist_match.group(1)) if artist_match else None
    title = unidecode.unidecode(title_match.group(1)) if title_match else None

    # Determine birth and death years:
    # If a birth-death range is found, split it into birth and death years
    # If only the birth year is found, use it and leave death year empty
    if birth_death_year_match:
        birth_year, death_year = birth_death_year_match.group(1).split(' - ')
    elif birth_year_match:
        birth_year = birth_year_match.group(1).strip()
        death_year = None
    else:
        birth_year, death_year = None, None

    # Combine multiple condition details into a single string, separated by commas
    condition_report = ", ".join(condition_matches) if condition_matches else None

    # Return all the extracted information as a dictionary
    return {
        "Artist": artist_name,
        "Birth_Year": birth_year,
        "Death_Year": death_year,
        "Title": title,
        "Year_Created": year_match.group(1) if year_match else None,
        "Medium": medium_match.group(1) if medium_match else None,
        "Dimensions(in)": dimensions_in_match.group(1) if dimensions_in_match else None,
        "Dimensions(cm)": dimensions_cm_match.group(1) if dimensions_cm_match else None,
        "lot_number": lot_number_match.group(1) if lot_number_match else None,
        "Estimate Price": estimate_match.group(1) if estimate_match else None,
        "Final Sold Price": sold_price_match.group(1) if sold_price_match else None,
        "Condition Report": condition_report,
    }

### 3b. Process Each PDF

In [None]:
# Loop through each file in the PDF folder
for filename in os.listdir(pdf_folder):
    if filename.endswith(".pdf"):  # Only process PDFs
        file_path = os.path.join(pdf_folder, filename)  #  Create path name by combining folder location and filename
        doc = fitz.open(file_path)  # Open the PDF document
        
        # Process pages in pairs (each painting's info spans two pages)
        for page_num in range(0, len(doc), 2):  
            page_text = ""

            # Get text from the current page
            page_text += doc[page_num].get_text("text")

            # If the next page exists, add its text too
            if page_num + 1 < len(doc):
                page_text += " " + doc[page_num + 1].get_text("text")

            # Extract structured information from the combined text by 'extract_info' function
            extracted_data = extract_info(page_text)

            # Save the filename with the extracted data for tracking
            extracted_data["File"] = filename  

            # Append the extracted information to the data list
            data_list.append(extracted_data)

        doc.close()  # Close the PDF after processing

### 3c. Convert Data to DataFrame and Save

In [None]:
df = pd.DataFrame(data_list)

# Save dataframe to CSV
df.to_csv("auction_data.csv", index=False, encoding="utf-8")
print("Data saved to auction_data.csv") # notification to indicate csv save

## Part 2: Scrapping with BeautifulSoup and Selenium
_____________________

In this stage, I collect artist performance metrics from an online auction platform by combining the strengths of Selenium and BeautifulSoup. Selenium is used to dynamically load the website, ensuring that all JavaScript-rendered content—including critical performance metrics—is fully captured. Once the page is rendered, BeautifulSoup is employed to parse the HTML structure. Artist-specific URLs are reconstructed, key metrics are extracted, and the data is organized into a CSV file for subsequent analysis.

### 1. Install and Import Libraries
Install Selenium to load javascript, BeautifulSoup to scrap HTML and time for interval to load the next page.

In [None]:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

### 2. Define Function to Construct URL
Each artist’s information page URL is defined by their name appended to the platform’s base URL. Artist names collected from Part 1 were saved into a new artists.csv file, which is then used to construct URLs for automated data extraction.

In [None]:
def construct_url(artist_name):
    
    # Convert the artist's name into a URL-friendly format
    # For example, "Yoshitomo Nara" becomes "yoshitomo-nara"
    if not artist_name or not isinstance(artist_name, str):
        raise ValueError("Artist name must be a non-empty string.")
    
    # Remove extra spaces and change to lowercase
    artist_name = artist_name.strip().lower()
    
    # Remove special characters, keeping only letters, numbers, spaces, and hyphens
    artist_slug = re.sub(r"[^\w\s-]", "", artist_name)
    # Replace spaces (and extra hyphens) with a single hyphen
    artist_slug = re.sub(r"[\s-]+", "-", artist_slug)
    
    # Build the URL using the formatted artist name
    url = f"https://www.example.com/{artist_slug}/results"
    return url

### 3. Function to Scrape Metrics

In [None]:
# Scrape metrics from an artist's results page using Selenium to render JavaScript
def get_artist_metrics(artist_url):
    
    # Set up Selenium with headless Chrome (runs without opening a browser window)
    chrome_options = Options()
    chrome_options.add_argument("--headless") # Run Chrome in the background
    chrome_options.add_argument("--disable-gpu") # Disable GPU use to avoid potential issues in headless mode
    chrome_options.add_argument("--no-sandbox") # Turn off Chrome's sandbox for easier setup
    driver = webdriver.Chrome(options=chrome_options) # load Chrome webdriver
    
    # Open the artist's page and wait for all the content to load
    driver.get(artist_url)
    time.sleep(5) # Wait 5 seconds for dynamic content to load fully
    
    # Get the full HTML of the loaded page
    html = driver.page_source
    soup = BeautifulSoup(html, "html.parser")
    
    driver.quit()  # Close the browser once the page is loaded

    # Set default values for the metrics
    metrics = {
        "critically_acclaimed": "N/A",
        "followers": "N/A",
        "values": "N/A",
    }
    
    # Look for buttons that say "Critically acclaimed"
    crit_buttons = soup.find_all("button", class_="Clickable-sc-10cr82y-0 fbnHxf")
    metrics["critically_acclaimed"] = "No"
    # If any button shows "Critically acclaimed", mark it as "Yes"
    for btn in crit_buttons:
        if "Critically acclaimed" in btn.get_text(strip=True):
            metrics["critically_acclaimed"] = "Yes"
            break

    # Look for the followers count in the designated div
    follower_div = soup.find("div", class_="Box-sc-15se88d-0 Text-sc-18gcpao-0 cZekcQ gviZDz")
    # If found, extract its value and save it as the followers metric.
    if follower_div:
        metrics["followers"] = follower_div.get_text(strip=True)
    
    # Find the section that holds other metric values
    value_elements = soup.find("div", class_="Box-sc-15se88d-0 CSSGrid-sc-1q8w5xn-0 GridColumns-sc-1g9p6xx-0 jdZUdM gRoBRz fwdhTL")
    # If found, extract its text and save it as the values metric
    if value_elements:
        if value_elements:
            metrics["values"] = value_elements.get_text(strip=True)
    
    return metrics

### 4. Load CSV File of Artists

In [None]:
# Try to load the CSV file that contains a list of artists
try:
    artists_df = pd.read_csv("artists.csv")
    # Check if the CSV has an "artist" column. If not, throw an error
    if "artist" not in artists_df.columns:
        raise ValueError("CSV file must contain an 'artist' column.")
except Exception as e:
    # Print any error that occurs during loading and stop the script
    print(f"Error loading CSV file: {e}")
    exit()

### 5. Scrape Data for Each Artist

An artists.csv file is loaded into Python to extract artist names. These names are then used to construct URLs for scraping data.

In [None]:
# Prepare an empty list to store data that we will scrape
data_list = []

# Loop over each row (artist) in the CSV
for index, row in artists_df.iterrows():
    artist_name = row["artist"]
    # Build the URL for the artist page
    url = construct_url(artist_name)
    print(f"Scraping data for {artist_name} from {url}")

    # Get the artist's metrics by scraping the page
    metrics = get_artist_metrics(url)
    if metrics:
        # Add the artist's name and URL to the metrics data
        metrics["artist"] = artist_name
        metrics["url"] = url
        # Save this data into our list
        data_list.append(metrics)
    
    time.sleep(1)  # Wait for 1 second between requests to avoid overwhelming the server

### 6. Save Scraped Data to CSV

In [None]:
# Create a DataFrame from the scraped data
scraped_df = pd.DataFrame(data_list)

# Save the scraped metrics to a CSV file
scraped_df.to_csv("artists_scraped_metrics.csv", index=False)
print("Scraping complete. Data saved to artists_scraped_metrics.csv")