# Web scrapping

In this notebook data from the Propery Sales webpages will be scraped and loaded into csv files.

## Import necessary libraries.

In [1]:
import requests
import bs4
import pandas as pd
import csv
import pprint

## Create helper functions.
Lets create a number of helper functions to scrape the web data. The following functions are defined below:
- fetch_html(url) - Takes a URL string as an input parameter and returns a BeautifulSoup object if the url is valid.
  
- extract_sale_date(soup) - Takes a BeautifulSoup object as a parameter and returns the the "Sale Date" text.

  
- parse_property_details(value, current_property) - Parses the property details field for Bedrooms, bathrooms etc.

  
-  parse_property_rows(soup,sale_date) - Takes a BeautifulSoup object and sale_date string as input parameteres. Iterates over all property sales entries and creates a list of dictionaries, where each key is a property feature (e.g bedrooms,bathrooms etc) and its value is the value for that property feature for the given property sale.

### Create a helper function to fetch the HTML content of a given URL.

In [2]:
def fetch_html(url):
    # This function takes a url string as input and returns a BeautifulSoup object if successful, None if not.
    response = requests.get(url)
    if response.status_code != 200:
        print(f"Failed to fetch {url}")
        return None
    return bs4.BeautifulSoup(response.text, "html.parser")

### Create a helper function to extract property sale date.

In [3]:
def extract_sale_date(soup):
    # This function will extract the "Sale Date" from <span class='sold'> tag.
    sold_span = soup.find("span", class_="sold")
    return sold_span.get_text(strip=True) if sold_span else None
    

### Create a helper function to parse property details.

In [4]:
def parse_property_details(value, current_property):
    # Parses the property details field for Bedrooms, bathrooms etc.
    details = value.split(";")
    for item in details:
        item = item.strip()
        if ":" in item:
            sub_key, sub_value = item.split(":")
            sub_key.strip()
            sub_value.strip()
             # Map 'Style' to 'Stories' for consistent naming
            if sub_key == "Style":
                current_property["Stories"] = sub_value
            else:
                current_property[sub_key] = sub_value
        else:
            # Handle cases like "3 Bathrooms" or "2 Bedrooms"
            parts = item.split(" ")
            if len(parts) >= 2:
                number, label = parts[0], parts[1]
                if label.startswith("Bathroom"):
                    current_property["Bathrooms"] = number
                elif label.startswith("Bedroom"):
                    current_property["Bedrooms"] = number
    return current_property
            

### Create helper function to parse all property rows in the table.

In [5]:
def parse_property_rows(soup,sale_date):
    # Iterate over the table rows to build a list of dictionaries
    rows = soup.find_all("tr")
    data = []
    current_property = {}

    for row in rows:
        tds = row.find_all("td")
        if len(tds) != 2:
            continue  # Ignore malformed rows

        key = tds[0].get_text(strip=True).replace(":", "")
        value = tds[1].get_text(strip=True)

        # Expand maulti stage details for the Property Details Fields
        if key == "Property Details":
            current_property = parse_property_details(value, current_property)
        else:
            current_property[key] = value

        # Each property ends when we reach 'First Time Buyer'
        if key == "First Time Buyer":
            if sale_date:
                current_property["Sale Date"] = sale_date
            data.append(current_property)
            current_property = {}

    return data

### Master scraping function

Lets now use all the helper functions to grab the url content and load it into a list of dictionaries.

In [6]:
# A master scrape function to scrape all property sales data from a given url. 
def scrape_data(url):
    soup = fetch_html(url)
    if not soup:
        return []
    sale_data = extract_sale_date(soup)
    property_data = parse_property_rows(soup, sale_data)
    return property_data

## Formatting and Saving

Now that we have successfully scraped the web data. Lets convert these lists to pandas dataframes for each year. Then lets save them to a csv files so we can clean and perform analysis on our datasets in a different notebook.

### Scrape 2021 Data

In [7]:
# Create a base url for property sales data in 2021 and a list to store the dictionaries.
base_url = "http://mlg.ucd.ie/modules/python/assign1/property/2021-page{:02d}.html"
data_2021 = []

# iterate through all the 2021 property sales urls and scrape their data.
for page_num in range(1, 18):
    # Grab current page
    url = base_url.format(page_num)
    # Scrape data from that URL address.
    page_data = scrape_data(url)
    # Add that addresses list of dictionaries to the 2021_data list.
    data_2021.extend(page_data)

### Convert to Dataframe and Save to csv Format

In [8]:
# Define the columns for our data frame.
columns = ["Sale Date", "Sale Price", "Location", "Year Built", "Garden", "Garage","Type", "Stories", "Bedrooms", "Bathrooms" , "First Time Buyer"]
# Convert the 2021 sales data into a pandas dataframe.
df = pd.DataFrame(data_2021, columns=columns)
# convert the data to a csv file and save it to the data directory.
df.to_csv("data/2021_property_sales_data.csv")

### Scrape 2022 Data

In [9]:
# Create a base url for property sales data in 2022 and a list to store the dictionaries.
base_url = "http://mlg.ucd.ie/modules/python/assign1/property/2022-page{:02d}.html"
data_2022 = []

# iterate through all the 2022 property sales urls and scrape their data.
for page_num in range(1, 17):
    # Grab current page
    url = base_url.format(page_num)
    # Scrape data from that URL address.
    page_data = scrape_data(url)
    # Add that addresses list of dictionaries to the 2022_data list.
    data_2022.extend(page_data)

### Convert to Dataframe and Save to csv Format.

In [10]:
# Define the columns for our data frame.
columns = ["Sale Date", "Sale Price", "Location", "Year Built", "Garden", "Garage","Type", "Stories", "Bedrooms", "Bathrooms" , "First Time Buyer"]
# Convert the 2022 sales data into a pandas dataframe.
df = pd.DataFrame(data_2022, columns=columns)
# Convert the data to a csv file and save it to the data directory.
df.to_csv("data/2022_property_sales_data.csv")

### Scrape 2023 Data

In [11]:
# Create a base url for property sales data in 2023 and a list to store the dictionaries.
base_url = "http://mlg.ucd.ie/modules/python/assign1/property/2023-page{:02d}.html"
data_2023 = []

# iterate through all the 2023 property sales urls and scrape their data.
for page_num in range(1, 19):
    # Grab current page
    url = base_url.format(page_num)
    # Convert the 2023 sales data into a pandas dataframe.
    page_data = scrape_data(url)
    # Convert the data to a csv file and save it to the data directory.
    data_2023.extend(page_data)

### Convert to Dataframe and Save to csv Format

In [12]:
# Define the columns for our data frame.
columns = ["Sale Date", "Sale Price", "Location", "Year Built", "Garden", "Garage","Type", "Stories", "Bedrooms", "Bathrooms" , "First Time Buyer"]
# Convert the 2023 sales data into a pandas dataframe.
df = pd.DataFrame(data_2023, columns=columns)
# Convert the data to a csv file and save it to the data directory.
df.to_csv("data/2023_property_sales_data.csv")

### Scrape 2024 Data

In [13]:
# Define the columns for our data frame.
base_url = "http://mlg.ucd.ie/modules/python/assign1/property/2024-page{:02d}.html"
data_2024 = []

# iterate through all the 2024 property sales urls and scrape their data.
for page_num in range(1, 24):
    # Grab current page
    url = base_url.format(page_num)
    # Convert the 2024 sales data into a pandas dataframe.
    page_data = scrape_data(url)
    # Convert the data to a csv file and save it to the data directory.
    data_2024.extend(page_data)

### Convert to Dataframe and Save to csv Format.

In [14]:
# Define the columns for our data frame.
columns = ["Sale Date", "Sale Price", "Location", "Year Built", "Garden", "Garage","Type", "Stories", "Bedrooms", "Bathrooms" , "First Time Buyer"]
# Convert the 2024 sales data into a pandas dataframe.
df = pd.DataFrame(data_2024, columns=columns)
# Convert the data to a csv file and save it to the data directory.
df.to_csv("data/2024_property_sales_data.csv")

## Conclusion

We now have four csv files with raw property sales data from years 2021 - 2024. These files can now be accessed from the data directory. Our Analysis notebook will load these files, clean them and perform analysis on the data.