## Rental Scraper Notebook

**Student ID:** 24204265  
**Contact:** yash.pathania@ucdconnect.ie

## Introduction

This notebook demonstrates how to scrape rental data for different quarters (Q1..Q4), parse specific fields (price, location, number of bedrooms, bathrooms, etc.), and store the scraped information in a CSV file.

## Overview

- **Libraries:**  
  We use libraries such as `requests` and `BeautifulSoup` for web scraping.

- **Helper Functions:**  
  Functions like `parse_price`, `parse_integer`, `parse_boolean`, and `parse_lease_length` assist in parsing the required fields.

- **Scraping Function:**  
  The main function, `scrape_rentals`, iterates through pages for each quarter until a 404 is encountered or until a preset limit is reached in test mode.

- **Execution Modes:**  
  - **Test Mode:** Scrapes only the first 5 pages per quarter for quicker testing.  
  - **Full Mode:** Continues scraping until a 404 is returned for each quarter.

- **Saving Data:**  
  The scraped data is saved into a CSV file.

## Key Notes

- **Test Mode:**  
  Designed for testing; only the first 5 pages per quarter are processed.

- **Page Iteration:**  
  Pages are scraped sequentially until a page returns a 404 error, indicating no more data for that quarter.

- **Data Extraction:**  
  The extraction process uses regex to parse data. While this method works for a basic website, it may not be fully robust.

- **Comments:**
  All comments are standard python function comments emphasiss on what cam in what came out the standarad python comment strcutre has been followed.

## Contact

For any clarifications or further details, please feel free to contact me at:  
**yash.pathania@ucdconnect.ie**

In [77]:
import requests # To make calls
from bs4 import BeautifulSoup # To parse the HTML
import csv # To write to CSV
import re # To use regular expressions ( extracting data from html)

### Helper Functions imprtant are using regex to parse the unformatted data

In [78]:
def parse_price(price_str):
    """
    Extracts the numeric value from strings like '€ 1,920 per month'
    and returns (float, 'EUR').
    """
    currency = "EUR"
    cleaned = price_str.replace("€", "").replace("per month", "").replace(",", "").strip()
    try:
        numeric_price = float(cleaned)
    except ValueError:
        numeric_price = None
    return numeric_price, currency

def parse_integer(value_str):
    """
    For '2 Bedrooms' or '1 Bathroom', return just the integer (2,1).
    Returns None if no integer is found.
    """
    match = re.search(r'\d+', value_str)
    if match:
        return int(match.group())
    return None

def parse_boolean(value_str):
    """
    Interpret 'Yes'/'Y' (case-insensitive) as 'Yes'.
    If '???' or missing, return '' (blank).
    Otherwise -> 'No'.
    """
    if not value_str:
        return ""
    val_str = value_str.strip().lower()
    if val_str in ("yes", "y"):
        return "Yes"
    elif "???" in val_str:
        return ""  # leave it blank
    return "No"

def parse_lease_length(value_str):
    """
    Convert lease length to a numeric number of months:
      '3 months' -> 3
      '1 year'   -> 12
      '???'      -> None
    """
    match = re.search(r'\d+', value_str)
    if not match:
        return None
    num = int(match.group())
    if 'year' in value_str.lower():
        return num * 12
    return num

def parse_location(location_str):
    """
    Splits the 'Location' field into (region, dublin_area).
      e.g. "Dublin City South - Dublin 16" -> ("Dublin City South", "16")
           "North Co Dublin"             -> ("North Co Dublin", "")
    """
    parts = location_str.split('-')
    region = parts[0].strip() if parts else location_str
    dublin_area = ""
    if len(parts) > 1:
        # Look for "Dublin 2", "Dublin 8", etc. in the second part
        match = re.search(r'Dublin\s*(\d+[A-Za-z0-9]*)', parts[1])
        if match:
            dublin_area = match.group(1)  # e.g. "16", "2", "6W", etc.
    return region, dublin_area

def fetch_page(url):
    """Fetches the content of the given URL."""
    return requests.get(url)

def get_soup(response):
    """Converts a requests response to a BeautifulSoup object."""
    return BeautifulSoup(response.text, 'html.parser')

def get_listings_container(soup):
    """
    Finds the primary <ol> element that holds the listings.
    Returns the <ol> or None if not found.
    """
    return soup.find('ol')

def extract_start_value(ol_element):
    """
    Extracts the starting listing number from the <ol>'s 'start' attribute.
    Defaults to 1 if not found or invalid.
    """
    start_val = 1
    if ol_element and 'start' in ol_element.attrs:
        try:
            start_val = int(ol_element['start'])
        except ValueError:
            pass
    return start_val

### Main Scraping Function ( breaks the page into tables and extractes data also does preprocessing to extract the write data using regex)

In [79]:

def process_listing(li_tag, listing_number, quarter, page_num):
    """
    Processes a single <li> (one listing) and returns a dictionary of listing data.
    """
    record_data = {
        'ListingID': listing_number,
        'Month': 'Unknown',
        'Quarter': quarter,
        'Page': page_num,
        'Price': None,
        'Currency': 'EUR',
        'Property Type': '',
        'Location': '',
        'Bedrooms': None,
        'Bathrooms': None,
        'Parking': '',
        'Garden': '',
        'Lease Length (months)': None,
        'Contact': ''
    }
    
    # Extract month/year text (e.g. "October 2024")
    month_span = li_tag.find('span', class_='record')
    if month_span:
        month_text = month_span.get_text(strip=True)
        parts = month_text.split()
        if parts:
            record_data['Month'] = parts[0]  # e.g. "October" (we're ignoring the year here)
    
    # The table with class="rental" has the property details
    rental_table = li_tag.find('table', class_='rental')
    if rental_table:
        rows = rental_table.find_all('tr')
        for row in rows:
            cells = row.find_all('td')
            if len(cells) != 2:
                continue
            field_name = cells[0].get_text(strip=True).replace(':', '')
            field_value = cells[1].get_text(strip=True)
            
            if field_name == 'Price':
                numeric_price, currency = parse_price(field_value)
                record_data['Price'] = numeric_price
                record_data['Currency'] = currency
            elif field_name == 'Property Type':
                record_data['Property Type'] = field_value
            elif field_name == 'Location':
                record_data['Location'] = field_value
            elif field_name == 'Bedrooms':
                record_data['Bedrooms'] = parse_integer(field_value)
            elif field_name == 'Bathrooms':
                record_data['Bathrooms'] = parse_integer(field_value)
            elif field_name == 'Parking':
                record_data['Parking'] = parse_boolean(field_value)
            elif field_name == 'Garden':
                record_data['Garden'] = parse_boolean(field_value)
            elif field_name == 'Lease Length':
                record_data['Lease Length (months)'] = parse_lease_length(field_value)
            elif field_name == 'Contact':
                record_data['Contact'] = field_value
    
    return record_data

def process_page(base_url, quarter, page_num):
    """
    Fetches and processes one page for a given quarter.
    Returns (list_of_listings, count).
    If a 404 occurs, returns (None, 0) to indicate no more pages.
    """
    url = f"{base_url}Q{quarter}-page{page_num:02d}.html"
    response = fetch_page(url)
    
    if response.status_code == 404:
        print(f"Got 404. Finished Q{quarter}.")
        return None, 0
    
    soup = get_soup(response)
    listings_ol = get_listings_container(soup)
    if not listings_ol:
        print("No <ol> found on this page; skipping.")
        return [], 0
    
    start_val = extract_start_value(listings_ol)
    li_tags = listings_ol.find_all('li', recursive=False)
    
    listings = []
    for offset, li_tag in enumerate(li_tags):
        listing_number = start_val + offset
        record = process_listing(li_tag, listing_number, quarter, page_num)
        listings.append(record)
    
    print(f"Scraped Q{quarter}-Page {page_num}, found {len(li_tags)} listings.")
    return listings, len(li_tags)

def scrape_rentals(base_url, test_mode=True):
    """
    Scrapes rental data for Q1..Q4, stopping when a 404 is encountered.
    If test_mode=True, only the first 5 pages per quarter are scraped.
    Returns a list of dicts (all_records).
    """
    all_records = []
    for quarter in range(1, 5):
        page_num = 1
        while True:
            listings, count = process_page(base_url, quarter, page_num)
            if listings is None:  # 404 encountered
                break
            all_records.extend(listings)
            page_num += 1
            if test_mode and page_num > 5:
                print(f"Test mode active; stopping early for Q{quarter}.")
                break
    return all_records

### Additional Usefull Columns 

In [80]:
def augment_data(records):
    """
    Enriches or transforms the final data before saving.
    - Parse region & Dublin area from 'Location'
    - Compute PricePerBedroom
    - Add a 'LeaseCategory' column
    etc.
    """
    new_records = []
    for rec in records:
        # Copy the original so we don't mutate in-place
        r = dict(rec)
        
        # Parse region + area
        region, dublin_area = parse_location(r["Location"])
        r["Region"] = region
        r["DublinArea"] = dublin_area
        
        # Example: price per bedroom
        if r["Price"] and r["Bedrooms"]:
            r["PricePerBedroom"] = round(r["Price"] / r["Bedrooms"], 2)
        else:
            r["PricePerBedroom"] = None
        
        # Example: short vs long lease category
        length_months = r["Lease Length (months)"]
        if length_months is not None:
            r["LeaseCategory"] = "ShortTerm" if length_months < 12 else "LongTerm"
        else:
            r["LeaseCategory"] = "Unknown"
        
        new_records.append(r)
    return new_records

### Field Names which can be extracted made modular for ease of use

In [81]:
def get_fieldnames():
    """
    Returns the list of field names for the rental data CSV.
    """
    return [
        'ListingID',
        'Month',
        'Quarter',
        'Page',
        'Price',
        'Currency',
        'Property Type',
        'Location',
        'Bedrooms',
        'Bathrooms',
        'Parking',
        'Garden',
        'Lease Length (months)',
        'Contact',
        # New/enriched columns
        'Region',
        'DublinArea',
        'PricePerBedroom',
        'LeaseCategory'
    ]

### Inital Scraper that starts scraping 

In [82]:
def scrape_data(url, test_mode):
    """
    Scrapes rental data from the provided URL.

    Parameters:
        url (str): The URL to scrape data from.
        test_mode (bool): If True, only processes the first 5 pages per quarter.

    Returns:
        list: A list of rental records.
    """
    print(f"Starting scrape from {url}. Test mode is set to {test_mode}.")
    results = scrape_rentals(url, test_mode=test_mode)
    return results


### Saving Scrapped Data Into A Csv

In [83]:
def write_results_to_csv(csv_filename, results, fieldnames):
    """
    Writes the rental data to a CSV file.

    Parameters:
        csv_filename (str): The name of the CSV file.
        results (list): The list of rental records.
        fieldnames (list): The CSV header fields.
    """
    with open(csv_filename, mode='w', newline='', encoding='utf-8') as csv_file:
        writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
        writer.writeheader()
        for record in results:
            writer.writerow(record)
    print(f"\nDone! Wrote {len(results)} records to {csv_filename}")

def scrape_and_save(url, csv_filename, test_mode=False):
    """
    Scrapes rental data from the given URL and writes the results to a CSV file.

    Parameters:
        url (str): The URL to scrape data from.
        csv_filename (str): The name of the CSV file to write.
        test_mode (bool): If True, only processes the first 5 pages per quarter.
    """
    results = scrape_data(url, test_mode)
    results = augment_data(results)
    
    fieldnames = get_fieldnames()
    write_results_to_csv(csv_filename, results, fieldnames)

### Using The Functions 

In [86]:
BASE_URL = "http://mlg.ucd.ie/modules/python/assignment1/rental/"
CSV_FILENAME = "rentals.csv"
scrape_and_save(BASE_URL, CSV_FILENAME, test_mode=False) # Note: test_mode=False will scrape all pages and test mode true will scrape only 5 pages per quarter ( this was for testing purposes)

Starting scrape from http://mlg.ucd.ie/modules/python/assignment1/rental/. Test mode is set to False.
Scraped Q1-Page 1, found 20 listings.
Scraped Q1-Page 2, found 20 listings.
Scraped Q1-Page 3, found 20 listings.
Scraped Q1-Page 4, found 20 listings.
Scraped Q1-Page 5, found 20 listings.
Scraped Q1-Page 6, found 20 listings.
Scraped Q1-Page 7, found 20 listings.
Scraped Q1-Page 8, found 20 listings.
Scraped Q1-Page 9, found 20 listings.
Scraped Q1-Page 10, found 20 listings.
Scraped Q1-Page 11, found 20 listings.
Scraped Q1-Page 12, found 20 listings.
Scraped Q1-Page 13, found 20 listings.
Scraped Q1-Page 14, found 20 listings.
Scraped Q1-Page 15, found 20 listings.
Scraped Q1-Page 16, found 20 listings.
Scraped Q1-Page 17, found 20 listings.
Scraped Q1-Page 18, found 20 listings.
Scraped Q1-Page 19, found 20 listings.
Scraped Q1-Page 20, found 20 listings.
Scraped Q1-Page 21, found 20 listings.
Scraped Q1-Page 22, found 20 listings.
Scraped Q1-Page 23, found 20 listings.
Scraped Q1