# Data Collection

student_number: 24284209
data_source: property sale register


## Importing Libraries
This cell imports all necessary libraries for data collection
- pandas for data manipulation and dataframe configuration
- bs4 for web scraping
- requests for hitting the url

In [14]:
import pandas as pd
import bs4
import requests

After looking at the site, it is clear that first the description section will need to be parsed in a separate function.

In [15]:
def parse_description(desc):
    """_summary_
    Args:
        desc: description from site

    Returns:
        _type: the type of the property
        style: the style of the property
        bedrooms: the number of bedrooms
        bathrooms: the number of bathrooms
    """    
    parts = desc.split('; ')
    type_ = None
    style = None
    bedrooms = None
    bathrooms = None
    for part in parts:
        if part.startswith('Type:'):
            type_ = part.replace('Type: ', '').strip()
        elif part.startswith('Style:'):
            style = part.replace('Style: ', '').strip()
        elif part.endswith('Bedroom') or part.endswith('Bedrooms'):
            bedrooms = part.split(' ')[0].strip()
        elif part.endswith('Bathroom') or part.endswith('Bathrooms'):
            bathrooms = part.split(' ')[0].strip()
    
    return type_, style,  bedrooms, bathrooms


## Data Collection Block
This provides a rudimentary scrape of the website for property sales.
1. Starts by iterating over the year of the sale and the page counts
2. It goes through each page available for the year of sale
3. It parses the table, grabs the years
4. It goes through the bullets and populates labels with values, or calls the description function if that is the label that is found.
5. It returns the crude csv and outputs this to an unprepared csv file which is prepared in the next task.

In [16]:
all_data = []

# years and page counts
years = {
    2021: 15,
    2022: 17,
    2023: 17,
    2024: 23
}

for year, total_pages in years.items():
    for page in range(1, total_pages+1):
        url = f"http://mlg.ucd.ie/modules/python/assignment1/property/{year}-page{page:02d}.html"
        response = requests.get(url)
        soup = bs4.BeautifulSoup(response.content, 'html.parser')
        content_div = soup.find('div', id='content')
        if content_div:
            ol = content_div.find('ol')
            if ol:
                listings = ol.find_all('li')
                for li in listings:
                    sale_date = li.find('span', class_='sold').text.strip()
                    table = li.find('table', class_='sale')
                    if table:
                        rows = table.find_all('tr')
                        data = {'Year': year, 'Sale Date': sale_date}
                        for row in rows:
                            cols = row.find_all('td')
                            if len(cols) == 2:
                                label = cols[0].text.strip().replace(':', '').strip()
                                value = cols[1].text.strip()
                                if label == 'Description':
                                    type_, style, bedrooms, bathrooms = parse_description(value)
                                    data['Type'] = type_
                                    data['Style'] = style
                                    data['Bedrooms'] = bedrooms
                                    data['Bathrooms'] = bathrooms
                                else:
                                    data[label] = value
                        all_data.append(data)  # Move this line outside the row loop

df = pd.DataFrame(all_data)

df.head()

Unnamed: 0,Year,Sale Date,Sale Price,Property Location,Year Built,Garden,Garage,Type,Style,Bedrooms,Bathrooms,First Time Buyer
0,2021,Sold 2021-01-10,"€381,302.00",Broomhouse,1967,Yes,Yes,Detached,1.5-Storey,3,1,No
1,2021,Sold 2021-01-10,"€325,898.00",Broomhouse,1978,Yes,???,Detached,1-Storey,3,1,Yes
2,2021,Sold 18 January 2021,"€ 370,354",Oak Park,1961,Yes,No,Detached,1-Storey,3,2,No
3,2021,Sold 2021-01-23,"€92,480.00",Beacon Hill,1958,Yes,No,Bungalow,1-Storey,1,1,Yes
4,2021,Sold 2021-01-25,"€312,030.00",Brookville,1987,Yes,Yes,Detached,1-Storey,3,1,No


In [17]:
df.to_csv('./unprepared_df.csv')