## Scraping Property Access PH through Selenium
This notebook was prepared by Adem Inovejas, Christopher Lim, Czarina Tiu, and Uriel Grace Magtibay (students of DATA102 S11 Y2022-2023).  

In your notebook, make sure you have the following details outlined:
- The website scraped
- Date and time when the data was collected
- What were the challenges encountered? You may narrate or illustrate this in the notebook.
- Do you think the collected data contains any personally identifiable information (PII)?
- Conclude with your key learnings and findings.

## Importing Libaries

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import os
import datetime
import time

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

## Setup

In [2]:
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

In [3]:
url = "https://propertyaccess.ph/offer/sale"
driver.get(url)

## Data Collection

### Collecting  the URLs of each properties

In [4]:
has_pages = True
pages = 1
urls = []
total_time = 0
while(has_pages):  
    start = time.perf_counter()
    
    products_container = driver.find_element(by="xpath", value='//div[@class="list-product page-content"]')
    
    # gets the list of products
    items = products_container.find_elements(by="xpath", value='.//div[@class="product-wrapper"]')
    for item in items:
        # get the href of the anchor of "product-img"
        url = item.find_element(by="xpath", value='.//div[@class="product-img"]/a').get_attribute('href')
        date_published = item.find_element(by="xpath", value='.//div[@class="date"]').text
        urls.append((url, date_published))
    
    end = time.perf_counter()
    
    total_time += end - start
    print('Extracted Page %d | %d items found | %.2fs' % (pages, len(items), end - start))
    try:
        # checks if next page button is available by getting the xpath
        next_page = driver.find_element("xpath", value='//ul[@class="pargination"]/li[@class="next flex-end"]/a').get_attribute("href");
        driver.get(next_page)
        pages += 1
    except:
        # if next page button is not available, stop the loop
        has_pages = False

print('Number of pages:', pages)
print('Number of products:', len(urls))
print('Total Time:', total_time)

Extracted Page 1 | 20 items found | 0.57s
Extracted Page 2 | 20 items found | 0.54s
Extracted Page 3 | 20 items found | 0.60s
Extracted Page 4 | 20 items found | 0.63s
Extracted Page 5 | 20 items found | 0.62s
Extracted Page 6 | 20 items found | 0.66s
Extracted Page 7 | 20 items found | 0.66s
Extracted Page 8 | 20 items found | 0.69s
Extracted Page 9 | 20 items found | 0.68s
Extracted Page 10 | 20 items found | 0.66s
Extracted Page 11 | 20 items found | 0.66s
Extracted Page 12 | 20 items found | 0.67s
Extracted Page 13 | 20 items found | 0.66s
Extracted Page 14 | 20 items found | 0.69s
Extracted Page 15 | 20 items found | 0.67s
Extracted Page 16 | 20 items found | 0.75s
Extracted Page 17 | 20 items found | 0.66s
Extracted Page 18 | 20 items found | 0.66s
Extracted Page 19 | 20 items found | 0.71s
Extracted Page 20 | 20 items found | 0.74s
Extracted Page 21 | 20 items found | 0.74s
Extracted Page 22 | 20 items found | 0.64s
Extracted Page 23 | 20 items found | 0.69s
Extracted Page 24 | 

Exporting the URLs through pandas dataframe (as .csv)

In [5]:
df = pd.DataFrame(urls, columns=['URL', 'Date Published'])
df.head()

Unnamed: 0,URL,Date Published
0,https://propertyaccess.ph/property/3-br-condo-...,Published on: 30/03/2022
1,https://propertyaccess.ph/property/1br-condo-i...,Published on: 30/03/2022
2,https://propertyaccess.ph/property/3-bedroom-c...,Published on: 17/07/2022
3,https://propertyaccess.ph/property/3-bedroom-c...,Published on: 17/07/2022
4,https://propertyaccess.ph/property/2br-condo-i...,Published on: 1/06/2022


In [6]:
df.to_csv('property_urls.csv', index=False)

### Going through each property page to get the details

In [4]:
df = pd.read_csv('property_urls.csv')
df.head()

Unnamed: 0,URL,Date Published
0,https://propertyaccess.ph/property/3-br-condo-...,Published on: 30/03/2022
1,https://propertyaccess.ph/property/1br-condo-i...,Published on: 30/03/2022
2,https://propertyaccess.ph/property/3-bedroom-c...,Published on: 17/07/2022
3,https://propertyaccess.ph/property/3-bedroom-c...,Published on: 17/07/2022
4,https://propertyaccess.ph/property/2br-condo-i...,Published on: 1/06/2022


In [None]:
data = []
total_time = 0
for i, item in enumerate(list(zip(df['URL'], df['Date Published']))):
    try:
        start = time.perf_counter()
        url, date_published = item
        driver.get(url)

        # get the container which contains the property details
        property_details = driver.find_element(by="xpath", value='//div[@class="product-detail-wraper"]')

        # find the name
        name = property_details.find_element(by="xpath", value='.//h1[@class="basic-info__name"]').text
        address = property_details.find_element(by="xpath", value='.//div[@class="basic-info__street"]').text
        price = property_details.find_element(by="xpath", value='.//div[@class="basic-info__price"]').text
        author = property_details.find_element(by="xpath", value='.//div[@class="agent-name line-clamp lc-2"]').text

        # get the bedrooms, showers, parking, furnish type, total developed and lot area through utility info
        utility_info = property_details.find_element(by="xpath", value='.//div[@class="basic-info__utilities"]')

        try:
            bedrooms = utility_info.find_element(by="xpath", value='//img[@data-src="https://cdn.propertyaccess.ph/prod/v1.2.5-fis-hotfix-2/f24d0f0d48276caeca2060fd60abf269.svg"]/..').text
        except:
            bedrooms = None

        try:
            showers = utility_info.find_element(by="xpath", value='//img[@data-src="https://cdn.propertyaccess.ph/prod/v1.2.5-fis-hotfix-2/a6e500c94a1eaacdd59fde045fba89f0.svg"]/..').text
        except:
            showers = None

        try:
            furnish = utility_info.find_element(by="xpath", value='//img[@data-src="https://cdn.propertyaccess.ph/prod/v1.2.5-fis-hotfix-2/9fb04d2eb34997da34e3dcfd140f51b6.svg"]/..').text
        except:
            furnish = None

        try:
            parking = utility_info.find_element(by="xpath", value='//img[@data-src="https://cdn.propertyaccess.ph/prod/v1.2.5-fis-hotfix-2/33d487e03cfed78b821458ab3c39b574.svg"]/..').text
        except:
            parking = None

        try:
            total_developed = utility_info.find_element(by="xpath", value='//img[@data-src="https://cdn.propertyaccess.ph/prod/v1.2.5-fis-hotfix-2/57cebe2c8bcb0aa7f604503ddb740621.svg"]/..').text
        except:
            total_developed = None

        try:
            lot_area = utility_info.find_element(by="xpath", value='//img[@data-src="https://cdn.propertyaccess.ph/prod/v1.2.5-fis-hotfix-2/da2ffe9d4238be9a3d697b4195b64fe7.svg"]/..').text
        except:
            lot_area = None

        # if section title is Property Features or Amenities
        features = [feature.text for feature in property_details.find_elements(by="xpath", value='.//section[@class="product-feature"]//div[text()="Property Features" or text()="Amenities"]/..//div[@class="item-listing readmore-target"]//p')]

        # if section title is Facilities
        facilities = [facility.text for facility in property_details.find_elements(by="xpath", value='.//section[@class="product-feature"]//div[text()="Facilities"]/..//div[@class="item-listing readmore-target"]//p')]

        # if section title is Nearby Places
        nearby_places = [place.text for place in property_details.find_elements(by="xpath", value='.//section[@class="product-nearby"]//div[@class="item-title"]')]

        timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        
        end = time.perf_counter()

        total_time += end - start
        
        print('[%4d] %s | %.2fs' % (i, name, end - start))

        data.append((name, address, author, price, bedrooms, showers, parking, furnish, total_developed, lot_area, features, facilities, nearby_places, url, timestamp))
    except:
        # if page is un-available
        print('Error occured on', url)

[   0] 3 BR Condo in The Grand Midori Ortigas, Pasig
[   1] 1 Bedroom Condo in The Grand Midori Ortigas, Pasig
[   2] 3 Bedroom Condo in Aurelia Residences, Taguig
[   3] 3 Bedroom Condo in Shang Residences at Wack Wack, Mandaluyong
[   4] 2BR Condo in Residences at The Galleon, Pasig
[   5] Penthouse in Residences at The Galleon, Pasig
[   6] 3 Bedroom Condo in Aurelia Residences, Taguig
[   7] 3 Bedroom Condo in Valencia Hills Tower E, Quezon City
[   8] 2 Bedroom Condo in Valencia Hills Tower E, Quezon City
[   9] 1 Bedroom Condo in Valencia Hills Tower E, Quezon City
[  10] 3 Bedroom Condo in Aurelia Residences, Taguig
[  11] 2 Bedroom Condo in The Rise Makati, Makati
[  12] 3 Bedroom Condo in Aurelia Residences, Taguig
[  13] 2 Bedroom Condo in Shang Residences at Wack Wack, Mandaluyong
[  14] 1 Bedroom Condo in Shang Residences at Wack Wack, Mandaluyong
[  15] 1BR Condo in Residences at The Galleon, Pasig
[  16] 1 Bedroom Condo in The Rise, Makati
[  17] Studio Condo in The Rise,

In [None]:
column_names = ['Name', 'Address', 'Author', 'Price', 'Bedrooms', 'Showers', 'Parking', 'Furnish', 'Total Developed', 'Lot Area', 'Features', 'Facilities', 'Nearby Places', 'URL', 'Timestamp']
df = pd.DataFrame(data, columns = column_names)
df.head()

In [15]:
df.describe()

Unnamed: 0,Name,Address,Author,Price,Bedrooms,Showers,Parking,Furnish,Total Developed,Lot Area,Features,Facilities,Nearby Places,URL,Timestamp
count,100,100,100,100,93,93,93,93,100,28,100,100,100,100,100
unique,83,44,27,96,6,8,5,3,81,26,55,36,5,100,100
top,"3 Bedroom Condo in Aurelia Residences, Taguig","Quezon City, Metro Manila",Henry Sedeño,"Sale: ₱ 35,000,000",3,1,0,Unfurnished,from 242 sqm,330 sqm,[Fiber ready],[],"[School, Hospital, Mall, Transportation hub]",https://propertyaccess.ph/property/3-bedroom-c...,2022-09-27 00:57:33
freq,4,10,15,2,27,38,34,65,4,2,13,31,47,1,1


In [16]:
df.to_json('unclean.json', orient='records', indent=2)

In [17]:
driver.close()

## Data Cleaning

## Challenges
- Having to go to each page in order to get the necessary data (takes longer time)
- 

## Conclusion
- [Does it have personal identifiable information?]
- 