## Scraping Property Access PH through Selenium
This notebook was prepared by Adem Inovejas, Christopher Lim, Czarina Tiu, and Uriel Grace Magtibay (students of DATA102 S11 Y2022-2023).  

In your notebook, make sure you have the following details outlined:
- The website scraped
- Date and time when the data was collected
- What were the challenges encountered? You may narrate or illustrate this in the notebook.
- Do you think the collected data contains any personally identifiable information (PII)?
- Conclude with your key learnings and findings.

## Importing Libaries

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import os
import datetime
import time

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

## Setup

In [2]:
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

In [3]:
url = "https://propertyaccess.ph/offer/sale"
driver.get(url)

## Data Collection

### Collecting  the URLs of each properties

In [None]:
has_pages = True
pages = 1
urls = []
total_time = 0
while(has_pages):  
    start = time.perf_counter()
    
    products_container = driver.find_element(by="xpath", value='//div[@class="list-product page-content"]')
    
    # gets the list of products
    items = products_container.find_elements(by="xpath", value='.//div[@class="product-wrapper"]')
    for item in items:
        # get the href of the anchor of "product-img"
        url = item.find_element(by="xpath", value='.//div[@class="product-img"]/a').get_attribute('href')
        date_published = item.find_element(by="xpath", value='.//div[@class="date"]').text
        urls.append((url, date_published))
    
    end = time.perf_counter()
    
    total_time += end - start
    print('Extracted Page %d | %d items found | %.2fs' % (pages, len(items), end - start))
    try:
        # checks if next page button is available by getting the xpath
        next_page = driver.find_element("xpath", value='//ul[@class="pargination"]/li[@class="next flex-end"]/a').get_attribute("href");
        driver.get(next_page)
        pages += 1
    except:
        # if next page button is not available, stop the loop
        has_pages = False

print('Number of pages:', pages)
print('Number of products:', len(urls))
print('Average Time per page:', total_time / len(urls))
print('Total Time:', total_time)

Extracted Page 1 | 20 items found | 0.57s


Exporting the URLs through pandas dataframe (as .csv)

In [None]:
df = pd.DataFrame(urls, columns=['URL', 'Date Published'])
df.head()

In [None]:
df.to_csv('property_urls.csv', index=False)

### Going through each property page to get the details

In [None]:
df = pd.read_csv('property_urls.csv')
df.head()

In [None]:
data = []
total_time = 0
for i, item in enumerate(list(zip(df['URL'], df['Date Published']))):
    try:
        start = time.perf_counter()
        url, date_published = item
        driver.get(url)

        # get the container which contains the property details
        property_details = driver.find_element(by="xpath", value='//div[@class="product-detail-wraper"]')

        # find the name
        name = property_details.find_element(by="xpath", value='.//h1[@class="basic-info__name"]').text
        address = property_details.find_element(by="xpath", value='.//div[@class="basic-info__street"]').text
        price = property_details.find_element(by="xpath", value='.//div[@class="basic-info__price"]').text
        author = property_details.find_element(by="xpath", value='.//div[@class="agent-name line-clamp lc-2"]').text

        # get the bedrooms, showers, parking, furnish type, total developed and lot area through utility info
        utility_info = property_details.find_element(by="xpath", value='.//div[@class="basic-info__utilities"]')

        try:
            bedrooms = utility_info.find_element(by="xpath", value='//img[@data-src="https://cdn.propertyaccess.ph/prod/v1.2.5-fis-hotfix-2/f24d0f0d48276caeca2060fd60abf269.svg"]/..').text
        except:
            bedrooms = None

        try:
            showers = utility_info.find_element(by="xpath", value='//img[@data-src="https://cdn.propertyaccess.ph/prod/v1.2.5-fis-hotfix-2/a6e500c94a1eaacdd59fde045fba89f0.svg"]/..').text
        except:
            showers = None

        try:
            furnish = utility_info.find_element(by="xpath", value='//img[@data-src="https://cdn.propertyaccess.ph/prod/v1.2.5-fis-hotfix-2/9fb04d2eb34997da34e3dcfd140f51b6.svg"]/..').text
        except:
            furnish = None

        try:
            parking = utility_info.find_element(by="xpath", value='//img[@data-src="https://cdn.propertyaccess.ph/prod/v1.2.5-fis-hotfix-2/33d487e03cfed78b821458ab3c39b574.svg"]/..').text
        except:
            parking = None

        try:
            total_developed = utility_info.find_element(by="xpath", value='//img[@data-src="https://cdn.propertyaccess.ph/prod/v1.2.5-fis-hotfix-2/57cebe2c8bcb0aa7f604503ddb740621.svg"]/..').text
        except:
            total_developed = None

        try:
            lot_area = utility_info.find_element(by="xpath", value='//img[@data-src="https://cdn.propertyaccess.ph/prod/v1.2.5-fis-hotfix-2/da2ffe9d4238be9a3d697b4195b64fe7.svg"]/..').text
        except:
            lot_area = None

        # if section title is Property Features or Amenities
        features = [feature.text for feature in property_details.find_elements(by="xpath", value='.//section[@class="product-feature"]//div[text()="Property Features" or text()="Amenities"]/..//div[@class="item-listing readmore-target"]//p')]

        # if section title is Facilities
        facilities = [facility.text for facility in property_details.find_elements(by="xpath", value='.//section[@class="product-feature"]//div[text()="Facilities"]/..//div[@class="item-listing readmore-target"]//p')]

        # if section title is Nearby Places
        nearby_places = [place.text for place in property_details.find_elements(by="xpath", value='.//section[@class="product-nearby"]//div[@class="item-title"]')]

        timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        
        end = time.perf_counter()

        total_time += end - start
        
        print('[%4d] %s | %.2fs' % (i, name, end - start))

        data.append((name, address, author, price, bedrooms, showers, parking, furnish, total_developed, lot_area, features, facilities, nearby_places, url, timestamp))
    except:
        # if page is un-available
        print('Error occured on', url)

In [None]:
column_names = ['Name', 'Address', 'Author', 'Price', 'Bedrooms', 'Showers', 'Parking', 'Furnish', 'Total Developed', 'Lot Area', 'Features', 'Facilities', 'Nearby Places', 'URL', 'Timestamp']
df = pd.DataFrame(data, columns = column_names)
df.head()

In [None]:
df.describe()

In [None]:
df.to_json('unclean.json', orient='records', indent=2)

In [17]:
driver.close()

## Data Cleaning

In [2]:
data = pd.read_json('unclean.json')
data

Unnamed: 0,Name,Address,Author,Price,Bedrooms,Showers,Parking,Furnish,Total Developed,Lot Area,Features,Facilities,Nearby Places,URL,Timestamp
0,"3 Bedroom Condo in Aurelia Residences, Taguig","McKinley Parkway, Taguig, Metro Manila",Shang Properties,"Sale: from ₱ 107,300,000",3,4,2,Unfurnished,from 242 sqm,,"[Central air conditioning, Balcony, Built-in w...","[Security, CCTV, Entertainment Area, Fitness C...","[School, Hospital, Mall, Transportation hub]",https://propertyaccess.ph/property/3-bedroom-c...,2022-09-27 00:57:33
1,3 Bedroom Condo in Shang Residences at Wack Wa...,"Wack Wack Road, Mandaluyong, Metro Manila",Shang Properties,"Sale: from ₱ 54,500,000",3,4,3,Unfurnished,from 231 sqm,,"[Central air conditioning, Balcony, Built-in w...","[Security, CCTV, Club House, Entertainment Are...","[School, Hospital, Mall, Transportation hub]",https://propertyaccess.ph/property/3-bedroom-c...,2022-09-27 00:57:34
2,"2BR Condo in Residences at The Galleon, Pasig",,Ortigas Land,"Sale: from ₱ 41,500,000",2,2,2,Unfurnished,from 109 sqm,,"[Central air conditioning, Balcony, Built-in w...","[Security, CCTV, Clubhouse, Connected to mall,...","[School, Hospital, Mall, Transportation hub]",https://propertyaccess.ph/property/2br-condo-i...,2022-09-27 00:57:36
3,"Penthouse in Residences at The Galleon, Pasig","ADB Avenue, Pasig, Metro Manila",Ortigas Land,"Sale: from ₱ 111,500,000",3,4,4+,Unfurnished,from 268 sqm,,"[Central air conditioning, Balcony, Built-in w...","[Security, CCTV, Clubhouse, Connected to mall,...","[School, Hospital, Mall, Transportation hub]",https://propertyaccess.ph/property/penthouse-i...,2022-09-27 00:57:38
4,"3 Bedroom Condo in Aurelia Residences, Taguig","McKinley Parkway, Taguig, Metro Manila",Shang Properties,"Sale: from ₱ 181,302,240",3,4,3,Unfurnished,from 337 sqm,,[Central air conditioning],"[Security, CCTV, Entertainment Area, Fitness C...","[School, Hospital, Mall, Transportation hub]",https://propertyaccess.ph/property/3-bedroom-c...,2022-09-27 00:57:47
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,"Residential Lot, Calabarzon",Calabarzon,Alyssa Barroso,"Sale: ₱ 7,700,000",,,,,from 242 sqm,275 sqm,[],[],[],https://propertyaccess.ph/property/residential...,2022-09-27 01:00:06
96,"3 Bedroom Condo in Avida Cityflex Towers, Taguig","Taguig, Metro Manila",Christine Li,"Sale: ₱ 15,000,000",3,1,0,Furnished,76 sqm,,"[Central air conditioning, Maid’s Room, Range ...","[Fitness Center, Garden/Lanai, Lobby, Play Are...","[School, Hospital, Mall]",https://propertyaccess.ph/property/3-bedroom-c...,2022-09-27 01:00:08
97,"Office Space, Makati","V.A. Rufino Street, Makati, Metro Manila",RBK Property Consultants Inc.,"Sale: ₱ 35,000,000",,,,,219 sqm,,[],"[Security, Building Reception Area, CCTV, Dini...",[],https://propertyaccess.ph/property/office-spac...,2022-09-27 01:00:10
98,"Office Space, Makati","V.A. Rufino Street, Makati, Metro Manila",RBK Property Consultants Inc.,"Sale: ₱ 58,000,000",,,,,213 sqm,,[],"[Security, Building Reception Area, CCTV, Dini...",[],https://propertyaccess.ph/property/office-spac...,2022-09-27 01:00:12


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Name             100 non-null    object        
 1   Address          100 non-null    object        
 2   Author           100 non-null    object        
 3   Price            100 non-null    object        
 4   Bedrooms         93 non-null     object        
 5   Showers          93 non-null     object        
 6   Parking          93 non-null     object        
 7   Furnish          93 non-null     object        
 8   Total Developed  100 non-null    object        
 9   Lot Area         28 non-null     object        
 10  Features         100 non-null    object        
 11  Facilities       100 non-null    object        
 12  Nearby Places    100 non-null    object        
 13  URL              100 non-null    object        
 14  Timestamp        100 non-null    datetime64

In [4]:
data['Price'] = data['Price'].str.replace('Sale: from ₱ ','')
data['Price'] = data['Price'].str.replace('Sale: ₱ ','')

In [5]:
data['Price'] = data['Price'].str.replace(',','').astype(int)

In [6]:
data['Price']

0     107300000
1      54500000
2      41500000
3     111500000
4     181302240
        ...    
95      7700000
96     15000000
97     35000000
98     58000000
99      3024000
Name: Price, Length: 100, dtype: int64

In [7]:
data['Total Developed'] = data['Total Developed'].str.replace('from','')
data['Total Developed'] = data['Total Developed'].str.replace('sqm','').astype(int)

In [8]:
data['Total Developed']

0     242
1     231
2     109
3     268
4     337
     ... 
95    242
96     76
97    219
98    213
99     52
Name: Total Developed, Length: 100, dtype: int64

In [9]:
data.head()

Unnamed: 0,Name,Address,Author,Price,Bedrooms,Showers,Parking,Furnish,Total Developed,Lot Area,Features,Facilities,Nearby Places,URL,Timestamp
0,"3 Bedroom Condo in Aurelia Residences, Taguig","McKinley Parkway, Taguig, Metro Manila",Shang Properties,107300000,3,4,2,Unfurnished,242,,"[Central air conditioning, Balcony, Built-in w...","[Security, CCTV, Entertainment Area, Fitness C...","[School, Hospital, Mall, Transportation hub]",https://propertyaccess.ph/property/3-bedroom-c...,2022-09-27 00:57:33
1,3 Bedroom Condo in Shang Residences at Wack Wa...,"Wack Wack Road, Mandaluyong, Metro Manila",Shang Properties,54500000,3,4,3,Unfurnished,231,,"[Central air conditioning, Balcony, Built-in w...","[Security, CCTV, Club House, Entertainment Are...","[School, Hospital, Mall, Transportation hub]",https://propertyaccess.ph/property/3-bedroom-c...,2022-09-27 00:57:34
2,"2BR Condo in Residences at The Galleon, Pasig",,Ortigas Land,41500000,2,2,2,Unfurnished,109,,"[Central air conditioning, Balcony, Built-in w...","[Security, CCTV, Clubhouse, Connected to mall,...","[School, Hospital, Mall, Transportation hub]",https://propertyaccess.ph/property/2br-condo-i...,2022-09-27 00:57:36
3,"Penthouse in Residences at The Galleon, Pasig","ADB Avenue, Pasig, Metro Manila",Ortigas Land,111500000,3,4,4+,Unfurnished,268,,"[Central air conditioning, Balcony, Built-in w...","[Security, CCTV, Clubhouse, Connected to mall,...","[School, Hospital, Mall, Transportation hub]",https://propertyaccess.ph/property/penthouse-i...,2022-09-27 00:57:38
4,"3 Bedroom Condo in Aurelia Residences, Taguig","McKinley Parkway, Taguig, Metro Manila",Shang Properties,181302240,3,4,3,Unfurnished,337,,[Central air conditioning],"[Security, CCTV, Entertainment Area, Fitness C...","[School, Hospital, Mall, Transportation hub]",https://propertyaccess.ph/property/3-bedroom-c...,2022-09-27 00:57:47


In [10]:
data['Timestamp'] = pd.to_datetime(data['Timestamp']).dt.date

In [11]:
data['Timestamp']

0     2022-09-27
1     2022-09-27
2     2022-09-27
3     2022-09-27
4     2022-09-27
         ...    
95    2022-09-27
96    2022-09-27
97    2022-09-27
98    2022-09-27
99    2022-09-27
Name: Timestamp, Length: 100, dtype: object

In [12]:
# Checking null values per column
data.isna().sum()

Name                0
Address             0
Author              0
Price               0
Bedrooms            7
Showers             7
Parking             7
Furnish             7
Total Developed     0
Lot Area           72
Features            0
Facilities          0
Nearby Places       0
URL                 0
Timestamp           0
dtype: int64

In [13]:
data=data.fillna(0)

In [14]:
data.isna().sum()

Name               0
Address            0
Author             0
Price              0
Bedrooms           0
Showers            0
Parking            0
Furnish            0
Total Developed    0
Lot Area           0
Features           0
Facilities         0
Nearby Places      0
URL                0
Timestamp          0
dtype: int64

In [15]:
data['Lot Area'] = data['Lot Area'].str.replace(' sqm','')

In [16]:
data['Lot Area'] = data['Lot Area'].fillna(0)

In [17]:
data['Lot Area'] = data['Lot Area'].astype(int)

In [18]:
data['Showers'] = data['Showers'].str.replace('+','')
data['Parking'] = data['Parking'].str.replace('+','')
data['Bedrooms'] = data['Bedrooms'].str.replace('+','')

  data['Showers'] = data['Showers'].str.replace('+','')
  data['Parking'] = data['Parking'].str.replace('+','')
  data['Bedrooms'] = data['Bedrooms'].str.replace('+','')


In [19]:
data['Showers'] = data['Showers'].astype(float)
data['Parking'] = data['Parking'].astype(float)
data['Bedrooms'] = data['Bedrooms'].astype(float)
# NOTE that int cant be used because of some values with decimals
# for situations with more than 5 of these categories, + is just used. 
# So it was removed in the code above to make it into a float
# please check error mssg

ValueError: could not convert string to float: 'Studio'

In [48]:
# converting unfurnished and furnished to boolean
data['Furnish'] = data['Furnish'].replace({'Unfurnished':0,'Furnished':1,'Semi-furnished':2})
data['Furnish'] = data['Furnish'].astype(int)

In [49]:
data.dtypes

Name                object
Address             object
Author              object
Price                int64
Bedrooms           float64
Showers            float64
Parking            float64
Furnish              int64
Total Developed      int64
Lot Area             int64
Features            object
Facilities          object
Nearby Places       object
URL                 object
Timestamp           object
dtype: object

In [50]:
data.head()

Unnamed: 0,Name,Address,Author,Price,Bedrooms,Showers,Parking,Furnish,Total Developed,Lot Area,Features,Facilities,Nearby Places,URL,Timestamp
0,"3 Bedroom Condo in Aurelia Residences, Taguig","McKinley Parkway, Taguig, Metro Manila",Shang Properties,107300000,4.0,4.0,4.0,0,242,0,"[Central air conditioning, Balcony, Built-in w...","[Security, CCTV, Entertainment Area, Fitness C...","[School, Hospital, Mall, Transportation hub]",https://propertyaccess.ph/property/3-bedroom-c...,2022-09-27
1,3 Bedroom Condo in Shang Residences at Wack Wa...,"Wack Wack Road, Mandaluyong, Metro Manila",Shang Properties,54500000,4.0,4.0,4.0,0,231,0,"[Central air conditioning, Balcony, Built-in w...","[Security, CCTV, Club House, Entertainment Are...","[School, Hospital, Mall, Transportation hub]",https://propertyaccess.ph/property/3-bedroom-c...,2022-09-27
2,"2BR Condo in Residences at The Galleon, Pasig",,Ortigas Land,41500000,2.0,2.0,2.0,0,109,0,"[Central air conditioning, Balcony, Built-in w...","[Security, CCTV, Clubhouse, Connected to mall,...","[School, Hospital, Mall, Transportation hub]",https://propertyaccess.ph/property/2br-condo-i...,2022-09-27
3,"Penthouse in Residences at The Galleon, Pasig","ADB Avenue, Pasig, Metro Manila",Ortigas Land,111500000,4.0,4.0,4.0,0,268,0,"[Central air conditioning, Balcony, Built-in w...","[Security, CCTV, Clubhouse, Connected to mall,...","[School, Hospital, Mall, Transportation hub]",https://propertyaccess.ph/property/penthouse-i...,2022-09-27
4,"3 Bedroom Condo in Aurelia Residences, Taguig","McKinley Parkway, Taguig, Metro Manila",Shang Properties,181302240,4.0,4.0,4.0,0,337,0,[Central air conditioning],"[Security, CCTV, Entertainment Area, Fitness C...","[School, Hospital, Mall, Transportation hub]",https://propertyaccess.ph/property/3-bedroom-c...,2022-09-27


In [51]:
data.to_json('cleaned data.json',orient='columns')

## Challenges
- Having to go to each page in order to get the necessary data (takes longer time)
- showers, parking, and bedrooms have values with 0.5 increments and for those with more than 5, it's just placed as 5+

## Conclusion
- [Does it have personal identifiable information?]
- 