## Daft Scraping

### Get individual rental ad  URLs
Pagination is done using the 'offset' property in the search results URL, so we can use that browse through the results pages.  
Daft displays 20 results per page, hence the value of '20' in this line:
       `for offset in range(0, number_of_adds, 20):`
    

In [13]:
from bs4 import BeautifulSoup
import urllib

daftresults_urlroot = 'http://www.daft.ie/dublin/apartments-for-rent/?s%5Bignored_agents%5D%5B0%5D=5732&s%5Bignored_agents%5D%5B1%5D=428&s%5Bignored_agents%5D%5B2%5D=1551&offset='
allAdUrls = []
number_of_adds = 700

def getAdUrls(pageresults):
    
    adURLs = []
    for result in pageresults:
        adURLs.append("http://www.daft.ie" + result.a["href"])
    return adURLs

for offset in range(0, number_of_adds, 20):
    results_html = urllib.request.urlopen(daftresults_urlroot + str(offset)).read()
    soup = BeautifulSoup(results_html, "html5lib")
    results = soup.find_all("div", class_="search_result_title_box")
    allAdUrls = allAdUrls + getAdUrls(results)

#print(allAdUrls)
print('Current number of Dublin rental ads: ' + str(len(allAdUrls)))


Current number of Dublin rental ads: 639


### Download individual Ad pages

This section loops through the list of individual rental ad URLs, and downloads them into a 'daftpages' directory.  

In [14]:
# Rental Ad URLs are now in this array: allAdUrls
# Loop through and download

for idx,adUrl in enumerate(allAdUrls):
    urllib.request.urlretrieve(adUrl, 'daftpages/daft_ad_' + str(idx) + '.html')

### Zip up pages

In [21]:
import tarfile
import datetime
import os

current_date = datetime.datetime.now().isoformat()
tar_file_name = 'daftpages_' + current_date + '.tar'
source_dir = 'daftpages/'
with tarfile.open(tar_file_name, "w:gz") as tar:
    tar.add(source_dir, arcname=os.path.basename(source_dir))

### Start scraping

In [144]:
import pandas as pd
import re
import json
import csv


num_of_rows = 639 #number_of_adds
data_csv = 'data/scraped_data.csv'
all_field_names = [
    'property_id',
    'property_category',
    'property_title',
    'property_type',
    'seller_name',
    'seller_id',
    'seller_type',
    'open_viewing',
    'no_of_photos',
    'available_from',
    'lease_units',
    'available_for',    
    'area',
    'county',
    'latitude',
    'longitude',    
    'furnished',
    'bathrooms',   
    'beds',   
    'facility',    
    'environment',
    'published_date',
    'page_name',
    'platform',
    'currency',
    'price_frequency',
    'price'
]

with open(data_csv, 'w') as csvfile:
    for idx in range(num_of_rows):
        try:
            adpage_html = open('daftpages/daft_ad_' + str(idx) + '.html').read()
            soup = BeautifulSoup(adpage_html, "html5lib")
        except:
            # seems like some pages have encoding issues?
            print('issue reading in page daftpages/daft_ad_' + str(idx) + '.html. Skipping this Ad.')
            continue

        #print(soup)
        # There is a handy javascrupt json dictionary on those daft pages, listing key features of the add
        # To get this data, find all script tags, then get the contents of the 10ths tag found (seems to be the 10th.
        # Now, this seems to be a bit brittle, need to find a way to target this better than just hope it'll always be 
        # the 10th script tag on the page; But maybe for now it's enough)
        scriptdata = soup.find_all('script', type='text/javascript')    
        trackingparams = scriptdata[10].get_text()
        trackingparams = trackingparams.replace('\u20ac','')

        try:
            feature_str = "{" + str(re.search('\\{(.+?)\\}', trackingparams).group(1)) + "}"
        except AttributeError:
            feature_str = "{}"
    
        ad_data = json.loads(feature_str)      
        # check for missing fields (mostly seller_id and seller_name), and add them with empty vals if required
        missing_fiels = set(all_field_names) - set(field_names)
        for missing in missing_fiels:
            ad_data[missing] = ""
            
        writer = csv.DictWriter(csvfile, fieldnames=all_field_names)
        if idx == 0: 
            writer.writeheader()
        writer.writerow(ad_data)


issue reading in page daftpages/daft_ad_26.html
issue reading in page daftpages/daft_ad_102.html
issue reading in page daftpages/daft_ad_164.html
issue reading in page daftpages/daft_ad_168.html
issue reading in page daftpages/daft_ad_173.html
issue reading in page daftpages/daft_ad_180.html
issue reading in page daftpages/daft_ad_181.html
issue reading in page daftpages/daft_ad_182.html
issue reading in page daftpages/daft_ad_259.html
issue reading in page daftpages/daft_ad_353.html
issue reading in page daftpages/daft_ad_400.html
issue reading in page daftpages/daft_ad_508.html


In [145]:
### Create Pandas Dataframe from CSV

In [151]:


data_csv = 'data/scraped_data.csv'
df = pd.read_csv(data_csv)

#drop some not very useful columns
df = df.drop('environment', 1)
df = df.drop('page_name', 1)
df = df.drop('platform', 1)
df = df.drop('property_category', 1)
df.head()

Unnamed: 0,property_id,property_title,property_type,seller_name,seller_id,seller_type,open_viewing,no_of_photos,available_from,lease_units,...,latitude,longitude,furnished,bathrooms,beds,facility,published_date,currency,price_frequency,price
0,1770917,"River Park, Conyngham Road, Dublin 8, Dublin 8",apartment,,,private,no,7,2017-09-11,months,...,53.348092,-6.302529,yes,1,1,"Parking,Central Heating,Cable Television,Washi...",2017-09-10,€,monthly,1575
1,1771948,"Temple Hill, Terenure Road West, Terenure, Dub...",apartment,,,private,no,7,2017-09-13,months,...,53.311233,-6.291145,yes,2,2,"Parking,Central Heating,House Alarm,Cable Tele...",2017-09-10,€,monthly,1950
2,1770493,"rosebank view, Clondalkin, Dublin 22",apartment,,,private,no,10,2017-09-09,months,...,53.328208,-6.395234,yes,1,1,"Parking,Central Heating,House Alarm,Cable Tele...",2017-09-10,€,monthly,1200
3,1762956,"Wavendon, 69 Northumberland Road, Ballsbridge,...",apartment,MD Property Management,6534.0,agent,no,23,2017-09-08,months,...,53.333177,-6.236485,either,2,2,"Parking,Central Heating,Washing Machine,Dishwa...",2017-09-10,€,monthly,3500
4,1762953,"New Bancroft, Greenhills Road, Tallaght, Dubli...",apartment,MD Property Management,6534.0,agent,no,13,2017-09-29,months,...,53.287775,-6.357701,yes,2,2,"Central Heating,Washing Machine,Dryer,Dishwash...",2017-09-10,€,monthly,1600


In [152]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 627 entries, 0 to 626
Data columns (total 23 columns):
property_id        627 non-null int64
property_title     627 non-null object
property_type      627 non-null object
seller_name        473 non-null object
seller_id          473 non-null float64
seller_type        627 non-null object
open_viewing       627 non-null object
no_of_photos       627 non-null int64
available_from     627 non-null object
lease_units        627 non-null object
available_for      627 non-null int64
area               627 non-null object
county             627 non-null object
latitude           627 non-null float64
longitude          627 non-null float64
furnished          627 non-null object
bathrooms          627 non-null int64
beds               627 non-null int64
facility           587 non-null object
published_date     627 non-null object
currency           627 non-null object
price_frequency    627 non-null object
price              627 non-null int64
d