# Australian General Practice location scraper

5 October 2021

---

**Description**

It is difficult to find existing datasets for general practice locations in Australia (e.g., from data.gov.au or data.vic.gov.au). Healthdirect appears to be the best official source of health provider locations. 

This scraper contains some basic code to scrape healthdirect.gov.au to get the names and locations of all General Practice sites in Victoria. It can be modified to obtain all those in any other given state, the whole country or just a subset of postcodes.


**Instructions**
- Make sure that pandas, requests and lxml are installed
- Ensure that the reference postcode data is specified correctly
- Run each of the cells in order. 

In [89]:
import requests
import pandas as pd
from lxml import html
from time import sleep

import json

from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

## Define paths and utility functions

- Specify the required data elements from the website
- Create utility function to extract data elements from web page

In [206]:
# these definesthe elements within the webpage where name and location information is recorded
path_name = './/div[@class="veyron-hsf-page "]/a/@href'
path_address1 = './/div[@class="veyron-hsf-page "]/a/@data-address'
path_lat = './/div[@class="veyron-hsf-page "]/a/@data-lat'
path_long = './/div[@class="veyron-hsf-page "]/a/@data-long'

In [207]:
# for the pages of the individual practices, these are the paths with relevant information
path_address2 = './/p[@class="hsf-service_details-data-address veyron-hsf-full-address"]/@data-full-address'
path_opening_hours_day = './/div[@class="hsf-service_details-data-hours-group"]/ul/li/span[@class="hsf-service_details-data-hours-weekday"]/text()'
path_opening_hours_time = './/div[@class="hsf-service_details-data-hours-group"]/ul/li/span[@class="hsf-service_details-data-hours-times"]/text()'
path_fees = './/p[@class="hsf-service_details-data-billing"]/text()'
path_details = './/p[@class="hsf-service_details-data-description"]/span/text()'

path_javascript = './/script[@type="text/javascript"]/text()'

In [208]:
# the data elements that we are interested in
data_scheme = {'Name': path_name,
               'Address': path_address1,
               'Latitude': path_lat,
               'Longitude': path_long}

In [209]:
def extract_js(js_text, start_marker='[{"', end_marker='}],\n', offset=2):
    """
    Extract the javascript text and convert to dict. This contains lat, long etc
    """
    start_index, end_index = js_text.find(start_marker), js_text.find(end_marker)
    output = js_text[start_index:end_index+offset]
    return json.loads(output)

def get_page_info(url):
    """
    Extract all the info we are interested in for a given URL
    """
    page = requests.get(url)
    tree = html.fromstring(page.text)
    address = tree.xpath(path_address)
    opening_hours_day = tree.xpath(path_opening_hours_day)
    opening_hours_time = tree.xpath(path_opening_hours_time)
    fees = tree.xpath(path_fees)
    js_text = tree.xpath(path_javascript)
    js_data = extract_js(js_text[9])
    
    details = ' '.join(tree.xpath(path_details)).strip()
    
    latitude = float(js_data[0]['location']['physicalLocation']['geocode']['latitude'])
    longitude = float(js_data[0]['location']['physicalLocation']['geocode']['longitude'])
    
    # create output dataframe
    df_temp = pd.DataFrame(data={'url': url, 'address': address, 'fees': ' '.join(fees).strip(), 
                                 'details': details,
                                 'latitude': latitude, 'longitude': longitude})
    
    return df_temp

In [210]:
def get_locations(text, data_scheme):
    """
    Retrieve location information in web page
    
    Args
    ----
    text (str): the scraped data from the webpage containing
    required information
    data_scheme (dict): specifies paths for each data element
    
    Returns
    pd.DataFrame with the parsed data
    """
    tree = html.fromstring(text)
    names = data_scheme.keys()
    output_data = dict()
    
    for name in names:
        path = data_scheme[name]
        item_data = tree.xpath(path)
        output_data[name] = item_data
        
    return pd.DataFrame(output_data)

## Create list of suburb-postcode pairs

Using reference postcode data, create the suburb-postcode pairs that we require to scrape location data for. This is specified in the ```folder``` and ```filename``` variables.

In [211]:
folder = '/home/alex/Desktop/Data/reference_data'
filename = f'{folder}/postcodes_scraped_221021.csv'
df = pd.read_csv(filename)

In [212]:
# if we are only interested in the VIC postcodes (assuming that that file has a 'state' column)
df_vic = df.query('state == "VIC"')

In [213]:
df_vic['suburb_formatted'] = df_vic['suburb'].apply(lambda s: s.replace(' ', '_'))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_vic['suburb_formatted'] = df_vic['suburb'].apply(lambda s: s.replace(' ', '_'))


In [214]:
df_vic['suburb_postcode'] = df_vic['suburb_formatted'].str.lower() + '-' + df_vic['postcode'].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_vic['suburb_postcode'] = df_vic['suburb_formatted'].str.lower() + '-' + df_vic['postcode'].astype(str)


In [215]:
suburb_postcodes = list(df_vic['suburb_postcode'].values)

pages = list(range(1,20)) # how many pages do we scrape for each of the suburb-postcode pairs?

## URL list to scrape

Create the list of URLs to scrape based on the suburb-postcode pairs that are required

In [233]:
# the basic url
url_base = 'https://www.healthdirect.gov.au/australian-health-services/results/'
url_base2 = 'https://www.healthdirect.gov.au'
url_middle = '/tihcs-aht-11222/gp-general-practice?pageIndex='
url_end = '&tab=SITE_VISIT'

In [217]:
# create the list of URLs to scrape
urls = [f'{url_base}{suburb_postcode}{url_middle}{n}{url_end}' 
        for suburb_postcode in suburb_postcodes 
        for n in range(1, 5)]

In [218]:
len(suburb_postcodes), len(urls)

(3183, 12732)

## Scrape the data

For each of the URLs in the list ```urls``` request the data, parse it and append to list of dataframes ```df_list```.

In [261]:
print(f'There are {len(urls)} URLs.')

There are 12732 URLs.


In [262]:
df_list = []
urls_bad = []

# in case the request returns an error, try again
# this makes the scraper more robust
retry_strategy = Retry(
    total=3,
    status_forcelist=[429, 500, 502, 503, 504],
    method_whitelist=["GET"]
)

adapter = HTTPAdapter(max_retries=retry_strategy)
http = requests.Session()
http.mount("https://", adapter)
http.mount("http://", adapter)

# loop through URLs scrape each one and parse data
for n, url in enumerate(urls[:100]):
    print(f'{n}: Scraping URL: {url}')
    page = http.get(url)
    print(f'status code: {page.status_code}')
    if page.status_code == 200:
        output_data = get_locations(page.text, data_scheme)
    else:
        print('Unsuccessful request')
        urls_bad.append(url)
    df_list.append(pd.DataFrame(output_data))

  retry_strategy = Retry(


0: Scraping URL: https://www.healthdirect.gov.au/australian-health-services/results/abbeyard-3737/tihcs-aht-11222/gp-general-practice?pageIndex=1&tab=SITE_VISIT
status code: 200
1: Scraping URL: https://www.healthdirect.gov.au/australian-health-services/results/abbeyard-3737/tihcs-aht-11222/gp-general-practice?pageIndex=2&tab=SITE_VISIT
status code: 200
2: Scraping URL: https://www.healthdirect.gov.au/australian-health-services/results/abbeyard-3737/tihcs-aht-11222/gp-general-practice?pageIndex=3&tab=SITE_VISIT
status code: 200
3: Scraping URL: https://www.healthdirect.gov.au/australian-health-services/results/abbeyard-3737/tihcs-aht-11222/gp-general-practice?pageIndex=4&tab=SITE_VISIT
status code: 200
4: Scraping URL: https://www.healthdirect.gov.au/australian-health-services/results/abbotsford-3067/tihcs-aht-11222/gp-general-practice?pageIndex=1&tab=SITE_VISIT
status code: 200
5: Scraping URL: https://www.healthdirect.gov.au/australian-health-services/results/abbotsford-3067/tihcs-ah

status code: 200
46: Scraping URL: https://www.healthdirect.gov.au/australian-health-services/results/aintree-3336/tihcs-aht-11222/gp-general-practice?pageIndex=3&tab=SITE_VISIT
status code: 200
47: Scraping URL: https://www.healthdirect.gov.au/australian-health-services/results/aintree-3336/tihcs-aht-11222/gp-general-practice?pageIndex=4&tab=SITE_VISIT
status code: 200
48: Scraping URL: https://www.healthdirect.gov.au/australian-health-services/results/aire_valley-3237/tihcs-aht-11222/gp-general-practice?pageIndex=1&tab=SITE_VISIT
status code: 200
49: Scraping URL: https://www.healthdirect.gov.au/australian-health-services/results/aire_valley-3237/tihcs-aht-11222/gp-general-practice?pageIndex=2&tab=SITE_VISIT
status code: 200
50: Scraping URL: https://www.healthdirect.gov.au/australian-health-services/results/aire_valley-3237/tihcs-aht-11222/gp-general-practice?pageIndex=3&tab=SITE_VISIT
status code: 200
51: Scraping URL: https://www.healthdirect.gov.au/australian-health-services/resu

status code: 200
92: Scraping URL: https://www.healthdirect.gov.au/australian-health-services/results/port_albert-3971/tihcs-aht-11222/gp-general-practice?pageIndex=1&tab=SITE_VISIT
status code: 200
93: Scraping URL: https://www.healthdirect.gov.au/australian-health-services/results/port_albert-3971/tihcs-aht-11222/gp-general-practice?pageIndex=2&tab=SITE_VISIT
status code: 200
94: Scraping URL: https://www.healthdirect.gov.au/australian-health-services/results/port_albert-3971/tihcs-aht-11222/gp-general-practice?pageIndex=3&tab=SITE_VISIT
status code: 200
95: Scraping URL: https://www.healthdirect.gov.au/australian-health-services/results/port_albert-3971/tihcs-aht-11222/gp-general-practice?pageIndex=4&tab=SITE_VISIT
status code: 200
96: Scraping URL: https://www.healthdirect.gov.au/australian-health-services/results/albion-3020/tihcs-aht-11222/gp-general-practice?pageIndex=1&tab=SITE_VISIT
status code: 200
97: Scraping URL: https://www.healthdirect.gov.au/australian-health-services/r

## Concatenate all the datasets for each suburb-postcode

In [263]:
df_final = pd.concat(df_list).drop_duplicates()

## Tidy up the data

- Remove unneccessary white spaces
- drop duplicates

In [264]:
for column in df_final.columns:
    df_final[column] = df_final[column].str.strip()

In [266]:
output_folder = '/home/alex/Desktop/Data/scraped/gp_locations'

df_final = df_final.drop_duplicates().reset_index().iloc[:, 1:]
df_final.to_csv(f'{output_folder}/gp_scraped_020922_partial.csv')

## If scraping in batches: Collate datasets

Since we have to scrape in batches (to avoid being kicked off by the website), we collate all the exported datasets at the end.

In [91]:
filenames = ['gp_scraped_191121.csv', 'gp_scraped_201121_new1.csv', 'gp_scraped_201121_new2.csv', 
             'gp_scraped_201121_new3.csv', 'gp_scraped_201121_new4.csv', 'gp_scraped_201121_new5.csv',
            'gp_scraped_201121.csv', 'gp_scraped_201121n.csv']

In [92]:
source_folder = '/Users/alexanderjameslee/Desktop/Data/locations/'

In [94]:
dfs = []

for filename in filenames:
    df_partial = pd.read_csv(f'{source_folder}/{filename}')
    dfs.append(df_partial)

In [100]:
df_all = pd.concat(dfs).drop_duplicates()

In [115]:
df_all['Title'] = df_all['Name'].apply(lambda l: l.split('/')[3].replace('-', ' '))

In [116]:
df_all.reset_index()[['Name', 'Title', 'Address', 
                      'Latitude', 'Longitude']].to_csv(f'{source_folder}/gp_clinics_scraped_all_201121.csv',
                                                      index=False)

## Additional info from each of the practices

- Fees
- Opening hours
- Other details

In [267]:
# for each of the practices, scrape the additional detailed info
df_all_details_temp = []

for url in df_final['Name'].values:
    print(f'\nRetrieving detailed info for {url}')
    url_name = f"{url_base2}{url}"
    df_temp = get_page_info(url_name)
    df_all_details_temp.append(df_temp)
    
df_all_details = pd.concat(df_all_details_temp).drop_duplicates().reset_index().iloc[:, 1:]


Retrieving detailed info for /australian-health-services/343d215f-9e76-4abb-9067-4191c8ed1e95/mount-buller-medical-centre/services/mount-buller-3723-summit#b43bf448-d56f-4ff8-b203-351d3c415bc3

Retrieving detailed info for /australian-health-services/23011106/mount-hotham-medical-centre/services/hotham-heights-3741-great-alpine#9e313dde-c07f-1c57-fb43-b7eb74e18adb

Retrieving detailed info for /australian-health-services/20037189/bright-medical-centre/services/bright-3741-gavan#eefec9b4-f234-9b80-aecf-afb04194c670

Retrieving detailed info for /australian-health-services/20046060/mount-beauty-medical-centre/services/mount-beauty-3699-tawonga#2fe34e36-92fc-ed14-53de-c9c6efad965e

Retrieving detailed info for /australian-health-services/20045528/falls-creek-medical-centre/services/falls-creek-3699-bogong-high-plains#f0077b15-58a2-f0f0-c1e5-63bef7f0cb6c

Retrieving detailed info for /australian-health-services/20040902/ysas-abbotsford/services/abbotsford-3067-johnston#1e734316-53b6-25db-


Retrieving detailed info for /australian-health-services/d919e83d-aa0e-42e5-8162-cdd495c123ba/four-corners-medical/services/cobblebank-3338-ferris-rd#488b53b9-ffbb-480e-e788-5357cae46cc1

Retrieving detailed info for /australian-health-services/23022902/aspire-medical-and-skin-centre/services/hillside-3037-sanctuary#134627e7-7652-191a-193d-dd296b863d06

Retrieving detailed info for /australian-health-services/20053126/taylors-hill-medical-clinic/services/taylors-hill-3037-corner-gourlay-road-and-hume#dea8de2d-fbde-51cd-cad5-10d623f078d8

Retrieving detailed info for /australian-health-services/2d47cc9a-6978-493a-91a1-a1e511207065/redicare-medical-centre/services/caroline-springs-3023-caroline-springs#5b4307d6-c681-4eb8-8606-52f4c56cb7bb

Retrieving detailed info for /australian-health-services/20046960/apollo-bay-medical-centre/services/apollo-bay-3233-mclachlan#0cb0dae6-27bb-0bd5-1bb9-0b1cf599a26c

Retrieving detailed info for /australian-health-services/20035805/corangamite-clinic/s


Retrieving detailed info for /australian-health-services/07129685-19ca-4287-9c1c-4eeb86d448de/greythorn-doctors-clinic/services/balwyn-north-3104-greythorn#46b421c7-cf38-4c82-af68-640f88f55f4a

Retrieving detailed info for /australian-health-services/23022428/sunshine-marketplace-medical-centre/services/sunshine-3020-harvester#43850003-5802-54e1-6425-258a63fe6084

Retrieving detailed info for /australian-health-services/704a10f2-4a8c-4d32-a700-83771265e05e/hume-medical-centre/services/sunshine-west-3020-glengala#2b4deb17-7d5c-4864-a1d2-34cde210ee9b

Retrieving detailed info for /australian-health-services/22000199/the-medical-corner/services/sunshine-west-3020-glengala#bfdc0a67-c261-47d7-c091-9842fc57813e

Retrieving detailed info for /australian-health-services/20053293/dhl-medical-clinic/services/sunshine-3020-durham#0f0bc62a-6671-7b21-94da-0bf227812b24

Retrieving detailed info for /australian-health-services/20050185/medical-one-sunshine/services/sunshine-3020-hampshire#fb29dc82-b