# Australian General Practice location scraper

5 October 2021

---

**Description**

It is difficult to find existing datasets for general practice locations in Australia (e.g., from data.gov.au or data.vic.gov.au). Healthdirect appears to be the best official source of health provider locations. 

This scraper contains some basic code to scrape healthdirect.gov.au to get the names and locations of all General Practice sites in Victoria. It can be modified to obtain all those in any other given state, the whole country or just a subset of postcodes.


**Instructions**
- Make sure that pandas, requests and lxml are installed
- Ensure that the reference postcode data is specified correctly
- Run each of the cells in order. 

In [2]:
import requests
import pandas as pd
from lxml import html
from time import sleep

from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

## Define paths and utility functions

- Specify the required data elements from the website
- Create utility function to extract data elements from web page

In [3]:
# these defines the elements within the webpage where name and location information is recorded
path_name = './/div[@class="veyron-hsf-page "]/a/@href'
path_address = './/div[@class="veyron-hsf-page "]/a/@data-address'
path_lat = './/div[@class="veyron-hsf-page "]/a/@data-lat'
path_long = './/div[@class="veyron-hsf-page "]/a/@data-long'

In [4]:
# the data elements that we are interested in
data_scheme = {'Name': path_name,
               'Address': path_address,
               'Latitude': path_lat,
               'Longitude': path_long}

In [5]:
def get_locations(text, data_scheme):
    """
    Retrieve location information in web page
    
    Args
    ----
    text (str): the scraped data from the webpage containing
    required information
    data_scheme (dict): specifies paths for each data element
    
    Returns
    pd.DataFrame with the parsed data
    """
    tree = html.fromstring(text)
    names = data_scheme.keys()
    output_data = dict()
    
    for name in names:
        path = data_scheme[name]
        item_data = tree.xpath(path)
        output_data[name] = item_data
        
    return pd.DataFrame(output_data)

## Create list of suburb-postcode pairs

Using reference postcode data, create the suburb-postcode pairs that we require to scrape location data for. This is specified in the ```folder``` and ```filename``` variables.

In [6]:
folder = 'reference_data'
filename = f'{folder}/postcodes_scraped_221021.csv'
df = pd.read_csv(filename)

In [7]:
# if we are only interested in the VIC postcodes (assuming that that file has a 'state' column)
df_vic = df.query('state == "VIC"')

In [8]:
df_vic['suburb_formatted'] = df_vic['suburb'].apply(lambda s: s.replace(' ', '_'))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_vic['suburb_formatted'] = df_vic['suburb'].apply(lambda s: s.replace(' ', '_'))


In [9]:
df_vic['suburb_postcode'] = df_vic['suburb_formatted'].str.lower() + '-' + df_vic['postcode'].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_vic['suburb_postcode'] = df_vic['suburb_formatted'].str.lower() + '-' + df_vic['postcode'].astype(str)


In [10]:
suburb_postcodes = list(df_vic['suburb_postcode'].values)

pages = list(range(1,20)) # how many pages do we scrape for each of the suburb-postcode pairs?

## URL list to scrape

Create the list of URLs to scrape based on the suburb-postcode pairs that are required

In [11]:
# the basic url
url_base = 'https://www.healthdirect.gov.au/australian-health-services/results/'
url_middle = '/tihcs-aht-11222/gp-general-practice?pageIndex='
url_end = '&tab=SITE_VISIT'

In [12]:
# create the list of URLs to scrape
urls = [f'{url_base}{suburb_postcode}{url_middle}{n}{url_end}' 
        for suburb_postcode in suburb_postcodes 
        for n in range(1, 5)]

In [13]:
len(suburb_postcodes)

3183

## Scrape the data

For each of the URLs in the list ```urls``` request the data, parse it and append to list of dataframes ```df_list```.

In [14]:
print(f'There are {len(urls)} URLs.')

There are 12732 URLs.


In [None]:
df_list = []
urls_bad = []

# in case the request returns an error, try again
# this makes the scraper more robust
retry_strategy = Retry(
    total=3,
    status_forcelist=[429, 500, 502, 503, 504],
    method_whitelist=["GET"]
)

adapter = HTTPAdapter(max_retries=retry_strategy)
http = requests.Session()
http.mount("https://", adapter)
http.mount("http://", adapter)

# loop through URLs scrape each one and parse data
for n, url in enumerate(urls):
    print(f'{n}: Scraping URL: {url}')
    page = http.get(url)
    print(f'status code: {page.status_code}')
    if page.status_code == 200:
        output_data = get_locations(page.text, data_scheme)
    else:
        print('Unsuccessful request')
        urls_bad.append(url)
    df_list.append(pd.DataFrame(output_data))

In [None]:
# scrape URLs that caused could not be scraped initially.

## Concatenate all the datasets for each suburb-postcode

In [18]:
df_final = pd.concat(df_list).drop_duplicates()

In [19]:
df_final

Unnamed: 0,Name,Address,Latitude,Longitude
0,/australian-health-services/23011106/mount-hot...,"32 Great Alpine Road, HOTHAM HEIGHTS 3741",-36.9833984375,147.142288208007812
1,/australian-health-services/20037189/bright-me...,"115 Gavan Street, BRIGHT 3741",-36.7268142700195312,146.960708618164062
2,/australian-health-services/20046060/mount-bea...,"1 Tawonga Crescent, MOUNT BEAUTY 3699",-36.7435798645019531,147.1715087890625
3,/australian-health-services/20045528/falls-cre...,"1 Bogong High Plains Road, FALLS CREEK 3699",-36.8613243103027344,147.276687622070312
4,/australian-health-services/20027328/standish-...,"107 Standish Street, MYRTLEFORD 3737",-36.5577354431152344,146.72650146484375
...,...,...,...,...
2,/australian-health-services/644bbf79-15bd-4f59...,"12 STANLEY STREET, WEST WODONGA 3690",-36.122049720,146.886751470
0,/australian-health-services/23026930/aurora-vi...,"Shop 8, 315 Harvest Home Road, EPPING 3076",-37.6214637756347656,145.0064697265625
1,/australian-health-services/fe9c83fc-a64b-4b5f...,"878 EDGARS ROAD, EPPING 3076",-37.622501480,145.005549120
3,/australian-health-services/9fceb613-6dfc-4320...,"Epping North Shopping Centre, Shop 10, 2 Lynda...",-37.6276764,145.02700439


## Tidy up the data

- Remove unneccessary white spaces
- drop duplicates

In [20]:
for column in df_final.columns:
    df_final[column] = df_final[column].str.strip()

In [21]:
output_folder = '/home/user/Desktop/Data/scraped/gp_locations'

df_final.drop_duplicates().to_csv(f'{output_folder}/gp_scraped_050322_all.csv')