# Webscraping to access hospital information for South Africa

This Jupyter Notebook will take you through the steps used to access a website programmatically to download valuable information about hospitals in South Africa for the COVID19 response. The data extracted here will be shared with the Data for Social Impact group at University of Pretoria who is heading up a project to collate open data and develop a COVID19 dashboard (https://datastudio.google.com/u/0/reporting/1b60bdc7-bec7-44c9-ba29-be0e043d8534/page/hrUIB).

It will also be shared with the [_afrimapr_](http://afrimapr.org) project which aims to make Open Data in Africa more accessible through the development of various R building blocks. Read more in our recent [blog post](https://www.lstmed.ac.uk/news-events/news/open-data-and-software-to-support-the-covid-19-response-in-africa).

## Source data

We'll be scraping data from the [South African Doctors](http://www.sadoctors.co.za) website which contains the following pages for each province:

- State Hospitals & Clinics: 

  - [Eastern Cape](http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_state_hospitals/state_public_hospitals_clinics_eastern_cape_south_africa/)
  - [Free State](http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_state_hospitals/state_public_hospitals_clinics_free_state_south_africa)
  - [Gauteng](http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_state_hospitals/state_public_hospitals_clinics_gauteng_south_africa)
  - [Kwazulu Natal](http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_state_hospitals/state_public_hospitals_clinics_kwazulu_natal_south_africa)
  - [Limpopo](http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_state_hospitals/state_public_hospitals_clinics_limpopo_south_africa)
  - [Mpumulanga](http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_state_hospitals/state_public_hospitals_clinics_mpumalanga_south_africa)
  - [North West](http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_state_hospitals/state_public_hospitals_clinics_north_west_province_south_africa)
  - [Northern Cape](http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_state_hospitals/state_public_hospitals_clinics_northern_cape_south_africa/)
  - [Western Cape](http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_state_hospitals/state_public_hospitals_clinics_western_cape_south_africa/)

- Private Hospitals & Clinics:

  - [Eastern Cape](http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_private_hospitals/private_hospitals_clinics_eastern_cape_south_africa)
  - [Free State](http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_private_hospitals/private_hospitals_clinics_free_state_south_africa)
  - [Gauteng](http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_private_hospitals/private_hospitals_clinics_kwazulu_natal_south_africa)
  - [Kwazulu Natal](http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_private_hospitals/private_hospitals_clinics_kwazulu_natal_south_africa)
  - [Limpopo](http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_private_hospitals/private_hospitals_clinics_limpopo_south_africa)
  - [Mpumulanga](http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_private_hospitals/private_hospitals_clinics_mpumalanga_south_africa)
  - [North West](http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_private_hospitals/private_hospitals_clinics_north_west_province_south_africa)
  - [Northern Cape](http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_private_hospitals/private_hospitals_clinics_northern_cape_south_africa)
  - [Western Cape](http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_private_hospitals/private_hospitals_clinics_western_cape_south_africa)

In [24]:
if (!('rvest' %in% installed.packages())) {
 install.packages('rvest')
}
if (!('dplyr' %in% installed.packages())) {
 install.packages('tidyverse')
}
library(rvest)
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.0     [32m✔[39m [34mpurrr  [39m 0.3.3
[32m✔[39m [34mtibble [39m 3.0.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mtidyr  [39m 1.0.2     [32m✔[39m [34mforcats[39m 0.5.0
[32m✔[39m [34mreadr  [39m 1.3.1     

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m         masks [34mstats[39m::filter()
[31m✖[39m [34mreadr[39m::[32mguess_encoding()[39m masks [34mrvest[39m::guess_encoding()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m            masks [34mstats[39m::lag()
[31m✖[39m [34mpurrr[39m::[32mpluck()[39m          masks [34mrvest[39m::pluck()



## First we'll create lists containing a URL for each province

Although, upon further inspection, we find that the private hospitals info pages are really not well populates so we'll focus on getting data for the public hospitals.

In [25]:
# Create base_url for base page because individual hospitals have relative urls
base_url <- 'http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net'

# Create a list containing a URL for each province

public_province_urls  <- c('http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_state_hospitals/state_public_hospitals_clinics_eastern_cape_south_africa',
                           'http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_state_hospitals/state_public_hospitals_clinics_free_state_south_africa',
                           'http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_state_hospitals/state_public_hospitals_clinics_gauteng_south_africa',
                           'http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_state_hospitals/state_public_hospitals_clinics_kwazulu_natal_south_africa',
                           'http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_state_hospitals/state_public_hospitals_clinics_limpopo_south_africa',
                           'http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_state_hospitals/state_public_hospitals_clinics_mpumalanga_south_africa',
                           'http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_state_hospitals/state_public_hospitals_clinics_north_west_province_south_africa',
                           'http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_state_hospitals/state_public_hospitals_clinics_northern_cape_south_africa',
                           'http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_state_hospitals/state_public_hospitals_clinics_western_cape_south_africa'
)

### What do our URLs look like?

In [28]:
print(public_province_urls)

[1] "http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_state_hospitals/state_public_hospitals_clinics_eastern_cape_south_africa"       
[2] "http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_state_hospitals/state_public_hospitals_clinics_free_state_south_africa"         
[3] "http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_state_hospitals/state_public_hospitals_clinics_gauteng_south_africa"            
[4] "http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_state_hospitals/state_public_hospitals_clinics_kwazulu_natal_south_africa"      
[5] "http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_state_hospitals/state_public_hospitals_clinics_limpopo_south_africa"            
[6] "http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net/hospitals_clinics_state_hospitals/state_public_ho

## Now we can create the empty tibble that will contain our data


In [26]:
public_hosp_details <- tibble(province = factor(),
                              name = character(),
                              lat = character(),
                              long = character(),
                              phys_address = character(),
                              post_address = character(),
                              phone = character(),
                              fax = character(),
                              cell = character(),                              
                              website = character(),
                              services = character(),
                              info = character()
                              )

### And check that the datatable (or tibble) indeed has the structure we expect

In [32]:
public_hosp_details

province,name,lat,long,phys_address,post_address,phone,fax,cell,website,services,info
<fct>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>


## Let's run through every hospital found on every province's page

### Errors

It seems as if some of the pages have more or less information. This script work for EC, Gauteng, and Free State, but breaks with the KZN hospitals due to change in the website layout.

In [34]:
# Only the first province (Eastern Cape) is selected by indicating public_province_urls[1]

for (each_province in public_province_urls[1]){
  province_html  <- read_html(each_province)
  
  # Get URLs for each hospital in province
  hospital_urls <- province_html %>% html_nodes('.media-heading a') %>% html_attr('href') 
  
  for (each_hospital in hospital_urls){
    

    # Extract information for each hospital in every province from the provided table
    hosp_info <- read_html(paste0(base_url, each_hospital)) %>% 
      html_node('table') %>% 
      html_table(fill = TRUE)

    # Extract hospital name from header line
    hosp_name <- read_html(paste0(base_url, each_hospital)) %>% 
      html_nodes('h1') %>% 
      html_text() 
  
    hosp_name <- unlist(strsplit(hosp_name, ' - '))[1]
    
    # If this script is run in RStudio, printing the hospital name will help to gauge progress
    # print(hosp_name)

    # Extract coordinates from embedded Google Maps URL
    hosp_coords <- read_html(paste0(base_url, each_hospital)) %>% 
      html_nodes('iframe') %>% 
      html_attr('src')
  
    hosp_coords <- str_extract(hosp_coords,'-\\d+\\.\\d+\\,\\d+\\.\\d+') %>% str_split(',') %>% unlist()
  
    # Extract services from datatable
    # Split string in two separating leading text from services offered + trailing \r\n - use second element in resulting list 
    services_str <- unlist(strsplit(hosp_info$X2[8], 'Services offered:...'))[2] 
  
    # Extract each service from substring above and remove leading and trailing whitespace
    services_list <- str_c(trimws(unlist(strsplit(services_str, '\r\n..'))), collapse = ', ')

    # Build data table
    
    public_hosp_details <- public_hosp_details %>% 
      add_row(province = ifelse('Region:' %in% hosp_info$X1, hosp_info$X2[hosp_info$X1 == 'Region:'], NA),
              name = hosp_name,
              lat = hosp_coords[1],
              long = hosp_coords[2],
              phys_address = ifelse('Physical Address:' %in% hosp_info$X1, hosp_info$X2[hosp_info$X1 == 'Physical Address:'], NA),
              post_address = ifelse('Postal Address:' %in% hosp_info$X1, hosp_info$X2[hosp_info$X1 == 'Postal Address:'], NA),
              phone = ifelse('Phone:' %in% hosp_info$X1, hosp_info$X2[hosp_info$X1 == 'Phone:'],NA),
              fax = ifelse('Fax:' %in% hosp_info$X1, hosp_info$X2[hosp_info$X1 == 'Fax:'], NA),
              website = ifelse('Web:' %in% hosp_info$X1, unlist(strsplit(hosp_info$X2[hosp_info$X1 == 'Web:'], ' '))[1], NA),
              services = ifelse(length(services_list) > 0, services_list, NA)
      )
  } 
}


[1] "Aberdeen Provincial Aided Hospital"
[1] "Aberdeen Provincial Aided Hospital"
[1] "Adelaide Provincial Aided Hospital"
[1] "Adelaide Provincial Aided Hospital"
[1] "Aliwal North Hospital"
[1] "Aliwal North Hospital"
[1] "All Saints Hospital"
[1] "All Saints Hospital"
[1] "Andries Vosloo Hospital"
[1] "Andries Vosloo Hospital"
[1] "B.J. Vorster Hospital"
[1] "B.J. Vorster Hospital"
[1] "Bedford Hospital"
[1] "Bedford Hospital"
[1] "Bedford Orthopaedic Hospital"
[1] "Bedford Orthopaedic Hospital"
[1] "Bhisho Hospital"
[1] "Bhisho Hospital"
[1] "Burgersdorp Hospital"
[1] "Burgersdorp Hospital"
[1] "Butterworth Hospital"
[1] "Butterworth Hospital"
[1] "Cala Hospital"
[1] "Cala Hospital"
[1] "Canzibe Hospital"
[1] "Canzibe Hospital"
[1] "Cathcart Hospital"
[1] "Cathcart Hospital"
[1] "Cecilia Makiwane Hospital"
[1] "Cecilia Makiwane Hospital"
[1] "Cloete Joubert Hospital"
[1] "Cloete Joubert Hospital"
[1] "Cofimvaba Hospital"
[1] "Cofimvaba Hospital"
[1] "Cradock Hospital"
[1] "Cradock 

### Let's see what our resulting table looks like


In [35]:
head(public_hosp_details)

province,name,lat,long,phys_address,post_address,phone,fax,cell,website,services,info
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Eastern Cape,Aberdeen Provincial Aided Hospital,-32.769741,26.629668,"35 Hope Street, Aberdeen, Eastern Cape, South Africa","PO Box 1723, Aberdeen, 6270, South Africa",+27 (0)49 846 0497,+27 (0)49 846 0176,,www.echealth.gov.za,"Emergency Services, Maternity Services, Medical Services, O.P.D. Services, Paediatrics, Surgical Services, pyright 2005 - 2011. Department of Health, Eastern Cape, South Africa",
Eastern Cape,Aberdeen Provincial Aided Hospital,-32.769741,26.629668,"35 Hope Street, Aberdeen, Eastern Cape, South Africa","PO Box 1723, Aberdeen, 6270, South Africa",+27 (0)49 846 0497,+27 (0)49 846 0176,,www.echealth.gov.za,"Emergency Services, Maternity Services, Medical Services, O.P.D. Services, Paediatrics, Surgical Services, pyright 2005 - 2011. Department of Health, Eastern Cape, South Africa",
Eastern Cape,Adelaide Provincial Aided Hospital,-32.703876,26.296742,"Piet Retief Drive, Adelaide, Eastern Cape, South Africa","PO Box 128, Adelaide, 5760, South Africa",+27 (0)46 684 0066 / 274,+27 (0)46 6840417,,www.echealth.gov.za,,
Eastern Cape,Adelaide Provincial Aided Hospital,-32.703876,26.296742,"Piet Retief Drive, Adelaide, Eastern Cape, South Africa","PO Box 128, Adelaide, 5760, South Africa",+27 (0)46 684 0066 / 274,+27 (0)46 6840417,,www.echealth.gov.za,,
Eastern Cape,Aliwal North Hospital,-30.717526,26.715274,"Parklane Street, Aliwal North, Eastern Cape, South Africa","Private Bag X1004, Aliwal North, 9757, South Africa",+27 (0)51 634 2381 /2382 /2383 /2384,+27 (0)51 634 1604,,www.echealth.gov.za,,
Eastern Cape,Aliwal North Hospital,-30.717526,26.715274,"Parklane Street, Aliwal North, Eastern Cape, South Africa","Private Bag X1004, Aliwal North, 9757, South Africa",+27 (0)51 634 2381 /2382 /2383 /2384,+27 (0)51 634 1604,,www.echealth.gov.za,,


## We have to add source information and copyright information for the table

Copyright for each province belongs to the Provincial Department of Health

In [36]:
public_hosp_details_source  <- public_hosp_details  %>% 
  mutate(source = base_url,
         copyright = case_when(province == 'Eastern Cape' ~ 'Eastern Cape Department of Health',
                               province == 'Gauteng' ~ 'Gauteng Department of Health',
                               province == 'Free State' ~ 'Free State Department of Health',
                               province == 'Northern Cape' ~ 'Northern Cape Department of Health',
                               province == 'North West' ~ 'North West Department of Health',
                               province == 'Limpopo' ~ 'Limpopo Department of Health',
                               province == 'Mpumulanga' ~ 'Mpumulanga Department of Health',
                               province == 'Kwazulu Natal' ~ 'Kwazulu Natal Department of Health', 
                               province == 'Western Cape' ~ 'Western Cape Department of Health'
                              )
        )

In [37]:
head(public_hosp_details_source)

province,name,lat,long,phys_address,post_address,phone,fax,cell,website,services,info,source,copyright
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Eastern Cape,Aberdeen Provincial Aided Hospital,-32.769741,26.629668,"35 Hope Street, Aberdeen, Eastern Cape, South Africa","PO Box 1723, Aberdeen, 6270, South Africa",+27 (0)49 846 0497,+27 (0)49 846 0176,,www.echealth.gov.za,"Emergency Services, Maternity Services, Medical Services, O.P.D. Services, Paediatrics, Surgical Services, pyright 2005 - 2011. Department of Health, Eastern Cape, South Africa",,http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net,Eastern Cape Department of Health
Eastern Cape,Aberdeen Provincial Aided Hospital,-32.769741,26.629668,"35 Hope Street, Aberdeen, Eastern Cape, South Africa","PO Box 1723, Aberdeen, 6270, South Africa",+27 (0)49 846 0497,+27 (0)49 846 0176,,www.echealth.gov.za,"Emergency Services, Maternity Services, Medical Services, O.P.D. Services, Paediatrics, Surgical Services, pyright 2005 - 2011. Department of Health, Eastern Cape, South Africa",,http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net,Eastern Cape Department of Health
Eastern Cape,Adelaide Provincial Aided Hospital,-32.703876,26.296742,"Piet Retief Drive, Adelaide, Eastern Cape, South Africa","PO Box 128, Adelaide, 5760, South Africa",+27 (0)46 684 0066 / 274,+27 (0)46 6840417,,www.echealth.gov.za,,,http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net,Eastern Cape Department of Health
Eastern Cape,Adelaide Provincial Aided Hospital,-32.703876,26.296742,"Piet Retief Drive, Adelaide, Eastern Cape, South Africa","PO Box 128, Adelaide, 5760, South Africa",+27 (0)46 684 0066 / 274,+27 (0)46 6840417,,www.echealth.gov.za,,,http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net,Eastern Cape Department of Health
Eastern Cape,Aliwal North Hospital,-30.717526,26.715274,"Parklane Street, Aliwal North, Eastern Cape, South Africa","Private Bag X1004, Aliwal North, 9757, South Africa",+27 (0)51 634 2381 /2382 /2383 /2384,+27 (0)51 634 1604,,www.echealth.gov.za,,,http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net,Eastern Cape Department of Health
Eastern Cape,Aliwal North Hospital,-30.717526,26.715274,"Parklane Street, Aliwal North, Eastern Cape, South Africa","Private Bag X1004, Aliwal North, 9757, South Africa",+27 (0)51 634 2381 /2382 /2383 /2384,+27 (0)51 634 1604,,www.echealth.gov.za,,,http://doctors-hospitals-medical-cape-town-south-africa.blaauwberg.net,Eastern Cape Department of Health
