#### zerotoanalyst-project1

# Creating a Contacts List for the UK's NHS using Web Scraping

### Project Objectives

The generic aim of this research project is to demonstrate the extent to which a contacts list can be created using web scraping.  The project is using Python and the Requests and Beautiful Spoup 4 (BS4) python libraries.

The project uses the National Health Service (NHS) in the UK to explore this objective.  The NHS is a public sector organisation comprising over 6000 organisations from large hospitals to local general health practices (GPs).  Overall it has 1.6 million employees and volunteers and provides healthcare services to the 67 million people in the UK with a budget of over 130 billion GP pounds.  The project demonstrates this using the NHS in England and including the hospital trusts and GP surgeries.  To become comprehensive the exercise would need to extent to include NHS Scotland, Wales and Northern Ireland which should be straight-forward.

The objective is framed within a wider research project exploring the nature and structure of "business ecosystems" located at the Wave Lab within the University of the Aegean Business School in Greece.  This listing will later be used to model the NHS as a business ecosystem.

### Relevant Use Cases 

A listing of this type has a very wide set of potential use cases:

1. Research studies to find survey respondents or case study sites or web-sites to gather information and data.
2. Market research activities for mailing lists, newsletters, lead generation.
3. Contacts for requests for information for a variety of purposes such as journalism, government advice, etc.
4. Political activism or citizen investigations into issues of relevance to citizens or political agendas.

Although the NHS is the subject of this demonstrator, web scraping contact details is applicable to most organisations and sectors and countries.

### Economic Advantages

Contact listings of this sort can often be purchased but usually at a cost and how up-to-date the databases are is an issue.  The economic avantages are therefore:

1. Cost savings over a commercial data provider.
2. Being able to re-run at any time for the latest contact information.
3. Having more options for missing data.
4. Potentially having more control over the data gathering.

### Web Site Used

A first stage involved exploring a number of web sites to find the best one for this purpose.  The site www.nhs.uk was chosen for this exercise.

<img src="https://i.imgur.com/NStkGiV.gif">

### Legal & Information Protection Conformance

The legality was checked of using the web site for web scraping of this type.  Generally, in the UK it is accepted as legal for UK citizens to scrape UK web-sites for research purposes.  This legal principle applies to this project.  Additionally, the website's Terms and Conditions were reviewed at https://www.nhs.uk/our-policies/terms-and-conditions/.  This advises in section 3.4 that "you can use NHS Website Content, including copying it, adapting it, and using it for any purpose, including commercially, provided you follow these terms and conditions and the terms of the OGL" (i.e. Open Government License http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/).  The OGL allows the use and adaptation of information from the site and combination with other data for both commercial and non-commercial purposes provided that the sorce is acknowledged and the OGL license is stated to apply.

### Evaluation Criteria

The project is a requirement within the Jovian Data Analyst Bootcamp and had the following evaluation criteria:

- The Jupyter notebook should run end-to-end without any errors or exceptions
- The Jupyter notebook should contain execution outputs for all the code cells
- The Jupyter notebook should contain proper explanations in Markdown cells
- Your project should involve web scraping of at least two web pages
- Your project should use the appropriate libraries for web scraping
- Your submission should include the CSV file generated by scraping
- The submitted CSV file should contain at least 3 columns and 100 rows of data
- The Jupyter notebook should be publicly accessible (not "Private" or "Secret")


## General Approach to Webscraping to Create the Contact List

### Website Content

The NHS website does contain a lot of information on NHS organisations.  The site is structured for a user (citizen) to drill down by type of organisation to select the individual organisation they are looking for and then that leads onto pages of information on that organisation within which there is contact information. So the site is not designed to be a source of contact lists, instead it is designed to find one organisation and its services. 

### Data Quality

A physical inspection of the site and the contact information showed that there is some common structures.  But also revealed that the pages are not identical and there is missing contact information.  On the plus side, the site has comprehensive and up-to-date information on all the NHS organisations by name.

### Approach to  Web Scraping

A simple "nested approach" is proposed of:

1. Choose a type of NHS organisation(hospital, GP, etc.)
2. Load using python Requests library the page listing the organisations of that type
2. By webscraping that page using BS4 to create a dataset for that type of organisation comprising the name and web-page url
3. For each organisation in the dataset load the page using Requests and use BS4 to find the contact details.
4. Add the organisation, type and contact details to another dataset.
5. When the exercise is complete to output the contacts dataset to a CSV file.

### Keeping the Project Small & Avoiding Overloading the Targetted Website

It is not necessary to scrape every type or organisation or all of the NHS organisation pages to prove the viability of the approach with this demonstrator, nor is it needed to fulfill the evaluation criteria of the Jovian Bootcamp.  Also screen scaping 16,000 pages or so, from the site may trigger security or load management software within the site's servers.  So to keep all this manageable the following constraining tactics are being adopted:

1. The project will be contrained to the two major types of NHS organisations - hospitals and GP surgeries.  Extending the exercise to all the other types (e.g. pharmacies, mental health trusts, dentists, etc.) is simply acheived by running the generic approach across the web pages for those types of organisations.
2. Within those two types of organisations only the first 50 hospitals (alphabetically) and 100 GP surgeries (again alphabetically) will be selected.  Extending to all hospitals and all GP surgeries is simply achieved by removing this limiter and allowing the functions to run untill all have been selected.

### Example of web page for a type of NHS organisation (hospitals in this case)
<img src="https://i.imgur.com/Y1gJ2xy.gif">



### Example of web page with contact details for one hospital (Ashforn & St Peter's Hospital Trust)

<img src="https://i.imgur.com/QyIz7uD.gif">

### Installing the Python Libraries

The following are required by the project:

- Jovian to allow the notebook to be stored and submitted on the Jovian platform.
- Requests which allows the notebook to load the web pages.
- Beautiful Soup 4 which provides functionality to scrape specific fields from the web pages.
- Pandas which provides functions to manage the dataset and output it to a CSV file.
- The module re which is used to work with regular expressions


In [1]:
!pip install jovian --upgrade --quiet
import jovian

In [2]:
!pip install requests --upgrade --quiet
import requests

In [3]:
!pip install beautifulsoup4 --upgrade --quiet
from bs4 import BeautifulSoup

In [4]:
!pip install pandas --upgrade --quiet
import pandas as pd

Note: The project also uses the modules re and os which are within Python so does not need to be installed, just imported.

In [5]:
import re
import os

### Creating a Pandas Dataframe for the Contacts Listing Data To Be Marshalled Into After Web Scraping

A repository is needed for the contacts data that will be scraped from the pages on the NHS website. 

In [6]:
NHS_Contacts_Dataset = pd.DataFrame(columns = ['Name','Role','Address','Phone','Email','Website'])
NHS_Contacts_Dataset

Unnamed: 0,Name,Role,Address,Phone,Email,Website


### Identifying the Pages for Types of NHS Organisations

The types of NHS organisation are available on the 'Services" page at url https://www.nhs.uk/Services/

<img src="https://i.imgur.com/jj3N1DR.gif">

From this the page for hospitals can be selected.

### Selecting the Page for NHS Hospitals

A listing of NHS hospitals is at url https://www.nhs.uk/Services/Pages/AcuteTrustList.aspx?trustType=Acute

<img src="https://i.imgur.com/Y1gJ2xy.gif">

In [7]:
# Setting the url as variable
nhs_hospitals_listing_url = 'https://www.nhs.uk/Services/Pages/AcuteTrustList.aspx?trustType=Acute'

In [8]:
# Using the Requests library to get the page & check it was captured (code 200)
nhs_hospitals_listing = requests.get(nhs_hospitals_listing_url)
nhs_hospitals_listing

<Response [200]>

### Parse the Page & Extract the Hospital urls

In [9]:
# Parse the html page with Beautiful Soup
nhs_hospitals_doc = BeautifulSoup(nhs_hospitals_listing.text, 'html.parser')
nhs_hospitals_doc


<!DOCTYPE html>

<!--[if lt IE 9]><html class="lte-ie8" lang="en"><![endif]--><!--[if IE 9]><html class="ie9" lang="en"><![endif]--><!--[if gt IE 9]><!--><html lang="en"><!--<![endif]-->
<head>
<title></title>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0, shrink-to-fit=no" name="viewport"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="/css/base.css" media="" rel="stylesheet" type="text/css"/>
<link href="/css/print.css" media="print" rel="stylesheet" type="text/css"/>
<meta content="no-cache" http-equiv="Pragma"/>
<meta content="-1" http-equiv="Expires"/>
<!--[if !IE]>-->
<style type="text/css">
@import url('/css/reset.css') screen;
@import url('/css/screen.css') screen;
@import url('/css/emergency-alert.css') screen;
@import url('/css/find-services.css')screen;
@import url('/css/pims/pims.css')screen;

  @import url('/css/pims/pims-responsive.css');

 </style>
<!--<![endif]-->
<!--[if I

In [10]:
# Finding the hospitals and the urls of their pages
nhs_hospital_listing_table = nhs_hospitals_doc.find(class_ = "notranslate o-listing" )
nhs_hospital_listing_table = nhs_hospital_listing_table.find_all('a')
nhs_hospitals_list = []

# Placing the hospital names and urls into a tuple (with some adjustments to point to the Contacts page)
for lines in nhs_hospital_listing_table:
    hospital = lines.contents[0]
    link = 'https://www.nhs.uk/Services/Trusts/ContactDetails/' + lines.get('href')[26:]
    new_line = [hospital, link]
    nhs_hospitals_list.append(new_line)

# Printing  a sample of 10 items from the list.
nhs_hospitals_list[:10]

[['Airedale NHS Foundation Trust',
  'https://www.nhs.uk/Services/Trusts/ContactDetails/DefaultView.aspx?id=103'],
 ["Alder Hey Children's NHS Foundation Trust",
  'https://www.nhs.uk/Services/Trusts/ContactDetails/DefaultView.aspx?id=827'],
 ["Ashford and St Peter's Hospitals NHS Foundation Trust",
  'https://www.nhs.uk/Services/Trusts/ContactDetails/DefaultView.aspx?id=1396'],
 ['Barking, Havering and Redbridge University Hospitals NHS Trust',
  'https://www.nhs.uk/Services/Trusts/ContactDetails/DefaultView.aspx?id=27'],
 ['Barnsley Hospital NHS Foundation Trust',
  'https://www.nhs.uk/Services/Trusts/ContactDetails/DefaultView.aspx?id=81'],
 ['Barts Health NHS Trust',
  'https://www.nhs.uk/Services/Trusts/ContactDetails/DefaultView.aspx?id=34604'],
 ['Bedford Hospital NHS Trust',
  'https://www.nhs.uk/Services/Trusts/ContactDetails/DefaultView.aspx?id=148'],
 ['BEDFORDSHIRE HOSPITALS NHS FOUNDATION TRUST',
  'https://www.nhs.uk/Services/Trusts/ContactDetails/DefaultView.aspx?id=177'

### Creating Python Functions to Find the Data Elements

Inspection of the NHS site being scraped reveals that whilst the page structures are consistent within each "type" of NHS organisation, the page structures are not consistent between different types of NHS organisations.  This means that functions to find the data elements within a page need to be defined for each type of organisation.

Notice that different data elements are being identified using different identifiers in the HTML script within the page.

#### Function for Scraping the Hospital Name from a Page

In [11]:
def find_hospital_name(nhs_hospital_page_content):
    # Some pages may be missing the information so exception capturing is needed
    try:
        nhs_hospital_name = nhs_hospital_page_content.find(id = "org-title").text
    except:
        nhs_hospital_name = ""    
    return nhs_hospital_name

#### Function for Scraping the Hospital Address

In [12]:
def find_hospital_address(nhs_hospital_page_content): 
    # Some pages may be missing the information so exception capturing is needed
    try:
        nhs_hospital_address = nhs_hospital_page_content.find(property = "streetAddress").text + ", " + nhs_hospital_page_content.find(property = "addressLocality").text + ", " + nhs_hospital_page_content.find(property = "addressRegion").text + ", " + nhs_hospital_page_content.find(property = "postalCode").text
    except:
        nhs_hospital_address = ""
    return nhs_hospital_address

#### Function for Scraping the Hospital Phone Number (Landline)

In [13]:
def find_nhs_hospital_phone(nhs_hospital_page_content):
    # Some pages may be missing the information so exception capturing is needed
    try:
        nhs_hospital_phone = nhs_hospital_page_content.find(class_ = "tel-no").text
    except:
        nhs_hospital_phone = ""
    return nhs_hospital_phone

#### Function for Scraping the Hospital Email Address

In [14]:
def find_nhs_hospital_email(nhs_hospital_page_content):
    # Some pages may be missing the information so exception capturing is needed
    try:
        nhs_hospital_email_raw = nhs_hospital_page_content.find(href=re.compile("mailto:"))
        nhs_hospital_email = nhs_hospital_email_raw.get('href')[7:]
    except:
        nhs_hospital_email = ""
    return nhs_hospital_email

#### Function for Scraping the Hospital Website url

In [15]:
def find_nhs_hospital_website(nhs_hospital_page_content):
    # Some pages may be missing the information so exception capturing is needed
    try:
        nhs_hospital_website = nhs_hospital_page_content.find(property = "url").text
    except:
        nhs_hospital_website = ""
    return nhs_hospital_website

#### Generic Function for Appending Contact Details to the Contacts Dataset

This function can be used for any type of NHS organisation, so is common to all pages scraped.

In [16]:
def Add_To_NHS_Contacts_Dataset(name, role, address, phone, email, website):
    nhs_contacts_list = [name, role, address, phone, email, website]
    nhs_contact_list_info = {'Name' : name,
                          'Role' : role, 
                          'Address': address, 
                          'Phone' : phone, 
                          'Email' : email, 
                          'Website' : website}     
    return nhs_contact_list_info

### Working Through Each Hospital url, Loading the Page & Extracting the Contact Data Elements

In [17]:
# Setting the Role as being "Hospital"
nhs_role = "Hospital"
# Looping through each hospital page (retricted to first 50). Removing this constraint will loop through all NHS hospitals!
for hospital in nhs_hospitals_list[0:50]:
    nhs_hospital_url = hospital[1]
    nhs_hospital_page = requests.get(nhs_hospital_url)
    
    # extracting the fields for "Name", "Address", "Phone", "Email", "Website" using Beautiful Soup 4
    nhs_hospital_page_content = BeautifulSoup(nhs_hospital_page.text, 'html.parser')
    nhs_hospital_name = find_hospital_name(nhs_hospital_page_content)
    nhs_hospital_address = find_hospital_address(nhs_hospital_page_content)
    nhs_hospital_phone = find_nhs_hospital_phone(nhs_hospital_page_content)
    nhs_hospital_email = find_nhs_hospital_email(nhs_hospital_page_content)
    nhs_hospital_website = find_nhs_hospital_website(nhs_hospital_page_content)
    
    # The contact details are added to the dataset using "append" 
    nhs_contact_list_info = Add_To_NHS_Contacts_Dataset(nhs_hospital_name, nhs_role, nhs_hospital_address, nhs_hospital_phone, nhs_hospital_email, nhs_hospital_website)
    NHS_Contacts_Dataset = NHS_Contacts_Dataset.append(nhs_contact_list_info, ignore_index=True)

    #Printing to screen to check
    print(nhs_contact_list_info)

{'Name': 'Airedale NHS Foundation Trust', 'Role': 'Hospital', 'Address': 'Airedale General Hospital, Skipton Road, Steeton, Keighley, West Yorkshire, BD20 6TD', 'Phone': '01535 652511', 'Email': 'personnel.dept@anhst.nhs.uk', 'Website': 'http://www.airedale-trust.nhs.uk/'}
{'Name': "Alder Hey Children's NHS Foundation Trust", 'Role': 'Hospital', 'Address': "Alder Hey Children's Hospital, Eaton Road, West Derby, Liverpool, Merseyside, L12 2AP", 'Phone': '0151 228 4811', 'Email': 'pals@alderhey.nhs.uk', 'Website': 'http://www.alderhey.nhs.uk'}
{'Name': "Ashford and St Peter's Hospitals NHS Foundation Trust", 'Role': 'Hospital', 'Address': 'St Peters Hospital, Guildford Road, Chertsey, Surrey, KT16 0PZ', 'Phone': '01932 872000', 'Email': 'asp-tr.patient.advice@nhs.net', 'Website': 'http://www.ashfordstpeters.nhs.uk/'}
{'Name': 'Barking, Havering and Redbridge University Hospitals NHS Trust', 'Role': 'Hospital', 'Address': "Queen's Hospital, Rom Valley Way, Romford, Essex, RM7 0AG", 'Phone

{'Name': 'Frimley Health NHS Foundation Trust', 'Role': 'Hospital', 'Address': 'Portsmouth Road, Frimley, Camberley, Surrey, GU16 7UJ', 'Phone': '0300 6145000', 'Email': 'palsusers@fhft.nhs.uk', 'Website': 'http://www.fhft.nhs.uk/'}
{'Name': 'Gateshead Health NHS Foundation Trust', 'Role': 'Hospital', 'Address': 'Queen Elizabeth Hospital, Sheriff Hill, Gateshead, Tyne and Wear, NE9 6SX', 'Phone': '0191 482 0000', 'Email': 'pals@ghnt.nhs.uk', 'Website': 'http://www.qegateshead.nhs.uk'}
{'Name': 'George Eliot Hospital NHS Trust', 'Role': 'Hospital', 'Address': 'Lewes House, College Street, Nuneaton, Warwickshire, CV10 7DJ', 'Phone': '024 76 351 351', 'Email': 'enquiries@geh.nhs.uk', 'Website': 'http://www.geh.nhs.uk/'}
{'Name': 'Gloucestershire Hospitals NHS Foundation Trust', 'Role': 'Hospital', 'Address': 'Trust Headquarters, Alexandra House, Sandford Road, Cheltenham, Gloucestershire, GL53 7AN', 'Phone': '0300 422 2222', 'Email': 'ghn-tr.membership@nhs.net', 'Website': 'http://www.glo

In [18]:
# Dataset with first 50 hospitals by alphabetic order.
NHS_Contacts_Dataset

Unnamed: 0,Name,Role,Address,Phone,Email,Website
0,Airedale NHS Foundation Trust,Hospital,"Airedale General Hospital, Skipton Road, Steet...",01535 652511,personnel.dept@anhst.nhs.uk,http://www.airedale-trust.nhs.uk/
1,Alder Hey Children's NHS Foundation Trust,Hospital,"Alder Hey Children's Hospital, Eaton Road, Wes...",0151 228 4811,pals@alderhey.nhs.uk,http://www.alderhey.nhs.uk
2,Ashford and St Peter's Hospitals NHS Foundatio...,Hospital,"St Peters Hospital, Guildford Road, Chertsey, ...",01932 872000,asp-tr.patient.advice@nhs.net,http://www.ashfordstpeters.nhs.uk/
3,"Barking, Havering and Redbridge University Hos...",Hospital,"Queen's Hospital, Rom Valley Way, Romford, Ess...",01708 435000,bhrut.pals@nhs.net,http://www.bhrhospitals.nhs.uk/
4,Barnsley Hospital NHS Foundation Trust,Hospital,"Gawber Road, Barnsley, South Yorkshire, S75 2EP",01226 730000,barnsleypals@nhs.net,http://www.barnsleyhospital.nhs.uk
5,Barts Health NHS Trust,Hospital,,020 7377 7000,,http://www.bartshealth.nhs.uk
6,Bedford Hospital NHS Trust,Hospital,"South Wing, Kempston Road, Bedford, Bedfordshi...",01234 355122,pals@bedfordhospital.nhs.uk,http://www.bedfordhospital.nhs.uk
7,BEDFORDSHIRE HOSPITALS NHS FOUNDATION TRUST,Hospital,"Lewsey Road, Luton, Bedfordshire, LU4 0DZ",01582 491166,info@ldh.nhs.uk,https://www.ldh.nhs.uk/
8,Birmingham Women's and Children's NHS Foundati...,Hospital,"Steelhouse Lane, Birmingham, West Midlands, B4...",0121 333 9999,,http://www.bwc.nhs.uk
9,Blackpool Teaching Hospitals NHS Foundation Trust,Hospital,"Victoria Hospital, Whinney Heys Road, Blackpoo...",01253 300000,bfwh.trustcommunications@nhs.net,https://www.bfwh.nhs.uk/


### Selecting the Page for GP Surgeries

A listing of GP Surgeries is at url https://www.nhs.uk/Services/Pages/AcuteTrustList.aspx?trustType=Acute

<img src="https://i.imgur.com/xrDpfrv.gif">

Visual inspection of this page shows some duplications and following the links shows these duplications are to a the same GP website page. So we will need to check for these duplications before selecting the surgeries

In [19]:
# Setting the url as variable
nhs_gps_listing_url = 'https://www.nhs.uk/Services/Pages/HospitalList.aspx?chorg=GpBranch'
nhs_gps_listing_url

'https://www.nhs.uk/Services/Pages/HospitalList.aspx?chorg=GpBranch'

In [20]:
# Using the Requests library to get the page & check it was captured (code 200)
nhs_gps_listing = requests.get(nhs_gps_listing_url)
nhs_gps_listing

<Response [200]>

### Parse the Page & Extract the GPs urls

In [21]:
# Parse the html page with Beautiful Soup
nhs_gps_doc = BeautifulSoup(nhs_gps_listing.text, 'html.parser')

# Showing a sample output of 10 lines
result = nhs_gps_doc.prettify().splitlines()
print('\n'.join(result[:10]))

<!DOCTYPE html>
<!--[if lt IE 9]><html class="lte-ie8" lang="en"><![endif]-->
<!--[if IE 9]><html class="ie9" lang="en"><![endif]-->
<!--[if gt IE 9]><!-->
<html lang="en">
 <!--<![endif]-->
 <head>
  <title>
  </title>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>


In [22]:
# Finding the GP surgeries and the urls of their pages
nhs_gps_listing_table = nhs_gps_doc.find(class_ = "notranslate o-listing" )
nhs_gps_listing_table = nhs_gps_listing_table.find_all('a')
nhs_gps_list = []

# Placing the GP surgery names and urls into a tuple (with some adjustments to point to the Contacts page)
for lines in nhs_gps_listing_table:
    gp = lines.contents[0]
    link = 'https://www.nhs.uk/' + lines.get('href')
    new_line = [gp, link]
    nhs_gps_list.append(new_line)

# Printing the list to screen to check (first 10 GP surgeries) 
nhs_gps_list[:10]

[[' Christchurch Medical Practice',
  'https://www.nhs.uk//Services/GP/Overview/DefaultView.aspx?id=42097'],
 [' Dr Alalade and Dr Klemenz',
  'https://www.nhs.uk//Services/GP/Overview/DefaultView.aspx?id=39199'],
 [' Dr P Weston & Dr T Helbitz',
  'https://www.nhs.uk//Services/GP/Overview/DefaultView.aspx?id=39906'],
 [' EDMG - The Surgery',
  'https://www.nhs.uk//Services/GP/Overview/DefaultView.aspx?id=152101'],
 [' Pinfold Surgery',
  'https://www.nhs.uk//Services/GP/Overview/DefaultView.aspx?id=39054'],
 ['102 The Avenue Surgery',
  'https://www.nhs.uk//Services/GP/Overview/DefaultView.aspx?id=43750'],
 ['168 Medical Group',
  'https://www.nhs.uk//Services/GP/Overview/DefaultView.aspx?id=37781'],
 ['19 Beaumont Street Surgery',
  'https://www.nhs.uk//Services/GP/Overview/DefaultView.aspx?id=42962'],
 ['27 Beaumont Street Medical Practice',
  'https://www.nhs.uk//Services/GP/Overview/DefaultView.aspx?id=41440'],
 ['28 Beaumont Street',
  'https://www.nhs.uk//Services/GP/Overview/De

### Creating Python Functions to Find the Data Elements for GP Surgeries

Because the GP pages are created with a slightly different HTML (asp.net) structure the hospital functions cannot be used. 

#### Function for Scraping the GP Surgery Name from a Page

In [23]:
def find_gp_name(nhs_gp_page_content):
    # Some pages may be missing the information so exception capturing is needed
    try:
        nhs_gp_name = nhs_gp_page_content.find(id = "page-heading-org-name").text
    except:
        nhs_gp_name = ""   
    return nhs_gp_name

#### Function for Scraping the GP Surgery Address

In [24]:
def find_gp_address(nhs_gp_page_content): 
    # Some pages may be missing the information so exception capturing is needed
    try:
        nhs_gp_address = nhs_gp_page_content.find(id = "address_panel_address").text
        # There are some formatting characters to remove, which luckily are identical for all data elements in GP Surgery pages
        nhs_gp_address = nhs_gp_address[2:]
        nhs_gp_address = nhs_gp_address.replace("\r\n\r\n",", ")
        nhs_gp_address = nhs_gp_address.replace("\r\n",", ")       
    except:
        nhs_gp_address = ""
    return nhs_gp_address

#### Function for Scraping the GP Surgery Phone Number (Landline)

In [25]:
def find_gp_phone(nhs_gp_page_content):
    # Some pages may be missing the information so exception capturing is needed
    try:
        nhs_gp_phone = nhs_gp_page_content.find(id = "contact_info_panel_phone_text").text
    except:
        nhs_gp_phone = ""
    return nhs_gp_phone

#### Function for Scraping the GP Surgery Email Address

In [26]:
def find_nhs_gp_email(nhs_gp_page_content):
    # Some pages may be missing the information so exception capturing is needed
    try:
        nhs_gp_email_raw = nhs_gp_page_content.find(href=re.compile("mailto:"))
        nhs_gp_email = nhs_gp_email_raw.get('href')[7:]
    except:
        nhs_gp_email = ""
    return nhs_gp_email

#### Function for Scraping the GP Surgery Website url

In [27]:
def find_nhs_gp_website(nhs_gp_page_content):
    # Some pages may be missing the information so exception capturing is needed
    try:
        nhs_gp_website_raw = nhs_gp_page_content.find(id = "contact_info_panel_website_link")
        nhs_gp_website = nhs_gp_website_raw.get('href')
    except:
        nhs_gp_website = ""
    return nhs_gp_website

In [28]:
# Setting the Role as being "GP Surgery"
nhs_role = "GP Surgery"
# Looping through each GP Surgery page (retricted to first 100). Removing this constraint will loop through all GP Surgeries!
for gp in nhs_gps_list[0:100]:
    nhs_gp_url = gp[1]
    nhs_gp_page = requests.get(nhs_gp_url)
    # extracting the fields for "Name", "Address", "Phone", "Email", "Website" using Beautiful Soup 4
    nhs_gp_page_content = BeautifulSoup(nhs_gp_page.text, 'html.parser')   
    nhs_gp_name = find_gp_name(nhs_gp_page_content)
    nhs_gp_address = find_gp_address(nhs_gp_page_content)
    nhs_gp_phone = find_gp_phone(nhs_gp_page_content)
    nhs_gp_email = find_nhs_gp_email(nhs_gp_page_content)
    nhs_gp_website = find_nhs_gp_website(nhs_gp_page_content)

    # The contact details are added to the dataset using "append" 
    nhs_contact_list_info = Add_To_NHS_Contacts_Dataset(nhs_gp_name, nhs_role, nhs_gp_address, nhs_gp_phone, nhs_gp_email, nhs_gp_website)
    NHS_Contacts_Dataset = NHS_Contacts_Dataset.append(nhs_contact_list_info, ignore_index=True)
    #Printing to screen to check
    print(nhs_contact_list_info)

{'Name': ' Christchurch Medical Practice', 'Role': 'GP Surgery', 'Address': 'The Orchard Surgery, 1 Purewell Cross Road, Christchurch, Dorset, BH23 3AF  ', 'Phone': '01202 474311', 'Email': '', 'Website': 'http://www.christchurchmedicalpractice.co.uk'}
{'Name': ' Dr Alalade and Dr Klemenz', 'Role': 'GP Surgery', 'Address': "The Nuffield Centre, St Michael's Road, Portsmouth, Hampshire, PO1 2BH  ", 'Phone': '02392736006', 'Email': 'pccg.universitysurgery@nhs.net', 'Website': 'http://www.universitysurgery.com'}
{'Name': '', 'Role': 'GP Surgery', 'Address': '', 'Phone': '', 'Email': '', 'Website': ''}
{'Name': ' EDMG - The Surgery', 'Role': 'GP Surgery', 'Address': 'Dunelm Road, Thornley, Durham, DH6 3HW  ', 'Phone': '', 'Email': '', 'Website': ''}
{'Name': ' Pinfold Surgery', 'Role': 'GP Surgery', 'Address': 'Health Care First Partnership, Pinfold Surgery, Pinfold Lane, LS26 9AA  ', 'Phone': '01977 664141', 'Email': '', 'Website': 'https://www.healthcarefirst.co.uk'}
{'Name': '102 The Av

{'Name': 'Abbey Medical Practice', 'Role': 'GP Surgery', 'Address': '95 Monks Road, Lincoln, Lincolnshire, LN2 5HR  ', 'Phone': '01522530334', 'Email': '', 'Website': 'http://www.abbeymedicalpractice.co.uk'}
{'Name': 'Abbey Medical Practice', 'Role': 'GP Surgery', 'Address': 'Mannock Medical Centre, Irthlingborough Road, Wellingborough, Northamptonshire, NN8 1LT  ', 'Phone': '01933233200', 'Email': '', 'Website': 'http://www.abbeymedicalpractice.uk.com/'}
{'Name': 'Abbey Medical Practice', 'Role': 'GP Surgery', 'Address': 'Evesham Medical Centre, Abbey Lane Evesham, Evesham, Worcestershire, WR11 4BS  ', 'Phone': '01386761111', 'Email': 'abbey.evesham@nhs.net', 'Website': 'http://www.abbeymedical.com'}
{'Name': '', 'Role': 'GP Surgery', 'Address': '', 'Phone': '', 'Email': '', 'Website': ''}
{'Name': 'Abbey Road Medical Practice', 'Role': 'GP Surgery', 'Address': '28A Abbey Road, South West Newham, Stratford, London, Greater London, E15 3LT  ', 'Phone': '02085342515', 'Email': '', 'Webs

{'Name': 'Addington Medical Centre', 'Role': 'GP Surgery', 'Address': '46 Station Road, Barnet, Hertfordshire, EN5 1QH  ', 'Phone': '02084414425', 'Email': '', 'Website': 'https://www.addingtonmedicalcentre.co.uk'}
{'Name': 'Addington Medical Practice', 'Role': 'GP Surgery', 'Address': 'Parkway, New Addington, Croydon, Surrey, CR0 0JA  ', 'Phone': '01689849993', 'Email': '', 'Website': ''}
{'Name': 'Addington Road Surgery', 'Role': 'GP Surgery', 'Address': '77 Addington Road, West Wickham, Kent, BR4 9BG  ', 'Phone': '02084625771', 'Email': '', 'Website': 'http://www.addingtonroadsurgery.co.uk'}
{'Name': 'Addison House - Haque Practice', 'Role': 'GP Surgery', 'Address': 'Hamstel Road, Harlow, Essex, CM20 1DS  ', 'Phone': '01279621900', 'Email': '', 'Website': 'http://www.addison-surgery.nhs.uk'}
{'Name': 'Addison Road Medical Practice', 'Role': 'GP Surgery', 'Address': 'Comely Bank Clinic, 46 Ravenswood Road, London, Greater London, E17 9LY  ', 'Phone': '02084307171', 'Email': '', 'Webs

### Reviewing the Dataset and Saving Data to a CSV file

In [29]:
# Dataset with first 50 hospitals and first 100 GP Surgeries by alphabetic order.
NHS_Contacts_Dataset

Unnamed: 0,Name,Role,Address,Phone,Email,Website
0,Airedale NHS Foundation Trust,Hospital,"Airedale General Hospital, Skipton Road, Steet...",01535 652511,personnel.dept@anhst.nhs.uk,http://www.airedale-trust.nhs.uk/
1,Alder Hey Children's NHS Foundation Trust,Hospital,"Alder Hey Children's Hospital, Eaton Road, Wes...",0151 228 4811,pals@alderhey.nhs.uk,http://www.alderhey.nhs.uk
2,Ashford and St Peter's Hospitals NHS Foundatio...,Hospital,"St Peters Hospital, Guildford Road, Chertsey, ...",01932 872000,asp-tr.patient.advice@nhs.net,http://www.ashfordstpeters.nhs.uk/
3,"Barking, Havering and Redbridge University Hos...",Hospital,"Queen's Hospital, Rom Valley Way, Romford, Ess...",01708 435000,bhrut.pals@nhs.net,http://www.bhrhospitals.nhs.uk/
4,Barnsley Hospital NHS Foundation Trust,Hospital,"Gawber Road, Barnsley, South Yorkshire, S75 2EP",01226 730000,barnsleypals@nhs.net,http://www.barnsleyhospital.nhs.uk
...,...,...,...,...,...,...
145,Ahmed N Queens Park Health Centre,GP Surgery,"Dart Street, London, Greater London, W10 4LD",02089649990,,
146,Ailsa Craig Medical Centre,GP Surgery,"270 Dickenson Road, Longsight, Manchester, Gre...",01612245555,cmccg.ailsacraig@nhs.net,http://www.ailsacraigmedicalpractice.co.uk
147,Ailsworth Medical Centre,GP Surgery,"32 Main Street, Ailsworth, Peterborough, Cambr...",01733 380686,,http://www.ailsworthsurgery.co.uk
148,Ainsdale Medical Centre,GP Surgery,"66 Station Road, Ainsdale, Southport, Merseysi...",01704575133,,http://www.ainsdalemedicalcentre.nhs.net


In [30]:
# Saving to CSV file
NHS_Contacts_Dataset.to_csv('NHS_Contacts_Dataset.csv', index=False)
os.listdir('/home/jovyan')

['.profile',
 '.bashrc',
 '.bash_logout',
 '.local',
 '.cache',
 '.jupyter',
 'NHS_Contacts_Dataset.csv',
 '.jovian',
 'zerotoanalyst-project1.ipynb',
 '.ipython',
 '.ipynb_checkpoints',
 '.jovianrc',
 '.empty',
 'work',
 '.config',
 '.conda',
 '.git',
 '.yarn']

### Extending the Project to All NHS Organisations

The project is a demonstrator and so has been constrained to first 50 hospitals and first 100 GP Surgeries.

It is simple to extend to all NHS organisations as the pages are accessible within the NHS site (https://www.nhs.uk) for all types of NHS organisations.  The starting pages for each type of NHS organisation would be:

- Areas Teams at https://www.nhs.uk/Services/Pages/AcuteTrustList.aspx?trustType=LocalAreaTeam
- Clinical Commissioning Groups at https://www.nhs.uk/Services/Pages/AcuteTrustList.aspx?trustType=ClinicalCommissionGroup
- Health & Care Trusts at https://www.nhs.uk/Services/Pages/AcuteTrustList.aspx?trustType=HealthAndCare
- Independent Providers at https://www.nhs.uk/Services/Pages/AcuteTrustList.aspx?trustType=Independent
- Mental Health Trusts at https://www.nhs.uk/Services/Pages/AcuteTrustList.aspx?trustType=MentalHealth
- Care Organisations at https://www.nhs.uk/Services/Pages/AcuteTrustList.aspx?trustType=SocialCare
- Dentists at https://www.nhs.uk/Services/Pages/HospitalList.aspx?chorg=Dentist
- Opticians at https://www.nhs.uk/Services/Pages/HospitalList.aspx?chorg=Optician
- Pharmacies at https://www.nhs.uk/Services/Pages/HospitalList.aspx?chorg=Pharmacy
- NHS Clinics at https://www.nhs.uk/Services/Pages/HospitalList.aspx?chorg=Clinic
- Care Providers at https://www.nhs.uk/Services/Pages/HospitalList.aspx?chorg=CareProvider

If these extensions were implemented then the contact details of all 16,000+ NHS organisations would be captured in the NHS_Contacts_Dataset.

<img src="https://i.imgur.com/pkpmnI5.gif">

### Issues & Solutions

The following issues need investigation:

1. Some organisations do not have some of the data elements published on their web pages within the NHS website.  This is particularly the case for e-mail addresses for GP surgeries for example.  A solution could be to try to find another web-site that does have these and scrape just the e-mail addresses and then match them into the dataset.
2. At 150 organisations the code is taking a couple of minutes to run.  So for all NHS organisations this would take perhaps 3 hours to run through all of the pages.  So it is worthwhile investigating whether the code could be optimised and also whether the most time is being used waiting for pages to be delivered as seems likely.

### Summary

The project has demonstrated that the NHS website can be web scraped with Beautiful Soup 4 to create a comprehensive contact dataset covering all NHS organisations.  A number of challenges meant it was not completely straight forward, but with careful investigation and some "trial and error" with BS4 functions the contacts dataset was completed for hospitals and GP surgeries.

### Future Work

There is much more that could be done with this project:

1. The extension of this work to include all types of NHS organisations could now be undertaken to create the full dataset.
2. A comparison of the dataset against commercially available datasets would provide an indication of the value of the dataset and also a view on whether a dataset from screen scraping is significantly more up-to-date.
3. This work has identified the web-sites of each of the organisations and if there is some standardisation of structure in these web-sites then screen scraping could, for example, capture the names of key executives.  These are usually in the pages communicating the Board membership.
4. Other valuable data elements could be added to the dataset from the NHS website such as the different services provided.

And this work could potentially be continued to explore further ways in which contact information can be captured from web-sites for research, sales leads, etc.

### Saving the notebook to Jovian

In [31]:
# Execute this to save new versions of the notebook
jovian.commit(project="zerotoanalyst-project1", privacy='public')

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m
[jovian] Updating notebook "metanoialondon/zerotoanalyst-project1" on https://jovian.ai[0m
[jovian] Uploading notebook..[0m
[jovian] Uploading additional files...[0m
[jovian] Committed successfully! https://jovian.ai/metanoialondon/zerotoanalyst-project1[0m


'https://jovian.ai/metanoialondon/zerotoanalyst-project1'

In [None]:
jovian.submit(assignment="zerotoanalyst-project1")

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m
