## **Web Scraping**

#### **Libraries**

We import the necessary libraries for the notebook.

In [1]:
import pandas as pd
import requests
import json
from bs4 import BeautifulSoup
from tqdm import tqdm

print("> Libraries Imported")

> Libraries Imported


## **Conditions data**
We obtain data on conditions from the NHS inform webpage [*Illinesses and conditions*](https://www.nhsinform.scot/illnesses-and-conditions/a-to-z), as suggested in the assignment. In order to do this, we need to use web scraping; for this purpose `BeautifulSoup` library will be useful.

#### **Obtain the web page**
First, we obtain the main content of the web page by mean of `requests` library:

In [2]:
header = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'}
cond_url = "https://www.nhsinform.scot/illnesses-and-conditions/a-to-z"

r_cond = requests.get(cond_url,headers=header)

# check the response code (it should be 200)
print(f"> Status Code: {r_cond.status_code}")

# and obtain the content
soup_cond = BeautifulSoup(r_cond.text, 'html.parser')
print("> Site obtained")

> Status Code: 200
> Site obtained


#### **Obtain each therapy detail page**

Now we create the logic to obtain the information we need. We are interested in the name and the type of each condition. We note that each name is contained in the `<a>` tag of the list, while the type is present in the page linked by the same tag (particularly in the `href` attribute link).

In [3]:
# the info is stored in the 'a' tags, with class 'nhs-uk__az-link'
links_cond = soup_cond.find_all('a',attrs={"class":"nhs-uk__az-link"})

# setup placeholders and variables
BASE_LINK_C = 'https://www.nhsinform.scot/' # useful to complete the href link
name_cond_list = []
href_cond_list = []

# iterate over obtained tags
for link in links_cond:

    # preprocess name
    clean_name = link.text.replace('\r','').replace('\n','').replace('\t','')
    
    # save it in the list
    name_cond_list.append(clean_name)

    # obtain href link (for the type page)
    href_cond_list.append(BASE_LINK_C + link.get('href'))

# show results
print(f"> Total Conditions Found: {len(name_cond_list)}")
for name, link in zip(name_cond_list[0:5], href_cond_list[0:5]):
    print(f" * '{name}' | Link: '{link}'")

> Total Conditions Found: 322
 * 'Abdominal aortic aneurysm' | Link: 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/a/abdominal-aortic-aneurysm'
 * 'Acne' | Link: 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/a/acne'
 * 'Acute cholecystitis' | Link: 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/a/acute-cholecystitis'
 * 'Acute lymphoblastic leukaemia' | Link: 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/a/acute-lymphoblastic-leukaemia'
 * 'Acute lymphoblastic leukaemia: Children' | Link: 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/a/acute-lymphoblastic-leukaemia-children'


#### **Obtain the information about type**
In particular, the type information is contained in an `<a>` tag in the breadcrumbs at the top of each page:

![condition type](notebooks_images\cond_type.png "Title")

We obtain this information for all conditions whose link does not give an error. In particulare we can have two types of errors:
- the web page no longer exists (404 error)
- the condition page does not have the type in the breadcrumbs (list index out of range error).

In [4]:
# initialize placeholder
types_cond_list = []

# iterate over conditions pages links
for link in tqdm(href_cond_list, desc="Scraping Sites"):
    
    # connect
    r = requests.get(link,headers=header)

    # check for status code
    if r.status_code == 200:

        try:
            # obtain content 
            soup = BeautifulSoup(r.text, 'html.parser')

            # obtain 'a' with class 'nhsuk-breadcrumb__link'
            links = soup.find_all('a',attrs={"class":"nhsuk-breadcrumb__link"})

            # save it in the list
            types_cond_list.append(links[2].text)

        except Exception as e:
            types_cond_list.append("ERROR")
            print(f"> Error while retrieving the page '{link}'.")
            print(f"  Error Details: {str(e)}")

    # if the page could not be reached, print a warning
    else:
        types_cond_list.append("ERROR")
        print(f"> Error while retrieving the page '{link}'.")
        print(f"  Status Code: {r.status_code}")


Scraping Sites:  19%|█▉        | 61/322 [00:37<02:17,  1.90it/s]

> Error while retrieving the page 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/c/chronic-fatigue-syndrome'.
  Status Code: 404


Scraping Sites:  28%|██▊       | 91/322 [00:53<02:17,  1.67it/s]

> Error while retrieving the page 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/d/diabetes'.
  Error Details: list index out of range


Scraping Sites:  53%|█████▎    | 172/322 [01:36<01:12,  2.08it/s]

> Error while retrieving the page 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/l/langerhans-cell-histiocytosis'.
  Status Code: 404


Scraping Sites:  58%|█████▊    | 188/322 [01:43<00:58,  2.29it/s]

> Error while retrieving the page 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/m/malnutrition'.
  Status Code: 404


Scraping Sites:  59%|█████▉    | 191/322 [01:45<00:53,  2.47it/s]

> Error while retrieving the page 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/m/menopause'.
  Status Code: 404


Scraping Sites:  66%|██████▌   | 211/322 [01:56<00:53,  2.06it/s]

> Error while retrieving the page 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/n/norovirus'.
  Status Code: 404


Scraping Sites:  75%|███████▍  | 240/322 [02:12<00:41,  2.00it/s]

> Error while retrieving the page 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/p/pregnancy-and-baby'.
  Error Details: list index out of range


Scraping Sites:  86%|████████▋ | 278/322 [02:31<00:22,  1.97it/s]

> Error while retrieving the page 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/s/stress-anxiety-and-low-mood'.
  Error Details: list index out of range


Scraping Sites:  87%|████████▋ | 279/322 [02:32<00:20,  2.10it/s]

> Error while retrieving the page 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/s/stroke'.
  Status Code: 404


Scraping Sites:  87%|████████▋ | 281/322 [02:32<00:17,  2.32it/s]

> Error while retrieving the page 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/s/suicide'.
  Status Code: 404


Scraping Sites:  90%|█████████ | 291/322 [02:38<00:16,  1.85it/s]

> Error while retrieving the page 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/t/thrush-in-men'.
  Status Code: 404


Scraping Sites:  93%|█████████▎| 298/322 [02:41<00:10,  2.21it/s]

> Error while retrieving the page 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/t/transient-ischaemic-attack-tia'.
  Status Code: 404


Scraping Sites:  97%|█████████▋| 311/322 [02:49<00:05,  1.98it/s]

> Error while retrieving the page 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/v/vaginal-thrush'.
  Status Code: 404


Scraping Sites: 100%|██████████| 322/322 [02:56<00:00,  1.82it/s]


#### **Create the complete dataset of conditions**

We merge all the scraped data together in order to better visualize them. For this purpose we save them in a `pandas` DataFrame. In addiction to that we also generate an id for each condition.

In [5]:
# create the df from lists
conditions_df = pd.DataFrame(
    data=zip(name_cond_list, types_cond_list),
    columns=["condition_name","condition_type"]
)

# remove error types
print(f"> Number of Conditions         {conditions_df.shape[0]}")
conditions_df = conditions_df.loc[conditions_df["condition_type"] != "ERROR"].reset_index()
print(f"> Number of Conditions (clean) {conditions_df.shape[0]}")

# add column with conditions ids
conditions_ids = [f"cond_{n}" for n in range(len(conditions_df))]
conditions_df["condition_id"] = conditions_ids

# reorder columns
conditions_df = conditions_df[["condition_id", "condition_name", "condition_type"]]

# show results
conditions_df

> Number of Conditions         322
> Number of Conditions (clean) 309


Unnamed: 0,condition_id,condition_name,condition_type
0,cond_0,Abdominal aortic aneurysm,Heart and blood vessels
1,cond_1,Acne,"Skin, hair and nails"
2,cond_2,Acute cholecystitis,"Stomach, liver and gastrointestinal tract"
3,cond_3,Acute lymphoblastic leukaemia,Cancer
4,cond_4,Acute lymphoblastic leukaemia: Children,Cancer
...,...,...,...
304,cond_304,Warts and verrucas,"Skin, hair and nails"
305,cond_305,Whooping cough,Infections and poisoning
306,cond_306,Wilms’ tumour,Cancer
307,cond_307,Womb (uterus) cancer,Cancer


#### **How many different types do we have?**

we can check the number of types of conditions we are dealing with:

In [6]:
# obtain unique types
types_cond_set = conditions_df[["condition_type"]].drop_duplicates().reset_index(drop=True)
types_cond_set

Unnamed: 0,condition_type
0,Heart and blood vessels
1,"Skin, hair and nails"
2,"Stomach, liver and gastrointestinal tract"
3,Cancer
4,Glands
5,"Ears, nose and throat"
6,Immune system
7,"Brain, nerves and spinal cord"
8,"Muscle, bone and joints"
9,Mental health


We have 25 conditions type. Let us now check the frequency of this types:

In [7]:
type_cond_freq = pd.DataFrame(conditions_df.groupby(['condition_type'])['condition_name'].count()).sort_values(by=['condition_name'], ascending=False)
type_cond_freq

Unnamed: 0_level_0,condition_name
condition_type,Unnamed: 1_level_1
Cancer,73
"Stomach, liver and gastrointestinal tract",33
Infections and poisoning,27
"Skin, hair and nails",22
"Brain, nerves and spinal cord",18
"Muscle, bone and joints",16
Sexual and reproductive,16
Mental health,16
"Ears, nose and throat",15
Lungs and airways,14


Some types are definitely more popular than others.

---

## **Therapies data**

We obtain data on therapies from the NHS inform webpage [*Tests and treatments*](https://www.nhsinform.scot/tests-and-treatments/a-to-z). Since the page structure is the same as in the previous case, we can proceed with the same operations.

#### **Obtain the web page**

In [9]:
header = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'}
base_url_ther = "https://www.nhsinform.scot/tests-and-treatments/a-to-z"

r_ther = requests.get(base_url_ther,headers=header)

# check the response code (it should be 200)
print(f"> Status Code: {r_ther.status_code}")

# and obtain the content
soup_ther = BeautifulSoup(r_ther.text, 'html.parser')
print("> Site obtained")

> Status Code: 200
> Site obtained


#### **Obtain each therapy detail page**

In [10]:
# the info is stored in the 'a' tags, with class 'nhs-uk__az-link'
links_ther = soup_ther.find_all('a',attrs={"class":"nhs-uk__az-link"})

# setup placeholders and variables
BASE_LINK_T = 'https://www.nhsinform.scot/'
name_ther_list = []
href_ther_list = []

# iterate over obtained 
for link in links_ther:

    # preprocess name
    clean_name = link.text.replace('\r','').replace('\n','').replace('\t','')
    
    # save it in the list
    name_ther_list.append(clean_name)

    # obtain href
    href_ther_list.append(BASE_LINK_T + link.get('href'))

# show results
print(f"> Total Therapies Found: {len(name_ther_list)}")
for name, link in zip(name_ther_list[0:5], href_ther_list[0:5]):
    print(f" * '{name}' | Link: '{link}'")

> Total Therapies Found: 69
 * 'Abortion' | Link: 'https://www.nhsinform.scot//tests-and-treatments/a-to-z/a/abortion'
 * 'Amniocentesis' | Link: 'https://www.nhsinform.scot//tests-and-treatments/a-to-z/a/amniocentesis'
 * 'Amputation' | Link: 'https://www.nhsinform.scot//tests-and-treatments/a-to-z/a/amputation'
 * 'Anaesthesia' | Link: 'https://www.nhsinform.scot//tests-and-treatments/a-to-z/a/anaesthesia'
 * 'Antibiotics' | Link: 'https://www.nhsinform.scot//tests-and-treatments/a-to-z/a/antibiotics'


#### **Obtain the information about type**


In [11]:
# setup placeholder
types_ther_list = []

for link in tqdm(href_ther_list, desc="Scraping Sites"):
    
    # connect
    r = requests.get(link,headers=header)

    # check for status code
    if r.status_code == 200:

        try:
            # obtain content 
            soup = BeautifulSoup(r.text, 'html.parser')

            # obtain 'a' with class 'nhsuk-breadcrumb__link'
            links = soup.find_all('a',attrs={"class":"nhsuk-breadcrumb__link"})

            # save it in the list
            types_ther_list.append(links[2].text)

        except Exception as e:
            types_ther_list.append("ERROR")
            print(f"> Error while retrieving the page '{link}'.")
            print(f"  Error Details: {str(e)}")

    # if the page could not be reached, print a warning
    else:
        types_ther_list.append("ERROR")
        print(f"> Error while retrieving the page '{link}'.")
        print(f"  Status Code: {r.status_code}")

Scraping Sites:  32%|███▏      | 22/69 [00:11<00:23,  1.98it/s]

> Error while retrieving the page 'https://www.nhsinform.scot//tests-and-treatments/a-to-z/c/contraception'.
  Error Details: list index out of range


Scraping Sites:  88%|████████▊ | 61/69 [00:31<00:04,  1.83it/s]

> Error while retrieving the page 'https://www.nhsinform.scot//tests-and-treatments/a-to-z/s/screening'.
  Error Details: list index out of range


Scraping Sites: 100%|██████████| 69/69 [00:35<00:00,  1.94it/s]


#### **Create the complete dataset of therapies**

In [12]:
# create the df from lists
therapies_df = pd.DataFrame(
    data=zip(name_ther_list, types_ther_list),
    columns=["therapy_name","therapy_type"]
)

# remove error types
print(f"> Number of Therapy         {therapies_df.shape[0]}")
therapies_df = therapies_df.loc[therapies_df["therapy_type"] != "ERROR"].reset_index()
print(f"> Number of Therapy (clean) {therapies_df.shape[0]}")

# add column with conditions ids
therapies_ids = [f"ther_{n}" for n in range(len(therapies_df))]
therapies_df["therapy_id"] = therapies_ids

# reorder columns
therapies_df = therapies_df[["therapy_id", "therapy_name", "therapy_type"]]

# show results
therapies_df

> Number of Therapy         69
> Number of Therapy (clean) 67


Unnamed: 0,therapy_id,therapy_name,therapy_type
0,ther_0,Abortion,Surgical procedures
1,ther_1,Amniocentesis,Biopsies
2,ther_2,Amputation,Surgical procedures
3,ther_3,Anaesthesia,Medicines and medical aids
4,ther_4,Antibiotics,Medicines and medical aids
...,...,...,...
62,ther_62,Ultrasound scan,Scans and X-rays
63,ther_63,Urinary catheterisation,Medicines and medical aids
64,ther_64,Warfarin,Medicines and medical aids
65,ther_65,Wisdom tooth removal,Dental treatments


We end up with 67 therapies.

#### **How many different types do we have and which are their frequencies?**

In [13]:
types_set = therapies_df[["therapy_type"]].drop_duplicates().reset_index(drop=True)
types_set

Unnamed: 0,therapy_type
0,Surgical procedures
1,Biopsies
2,Medicines and medical aids
3,Non-surgical procedures
4,Blood tests
5,Routine tests and examinations
6,Scans and X-rays
7,Dental treatments
8,Counselling and therapies


The therapies types are 9.

In [14]:
type_freq = pd.DataFrame(therapies_df.groupby(['therapy_type'])['therapy_name'].count()).sort_values(by=['therapy_name'], ascending=False)
type_freq

Unnamed: 0_level_0,therapy_name
therapy_type,Unnamed: 1_level_1
Medicines and medical aids,24
Surgical procedures,16
Non-surgical procedures,9
Scans and X-rays,7
Biopsies,3
Dental treatments,3
Counselling and therapies,2
Routine tests and examinations,2
Blood tests,1


#### **Create dictionaries and store them into a .json file**

We can finally create the dictionaries and store them in a `.json` file, as request in the assignment.


In [25]:
# create the dict
RESULT_DICT = {"conditions":[], "therapies":[]}
# conditions
for idx,row in conditions_df.iterrows():
    RESULT_DICT["conditions"].append(
        {
            "id":row["condition_id"],
            "name":row["condition_name"],
            "type":row["condition_type"],
        }
    )
# therapies
for idx,row in therapies_df.iterrows():
    RESULT_DICT["therapies"].append(
        {
            "id":row["therapy_id"],
            "name":row["therapy_name"],
            "type":row["therapy_type"],
        }
    )

# save it as .json
with open('../data/data.json', 'w') as fp:
    json.dump(RESULT_DICT, fp, indent=4)
print("> JSON stored correctly")

> JSON stored correctly
