## **Web Scraping**

#### **Libraries**

We import the necessary libraries for the notebook.

In [1]:
import pandas as pd
import requests
import json
from bs4 import BeautifulSoup
from tqdm import tqdm

print("> Libraries Imported")

> Libraries Imported


## **Conditions**
#### **Obtain the main page**

First, we obtain the main content of the page *https://www.nhsinform.scot/illnesses-and-conditions/a-to-z*

In [2]:
header = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'}
base_url = "https://www.nhsinform.scot/illnesses-and-conditions/a-to-z"

r = requests.get(base_url,headers=header)

# check the response code (it should be 200)
print(f"> Status Code: {r.status_code}")

# and obtain the content
soup = BeautifulSoup(r.text, 'html.parser')
print("> Site obtained")

> Status Code: 200
> Site obtained


Now we create the logic to obtain the information we need.

In [3]:
# the info is stored in the 'a' tags, with class 'nhs-uk__az-link'
links = soup.find_all('a',attrs={"class":"nhs-uk__az-link"})

# setup placeholders and variables
BASE_LINK = 'https://www.nhsinform.scot/'
name_list = []
href_list = []

# iterate over obtained 
for link in links:

    # preprocess name
    clean_name = link.text.replace('\r','').replace('\n','').replace('\t','')
    
    # save it in the list
    name_list.append(clean_name)

    # obtain href
    href_list.append(BASE_LINK + link.get('href'))

# show results
print(f"> Total Conditions Found: {len(name_list)}")
for name, link in zip(name_list[0:5], href_list[0:5]):
    print(f" * '{name}' | Link: '{link}'")

> Total Conditions Found: 322
 * 'Abdominal aortic aneurysm' | Link: 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/a/abdominal-aortic-aneurysm'
 * 'Acne' | Link: 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/a/acne'
 * 'Acute cholecystitis' | Link: 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/a/acute-cholecystitis'
 * 'Acute lymphoblastic leukaemia' | Link: 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/a/acute-lymphoblastic-leukaemia'
 * 'Acute lymphoblastic leukaemia: Children' | Link: 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/a/acute-lymphoblastic-leukaemia-children'


#### **Obtain the information about type**

In [4]:
# setup placeholder
types_list = []

for link in tqdm(href_list, desc="Scraping Sites"):
    
    # connect
    r = requests.get(link,headers=header)

    # check for status code
    if r.status_code == 200:

        try:
            # obtain content 
            soup = BeautifulSoup(r.text, 'html.parser')

            # obtain 'a' with class 'nhsuk-breadcrumb__link'
            links = soup.find_all('a',attrs={"class":"nhsuk-breadcrumb__link"})

            # save it in the list
            types_list.append(links[2].text)

        except Exception as e:
            types_list.append("ERROR")
            print(f"> Error while retrieving the page '{link}'.")
            print(f"  Error Details: {str(e)}")

    # if the page could not be reached, print a warning
    else:
        types_list.append("ERROR")
        print(f"> Error while retrieving the page '{link}'.")
        print(f"  Status Code: {r.status_code}")


Scraping Sites:  19%|█▉        | 61/322 [00:51<03:10,  1.37it/s]

> Error while retrieving the page 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/c/chronic-fatigue-syndrome'.
  Status Code: 404


Scraping Sites:  28%|██▊       | 91/322 [01:15<03:09,  1.22it/s]

> Error while retrieving the page 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/d/diabetes'.
  Error Details: list index out of range


Scraping Sites:  53%|█████▎    | 172/322 [02:16<01:23,  1.79it/s]

> Error while retrieving the page 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/l/langerhans-cell-histiocytosis'.
  Status Code: 404


Scraping Sites:  58%|█████▊    | 188/322 [02:26<01:18,  1.70it/s]

> Error while retrieving the page 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/m/malnutrition'.
  Status Code: 404


Scraping Sites:  59%|█████▉    | 191/322 [02:27<01:11,  1.84it/s]

> Error while retrieving the page 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/m/menopause'.
  Status Code: 404


Scraping Sites:  66%|██████▌   | 211/322 [02:43<01:11,  1.54it/s]

> Error while retrieving the page 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/n/norovirus'.
  Status Code: 404


Scraping Sites:  75%|███████▍  | 240/322 [03:01<00:51,  1.60it/s]

> Error while retrieving the page 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/p/pregnancy-and-baby'.
  Error Details: list index out of range


Scraping Sites:  86%|████████▋ | 278/322 [03:27<00:28,  1.57it/s]

> Error while retrieving the page 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/s/stress-anxiety-and-low-mood'.
  Error Details: list index out of range


Scraping Sites:  87%|████████▋ | 279/322 [03:27<00:27,  1.59it/s]

> Error while retrieving the page 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/s/stroke'.
  Status Code: 404


Scraping Sites:  87%|████████▋ | 281/322 [03:28<00:22,  1.81it/s]

> Error while retrieving the page 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/s/suicide'.
  Status Code: 404


Scraping Sites:  90%|█████████ | 291/322 [03:34<00:16,  1.90it/s]

> Error while retrieving the page 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/t/thrush-in-men'.
  Status Code: 404


Scraping Sites:  93%|█████████▎| 298/322 [03:38<00:13,  1.79it/s]

> Error while retrieving the page 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/t/transient-ischaemic-attack-tia'.
  Status Code: 404


Scraping Sites:  97%|█████████▋| 311/322 [03:46<00:05,  1.89it/s]

> Error while retrieving the page 'https://www.nhsinform.scot//illnesses-and-conditions/a-to-z/v/vaginal-thrush'.
  Status Code: 404


Scraping Sites: 100%|██████████| 322/322 [03:54<00:00,  1.37it/s]


#### **Create the complete dataset of conditions**

We merge all the scraped data together.

In [5]:
# create the df from lists
conditions_df = pd.DataFrame(
    data=zip(name_list, types_list),
    columns=["condition_name","condition_type"]
)

# remove error types
print(f"> Number of Conditions         {conditions_df.shape[0]}")
conditions_df = conditions_df.loc[conditions_df["condition_type"] != "ERROR"].reset_index()
print(f"> Number of Conditions (clean) {conditions_df.shape[0]}")

# add column with conditions ids
conditions_ids = [f"cond_{n}" for n in range(len(conditions_df))]
conditions_df["condition_id"] = conditions_ids

# reorder columns
conditions_df = conditions_df[["condition_id", "condition_name", "condition_type"]]

# show results
conditions_df

> Number of Conditions         322
> Number of Conditions (clean) 309


Unnamed: 0,condition_id,condition_name,condition_type
0,cond_0,Abdominal aortic aneurysm,Heart and blood vessels
1,cond_1,Acne,"Skin, hair and nails"
2,cond_2,Acute cholecystitis,"Stomach, liver and gastrointestinal tract"
3,cond_3,Acute lymphoblastic leukaemia,Cancer
4,cond_4,Acute lymphoblastic leukaemia: Children,Cancer
...,...,...,...
304,cond_304,Warts and verrucas,"Skin, hair and nails"
305,cond_305,Whooping cough,Infections and poisoning
306,cond_306,Wilms’ tumour,Cancer
307,cond_307,Womb (uterus) cancer,Cancer


*How many different types do we have?*

In [6]:
types_set = conditions_df[["condition_type"]].drop_duplicates().reset_index(drop=True)
types_set

Unnamed: 0,condition_type
0,Heart and blood vessels
1,"Skin, hair and nails"
2,"Stomach, liver and gastrointestinal tract"
3,Cancer
4,Glands
5,"Ears, nose and throat"
6,Immune system
7,"Brain, nerves and spinal cord"
8,"Muscle, bone and joints"
9,Mental health


In [7]:
type_freq = pd.DataFrame(conditions_df.groupby(['condition_type'])['condition_name'].count()).sort_values(by=['condition_name'], ascending=False)
type_freq

Unnamed: 0_level_0,condition_name
condition_type,Unnamed: 1_level_1
Cancer,73
"Stomach, liver and gastrointestinal tract",33
Infections and poisoning,27
"Skin, hair and nails",22
"Brain, nerves and spinal cord",18
"Muscle, bone and joints",16
Sexual and reproductive,16
Mental health,16
"Ears, nose and throat",15
Lungs and airways,14


#### **Create a dictionary**

And store it as a .json

In [8]:
# create the dict
RESULT_DICT = {"conditions":[]}
for idx,row in conditions_df.iterrows():
    RESULT_DICT["conditions"].append(
        {
            "id":row["condition_id"],
            "name":row["condition_name"],
            "type":row["condition_type"],
        }
    )

# save it as .json
with open('conditions.json', 'w') as fp:
    json.dump(RESULT_DICT, fp, indent=4)
print("> JSON stored correctly")

> JSON stored correctly


---
## **Therapies**

In [9]:
url_ther = "https://en.wikipedia.org/wiki/List_of_therapies"

r = requests.get(url_ther,headers=header)

# check the response code (it should be 200)
print(f"> Status Code: {r.status_code}")

# and obtain the content
soup = BeautifulSoup(r.text, 'html.parser')
print("> Site obtained")

> Status Code: 200
> Site obtained


In [12]:
# the info is stored in the 'a' tags, with class 'nhs-uk__az-link'
links = soup.find_all('a',attrs={"class":"mw-redirect"})

links

clean_name = link.text

AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

In [14]:
# the info is stored in the 'a' tags, with class 'nhs-uk__az-link'
links = soup.find_all('a',attrs={"class":"mw-redirect"})

# setup placeholders and variables
BASE_LINK = 'https://en.wikipedia.org'
name_list = []
href_list = []

# iterate over obtained 
for link in links:

    # preprocess name
    clean_name = link.text.replace('\r','').replace('\n','').replace('\t','')
    
    # save it in the list
    name_list.append(clean_name)

    # obtain href
    href_list.append(BASE_LINK + link.get('href'))

# show results
print(f"> Total Conditions Found: {len(name_list)}")
for name, link in zip(name_list[0:5], href_list[0:5]):
    print(f" * '{name}' | Link: '{link}'")

> Total Conditions Found: 51
 * 'abortive therapy' | Link: 'https://en.wikipedia.org/wiki/Abortive_therapy'
 * 'antibody therapy' | Link: 'https://en.wikipedia.org/wiki/Antibody_therapy_(disambiguation)'
 * 'aquarium therapy' | Link: 'https://en.wikipedia.org/wiki/Aquarium_therapy'
 * 'aurotherapy' | Link: 'https://en.wikipedia.org/wiki/Chrysotherapy'
 * 'cell transfer therapy' | Link: 'https://en.wikipedia.org/wiki/Cell_transfer_therapy'
