# Data Collection

## Part 1: Web Scraping

In the following section, I scrape the Mayo Clinic's Symptoms and Causes pages under all of their indexed diseases and conditions.

In [1]:
# import the necessary libraries
import sys
import requests
from bs4 import BeautifulSoup
from string import ascii_uppercase as upp
import re
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

The first step is to get the long list of symptoms-causes URLs that are linked from the Mayo Clinic's indexed Diseases and Conditions lookup. I begin by saving the url up to and including the query for the letter, but not the letter itself. I also save the root of the URL for use later.

In [2]:
# get URLs: diseases-conditions pages organized in an alphabetical index, with one # entry
url = 'https://www.mayoclinic.org/diseases-conditions/index?letter='
root = 'https://www.mayoclinic.org'

Next, I write a function to extract the URLs of the pages I want from the letter index pages.

In [3]:
def extract(letter, addresses):
    index = url + letter # url of index page for this letter
    get = requests.get(index).content
    soup = BeautifulSoup(get, "lxml")
    # get list of articles on index page
    within = soup.find_all(class_ = "index content-within")
    for elm in within:
        # gets letter's articles as strings in a list
        links = re.findall("(?<=a\shref=\").*?(?=\">)", str(elm))
        # for each of the links
        for page in range(len(links)): 
            full = root + links[page] # symptoms-causes URL
            if addresses.count(full)<1:
                addresses.append(full)

I use the function to extract the URL from the # index page, the only index item that isn't listed under a capital letter of the alphabet. This helps me to ensure the function is behaving correctly without scraping too much.

In [4]:
addresses = []
extract("0", addresses)
addresses

['https://www.mayoclinic.org/diseases-conditions/digeorge-syndrome/symptoms-causes/syc-20353543']

Now that I know it works, I can extract the other URLs:

In [5]:
for letter in upp:
    extract(letter, addresses)

Now it's time to put these URLs into an initial dataframe that I will merge with the relevant Spider and Moz data.

In [6]:
mayo_data = pd.DataFrame(addresses, columns=["url"])
mayo_data.head()

Unnamed: 0,url
0,https://www.mayoclinic.org/diseases-conditions...
1,https://www.mayoclinic.org/diseases-conditions...
2,https://www.mayoclinic.org/diseases-conditions...
3,https://www.mayoclinic.org/diseases-conditions...
4,https://www.mayoclinic.org/diseases-conditions...


In [7]:
mayo_data.shape

(1181, 1)

## Part 2: SEO Spider Data

#### Raw Source Data: https://drive.google.com/open?id=1vlTTVOf3L2TnJxRma4TyJPCsgKpVvM19

Below are all of the columns provided by SEO Spider, not all of which will be useful for my purposes.

In [8]:
raw = pd.read_csv("symptoms-causes.csv")
raw.head(1)

Unnamed: 0,Address,Content,Status Code,Status,Indexability,Indexability Status,Title 1,Title 1 Length,Title 1 Pixel Width,Meta Description 1,...,Outlinks,Unique Outlinks,External Outlinks,Unique External Outlinks,Hash,Response Time,Last Modified,Redirect URL,Redirect Type,URL Encoded Address
0,https://www.mayoclinic.org/diseases-conditions...,text/html; charset=utf-8,200,OK,,,Congenital heart disease in adults - Symptoms ...,70,643,Learn about treatments and complications of he...,...,112,83,70,48,,,,,,https://www.mayoclinic.org/diseases-conditions...


In [9]:
raw.columns

Index(['Address', 'Content', 'Status Code', 'Status', 'Indexability',
       'Indexability Status', 'Title 1', 'Title 1 Length',
       'Title 1 Pixel Width', 'Meta Description 1',
       'Meta Description 1 Length', 'Meta Description 1 Pixel Width',
       'Meta Keyword 1', 'Meta Keywords 1 Length', 'H1-1', 'H1-1 length',
       'H1-2', 'H1-2 length', 'H2-1', 'H2-1 length', 'H2-2', 'H2-2 length',
       'Meta Robots 1', 'X-Robots-Tag 1', 'Meta Refresh 1',
       'Canonical Link Element 1', 'rel="next" 1', 'rel="prev" 1',
       'HTTP rel="next" 1', 'HTTP rel="prev" 1', 'Size (bytes)', 'Word Count',
       'Text Ratio', 'Crawl Depth', 'Link Score', 'Inlinks', 'Unique Inlinks',
       '% of Total', 'Outlinks', 'Unique Outlinks', 'External Outlinks',
       'Unique External Outlinks', 'Hash', 'Response Time', 'Last Modified',
       'Redirect URL', 'Redirect Type', 'URL Encoded Address'],
      dtype='object')

I select which columns I want to keep:

In [10]:
trimmed = raw[["URL Encoded Address", 'H1-1', 'H1-1 length', "Meta Description 1", 'Meta Description 1 Length', 'Size (bytes)', "Word Count", "Inlinks", "Unique Inlinks", 'Outlinks', 'Unique Outlinks', 'External Outlinks',
       'Unique External Outlinks']]
trimmed.head()

Unnamed: 0,URL Encoded Address,H1-1,H1-1 length,Meta Description 1,Meta Description 1 Length,Size (bytes),Word Count,Inlinks,Unique Inlinks,Outlinks,Unique Outlinks,External Outlinks,Unique External Outlinks
0,https://www.mayoclinic.org/diseases-conditions...,Congenital heart disease in adults,34,Learn about treatments and complications of he...,114,60740,2005,53,29,112,83,70,48
1,https://www.mayoclinic.org/diseases-conditions...,Pulmonary fibrosis,18,"Pulmonary fibrosis — Learn about the symptoms,...",154,54587,2083,36,18,83,53,67,45
2,https://www.mayoclinic.org/diseases-conditions...,Epilepsy,8,"Learn about epilepsy symptoms, possible causes...",125,63278,2749,40,21,90,60,73,51
3,https://www.mayoclinic.org/diseases-conditions...,Cirrhosis,9,Cirrhosis is an advanced stage of scarring and...,147,55146,2008,35,17,82,53,67,45
4,https://www.mayoclinic.org/diseases-conditions...,Heart arrhythmia,16,Learn about common heart disorders that can ca...,103,68502,3397,54,34,94,65,67,45


...and convert them into more coding-friendly formats:

In [11]:
# lowercase, spaces to underscores
new_colnames = [x.lower() for x in trimmed.columns]
new_colnames = [x.replace(' ', '_') for x in new_colnames]

# replace in original dataframe
cleaned = trimmed
cleaned.columns = new_colnames

# replace individual column names that need modifying
cleaned = cleaned.rename(columns = {'url_encoded_address' : 'url'})
cleaned = cleaned.rename(columns = {'h1-1': 'header'})
cleaned = cleaned.rename(columns = {'h1-1_length': 'header_len'})
cleaned = cleaned.rename(columns = {'meta_description_1' : 'meta'})
cleaned = cleaned.rename(columns = {'meta_description_1_length' : 'meta_len'})
cleaned = cleaned.rename(columns = {'size_(bytes)' : 'bytes'})
cleaned = cleaned.rename(columns = {'unique_inlinks' : 'unique_in'})
cleaned = cleaned.rename(columns = {'unique_outlinks' : 'unique_out'})
cleaned = cleaned.rename(columns = {'external_outlinks' : 'ext_links'})
cleaned = cleaned.rename(columns = {'unique_external_outlinks' : 'unique_ext'})

# reformat headers
lower = [x.lower() for x in cleaned["header"]]
cleaned["header"] = lower

cleaned.head(3)

Unnamed: 0,url,header,header_len,meta,meta_len,bytes,word_count,inlinks,unique_in,outlinks,unique_out,ext_links,unique_ext
0,https://www.mayoclinic.org/diseases-conditions...,congenital heart disease in adults,34,Learn about treatments and complications of he...,114,60740,2005,53,29,112,83,70,48
1,https://www.mayoclinic.org/diseases-conditions...,pulmonary fibrosis,18,"Pulmonary fibrosis — Learn about the symptoms,...",154,54587,2083,36,18,83,53,67,45
2,https://www.mayoclinic.org/diseases-conditions...,epilepsy,8,"Learn about epilepsy symptoms, possible causes...",125,63278,2749,40,21,90,60,73,51


In [12]:
cleaned.shape

(1199, 13)

## Part 3: Merge Datasets

In [13]:
# inner merge to get all URLs scraped from Mayo Clinic site that were included in the SEO Spider crawl
combo = pd.merge(mayo_data, cleaned, on='url', how='inner')
combo.head(3)

Unnamed: 0,url,header,header_len,meta,meta_len,bytes,word_count,inlinks,unique_in,outlinks,unique_out,ext_links,unique_ext
0,https://www.mayoclinic.org/diseases-conditions...,digeorge syndrome (22q11.2 deletion syndrome),45,DiGeorge syndrome (22q11.2 deletion syndrome) ...,151,57561,2212,13,7,70,44,67,45
1,https://www.mayoclinic.org/diseases-conditions...,atrial fibrillation,19,"Find out about atrial fibrillation, a heart co...",152,68770,2732,31,19,100,71,75,52
2,https://www.mayoclinic.org/diseases-conditions...,abdominal aortic aneurysm,25,An abdominal aortic aneurysm can grow slowly a...,128,48752,1530,26,15,78,49,67,45


In [14]:
# lost 18 URLs that were accessed in the crawl but weren't scraped from the Mayo Clinic site
# and lost a further 38 that were scraped but we're included in the crawl
combo.shape

(1143, 13)

In [15]:
# check for any duplicate URLs
combo = combo[combo["url"].duplicated() == False]

In [16]:
# no duplicates
combo.shape

(1143, 13)

In [17]:
#check for any NaNs in the dataset
combo.isnull().values.any()

True

In [18]:
#check where
combo.isnull().any()

url           False
header        False
header_len    False
meta           True
meta_len      False
bytes         False
word_count    False
inlinks       False
unique_in     False
outlinks      False
unique_out    False
ext_links     False
unique_ext    False
dtype: bool

In [19]:
# check which it is
combo[combo["meta"].isnull()]

Unnamed: 0,url,header,header_len,meta,meta_len,bytes,word_count,inlinks,unique_in,outlinks,unique_out,ext_links,unique_ext
1084,https://www.mayoclinic.org/diseases-conditions...,tapeworm infection,18,,0,47202,1731,15,6,71,45,67,45


In [20]:
# set null meta description to empty instead of null (without making a copy)
combo.loc[combo['meta'].isnull(), 'meta'] = " "

In [21]:
combo[combo["header"]=="tapeworm infection"]

Unnamed: 0,url,header,header_len,meta,meta_len,bytes,word_count,inlinks,unique_in,outlinks,unique_out,ext_links,unique_ext
1084,https://www.mayoclinic.org/diseases-conditions...,tapeworm infection,18,,0,47202,1731,15,6,71,45,67,45


In [22]:
# make sure that was it
combo.isnull().values.any()

False

## Part 4: Moz Data
### Top 500 ranking pages on the Mayo Clinic Domain

In [230]:
moz = pd.read_csv("moz-top-pages.csv")
moz.head()

Unnamed: 0,URL,Title,Total Links,PA,Linking Domains to Page,HTTP Status Code,Outbound Domains from Page,Outbound Links from Page
0,www.mayoclinic.org/,\r\n Mayo Clinic - Mayo Clinic,1008399,74,34849,200.0,10,31
1,www.mayoclinic.org/healthy-lifestyle/nutrition...,\r\n\tWater: How much should you drink every d...,11319,68,3523,200.0,10,37
2,www.mayoclinic.org/healthy-lifestyle/fitness/i...,\r\n\tExercise: 7 benefits of regular physical...,7961,67,2779,200.0,10,37
3,www.mayoclinic.org/healthy-lifestyle/nutrition...,\r\n\tMediterranean diet for heart health - Ma...,8152,67,2675,200.0,10,37
4,www.mayoclinic.org/healthy-lifestyle/stress-ma...,\r\n\tStress symptoms: Effects on your body an...,5819,67,2365,200.0,10,36


In [231]:
# lowercase, spaces to underscores
colnames = [x.lower() for x in moz.columns]
colnames = [x.replace(' ', '_') for x in colnames]

# replace in original dataframe
moz.columns = colnames

In [232]:
moz.head()

Unnamed: 0,url,title,total_links,pa,linking_domains_to_page,http_status_code,outbound_domains_from_page,outbound_links_from_page
0,www.mayoclinic.org/,\r\n Mayo Clinic - Mayo Clinic,1008399,74,34849,200.0,10,31
1,www.mayoclinic.org/healthy-lifestyle/nutrition...,\r\n\tWater: How much should you drink every d...,11319,68,3523,200.0,10,37
2,www.mayoclinic.org/healthy-lifestyle/fitness/i...,\r\n\tExercise: 7 benefits of regular physical...,7961,67,2779,200.0,10,37
3,www.mayoclinic.org/healthy-lifestyle/nutrition...,\r\n\tMediterranean diet for heart health - Ma...,8152,67,2675,200.0,10,37
4,www.mayoclinic.org/healthy-lifestyle/stress-ma...,\r\n\tStress symptoms: Effects on your body an...,5819,67,2365,200.0,10,36


In [233]:
# make sure url format matches
urls = []
for row in range(len(moz["url"])):	
    urls.append("https://"+moz["url"][row])

moz["url"]=urls

moz.head()

Unnamed: 0,url,title,total_links,pa,linking_domains_to_page,http_status_code,outbound_domains_from_page,outbound_links_from_page
0,https://www.mayoclinic.org/,\r\n Mayo Clinic - Mayo Clinic,1008399,74,34849,200.0,10,31
1,https://www.mayoclinic.org/healthy-lifestyle/n...,\r\n\tWater: How much should you drink every d...,11319,68,3523,200.0,10,37
2,https://www.mayoclinic.org/healthy-lifestyle/f...,\r\n\tExercise: 7 benefits of regular physical...,7961,67,2779,200.0,10,37
3,https://www.mayoclinic.org/healthy-lifestyle/n...,\r\n\tMediterranean diet for heart health - Ma...,8152,67,2675,200.0,10,37
4,https://www.mayoclinic.org/healthy-lifestyle/s...,\r\n\tStress symptoms: Effects on your body an...,5819,67,2365,200.0,10,36


In [234]:
# just want page authority from this dataset
moz = moz[['url','pa']]

In [253]:
data = pd.merge(combo, moz, on='url', how='left')
data.shape

(1149, 14)

In [254]:
# get rid of duplicate URLs
data = data[data["url"].duplicated() == False]
data.shape

(1143, 14)

In [255]:
data.pa.min()

60.0

In [256]:
data.pa.max()

67.0

In [257]:
len(data[data.pa.notnull()])

216

In [258]:
data.pa.var()

1.7860249784668392

Only 43% of the 500 top-ranking pages on the Mayo Clinic website are Symptoms and Causes pages. Of those pages, their Page Authority scores out of 100 range from 60-67, not very wide. This is a very small sample with not much variance, so I will use the data from Moz about Page Authority in a different way. What I want to know is which pages in my dataset are in the top 500, and since the Page Authority scores of those pages are fairly close together, I will mark them as top-ranking with a 1 (True), and the rest of the pages in the dataset with a 0 (False). This way I am not losing very much information, and I am keeping my sample size much closer to the population size.

It would be most desirable if I had the page authority ranking from Moz for all of the Mayo Clinic pages I am interested in looking at, but unfortunately they only provide the top 500 for the domain.

In [262]:
# make page authority column into binary data
data.pa.loc[data.pa.notnull()] = 1
data.pa.loc[data.pa.isnull()] = 0

In [263]:
data.head()

Unnamed: 0,url,header,header_len,meta,meta_len,bytes,word_count,inlinks,unique_in,outlinks,unique_out,ext_links,unique_ext,pa
0,https://www.mayoclinic.org/diseases-conditions...,digeorge syndrome (22q11.2 deletion syndrome),45,DiGeorge syndrome (22q11.2 deletion syndrome) ...,151,57561,2212,13,7,70,44,67,45,0.0
1,https://www.mayoclinic.org/diseases-conditions...,atrial fibrillation,19,"Find out about atrial fibrillation, a heart co...",152,68770,2732,31,19,100,71,75,52,1.0
2,https://www.mayoclinic.org/diseases-conditions...,abdominal aortic aneurysm,25,An abdominal aortic aneurysm can grow slowly a...,128,48752,1530,26,15,78,49,67,45,0.0
3,https://www.mayoclinic.org/diseases-conditions...,hyperhidrosis,13,"Learn more about causes, symptoms, treatment a...",153,45026,1385,16,11,75,46,67,45,0.0
4,https://www.mayoclinic.org/diseases-conditions...,bartholin's cyst,16,A Bartholin's cyst is a fluid-filled lump near...,126,43037,1202,12,6,70,44,67,45,0.0


In [265]:
data.columns

Index(['url', 'header', 'header_len', 'meta', 'meta_len', 'bytes',
       'word_count', 'inlinks', 'unique_in', 'outlinks', 'unique_out',
       'ext_links', 'unique_ext', 'pa'],
      dtype='object')

## Part 5: More Web Scraping (publication date)

In [267]:
dates = []

for page in data["url"]:
    content = requests.get(page).content # page content
    file = BeautifulSoup(content, "lxml") # in lxml
    date = file.find("div", class_='pubdate')
    if(date!=None):
        match = re.findall("(?<=\\r\\n).*?(?=\\r\\n)", str(date.get_text()))[0].strip() # get rid of \r\n and spaces
        dates.append(match) # add to column list
    else:
        dates.append(None)

# add new column to dataframe
data['pub_date'] = dates
data.head()

Unnamed: 0,url,header,header_len,meta,meta_len,bytes,word_count,inlinks,unique_in,outlinks,unique_out,ext_links,unique_ext,pa,pub_date
0,https://www.mayoclinic.org/diseases-conditions...,digeorge syndrome (22q11.2 deletion syndrome),45,DiGeorge syndrome (22q11.2 deletion syndrome) ...,151,57561,2212,13,7,70,44,67,45,0.0,"July 18, 2017"
1,https://www.mayoclinic.org/diseases-conditions...,atrial fibrillation,19,"Find out about atrial fibrillation, a heart co...",152,68770,2732,31,19,100,71,75,52,1.0,"June 20, 2019"
2,https://www.mayoclinic.org/diseases-conditions...,abdominal aortic aneurysm,25,An abdominal aortic aneurysm can grow slowly a...,128,48752,1530,26,15,78,49,67,45,0.0,"March 15, 2019"
3,https://www.mayoclinic.org/diseases-conditions...,hyperhidrosis,13,"Learn more about causes, symptoms, treatment a...",153,45026,1385,16,11,75,46,67,45,0.0,"Oct. 27, 2017"
4,https://www.mayoclinic.org/diseases-conditions...,bartholin's cyst,16,A Bartholin's cyst is a fluid-filled lump near...,126,43037,1202,12,6,70,44,67,45,0.0,"April 24, 2020"


In [268]:
# check if any didn't get scraped
data.isnull().values.any()
data.isnull().any()

url           False
header        False
header_len    False
meta          False
meta_len      False
bytes         False
word_count    False
inlinks       False
unique_in     False
outlinks      False
unique_out    False
ext_links     False
unique_ext    False
pa            False
pub_date       True
dtype: bool

In [269]:
# check how many
len(data.loc[data.pub_date.isnull()])

66

In [276]:
# get url(s) that don't have a publication date
nulls = data[data.pub_date.isnull()]

# request them again
for url in nulls["url"]:
    content = requests.get(url).content # page content
    file = BeautifulSoup(content, "lxml") # in lxml
    date = file.find("div", class_='pubdate')
    if date!=None:
        match = re.findall("(?<=\\r\\n).*?(?=\\r\\n)", str(date.get_text()))[0].strip() # get rid of \r\n and spaces
        data.loc[data['url']==url, 'pub_date'] = match
        
# check again...
len(data.loc[data.pub_date.isnull()])

0

In [271]:
# convert strings to comparable datetime objects
import datetime
# jan, feb, aug, sept, oct, nov, dec
dat = []
for date in data["pub_date"]:
    if type(date)==str: # if it isn't null
        if "." in date:
            if "Sept" in date: #special case: datetime recognizes "Sep" not "Sept"
                date = re.sub('t', '', date)
            datetime_ob = datetime.datetime.strptime(date, '%b. %d, %Y')
        else:
            datetime_ob = datetime.datetime.strptime(date, '%B %d, %Y')
        dat.append(datetime_ob)
    else:
        dat.append(None)

data.pub_date = dat

In [272]:
data.head(3)

Unnamed: 0,url,header,header_len,meta,meta_len,bytes,word_count,inlinks,unique_in,outlinks,unique_out,ext_links,unique_ext,pa,pub_date
0,https://www.mayoclinic.org/diseases-conditions...,digeorge syndrome (22q11.2 deletion syndrome),45,DiGeorge syndrome (22q11.2 deletion syndrome) ...,151,57561,2212,13,7,70,44,67,45,0.0,2017-07-18
1,https://www.mayoclinic.org/diseases-conditions...,atrial fibrillation,19,"Find out about atrial fibrillation, a heart co...",152,68770,2732,31,19,100,71,75,52,1.0,2019-06-20
2,https://www.mayoclinic.org/diseases-conditions...,abdominal aortic aneurysm,25,An abdominal aortic aneurysm can grow slowly a...,128,48752,1530,26,15,78,49,67,45,0.0,2019-03-15


In [273]:
data = data[data.pub_date.notnull()]

In [274]:
data.shape

(1082, 15)

## Part 6: Save dataset

In [275]:
data.to_csv('data.csv')