# Phase 2
## Data Collection and Description (Draft)

*Effectively communicate what your dataset is about. The technical details of the data set will be described in your datasheet (follow a template as in the examples in sections 3.1-3.5 (Motivation to Uses) of this article on datasheets. Think of this as the “origin story” of your data set. You can write this in any style as long as it's easy to read as a Q&A. Datasheet will be graded on content, not style.*

###  Why was this dataset created?
Last summer and continuing remotely, I have been working at MassGeneral Hospital for Children as a digital marketing and web development intern. MGHfC struggles with brand recognition in competition with other childrens' hospitals and healthcare facilities, and as such the digital marketing department at MGHfC has been publishing more patient education pages on the website about conditions treated at the clinics of MGH's children's hospital to try and rank in Google and generate more site traffic and appointments.

Despite their recent generation, MGHfC's pages are often beat in Google rankings by those of the Mayo Clinic, the widely acclaimed healthcare company based in Minnesota. Whereas the Mayo Clinic ranks in Google for 2.5m keywords, MGHfC ranks for 90,813 (data from previous analysis on https://moz.com/). To study what site structure and SEO tactics aid the Mayo Clinic in ranking in Google so often and so highly for their education pages, in this project I will be analyzing the contents of their Symptoms and Causes pages of their listed diseases and conditions to try and detect patterns in their pages that may be boosting their page rank.

The final dataset I will be working with is a self-contained conglommeration of data I scraped from the Mayo Clinic's site and data provided by an SEO analysis helper tool that I have access to due to my internship position, Screaming Frog SEO Spider (indeed, that is the name of the application).

### What processes might have influenced what data was observed and recorded and what was not?
With the goals outlined above in mind, SEO Spider provides valuable information that I would not otherwise be able to gather–in this case, data about the kinds of links on the pages (ingoing, outgoing, unique...). However, due to the way the Mayo Clinic's symptoms and causes URLs are structured, I have to filter out the pages that I want from the data it provides, which also includes other pages listed under Diseases and Conditions.

I could have done the project with only the Spider data, but I wanted to get more practice at webscraping! I'm proud of the result, even if the merging of the datasets is a bit redundant. Full disclaimer, I had hoped to scrape more data by hand than I got done for this draft of the datasheet, and I hope to use BeautifulSoup and XML parsing to do more of that for the final version.

I scraped sample data from the particular pages that I wanted and merged it with data from Spider for the same pages. The resulting number of pages is fewer than either original set, weeding out some pages that I did not want to include in the analysis. The final product should be the full population of Symptoms and Causes pages on the Mayo Clinic site.

### What preprocessing was done, and how did the data come to be in the form that you are using?
In terms of the SEO Spider data, I crawled https://www.mayoclinic.org/diseases-conditions to a depth of 2 and downloaded a .csv file which I then cleaned in this notebook so that the dataframe only included the columns I wanted. I left behind address (the "ugly" (non-canonical) version of the final url), status code (can access myself, and should be 200, not interesting for data analysis), page title and headers, which are standard across Symptoms and Causes pages, % total (code to content ratio), and empty columns that I asked Spider not to gather data for or that Spider left blank.

### What are the observations (rows) and the attributes (columns)?
**Observations:** Each row contains information about the symptoms and treatment page for a disease or condition listed on the Mayo Clinic website. 

**Attributes:** The page's url, its primary header (H1, also the name of the condition described), the word count of the page content, the size of the page in bytes, the meta description of the page (what shows up in the Google result), the inlinks (internal links pointing to a given URL from the same subdomain that is being crawled), unique inlinks (count several internal links from the same subdomain page as 1), outlinks (internal links to other URLs on the same subdomain), unique outlinks, external outlinks (links to another subdomain), and unique external outlinks.

### If people are involved, were they aware of the data collection and if so, what purpose did they expect the data to be used for?
People are tangentially involved–the data I will be analyzing looks at how webpages were designed by someone, whether that's web designers working there or working for a company that created their website template, and the structure of the content of those pages, which likely have many different authors. Data about the people themselves will not be involved; only data about their work.

### Who funded the creation of the dataset?
The license for SEO Spider is funded by the MGH Marketing Department. My internship is unpaid–I am not completing this work for compensation besides my standing in this course, and the idea of applying this project to the area of my internship was entirely my own.

To my knowledge, no one working for the Mayo Clinic is aware of this project, and none of the data used is confidential, private, or upsetting.

This dataset has not yet been used for any tasks. It could potentially be used by other competitors of the Mayo Clinic or the Mayo Clinic itself to attempt to improve marketing strategies. This data should not be used to compromise the Mayo Clinic or any associated groups.

## Part 1: Web Scraping

In the following section, I scrape the Mayo Clinic's Symptoms and Treatment pages under all of their indexed diseases and conditions.

In [2]:
# import the libraries
import sys
#!conda install --yes --prefix {sys.prefix} requests
import requests
from bs4 import BeautifulSoup
#!conda install --yes --prefix {sys.prefix} lxml
from string import ascii_uppercase as upp
import re # for regular expressions in part 3
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

The first step is to get the long list of symptoms-causes URLs that are linked from the Mayo Clinic's indexed Diseases and Conditions lookup. I begin by saving the url up to and including the query for the letter, but not the letter itself. I also save the root of the URL for use later.

In [1]:
# get URLs: diseases-conditions pages organized in an alphabetical index, with one # entry
url = 'https://www.mayoclinic.org/diseases-conditions/index?letter='
root = 'https://www.mayoclinic.org'

Next, I write a function to extract the URLs of the pages I want (from the URLs of the letter index pages!). Since I am doing a sort of double scraping, I make sure to limit the number of pages I scrape per index letter.

In [20]:
def extract(letter, addresses):
    index = url + letter # url of index page for this letter
    get = requests.get(index).content
    soup = BeautifulSoup(get, "lxml")
    # get list of articles on index page
    within = soup.find_all(class_ = "index content-within")
    # for limited number of index items in the resulting array
    #if len(within) > 50:
       # stop = 50
   # else:
        #stop = len(within)
    for elm in within:
        # gets letter's articles as strings in a list
        links = re.findall("(?<=a\shref=\").*?(?=\">)", str(elm))
        # for each of the links
        for page in range(len(links)): 
            full = root + links[page] # symptoms-causes URL
            if addresses.count(full)<1:
                addresses.append(full)
                content = requests.get(full)
                # save as mayoA-000 etc.
                with open("mayo{}-{:03d}.xml".format(letter, page), "w") as writer:
                    writer.write(content.text)

I use the function to extract the URL from the # index page, the only index item that isn't listed under a capital letter of the alphabet. This helps me to ensure the function is behaving correctly without scraping too much.

In [21]:
addresses = []
extract("0", addresses)
addresses

['https://www.mayoclinic.org/diseases-conditions/digeorge-syndrome/symptoms-causes/syc-20353543']

Now that I know it works, I extract more URLs:

In [22]:
for letter in upp:
    extract(letter, addresses)

Now it's time to put these URLs into an initial dataframe that I will merge with the relevant Spider data.

In [23]:
mayo_data = pd.DataFrame(addresses, columns=["url"])
mayo_data.head()

Unnamed: 0,url
0,https://www.mayoclinic.org/diseases-conditions...
1,https://www.mayoclinic.org/diseases-conditions...
2,https://www.mayoclinic.org/diseases-conditions...
3,https://www.mayoclinic.org/diseases-conditions...
4,https://www.mayoclinic.org/diseases-conditions...


In [24]:
mayo_data.shape

(1183, 1)

## Part 2: SEO Spider Data

### Raw Source Data
https://drive.google.com/open?id=1vlTTVOf3L2TnJxRma4TyJPCsgKpVvM19

Below are all of the columns provided by SEO Spider, not all of which will be useful for my purposes.

In [25]:
raw = pd.read_csv("symptoms-causes.csv")
raw.columns

Index(['Address', 'Content', 'Status Code', 'Status', 'Indexability',
       'Indexability Status', 'Title 1', 'Title 1 Length',
       'Title 1 Pixel Width', 'Meta Description 1',
       'Meta Description 1 Length', 'Meta Description 1 Pixel Width',
       'Meta Keyword 1', 'Meta Keywords 1 Length', 'H1-1', 'H1-1 length',
       'H1-2', 'H1-2 length', 'H2-1', 'H2-1 length', 'H2-2', 'H2-2 length',
       'Meta Robots 1', 'X-Robots-Tag 1', 'Meta Refresh 1',
       'Canonical Link Element 1', 'rel="next" 1', 'rel="prev" 1',
       'HTTP rel="next" 1', 'HTTP rel="prev" 1', 'Size (bytes)', 'Word Count',
       'Text Ratio', 'Crawl Depth', 'Link Score', 'Inlinks', 'Unique Inlinks',
       '% of Total', 'Outlinks', 'Unique Outlinks', 'External Outlinks',
       'Unique External Outlinks', 'Hash', 'Response Time', 'Last Modified',
       'Redirect URL', 'Redirect Type', 'URL Encoded Address'],
      dtype='object')

I select which columns I want to keep:

In [26]:
trimmed = raw[['H1-1', "URL Encoded Address", "Word Count", 'Size (bytes)', "Meta Description 1", "Inlinks", "Unique Inlinks", 'Outlinks', 'Unique Outlinks', 'External Outlinks',
       'Unique External Outlinks']]
trimmed.head()

Unnamed: 0,H1-1,URL Encoded Address,Word Count,Size (bytes),Meta Description 1,Inlinks,Unique Inlinks,Outlinks,Unique Outlinks,External Outlinks,Unique External Outlinks
0,Congenital heart disease in adults,https://www.mayoclinic.org/diseases-conditions...,2005,60740,Learn about treatments and complications of he...,53,29,112,83,70,48
1,Pulmonary fibrosis,https://www.mayoclinic.org/diseases-conditions...,2083,54587,"Pulmonary fibrosis — Learn about the symptoms,...",36,18,83,53,67,45
2,Epilepsy,https://www.mayoclinic.org/diseases-conditions...,2749,63278,"Learn about epilepsy symptoms, possible causes...",40,21,90,60,73,51
3,Cirrhosis,https://www.mayoclinic.org/diseases-conditions...,2008,55146,Cirrhosis is an advanced stage of scarring and...,35,17,82,53,67,45
4,Heart arrhythmia,https://www.mayoclinic.org/diseases-conditions...,3397,68502,Learn about common heart disorders that can ca...,54,34,94,65,67,45


...and convert them into more coding-friendly formats:

In [27]:
# lowercase, spaces to underscores
new_colnames = [x.lower() for x in trimmed.columns]
new_colnames = [x.replace(' ', '_') for x in new_colnames]

# replace in original dataframe
cleaned = trimmed
cleaned.columns = new_colnames

# replace individual column names that need modifying
cleaned = cleaned.rename(columns = {'h1-1': 'header'})
cleaned = cleaned.rename(columns = {'url_encoded_address' : 'url'})
cleaned = cleaned.rename(columns = {'size_(bytes)' : 'bytes'})
cleaned = cleaned.rename(columns = {'meta_description_1' : 'meta'})

# reformat headers
lower = [x.lower() for x in cleaned["header"]]
cleaned["header"] = lower

cleaned.head(3)

Unnamed: 0,header,url,word_count,bytes,meta,inlinks,unique_inlinks,outlinks,unique_outlinks,external_outlinks,unique_external_outlinks
0,congenital heart disease in adults,https://www.mayoclinic.org/diseases-conditions...,2005,60740,Learn about treatments and complications of he...,53,29,112,83,70,48
1,pulmonary fibrosis,https://www.mayoclinic.org/diseases-conditions...,2083,54587,"Pulmonary fibrosis — Learn about the symptoms,...",36,18,83,53,67,45
2,epilepsy,https://www.mayoclinic.org/diseases-conditions...,2749,63278,"Learn about epilepsy symptoms, possible causes...",40,21,90,60,73,51


In [28]:
cleaned.shape

(1199, 11)

## Merge Datasets

In merging the datasets, I am paring down the data just a little bit.

In [29]:
data = pd.merge(mayo_data, cleaned, on='url', how='inner')
data.head(3)

Unnamed: 0,url,header,word_count,bytes,meta,inlinks,unique_inlinks,outlinks,unique_outlinks,external_outlinks,unique_external_outlinks
0,https://www.mayoclinic.org/diseases-conditions...,digeorge syndrome (22q11.2 deletion syndrome),2212,57561,DiGeorge syndrome (22q11.2 deletion syndrome) ...,13,7,70,44,67,45
1,https://www.mayoclinic.org/diseases-conditions...,atrial fibrillation,2732,68770,"Find out about atrial fibrillation, a heart co...",31,19,100,71,75,52
2,https://www.mayoclinic.org/diseases-conditions...,abdominal aortic aneurysm,1530,48752,An abdominal aortic aneurysm can grow slowly a...,26,15,78,49,67,45


In [30]:
data.shape

(1145, 11)