# Medium Stats Data Collection

The goal is to scrape baseline stats on stories from https://medium.com/me/stats in order to get a better understanding of how readers engage with a writers work. Note that this is a personal project and is in no way associated with Medium. Some general advice and implementations are adapted and revised from the following resources:
* https://hackernoon.com/web-scraping-tutorial-with-python-tips-and-tricks-db070e70e071
* https://realpython.com/python-web-scraping-practical-introduction/

In [51]:
# Imports
from selenium import webdriver
import time
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd

In [52]:
# Input Google login info to get into Medium
USER = ''
PASS = ''

## Logging in with Selenium

Since Mediums statistics page is behind login, we'll have to automate this process in order to scrape the overview stats. This easiest way to do this in Python is widely viewed as Selenium. Selenium (not the chemical element) is a Python package built specifically for automating browsers. With a little bit of getting up to speed and learning to read through the html within the inspect view of Google chrome, you'll be up and running with automating anything your heart desires. For me, I struggled getting through Mediums email sign in due to a captcha, so I decided to use my google login and go that route. 

This will only work if you use your Google login to get into Medium, though it should be fairly easy to follow similar steps for Facebook, Twitter, etc. I import my username and password from another confidential python script. Feel free to do the same or just enter your information manually. Some code and resources that I drew from throughout this portion:
* https://crossbrowsertesting.com/blog/test-automation/automate-login-with-selenium/
* https://stackoverflow.com/questions/20986631/how-can-i-scroll-a-web-page-using-selenium-webdriver-in-python

In [6]:
# Goes through splash and login process
def splash_process(driver, email, password):
    
    # Goes to sign in page
    driver.get('https://medium.com/m/signin')

    # Clicks sign in button
    driver.find_element_by_xpath(".//button[contains(.,'Sign in')]").click()

    # Clicks sign in with Google
    driver.find_element_by_xpath(".//button[contains(.,'Sign in with Google')]").click()

    # Finds email field
    email_field = driver.find_element_by_id("identifierId")

    # Types in email
    email_field.send_keys(email)

    # Clicks next button
    driver.find_element_by_id("identifierNext").click()

    # Wait a sec
    time.sleep(1)

    # Finds password field
    pass_field = driver.find_element_by_name("password")

    # Types in password
    pass_field.send_keys(password)

    # Click next button
    driver.find_element_by_id("passwordNext").click()
    
    # Wait a sec
    time.sleep(3)
    
    # Go to stats page and return it 
    driver.get('https://medium.com/me/stats')

In [7]:
# Scrolls to bottom to get all posts into view
def scroll(driver):

    # Get scroll height
    last_height = driver.execute_script("return document.body.scrollHeight")

    while True:
        # Scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait to load page
        time.sleep(5)

        # Calculate new scroll height and compare with last scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height

In [8]:
# Start the driver
driver = webdriver.Chrome('../chromedriver')

In [9]:
# Log in!
splash_process(driver, USER, PASS)
scroll(driver)

## Wrangling HTML with BeautifulSoup

And just like that we are in! Now we have to begin scraping the data from the raw html. There's a couple notable ways to do this, but the most common way in Python is usually to leverage the `BeautifulSoup` package. This allows us to take some raw html and parse it for specific things that we want. In my case, I'm looking for title, read time, publication, views, reads, ratio, and fans. Resources and threads of note for this section: 
* https://www.digitalocean.com/community/tutorials/how-to-scrape-web-pages-with-beautiful-soup-and-python-3
* https://stackoverflow.com/questions/14444732/how-to-split-a-html-page-to-multiple-pages-using-python-and-beautiful-soup

In [10]:
# Grab the main table html from Medium stats 
table = driver.find_element_by_class_name('js-statsTableBody')

In [11]:
# Get the raw html from our table element
raw_html = table.get_attribute('innerHTML')

In [12]:
# Preview html
print(raw_html[0:2000])

<tr class="sortableTable-row sortableTable-row--dateBucket"><td><span class="sortableTable-value">1530208043198</span><header class="heading heading--allCaps"><div class="u-clearfix"><div class="heading-content u-floatLeft"><span class="heading-title heading-title--bold">June 2018</span></div></div></header></td></tr><tr class="sortableTable-row js-statsTableRow" data-action="show-post-graph" data-action-value="78d11456019" data-timestamp="1530024573261"><td class="sortableTable-rowTitle"><span class="sortableTable-value">1530208043198</span><h2 class="sortableTable-title">An Ode to the Type A Data Scientist</h2><span class="sortableTable-text"><span class="u-xs-hide"><span class="readingTime" title="7 min read"></span><span class="middotDivider"></span><em>In</em> <a class="sortableTable-link" href="https://towardsdatascience.com">Towards Data Science</a><span class="middotDivider"></span></span><a class="sortableTable-link" href="https://towardsdatascience.com/ode-to-the-type-a-data-

In [13]:
# Quit our driver
driver.quit()

In [14]:
# Clean html  
soup = BeautifulSoup(raw_html, 'html.parser')

In [39]:
# Story titles
titles = [item.text for i, item in enumerate(soup.select('h2'))]
print(titles[0])

An Ode to the Type A Data Scientist


In [16]:
# Reading times
read_times = [item.get('title') for i, item in enumerate(soup.findAll('span', {'class':'readingTime'}))]
print(read_times[0])

7 min read


In [17]:
# Publication names
pubs = []
h2tags = soup.find_all('h2')
for h2tag in h2tags:
    page = [str(h2tag)]
    elem = h2tag.next_sibling
    while elem and elem.name != 'h2':
        if elem.text.split('View story')[0] == '':
            pubs.append('None')
        else:
            pubs.append(elem.text.split('View story')[0][3::])
        elem = elem.next_sibling
print(pubs[0])

Towards Data Science


In [18]:
# Get all numerical metrics
nums = [item.text for i, item in enumerate(soup.findAll('span', {'class':'sortableTable-value'})) if (len(item.text) < 13 or '.' in item.text)]

In [19]:
# Views
views = nums[::4]
print(views[0])

3224


In [20]:
# Reads
reads = nums[1::4]
print(reads[0])

699


In [21]:
# Read ratio
ratio = nums[2::4]
print(ratio[0])

21.681141439205955


In [22]:
# Fans
fans = nums[3::4]
print(fans[0])

72


## Creating a Pandas DataFrame

Now that we have lists of each particular feature that we wanted, it gets infinitely easier from here! We just have to adapt these lists into a DataFrame and then we can easily manipulate, utilize, and analyze our collected information for whatever purpose we want.

In [23]:
# Create dataframe
df = pd.DataFrame(data={'Title': titles, 'Read Time': read_times, 'Publication': pubs, 'Views': views, 
                        'Reads': reads, 'Read Ratio': ratio, 'Fans': fans})

In [24]:
# Reorder columns
df = df[['Title', 'Publication', 'Read Time', 'Views', 'Reads', 'Read Ratio', 'Fans']]

In [25]:
# Preview
df.head()

Unnamed: 0,Title,Publication,Read Time,Views,Reads,Read Ratio,Fans
0,An Ode to the Type A Data Scientist,Towards Data Science,7 min read,3224,699,21.681141439205955,72
1,Self Driven Data Science - Issue #52,Hacker Noon,3 min read,283,115,40.63604240282685,7
2,Choosing Your First Job: Size Matters,Hacker Noon,7 min read,391,144,36.82864450127877,19
3,Python for Data Science and Machine Learning B...,Towards Data Science,6 min read,2175,795,36.55172413793103,48
4,The Key to Optimal Internships,Hacker Noon,5 min read,197,80,40.60913705583757,9


In [26]:
# Overview
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76 entries, 0 to 75
Data columns (total 7 columns):
Title          76 non-null object
Publication    76 non-null object
Read Time      76 non-null object
Views          76 non-null object
Reads          76 non-null object
Read Ratio     76 non-null object
Fans           76 non-null object
dtypes: object(7)
memory usage: 4.2+ KB


## Data Cleaning


Now that we have our information neatly placed in a Pandas DataFrame, we can't ignore data cleaning. Not every dataset will be squeaky clean like Kaggle may lead you to believe. This is especially applicable when dealing with scraped data where anomalies can come up from time to time.

In [27]:
# Convert numerical features to floats
df = df.apply(pd.to_numeric, errors='ignore')
df['Read Time'] = df['Read Time'].apply(lambda x: int(x.split()[0]))
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76 entries, 0 to 75
Data columns (total 7 columns):
Title          76 non-null object
Publication    76 non-null object
Read Time      76 non-null int64
Views          76 non-null int64
Reads          76 non-null int64
Read Ratio     76 non-null float64
Fans           76 non-null int64
dtypes: float64(1), int64(4), object(2)
memory usage: 4.2+ KB


In [28]:
# Another look
df.tail()

Unnamed: 0,Title,Publication,Read Time,Views,Reads,Read Ratio,Fans
71,The 5 Proven Best Ways to Start Your Day,The Ascent,4,109,69,63.302752,10
72,5 Lessons From Going Idea to Prototype in 24 H...,UX Planet,5,1454,616,42.365887,28
73,Self Driven Data Science — Issue #5,Towards Data Science,2,253,146,57.70751,4
74,A Beginners Guide to the World Within Data Sci...,Towards Data Science,13,1478,334,22.598106,29
75,How I Created My Personal Logo as an Amateur D...,UX Planet,5,601,384,63.893511,9


In [29]:
# Numerical overview
df.describe()

Unnamed: 0,Read Time,Views,Reads,Read Ratio,Fans
count,76.0,76.0,76.0,76.0,76.0
mean,4.052632,2975.539474,763.947368,46.335564,60.644737
std,1.952057,11670.325013,2304.829156,12.793347,226.094476
min,2.0,109.0,60.0,10.426731,1.0
25%,3.0,344.5,146.75,40.43207,9.0
50%,3.0,603.0,282.0,49.007654,14.0
75%,5.0,1096.5,441.0,56.062458,27.25
max,13.0,85373.0,14945.0,64.351852,1447.0


## Export Data

Now that we have our clean dataset, let's export it as a csv file that we'll use for exploratory data analysis in the next portion of this project where we'll dive into the more subtle aspects of analyzing engagement in respect to my work and later on a larger scale.

In [30]:
# Export as csv
df.to_csv('mystats.csv', index=False)

## Wrapping Up
That's all for the scraping process. Check out the follow up notebook titled `Medium Stats Data Analysis` in this repo as well. There is also a script that you can run and largely automate this process if you're interested in skipping straight the the analysis. Thanks for reading! Follow me on Medium if interested!

https://medium.com/@conordewey3