# Scraping the Wellesley News for Article Data

* Christine Pourheydarian
* January 20, 2021
* virtual tools used: this Jupyter notebook, Selenium, Python 3, Anaconda Navigator, Webdriver. 
* Credits note: Much of this code is taken from a notebook shared by Francisca Moya Jimenez. 

### Selenium

Scraping will be done with Selenium. Selenium is mainly used for testing, but is also useful for scraping data from websites with dynamic HTML. 

Install Selenium:

In [1]:
!pip install selenium



In [2]:
import selenium 

The webdriver you use will depend on the browser you prefer to use. I downloaded Chrome's webdriver and 
stored it in a folder called "driver" that is in the same folder as this Jupyter notebook, so that it would be easy to find and so that the PATH would be simple. (More information on Chrome's WebDriver can be found here: https://chromedriver.chromium.org/ )

In [3]:
from selenium import webdriver

#you will need to customize the path based on what you name the webdriver you are using 
#and based on where you store it on your computer. 
driver = webdriver.Chrome(executable_path='driver/chromedriver') 

I noticed that the Wellesley News urls took a very, very long time to load. One useful strategy for more efficient, 
timely scraping is to stop downloading resources when loading the website is taking a long time. 

In [4]:
from selenium.webdriver.chrome.options import Options

An "eager" page loading strategy means that only html content is downloaded and parsed. Used here, since the html content is all we need for getting the article data

In [5]:
options = Options()
options.page_load_strategy = 'eager' 

In [6]:
DRIVER_PATH = 'driver/chromedriver'
driver = webdriver.Chrome(executable_path=DRIVER_PATH, options=options)

Since the website's pages take a long time to load, we will also use the time module's sleep function to pause our python program when we want to give the website a bit of time to load

In [7]:
from time import sleep

### Making sure that the driver works and that you understand what it is

The driver is what you can use to scrape data from a website. Pretend the driver is a little character in a video game that you will control. You will tell the driver what to do, in order to gather all the data you need to gather. You will tell the driver what to do by using code as instructions.

The following line of code should open a new browser that goes to the site: https://thewellesleynews.com. 
If you are using ChromeDriver, it should say, "Chrome is being controlled by automated test software", which 
means that Chrome is being controlled by Selenium Webdriver in this case. 

In [8]:
driver.get('https://thewellesleynews.com')

In [9]:
dir(driver)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_file_detector',
 '_is_remote',
 '_mobile',
 '_switch_to',
 '_unwrap_value',
 '_web_element_cls',
 '_wrap_value',
 'add_cookie',
 'application_cache',
 'back',
 'capabilities',
 'close',
 'command_executor',
 'create_options',
 'create_web_element',
 'current_url',
 'current_window_handle',
 'delete_all_cookies',
 'delete_cookie',
 'desired_capabilities',
 'error_handler',
 'execute',
 'execute_async_script',
 'execute_cdp_cmd',
 'execute_script',
 'file_detector',
 'file_detector_context',
 'find_element',
 'find_element_by_class_name',
 'find_element_by_css_selector',
 'find_element_by_id',
 

In [10]:
driver.current_url

'https://thewellesleynews.com/'

In [11]:
driver.title

'The Wellesley News'

### Figuring out how to scrape the Wellesley News

#### Understanding the html

First, figure out what data you want to collect, and study the html to figure out how that data you want to access is stored, and how to navigate to it. I used Chrome's Inspect tool.


I wanted to collect scrape all the articles from 2013 to 2020. For each article, I wanted to collect its title, contents/body, category, author, and publication date
Initial observations on the html of the articles on The Wellesley News website:
* The articles are organized by year and pages within the year.
*  The articles are contained in \<article> tags.
* Article titles have class "entry-title"
* Article categories have class "entry-category". There can be multiple categories.
* Article authors have class "author"
* Article's paragraphs are in paragraphs in a div that has class "entry-content".
* Article's publication date stored in a \<time> tag that has class entry-date.

#### Create game plan: 
Scrape all the articles from 2013 to 2020. For each article, get its title, contents/body, category, author, and publication date. Store all article data in a json file.  

### Functions

#### Helper functions 

In [21]:
# Helper functions 

def getNumPages():
    """Gets the total number of pages appearing on the bottom of the page"""
    sleep(2)
    return int(driver.find_elements_by_class_name('page-numbers')[3].text)

def getTitle(article):
    """Returns an article's title as a string"""
    return article.find_element_by_tag_name('h2').text

def getAuthors(article):
    """Returns a list with an article's authors"""
    allElem = []
    for elem in article.find_elements_by_class_name('author'):
        allElem.extend(elem.find_elements_by_tag_name('a'))
    return [elem.text for elem in allElem if elem.text != '']

def getCategories(article):
    """Returns a list with an article's categories"""
    return article.find_element_by_class_name('entry-category').text.split(', ')

def getLink(article):
    """Returns an article's link as a string"""
    return article.find_element_by_class_name('read-more-link').get_attribute('href')

def getText(article):
    """Returns an article's short text snippet as a string"""
    return article.find_element_by_class_name('entry-summary').text[:-11]

#Returns article's contents given article's url
def getArticle(url):
    driver.get(url)
    sleep(2)
    paragraphs = [p.text for p in driver\
                  .find_element_by_css_selector('.single-box.clearfix.entry-content')\
                  .find_elements_by_tag_name('p')]
    return ' '.join(paragraphs)

In [22]:
def getArticles(year):
    """Retrieves all articles in the Wellesley News website for a given year"""
    allArticles = []
    
    # Navigate to page
    driver.get('https://thewellesleynews.com/'+year+'/')
    sleep(2)
    pages = getNumPages()
        
    # Extract relevant article information
    def getInfo(article):
        """Scrapes the title, authors, categories, url, date and text for each article"""
        url = getLink(article)
        return {'title':getTitle(article), 'authors':getAuthors(article), \
                'date':url[29:39], 'categories':getCategories(article), \
                'url':url,'text':getText(article)}
    
    # Visit and scrape pages
    for page in range(1,pages+1):
        if page == 1:
            for article in driver.find_elements_by_tag_name('article'):
                try:
                    allArticles.append(getInfo(article))
                except:
                    # Close pop up window and run getInfo again
                    driver.find_element_by_id('close-icon').click()
                    allArticles.append(getInfo(article))
        else:
            driver.get('https://thewellesleynews.com/'+year+'/page/'+str(page)+'/')
            sleep(2)
            for article in driver.find_elements_by_tag_name('article'):
                allArticles.append(getInfo(article))
                       
    return allArticles
            

In [23]:
articles = []
for year in ['2013']: #'2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020'#You can change the years based on what year you want to collect data from. I collected data from 2013 to 2020. 
    articles.extend(getArticles(year))

ValueError: invalid literal for int() with base 10: ''

Checking that it got all of the correct article urls. 

In [76]:
articles[0]

IndexError: list index out of range

In [None]:
len(articles)

In [None]:
articleList = [] 

In [None]:
# Select all 2013 articles and making their contents into dictionaries and adding their content to a list that 
#will be dumped into a json file once we have done this for all years 2013 thru 2020
for article in articles:
    data=  {
            "title": article['title'],
            "authors": article['authors'], 
            "date": article['date'], 
            "categories": article['categories'], 
            "url": article['url'], 
            "body": getArticle(article['url']), 
            }
    articleList.append(data)
          

Checking that all data of articles from the given year were added.
There are: 
171 articles from 2013
555 articles from 2014
467 articles from 2015 
471 articles from 2016
398 articles from 2017
422 articles from 2018
431 articles from 2019
205 articles from 2020
so the length of the article list should be
171 + 555 + 467 + 471 + 398 + 422 + 431 + 205 = 3120, 
if data from all articles ranging between 2013 from 2020 was collected. 

In [None]:
len(articleList) 

### Saving data

In [None]:
import json

In [None]:
with open('articleList.json', 'w') as outfile:
    json.dump(articleList, outfile)

### Quitting the driver

In [None]:
driver.quit()