Webscrape: Green Party Website
- Author: Colin Macdonald (for RA2), June 2020

The objective is to scan the Green Party website for media posts and: 
    1) return user defined number of posts (by count, date, both)
    2) return a summary of post information
        a) Date
        b) Header
        c) Content
        d) link

In [1]:
# Import the required (or not) libraries
import requests
import numpy as np
import time
import pandas as pd
import re as re
from datetime import datetime
from urllib.request import urlopen
from bs4 import BeautifulSoup
from html.parser import HTMLParser

The next two sections call the Green Party Media Releases page, renders with BeautifulSoup to index total number of pages
    - calls the pagination <ul> <a.href> tags for the page index in order to return:
        1) the total number of pages, 
        2) value = lastpage  

In [2]:
# Open Greenparty Media Releases page and render
from requests import get
url = 'https://www.greenparty.ca/en/news/media-releases'
response = get(url)
print(response.text[:500])
media_soup = BeautifulSoup(response.text, 'html.parser')

<!DOCTYPE html>
<html lang="en" dir="ltr" prefix="og: http://ogp.me/ns# content: http://purl.org/rss/1.0/modules/content/ dc: http://purl.org/dc/terms/ foaf: http://xmlns.com/foaf/0.1/ rdfs: http://www.w3.org/2000/01/rdf-schema# sioc: http://rdfs.org/sioc/ns# sioct: http://rdfs.org/sioc/types# skos: http://www.w3.org/2004/02/skos/core# xsd: http://www.w3.org/2001/XMLSchema#">
<head profile="http://www.w3.org/1999/xhtml/vocab">
  <meta charset="utf-8">
  <meta name="viewport" content="width=devic


In [4]:
# Search the pagination list, <a href> tags for the last page and extract last page value +1 
pages = int(media_soup.find('li',{'class':'pager-last'}).a['href'].replace('/en/news/media-releases?page=',''))
last_page = pages + 1
print(last_page)

137


This section runs scans through of all the media post headings within the pages in order to return: 
    a) count of the total number of blog posts
    b) a list of lists of all blog headers which will then be used to extract the url for each individual article
    c) a flattened list of the above

In [5]:
# Runs through all pages to return list of lists for media posts 
# & a running count of posts/page 
media_posts_list = []
last_posts_list = []
last_post = 0
page_index = 0
    
for pages in range(0,last_page):
    response = get('https://www.greenparty.ca/en/news/media-releases?page={}'.format(pages))
    post_soup = BeautifulSoup(response.text, 'html.parser')
    media_posts = post_soup.find_all('h3', class_='media-heading')
    media_posts_list.append(media_posts)
    last_post = len(media_posts) + last_post
    page_index = page_index +1
    print('Posts Processed(Page/Page Posts/Total Posts): ', page_index,' ', len(media_posts),' ', last_post,)

Posts Processed(Page/Page Posts/Total Posts):  1   10   10
Posts Processed(Page/Page Posts/Total Posts):  2   10   20
Posts Processed(Page/Page Posts/Total Posts):  3   10   30
Posts Processed(Page/Page Posts/Total Posts):  4   10   40
Posts Processed(Page/Page Posts/Total Posts):  5   10   50
Posts Processed(Page/Page Posts/Total Posts):  6   10   60
Posts Processed(Page/Page Posts/Total Posts):  7   10   70
Posts Processed(Page/Page Posts/Total Posts):  8   10   80
Posts Processed(Page/Page Posts/Total Posts):  9   10   90
Posts Processed(Page/Page Posts/Total Posts):  10   10   100
Posts Processed(Page/Page Posts/Total Posts):  11   10   110
Posts Processed(Page/Page Posts/Total Posts):  12   10   120
Posts Processed(Page/Page Posts/Total Posts):  13   10   130
Posts Processed(Page/Page Posts/Total Posts):  14   10   140
Posts Processed(Page/Page Posts/Total Posts):  15   10   150
Posts Processed(Page/Page Posts/Total Posts):  16   10   160
Posts Processed(Page/Page Posts/Total Post

In [6]:
media_posts_list_flat = [x for y in media_posts_list for x in y]
#print(media_posts_list_flat)

The next three sections pull the date, header, content, and links. Since the media Releases page does not contain these items directly embedded within tags the workaround is to: 
        a) concatonate a link for each media article using the base (static portion) of the url + the article specific suffix url from the a. href tag contained in the flat list above
        b) take the user input for either a blog limit, date limit, both, or none
        b) itterate through each url, render in BeautifulSoup, extract the date, header, content, url into individual lists of lists. Extra escape characters, leading and trailing spaces, non-vital information, etc are cleaned in this step 

In [7]:
#builds the url from the base + suffix 
links_list = []
base_url = 'https://www.greenparty.ca'
for mp in media_posts_list_flat:
    suffix = mp.find('a').attrs['href']
    links = (base_url + suffix)
    links_list.append(links)
print(links_list)



The user input goes here 

In [16]:
user_blog_limit = input('Provide the number of blogs to retrieve. Hit enter to skip & scan all blogs: ')
if user_blog_limit == '':
    user_blog_limit = last_post
    
user_date_limit = input ('Provide the last date for blog retrieval(YYYY-MM-DD), hit enter to skip & scan all dates: ')
if user_date_limit == '':
    user_date_limit = '2011-03-03'
user_date_check = datetime.strptime(user_date_limit, "%Y-%m-%d")

Provide the number of blogs to retrieve. Hit enter to skip & scan all blogs: 50
Provide the last date for blog retrieval(YYYY-MM-DD), hit enter to skip & scan all dates: 


This function is where the data is pulled from the individual articles attached to each link. For each link the function checks validity of the blog and date limits, itterates through the link list, uses BeautifulSoup to extract the date/header/content text (cleans it a little as mentioned above), adds the url. The output is a series of item lists.  

In [17]:
date_list = []
header_list = []
content_list = []
url_list = []

date_check = datetime.now()
user_blog_limit = int(user_blog_limit)
list_index = 0

        
        
for link in links_list:
    while (user_blog_limit > list_index and user_date_check < date_check):
        url = links_list[list_index]
        response = get(url).text
        content_soup = BeautifulSoup(response, 'html.parser')

        date = content_soup.find('div', class_='pane-content').get_text().replace('\n','').strip()
        date_check = datetime.strptime(date, '%B %d, %Y')
        date_list.append(date)

        header = content_soup.find('h1', class_='page-header title-container text-left visible').get_text().strip()
        header_list.append(header)

        content = content_soup.find('div', class_='field-body').get_text().strip().replace('\n','').replace('\xa0','')\
        .replace('/','').replace('# # #For more information or to arrange an interview:Rosie EmeryPress Secretary613-562-4916 ext, 204rosie.emery@greenparty.ca', '')\
        .replace('# # #For more information or to arrange an interview:Rod LeggettPress Secretary613-562-4916 ext, 204rod.leggett@greenparty.ca', '')\
        .replace('- 30 -For additional information or to arrange an interview, contact:Dan PalmerPress Secretary | Attaché de pressedan.palmer@greenparty.cam:(613) 614-4916', '')\
        .replace('-30-For more information or for interviews, please communicate withRobin Marty, attaché de presseParti Vert du Canada514.652.3669robin.marty@greenparty.caLisa Mintz-Sauvons la falaise- Amis du Parc Meadowbrook - Coalition Verte438.877.2470l-mintz@hotmail.com','')\
        .replace('# # #For more information or to arrange an interview:Rosie EmeryPress Secretary613-562-4916 ext.rosie.emery@greenparty.ca','')\
        .replace('June 17, 2020','').replace('May 27, 2020FOR IMMEDIATE RELEASE','').replace('â€',' ')
        content_list.append(content)

        url_list.append(url)

        list_index = list_index + 1  

        print('Posts Processed: ',list_index,' ',date)
print(url_list)
print (header_list)
print (content_list)

Posts Processed:  1   June 22, 2020
Posts Processed:  2   June 19, 2020
Posts Processed:  3   June 17, 2020
Posts Processed:  4   June 16, 2020
Posts Processed:  5   June 12, 2020
Posts Processed:  6   June 11, 2020
Posts Processed:  7   June 11, 2020
Posts Processed:  8   June 10, 2020
Posts Processed:  9   June 04, 2020
Posts Processed:  10   June 04, 2020
Posts Processed:  11   June 03, 2020
Posts Processed:  12   June 02, 2020
Posts Processed:  13   May 29, 2020
Posts Processed:  14   May 29, 2020
Posts Processed:  15   May 28, 2020
Posts Processed:  16   May 27, 2020
Posts Processed:  17   May 26, 2020
Posts Processed:  18   May 21, 2020
Posts Processed:  19   May 13, 2020
Posts Processed:  20   May 13, 2020
Posts Processed:  21   May 11, 2020
Posts Processed:  22   May 11, 2020
Posts Processed:  23   May 07, 2020
Posts Processed:  24   May 07, 2020
Posts Processed:  25   May 05, 2020
Posts Processed:  26   April 30, 2020
Posts Processed:  27   April 28, 2020
Posts Processed:  28 

The next two sections:
    a) takes the content lists and brings them into a pandas dataframe
    b) writes the dateframe to csv

In [18]:
scrape_output = pd.DataFrame({'Date': date_list, 'Header': header_list, 'Content': content_list, 'Link': url_list})
print(scrape_output)

              Date  ...                                               Link
0    June 22, 2020  ...  https://www.greenparty.ca/en/media-release/202...
1    June 19, 2020  ...  https://www.greenparty.ca/en/media-release/202...
2    June 17, 2020  ...  https://www.greenparty.ca/en/media-release/202...
3    June 16, 2020  ...  https://www.greenparty.ca/en/media-release/202...
4    June 12, 2020  ...  https://www.greenparty.ca/en/media-release/202...
5    June 11, 2020  ...  https://www.greenparty.ca/en/media-release/202...
6    June 11, 2020  ...  https://www.greenparty.ca/en/media-release/202...
7    June 10, 2020  ...  https://www.greenparty.ca/en/media-release/202...
8    June 04, 2020  ...  https://www.greenparty.ca/en/media-release/202...
9    June 04, 2020  ...  https://www.greenparty.ca/en/media-release/202...
10   June 03, 2020  ...  https://www.greenparty.ca/en/media-release/202...
11   June 02, 2020  ...  https://www.greenparty.ca/en/media-release/202...
12    May 29, 2020  ...  

In [19]:
scrape_output.to_csv('greenpartywebscrape50.csv', encoding='utf-8')
print('Write Complete')

Write Complete
