# Information

In [1]:
#title                :DC Council Calendar Web Scrape (bs4)
#description          :This will scrape a web page of calendar items listed in a table, place in a pandas DataFrame
#                      and export the DataFrame to a .csv file. 
#author               :alisonthaung
#date created         :2017-07-10
#date last modified   :2017-07-25
#python_version       :'3.6.1 |Anaconda 4.4.0 (x86_64)| (default, May 11 2017, 13:04:09)
#                      [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]'
#operating system     :MacOS Sierra 10.12.5
#==============================================================================

In [2]:
import sys
sys.version

'3.6.1 |Anaconda 4.4.0 (x86_64)| (default, May 11 2017, 13:04:09) \n[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]'

# Import libraries

BeautifulSoup to parse the webpage

In [3]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

Selenium to interact with webpage and select date

In [4]:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Specify URL from which to scrape, create browser object

In [5]:
url = ('http://dccouncil.us/calendar')
browser = webdriver.Firefox()
browser.get(url)

wait = WebDriverWait(browser, 30)

# Set Date

By default, this page loads with the current date. The calendar does not always show any data for the current date/week. For the initial code, I have set a date to Jun 1 to grab historical data up to the current date.

Note: when deploying as script to run daily, code here can be deleted and different soup object created

In [6]:
date = '2017-06-01'
browser.find_element_by_id('cal-date-select-dev').clear()
browser.find_element_by_id('cal-date-select-dev').send_keys(date)
browser.find_element_by_xpath('//td//input[contains(@class, "site-button-small-dev submit")]').click()

Identify underlying source page of dynamically filtered data. (Can't use URL because URL doesn't point to underlying data)

In [7]:
source = browser.page_source
soup = BeautifulSoup(source, 'lxml')

Below is the code you would run if you do not run the previous 3 blocks of code (When in production mode and no longer needing historical data).

In [8]:
# url = 'http://dccouncil.us/calendar'
# r = requests.get(url)
# soup = BeautifulSoup(r.text, 'lxml')

# Review underlying html structure of webpage

In [9]:
soup.prettify

<bound method Tag.prettify of <html class=" js flexbox rgba hsla multiplebgs backgroundsize borderimage borderradius boxshadow textshadow opacity cssanimations csscolumns cssgradients no-cssreflections csstransforms csstransforms3d csstransitions fontface generatedcontent" lang="en" style="height: 100%;"><!--<![endif]--><head>
<meta charset="utf-8"/>
<meta content="eSt4OvXWV9h88hl71sSKm4UBBRk9EBAy2R8__sAkCFc" name="google-site-verification"/>
<meta content="/files/site/images/fb_logo.jpeg" property="og:image"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<title>Council of the District of Columbia</title>
<link href="/files/site/assets/favicon.ico" rel="shortcut icon"/>
<link href="http://dccouncil.us/?css=styles/index.v.1444061289" rel="stylesheet"/>
<link href="//fonts.googleapis.com/css?family=Open+Sans:400,600,700" rel="stylesheet" type="text/css"/>
<link href="http://dccouncil.us/news/rss" rel="alternate" title="RSS" type="application/rss+xml"/>
<script async="" src="//c

# Create lists of data for columns

The dates, times, and locations of each meeting are listed in a single &lt;td&gt; within a &lt;div&gt; class of "event-description-dev-metabox". Each item is listed in successive &lt;p&gt; tags. The following code extracts the information from successive &lt;p&gt; tags.

In [10]:
date = [div.find('p').text for div in soup.find_all('div', {'class': 'event-description-dev-metabox'})]
time = [div.find('p').findNext('p').text for div in soup.find_all('div', {'class': 'event-description-dev-metabox'})]
location = [div.find('p').findNext('p').findNext('p').text for div in 
            soup.find_all('div', {'class': 'event-description-dev-metabox'})]

In [11]:
date

['Thursday, 6/1/2017',
 'Thursday, 6/1/2017',
 'Monday, 6/5/2017',
 'Monday, 6/5/2017',
 'Monday, 6/5/2017',
 'Tuesday, 6/6/2017',
 'Tuesday, 6/6/2017',
 'Wednesday, 6/7/2017',
 'Thursday, 6/8/2017',
 'Friday, 6/9/2017',
 'Tuesday, 6/13/2017',
 'Tuesday, 6/13/2017',
 'Tuesday, 6/13/2017',
 'Tuesday, 6/13/2017',
 'Wednesday, 6/14/2017',
 'Wednesday, 6/14/2017',
 'Wednesday, 6/14/2017',
 'Wednesday, 6/14/2017',
 'Thursday, 6/15/2017',
 'Thursday, 6/15/2017',
 'Friday, 6/16/2017',
 'Monday, 6/19/2017',
 'Monday, 6/19/2017',
 'Tuesday, 6/20/2017',
 'Wednesday, 6/21/2017',
 'Wednesday, 6/21/2017',
 'Wednesday, 6/21/2017',
 'Wednesday, 6/21/2017',
 'Thursday, 6/22/2017',
 'Thursday, 6/22/2017',
 'Friday, 6/23/2017',
 'Monday, 6/26/2017',
 'Monday, 6/26/2017',
 'Tuesday, 6/27/2017',
 'Tuesday, 6/27/2017',
 'Tuesday, 6/27/2017',
 'Wednesday, 6/28/2017',
 'Wednesday, 6/28/2017',
 'Thursday, 6/29/2017',
 'Thursday, 6/29/2017',
 'Friday, 6/30/2017',
 'Wednesday, 7/5/2017',
 'Wednesday, 7/5/2017',

In [12]:
time

['9:30am',
 '2:00pm',
 '10:30am',
 '10:30am',
 '11:00am',
 '10:00am',
 '11:00am',
 '10:00am',
 '9:30am',
 '1:00pm',
 '10:00am',
 '1:00pm',
 '1:30pm',
 '3:00pm',
 '10:00am',
 '10:00am',
 '1:00pm',
 '2:30pm',
 '10:00am',
 '11:00am',
 '11:00am',
 '11:00am',
 '6:00pm',
 '10:00am',
 '9:30am',
 '10:00am',
 '1:00pm',
 '2:00pm',
 '10:00am',
 '10:30am',
 '2:00pm',
 '10:00am',
 '11:00am',
 '10:00am',
 '2:00pm',
 '2:00pm',
 '10:00am',
 '10:00am',
 '9:00am',
 '12:30pm',
 '10:00am',
 '9:45am',
 '10:30am',
 '1:00pm',
 '4:00pm',
 '9:30am',
 '9:30am',
 '10:30am',
 '1:00pm',
 '6:00pm',
 '10:00am',
 '11:00am',
 '9:30am',
 '11:00am',
 '1:00pm',
 '2:30pm',
 '6:00pm',
 '11:00am',
 '6:30pm',
 '11:00am',
 '10:00am',
 '10:00am',
 '10:00am',
 '10:00am',
 '10:00am',
 '10:00am',
 '10:00am',
 '10:00am',
 '10:00am']

In [13]:
location

['Room 412',
 'Room 123',
 'Room 412',
 'Room 123',
 'Room 500',
 'Room 500',
 'Room 500',
 'Room 412',
 'Room 120',
 'Room 412',
 'Room 500',
 'Roon 500',
 'Room 500',
 'Room 120',
 'Room 120',
 'Room 412',
 'Room 500',
 'Room 120',
 'Room 412',
 'Room 120',
 'Room 412',
 'Room 120',
 'Canceled',
 'Room 500',
 'Room 120',
 'Room 123',
 'Room 500',
 'Room 120',
 'Room 120',
 'Room 412',
 'Room 500',
 'Room 120',
 'Room 412',
 'Room 500',
 'Room 120',
 'Room 123',
 'Room 123',
 'Room 500',
 'Room 500',
 'Room 123',
 'Room 412',
 'Room 120',
 'Room 123',
 'Room 500',
 'Room 120',
 'Room 412',
 'Room 500',
 'Room 412',
 'Room 412',
 'Columbia Heights Education Campus Auditorium, 3101 16th Street, NW, Washington, D.C. 20010',
 'Room 500',
 'Room 500',
 'Room 412',
 '(Cancelled)',
 'Room 120',
 'Room 500',
 'University of District of Columbia Student Center; 4200 Connecticut Avenue, N.W.; Washington, DC 20008',
 'Room 412',
 'Edward J. Pryzbyla Center, Great Room A; 620 Michigan Avenue NE; 

# Extract from div class: event-description-content-dev for title and contents of meeting

In [14]:
titles = [div.find('a').text for div in soup.find_all('div', {'class': 'event-description-content-dev'})]

In [15]:
titles

['Judiciary & Public Safety Public  Hearing',
 'Health Public Roundtable',
 'Legislative Media Briefing',
 'Business & Economic Development Additional Meeting',
 'Health & Education Joint Public Oversight Roundtable',
 'Committee of the Whole Additional Meeting',
 'Legislative Meeting',
 'Housing & Neighborhood Revitalization Public Roundtable',
 'Judiciary & Public Safety Public Hearing',
 'Education Public Roundtable',
 'Legislative Meeting',
 'Committee of the Whole Public Roundtable',
 'Committee of the Whole Public Roundtable',
 'Education Meeting',
 'Finance & Revenue Public Roundtable',
 'Human Services Public Hearing',
 'Health Public Oversight Roundtable',
 'Transportation & the Environment Additional Meeting',
 'Human Services Public Hearing',
 'Judiciary & Public Safety Public  Hearing',
 'Transportation & the Environment Public Hearing',
 'Business & Economic Development Public Hearing',
 'Education Public Roundtable (Canceled)',
 'Committee of the Whole Meeting',
 'Finance

In [16]:
children = soup.findChildren('div', {'class': 'event-description-content-dev'})

content = []
for item in children:
    # Don't need Title info as that's already been extracted. Start reading content at index = 1 and extract data to end
    item = [content for content in item.text.split('\n') if len(content)>0][1:]
    
    # Create string from separate list items to all be listed in
    item = ' '.join(item)
    content.append(item)

content

['The Committee on the Judicuary & Public Safety will hold a Public Hearing on the following Legislation: Bill 22-0012, the "Revision of Guardianship of Minors and Creation of Supplemental Needs Trusts Act of 2017" Bill 22-0020, the "Consumer Disclosure Act of 2017" Bill 22-0049, the "Uniform Power of Attorney Amendment Act of 2017" Bill 22-0169, the "Electronic Signature Authorization Act of 2017" Bill 22-0198, the "Uniform Partition of Heirs\' Property Act of 2017" Bill 22-0199, the "Uniform Fiduciary Access to Digital Assets Act of 2017" The Committee invites the public to testify or to submit written testimony. Anyone wishing to\xa0testify at the hearing should contact the Committee via email at judiciary@dccouncil.us or at (202)\xa0727-8275, and provide their name, telephone number, organizational affiliation, and title (if any),\xa0by close of business Friday, May 26. Representatives of organizations will be allowed a\xa0maximum of five minutes for oral\xa0testimony, and individu

# Check that all lists are the same length

Checking that all lists are the same length to "sanity check" the scrape and ensure that the number of rows of data is lined up appropriately for each variable

In [17]:
len(content) == len(titles) == len(location) == len(time) == len(date)

True

# Create DataFrame for data and export to csv

In [18]:
data = {'Date': date,
       'Time' : time,
       'Location': location,
       'Titles': titles,
       'Content': content}

df = pd.DataFrame(data, columns = ['Date', 'Time', 'Location', 'Titles', 'Content'])

Look at dataframe and review to make sure it looks like the data structure you want to export. Spot check against data on website

In [19]:
df.head()

Unnamed: 0,Date,Time,Location,Titles,Content
0,"Thursday, 6/1/2017",9:30am,Room 412,Judiciary & Public Safety Public Hearing,The Committee on the Judicuary & Public Safety...
1,"Thursday, 6/1/2017",2:00pm,Room 123,Health Public Roundtable,The Committee on Health will hold a Public Rou...
2,"Monday, 6/5/2017",10:30am,Room 412,Legislative Media Briefing,Office of Chairman Mendelson Council of the Di...
3,"Monday, 6/5/2017",10:30am,Room 123,Business & Economic Development Additional Mee...,The Committee on Business & Economic Developme...
4,"Monday, 6/5/2017",11:00am,Room 500,Health & Education Joint Public Oversight Roun...,The Committee on Health & the Committee on Edu...


In [20]:
df.to_csv('DC Council Calendar - 2017-07-25')