# IRE Board members

The goal: Scrape [this list of IRE board members](https://www.ire.org/about-ire/past-ire-board-members/) into a CSV.

This project introduces a few new concepts:
- Scraping data that's not part of a table
- Specifying custom request headers to evade a bot detection rule on our server
- Using string methods and default values when parsing out the data

In [1]:
# stdlib library we'll use to write the CSV file
import csv

# installed library to handle the HTTP traffic
import requests

# installed library to parse the HTML
from bs4 import BeautifulSoup

In [2]:
URL = 'https://www.ire.org/about-ire/past-ire-board-members/'

In [3]:
# set up request headers
# the IRE website rejects incoming requests with the
# `requests` library's default user-agent, so we
# need to pretend to be a browser -- we can do that by
# setting the `User-Agent` value to mimic a value that
# a browser would send, and add this to the headers
# of the request before it's sent
# read more: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'
}

In [4]:
# send a GET request to fetch the page using the headers we just created
r = requests.get(
    'https://www.ire.org/about-ire/past-ire-board-members/',
    headers=headers
)

# raise an error if the HTTP request returns an error code
# HTTP codes: https://http.cat
r.raise_for_status()

In [5]:
# use the BeautifulSoup object to parse the response text
# -- r.text -- with the default HTML parser
# https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use
soup = BeautifulSoup(r.text, 'html.parser')

In [6]:
print(soup)

<!DOCTYPE html>

<html lang="en-US">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<!-- WP_HEAD() START -->
<meta content="index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1" name="robots">
<!-- This site is optimized with the Yoast SEO plugin v20.0 - https://yoast.com/wordpress/plugins/seo/ -->
<title>Past IRE Board Members - Investigative Reporters &amp; Editors</title>
<link href="https://www.ire.org/about-ire/past-ire-board-members/" rel="canonical"/>
<meta content="en_US" property="og:locale">
<meta content="article" property="og:type"/>
<meta content="Past IRE Board Members - Investigative Reporters &amp; Editors" property="og:title"/>
<meta content="https://www.ire.org/about-ire/past-ire-board-members/" property="og:url"/>
<meta content="Investigative Reporters &amp; Editors" property="og:site_name"/>
<meta content="https://www.facebook.com/IRE.NICAR/" property="article:publisher"/>
<meta content=

In [7]:
# search the HTML tree to find the div
# with the `id` attribute of "past-ire-board-members"
target_div = soup.find(
    'div',
    {'id': 'past-ire-board-members'}
)

In [8]:
print(target_div)

<div class="ct-new-columns" id="past-ire-board-members"><div class="ct-div-block" id="div_block-175-2155"><div class="oxy-rich-text" id="_rich_text-178-2155"><p>Paul Adrian (2000-2006)</p><p>Matt Apuzzo (2017-2019)</p><p>Rosemary Armao (1997-2001)</p><p>Bethany Barnes (2019-2021)</p><p>Roberta Baskin (1998-2000)</p><p>*David Boardman (1997-2007)</p><p>Walt Bogdanich (1988-1992)</p><p>Ziva Branstetter (2013-2019)</p><p>Darla Cameron (2022-present)</p><p>John Camp (1990-1992)</p><p>Rose Ciotta (1993-2001)</p><p>Wendell Cochran (2006-2008)</p><p>*Sarah Cohen (2010-2018)</p><p>Robert Cribb (2009-2014)</p><p>Bill Dedman (1990-1996)</p><p>Matt Dempsey (2018-2020)</p><p>*David Dietz (1996-2004) (dec)</p><p>Steve Doig (2002-2006)</p><p>Andrew Donohue (2010-2018)</p><p>Leonard Downie Jr. (2009-2015)</p><p>Bob Drogin (1986-1989)</p><p>Bill Farr (1978-1985) (dec)</p><p>Renee Ferguson (2006-2008)</p><p>Jodie Fleischer (2019-present)</p><p>Jennifer Forsyth (2020-2022)</p><p>Ellen Gabler (2012-2018)

In [9]:
# within that div, find all the paragraph tags
members = target_div.find_all('p')

In [10]:
members

[<p>Paul Adrian (2000-2006)</p>,
 <p>Matt Apuzzo (2017-2019)</p>,
 <p>Rosemary Armao (1997-2001)</p>,
 <p>Bethany Barnes (2019-2021)</p>,
 <p>Roberta Baskin (1998-2000)</p>,
 <p>*David Boardman (1997-2007)</p>,
 <p>Walt Bogdanich (1988-1992)</p>,
 <p>Ziva Branstetter (2013-2019)</p>,
 <p>Darla Cameron (2022-present)</p>,
 <p>John Camp (1990-1992)</p>,
 <p>Rose Ciotta (1993-2001)</p>,
 <p>Wendell Cochran (2006-2008)</p>,
 <p>*Sarah Cohen (2010-2018)</p>,
 <p>Robert Cribb (2009-2014)</p>,
 <p>Bill Dedman (1990-1996)</p>,
 <p>Matt Dempsey (2018-2020)</p>,
 <p>*David Dietz (1996-2004) (dec)</p>,
 <p>Steve Doig (2002-2006)</p>,
 <p>Andrew Donohue (2010-2018)</p>,
 <p>Leonard Downie Jr. (2009-2015)</p>,
 <p>Bob Drogin (1986-1989)</p>,
 <p>Bill Farr (1978-1985) (dec)</p>,
 <p>Renee Ferguson (2006-2008)</p>,
 <p>Jodie Fleischer (2019-present)</p>,
 <p>Jennifer Forsyth (2020-2022)</p>,
 <p>Ellen Gabler (2012-2018)</p>,
 <p>Cindy Galli (2019-present)</p>,
 <p>*Manny Garcia (2006-2014)</p>,
 <p>L

In [11]:
# set up the CSV headers to write to file
csv_headers = [
    'name',
    'terms',
    'was_president',
    'is_deceased'
]

In [12]:
# next, set up the file to write the CSV data into
# https://docs.python.org/3/library/csv.html#csv.writer

# open the CSV file in write ('w') mode, specifying newline='' to deal with
# potential PC-only line ending problem
with open('ire-board.csv', 'w', newline='') as outfile:

    # set up a csv.writer object tied to the file we just opened
    writer = csv.writer(outfile)

    # write the list of headers
    writer.writerow(csv_headers)

    # loop over the list of paragraphs we targeted above
    for member in members:

        # we don't want the entire Tag object, just the text
        text = member.text

        # set up some default values -- the member was not president
        was_president = False

        # and is not deceased
        is_deceased = False

        # IRE denotes past presidents with a leading asterisk
        # so check to see if the string startswith '*'
        # https://docs.python.org/3/library/stdtypes.html?highlight=startswith#str.startswith
        if text.startswith('*'):

            # if so, switch the value for the `was_president` variable to True
            was_president = True

        # check to see if "(dec)" is anywhere in the text, which
        # indicates this person is deceased
        # https://docs.python.org/3/reference/expressions.html#in
        if '(dec)' in text:
            is_deceased = True

        # next, start parsing out the pieces
        # separate the name from the terms by splitting on "("
        text_split = text.split('(')

        # the name will be the first ([0]) item in the resulting list
        # while we're at it, strip off any leading asterisks
        # https://docs.python.org/3/library/stdtypes.html?highlight=lstrip#str.lstrip
        # and strip() off any leading or trailing whitespace
        # https://docs.python.org/3/library/stdtypes.html?highlight=lstrip#str.strip
        name = text_split[0].lstrip('*').strip()

        # the term(s) of service will be the second item ([1]) in that list
        # and the term text is always terminated with a closing parens
        # so splitting on that closing parens and taking the first ([0])
        # item in the list will give us the term(s)
        terms = text_split[1].split(')')[0]

        # put the collected data into a list
        data = [
            name,
            terms,
            was_president,
            is_deceased
        ]

        # and write this row of data into the CSV file
        writer.writerow(data)