# Scrape Tennessean

The purpose of this notebook was to identify articles with quotes from candidates 
for the 2023 mayoral race, pull out those statements, and categorize them.

It uses Selenium to scrape and the ChatGPT API to parse articles and identify statements.

In [16]:
import os
from dataclasses import dataclass
from time import sleep

import pandas as pd
from selenium import webdriver
from selenium.common.exceptions import (StaleElementReferenceException,
                                        TimeoutException,
                                        NoSuchElementException)
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from tqdm import tqdm

In [2]:
# My chromium driver isn't in PATH, for some reason

os.environ['PATH'] += ':/opt/homebrew/bin'

In [22]:
# Start a Selenium driver and navigate to the CE homepage
options = Options()
options.headless = False
options.add_argument("--window-size=1200,800")

driver = webdriver.Chrome(options=options)
driver.get('https://www.tennessean.com/')
wait = WebDriverWait(driver, 4)

  options.headless = False


In [4]:
articles = []


@dataclass
class Article:
    title: str
    link: str
    description: str

In [5]:
candidates = [
    'Floyd Bonner', 'Joe Brown', 'Karen Camper', 'Frank Colvett',
    'J.W. Gibson', 'Willie Herenton', 'Michelle McKissack', 'Van Turner',
    'Paul Young', 'James Harvey', 'Reggie Hall'
]

In [6]:
for candidate in tqdm(candidates):
    # Open up the search box
    driver.find_element(By.XPATH,
                        '/html/body/header/nav/div[2]/div[1]/a').click()
    # Search for articles containing the candidate's name and "statement"
    webdriver.ActionChains(driver).send_keys(f'"{candidate}" statement',
                                             Keys.ENTER).perform()
    # Go through up to 10 pages
    for i in range(10):

        # Get all the articles
        for article in driver.find_elements(By.XPATH,
                                            '/html/body/main/div[1]/a'):
            # skip subscriber-only articles
            if 'gnt_lbl_pm' in article.get_attribute('class'):
                continue
            articles.append(
                Article(title=article.text,
                        link=article.get_attribute('href'),
                        description=article.get_attribute('data-c-desc')))

        # Go to the next page
        for next_button in driver.find_elements(
                By.XPATH, '/html/body/main/div[1]/div[4]/a[3]'):
            next_button.click()
            break
        else:
            # If there are no more pages, exit the loop
            break

  0%|          | 0/11 [00:00<?, ?it/s]

100%|██████████| 11/11 [00:25<00:00,  2.29s/it]


In [7]:
len(articles)

32

In [9]:
df = pd.DataFrame(articles).drop_duplicates()

df[df.link.str.contains('opinion')]

Unnamed: 0,title,link,description
30,"The brutal death of Tyre Nichols: horror, grie...",https://www.tennessean.com/story/opinion/colum...,"After each horrific episode, there is talk of ..."


In [12]:
# Throw out to disk
links_df = pd.DataFrame(articles).drop_duplicates()
links_df['date'] = pd.to_datetime(
    links_df.link.str.extract('(\d{4}/\d{2}/\d{2})', expand=False))
links_df.to_csv('tn_links.csv', index=False)

In [13]:
# Restart point
links_df = pd.read_csv('tn_links.csv')
print(len(links_df))
links_df.head()

32


Unnamed: 0,title,link,description,date
0,Settlement reached in Springfield High racial ...,https://www.tennessean.com/story/news/local/ro...,A settlement has been reached between a former...,2018-01-26
1,FBI questions second Nashville judge in Casey ...,https://www.tennessean.com/story/news/2017/08/...,At least one other Nashville judge is entangle...,2017-08-22
2,WKU: We knew call on Vandy 2-point conversion\...,https://www.tennessean.com/story/sports/colleg...,WKU players said they knew exactly what Vander...,2015-09-04
3,Mediation possible in Robertson County jail su...,https://www.tennessean.com/story/news/local/ro...,A private mediation could be in the works for ...,2016-07-03
4,"Feds: Moreland tried to pay woman $6,100 to re...",https://www.tennessean.com/story/news/crime/20...,Moreland will appear in court Friday for a hea...,2017-03-28


## Load article bodies

In [14]:
article_text = []

In [23]:
articles_df = links_df.copy()
for idx, link in enumerate(tqdm(articles_df.link.tolist())):
    if len(article_text) > idx:
        continue
    driver.get(link)
    try:
        wait.until(
            EC.visibility_of_element_located(
                (By.XPATH, '/html/body/div[2]/main/article/div[5]')))
        article_text.append(
            driver.find_element(By.XPATH,
                                '/html/body/div[2]/main/article/div[5]').text)
    except (TimeoutException, NoSuchElementException):  # hmm didn't load
        article_text.append('Failed to load')
articles_df['text'] = article_text

100%|██████████| 32/32 [00:14<00:00,  2.28it/s]


In [24]:
articles_df.to_csv('tn_articles.csv', index=False)

## Use GPT to parse out statements

In [29]:
import openai

openai.api_key = os.getenv("OPENAI_API_KEY")
openai.Model.list()

In [30]:
preface = """ Between triple quotes is an article from newspaper The Tennessean
'''
%s
'''
Please write out the speakers and statements in this article in an array, along with a category. 
The available categories are 'crime and safety', 'public education', 'housing', and 'none of the above'. 
Here is example output:
[('Elmer Fudd', 'I\'m going to get that rabbit a home.', 'housing'),
('Buggs Bunny', 'What\'s up, doc?', 'none of the above')]
"""

In [31]:
import time

completions = {}
articles_df = pd.read_csv('tn_articles.csv')

In [32]:
prev_start = time.time() - 30

for _, row in tqdm(list(articles_df.iterrows())):
    if row.text == 'Failed to load':
        continue
    # Make sure we stay under 3 requests a minute
    time.sleep(max(21 - (time.time() - prev_start), 0))
    if row.title in completions:
        continue
    elif 'Geoffrey Redd' in row.title:
        continue  # something weird with this article?
    # Exclude paragraphs about author or other articles
    if not isinstance(row.text, str):
        continue
    article_input = '\n'.join(
        line for line in row.text.splitlines()
        if not line.startswith('Related') and not 'tennessean' in line)
    wordcount = article_input.count(
        ' ')  # use spaces as a rough estimate of words
    # Use a model with a larger limit for longer articles, otherwise the most recent model
    model = 'gpt-3.5-turbo-16k' if wordcount > 3000 else 'gpt-3.5-turbo'
    prev_start = time.time()
    messages = [
        # dict(role="system", content=system_msg),
        dict(role="user", content=preface % article_input),
    ]
    completions[row.title] = openai.ChatCompletion.create(model=model,
                                                          messages=messages)


100%|██████████| 32/32 [08:06<00:00, 15.20s/it]


In [38]:
# Parse the ones that work well enough (note: almost all issues caused by quotes - next time have it use single quotes or another format)
parsed = {}
for title, completion in completions.items():
    try:
        parsed[title] = eval(completion.choices[0].message.content)
    except SyntaxError:
        print(('"%s":' % title) + completion.choices[0].message.content + ',')

"FBI questions second Nashville judge in Casey Moreland case
Stacey Barchenger and Dave Boucher ":[('Peter Strianse', "Robinson, a fellow judge and longtime friend of Moreland's, has been interviewed by the FBI, said Peter Strianse, Moreland's lawyer. It's unclear if Robinson alerted the FBI about the affidavit or spoke with investigators for another reason.", 'crime and safety'), 
('Robinson', "I’m not going to comment on any federal investigation or that federal case," he said Tuesday.", 'crime and safety'), 
('Prosecutors and Moreland\'s lawyers', 'Prosecutors and Moreland\'s lawyers sparred in a brief Tuesday afternoon hearing during which Moreland sought to have a GPS monitor removed before his trial in June.', 'crime and safety'), 
('James Pedigo', 'The person who gave the woman the affidavit, James Pedigo, wore a wire for the FBI to record some of his interactions with the woman.', 'crime and safety'), 
('U.S. Magistrate Judge Joe Brown', "But even U.S. Magistrate Judge Joe Brow

In [42]:
# Add the ones I fixed manually
parsed.update({
    "FBI questions second Nashville judge in Casey Moreland case\nStacey Barchenger and Dave Boucher":
    [('Peter Strianse',
      "Robinson, a fellow judge and longtime friend of Moreland's, has been interviewed by the FBI, said Peter Strianse, Moreland's lawyer. It's unclear if Robinson alerted the FBI about the affidavit or spoke with investigators for another reason.",
      'crime and safety'),
     ('Robinson',
      "\"I’m not going to comment on any federal investigation or that federal case,\" he said Tuesday.",
      'crime and safety'),
     ('Prosecutors and Moreland\'s lawyers',
      'Prosecutors and Moreland\'s lawyers sparred in a brief Tuesday afternoon hearing during which Moreland sought to have a GPS monitor removed before his trial in June.',
      'crime and safety'),
     ('James Pedigo',
      'The person who gave the woman the affidavit, James Pedigo, wore a wire for the FBI to record some of his interactions with the woman.',
      'crime and safety'),
     ('U.S. Magistrate Judge Joe Brown',
      "But even U.S. Magistrate Judge Joe Brown called the home confinement liberal on Tuesday, after prosecutors said Moreland was able to leave home for things like running errands. ",
      'crime and safety'),
     ('Strianse',
      "And Moreland was back to work as of about one week ago, said Strianse, who declined to provide details about the ex-judge's employment except to say it's a 9-to-5 business job in Nashville.",
      'none of the above'),
     ('Cecil VanDevender',
      "\"The person in the best position to know who the witnesses are is Mr. Moreland,\" he argued in court.",
      'crime and safety')],
    "'This is shattering': Officials respond to elementary school shooting":
    [('Zulfat Suara',
      'As a parent, the last place you will expect your child to be hurt is at school. I cannot imagine the pain. Praying for the families involved.',
      'crime and safety'),
     ('John Cooper',
      "My heart goes out to the families of the victims. Our entire city stands with you.",
      'crime and safety'),
     ('Bob Mendes', 'This is shattering.', 'none of the above'),
     ('Adrienne Battle',
      "As a parent, as an educator, as a human being, I'm grieving today over the tragic murder of children and school staff right here in our community. My heart goes out to the entire Covenant School community and the parents grieving the unimaginable loss of life today. MNPS has been in close contact with Nashville police and is providing whatever support we can. We have invested considerable resources to strengthen security at our facilities. We will continue to reinforce safety protocols in response to the far too many, far too often instances of school shootings across the nation over the years.",
      'public education'),
     ('Joe Biden',
      "It's heartbreaking, a family's worst nightmare. We have to do more to stop gun violence. It's ripping our communities apart … ripping at the very soul of our nation. We have to do more to protect our schools so they aren't turned into prisons. So many members of the military (come) back with post-traumatic stress. Enough is enough.",
      'crime and safety'),
     ('Jill Biden',
      'I am truly without words and our children deserve better. We stand, all of us, with Nashville in prayer.',
      'none of the above'),
     ('Bob Freeman',
      "This is an unimaginable tragedy for the victims, all the children, families, teachers, staff and my entire community. I'm praying for my neighborhood, my city and my state. It is time to pull together and provide all the love and support that we can to those affected by this terrible catastrophe. It is time for serious action.",
      'none of the above'),
     ('Tonya Hancock',
      'Please help us pass stricter gun laws in Tennessee and in the Federal Government to save us from these senseless tragedies.',
      'crime and safety'),
     ('Sandra Sepulveda',
      "The politicians who would offer their 'thoughts and prayers' in this state and at the federal level, and do nothing are cowards. They are empty words.",
      'none of the above'),
     ('Jeff Yarbro',
      "Let's pray politicians with power to do something about school shooting will find the courage to act. Let's pray the anger we feel is transformed into demanding better. Let's pray the compassion we have for the families and teachers doesn't give way to cynicism or giving up.",
      'none of the above'),
     ('Karen Camper',
      "I will continue to work as the Democratic Leader to find REAL solutions to this very REAL problem of guns being used to harm our children. Governor Lee, we must make change now.",
      'crime and safety'),
     ('London Lamar',
      "Schools are supposed to be a safe place to send our kids. It hasn't been safe from gun violence for a long time.",
      'public education'),
     ('Bill Lee',
      'join us in praying for the school, congregation and Nashville community',
      'none of the above'),
     ('Marsha Blackburn',
      "Thank you to the first responders working on site. Please join us in prayer for those affected.",
      'none of the above'),
     ('Bill Hagerty',
      'I am devastated and heartbroken and thanked law enforcement and first responders for their heroic actions',
      'none of the above'),
     ('Mark Green',
      'offering prayer for all students, faculty and staff and the loved ones of those who were killed in this senseless loss of life. He also thanked Vanderbilt`s medical teams and first responders.',
      'none of the above'),
     ('Andy Ogles',
      "We are sending our thoughts and prayers to the families of those lost. As a father of three, I am utterly heartbroken by this senseless act of violence.",
      'none of the above'),
     ('Cameron Sexton',
      "With many colleagues and families impacted by the tragedy today, no bills will be considered during the Senate's floor session tonight.",
      'none of the above')],
    "Grand Divisions: A Tennessee politics podcast": [
        ('John Geer',
         "Trump's tariffs will have 'ripple impact' on Tennessee auto sector",
         'none of the above'),
        ('Jamie McGee',
         'Trump to Sen. Bob Corker: Abandon legislation challenging tariffs; Corker refuses',
         'none of the above'),
        ('Bob Corker',
         'Trump\'s tariffs will have \'ripple impact\' on Tennessee auto sector',
         'none of the above'),
        ('Phil Bredesen',
         'Gun rights and gun control key issue in Tennessee governor\'s race',
         'crime and safety'),
        ('Amanda Hunter',
         'Little sparring as candidates tackle gambling, crime in televised GOP governor race debate',
         'crime and safety'),
        ('Stephanie Teatro',
         "REGISTER TO VOTE\nAs Republican primary nears, Randy Boyd, Diane Black trade barbs in new TV ads\nDiane Black takes swipes at Randy Boyd, Bill Lee in her first attack ad of governor's race\nWhat the voting records reveal about Tennessee's candidates for governor\nAt a small Mexican restaurant in Trump country, a DACA recipient hopes to change minds\nTennessee's Senators, candidates weigh in on Justice Anthony Kennedy retirement",
         'none of the above'),
        ('Tom Lee',
         'Tennessee Elections: Craig Fitzhugh \'all about people\' in governor run\nKarl Dean is pitching pragmatism, not partisanship, in the governor\'s race. Will it fire up Democrats?\nHow Randy Boyd\'s company avoided paying millions in taxes using a \'double Irish\' loophole\nTennessee governor race: Where the candidates stand on the state\'s biggest education issues\nTNReady testing: How the state\'s gubernatorial candidates plan to fix the state\'s online test\nTennessee Rep. Ron Lollar dies at age 69\nJudge: Tennessee can\'t revoke driver\'s licenses from people who can\'t pay court costs',
         'public education'),
        ('Brent Leatherwood',
         'How Tennessee\'s campaign for governor has become the most expensive ever\nBlackburn strategist: ‘Death by 10,000 cuts’ will beat Bredesen in Tennessee US Senate race\nIn latest ad, Beth Harwell equates Republican rivals in race for governor to children\nPotentially illegal text messages attack Randy Boyd, Bill Lee as early voting begins\nBaptist public policy arm hires former Tennessee GOP leader',
         'none of the above'),
        ('Bill Haslam',
         'Tennessean candidate profiles and more\nGov. Bill Haslam weighs clemency for death row offenders, awaits Cyntoia Brown recommendation\nGov. Bill Haslam: At some point ahead, Tennessee can \'come out ahead\' in expanding Medicaid\nTennessee governor\'s race: See who\'s backing the candidates\nNew poll: Bill Lee leads in Tennessee\'s Republican campaign for governor\nEarly voting remains on the rise, still room for many last-minute decisions',
         'none of the above'),
        ('Michael Sullivan',
         'ALL Tennessean election coverage\nInside the Republican campaign for Tennessee governor as election day nears\nHow the Democratic race for Tennessee governor unfolded\nVice President Mike Pence on Tennessee governor\'s race: Diane Black \'has my support\'\nSpending in Tennessee\'s race for governor tops $50M\nHaslam: White House should stay out of Republican primary for Tennessee governor\nTennessee faces an opioids crisis. Here\'s how the candidates for governor plan to tackle it.\nFact check: How Bill Lee\'s company handled layoffs and the recession\nLee Company sends cease-and-desist letter to Black campaign over claims made in mailers',
         'none of the above'),
        ('Kyle Kondik',
         "On the campaign trail: How Marsha Blackburn hopes to win Tennessee's US Senate race\nOn the campaign trail: How Phil Bredesen hopes to win Tennessee's US Senate race\nTennessee Republican Party revealed as group behind mysterious Google ads targeting Phil Bredesen\nKarl Dean proposes plan to allow counties to implement their own gas tax\nKoch network launches $2 million anti-Bredesen ad in Tennessee's US Senate race\nPhil Bredesen fires back at Koch attack with own ad, calls out Marsha Blackburn\nBlackburn agrees to second debate against Bredesen in Tennessee US Senate race\nTennessee is...",
         'none of the above'),
    ]
})

In [43]:
parsed_df = pd.concat([
    pd.DataFrame(quotes, columns=['candidate', 'statement', 'category'
                                  ]).assign(title=title)
    for title, quotes in parsed.items()
],
                      ignore_index=True).join(links_df.set_index('title'),
                                              on='title')

# Filter for actual candidates
keep = []
last_names = [candidate.split()[-1].casefold() for candidate in candidates]
for _, row in parsed_df.iterrows():
    keep.append(row.candidate.split()[-1].casefold() in last_names)
parsed_df = parsed_df[keep]

parsed_df = parsed_df[parsed_df.statement.str.len() > 20]

parsed_df = parsed_df[
    parsed_df.candidate.str.split().str[-1].str.casefold().isin(last_names)
    & ~parsed_df.candidate.
    isin(  # exclude names that share partially with actual candidates
        ['Kiki Hall', 'Kenneth A. Turner', 'Zamyra Hall'])]

print(len(parsed_df))
parsed_df

14


Unnamed: 0,candidate,statement,category,title,link,description,date
9,House Minority Leader Karen Camper,Bill was well respected by members on both sid...,none of the above,Nashville's longtime state Rep. Bill Beck dies...,https://www.tennessean.com/story/news/politics...,"A native of Madison and Whites Creek, Bill Bec...",2023-06-04
17,Metro Council member Zach Young,Talk about a good man who truly cared about im...,none of the above,Nashville's longtime state Rep. Bill Beck dies...,https://www.tennessean.com/story/news/politics...,"A native of Madison and Whites Creek, Bill Bec...",2023-06-04
51,Tennessee House Minority Leader Karen Camper,“This is an unfortunate decision based on poli...,none of the above,Abortion in Tennessee updates: Protesters call...,https://www.tennessean.com/story/news/politics...,Following the Supreme Court's decision to over...,2022-06-24
57,Karen Camper,"""Attempting to nullify the choice of the peopl...",none of the above,Rep. John DeBerry's ouster from the Tennessee ...,https://www.tennessean.com/story/news/politics...,The Tennessee Democratic Party removed state R...,2020-04-13
72,Rep. Karen Camper,The caucus is pursuing a civil case against Ti...,crime and safety,Audit: Former Democratic legislative staffer a...,https://www.tennessean.com/story/news/politics...,Derrick Tibbs resigned after allegedly siphoni...,2019-01-10
73,Rep. Karen Camper,"“As the Democratic leader, I am sorry that thi...",none of the above,Audit: Former Democratic legislative staffer a...,https://www.tennessean.com/story/news/politics...,Derrick Tibbs resigned after allegedly siphoni...,2019-01-10
79,Karen Camper,Citizens of the State of Tennessee deserve to ...,none of the above,Tennessee Democrats call on Glen Casada to ste...,https://www.tennessean.com/story/news/politics...,Tennessee's legislative black caucus is holdin...,2019-05-07
88,Karen Camper,“Citizens of the State of Tennessee deserve to...,none of the above,Can Glen Casada be forced out as Tennessee's S...,https://www.tennessean.com/story/news/politics...,Calls continue to mount for Tennessee House Sp...,2019-05-09
97,Karen Camper,"""Himes’ work would ""ensure accountability with...",crime and safety,"Amid scandal involving speaker's office, House...",https://www.tennessean.com/story/news/politics...,"As part of the role, Doug Himes will serve as ...",2019-05-31
106,Memphis-Shelby County Schools board chair Mich...,Sending public dollars to private school syste...,public education,Tennessee Supreme Court rules in favor of Gov....,https://www.tennessean.com/story/news/educatio...,At the heart of the legal battle was whether t...,2022-05-18


In [44]:
with pd.ExcelWriter('tn_statements.xlsx') as writer:
    parsed_df.to_excel(writer, index=False, sheet_name='final')

After this point the files were manually curated and delivered to the client.