# Scrape speeches

The following details code to scrape speeches from a website. This code currently focused on scraping speeches from Wellesley's website. Minor changes may need to happen based on 

In [1]:
 #import chrome webdriver
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

In [142]:
browser = webdriver.Chrome()

In [43]:
def scrapeSpeech(browserEl, url):
    """
    given a browser and url of a page extracts the speech text from that page
    """
    browserEl.get(url)
    speech = browser.find_elements_by_xpath('//div[@class="field-item even"]')[0]
    return speech.text

In [101]:
def extractYear(url):
    """
    returns the numbers included in a url
    """
    digits = [d for d in url if d.isdigit()]
    return ''.join(digits)

In [161]:
def extractSpeech(url, browserEl,uniName,num=1,end=None):
    """
    writes to a file the speech text of a speech found under a particular url 
    and saves it to a file that includes the university name and year the speech was given
    """
    speech = scrapeSpeech(browserEl, url)
    year = extractYear(url)
    f = open(uniName + year + ".txt","w+") # save speech under file that includes university name + year of speech
    if end == None:
        clean_speech = ' '.join(speech.split('\n')[num:]) # gets rid of speaker info (not part of speech)
    else:
        clean_speech = ' '.join(speech.split('\n')[num:end])
    f.write(''.join(clean_speech.split("\"" ))) # cleans speech text further
    f.close() 

In [139]:
def getSpeeches(browserElement, file, uniName, num=1):
    """
    given a file that includes urls extracts the speeches found under each url and saves them under individual files 
    that include the university name (uniName)
    """
    with open(file) as fileName:
        urls = fileName.readlines()
        for link in urls:
            extractSpeech(link, browserElement,uniName,num)

In [133]:
# extract speeches from wellesley
getSpeeches(browser,'wellesley_urls.txt', 'wellesley')

In the case of speeches held at Wellesley there are multiple lines that include sometimes the title of the person given the speech as well as additional information. In most cases only the first line in the speech includes the speech information. However since this is not always the case we split up the urls into those that only have information on the first line and those that have multiple lines of information.

In [143]:
# extract speeches from wellesley with 2 lines before speech commences
getSpeeches(browser,'wellesley_urls2.txt', 'wellesley',2)

In [144]:
# extract speech from 1902
extractSpeech('https://www.wellesley.edu/events/commencement/archives/1902commencement', browser, 'wellesley', 3)

There is one special case in 1990 when there were to commencement speakers, here both Barbara Bush and Raisa Gorbachev  gave commencement speeches. Both speeches are included on the same page thus making it a bit harder to extract each speech with the code written above. Instead by examing the speech carefully we can split up the text to include the text of each individual speech.

In [145]:
# grab text from 1990s speeches
speech1990 = scrapeSpeech(browser, 'https://www.wellesley.edu/events/commencement/archives/1990commencement/commencementaddress')

In [149]:
# split speech
speech1990.split('\n')

['Commencement Speakers:',
 '',
 'Barbara Bush',
 'Raisa Gorbachev',
 "Mrs. Bush's Commencement Address to the Wellesley College Class of 1990",
 'Thank you President Keohane, Mrs. Gorbachev, Trustees, Faculty, Parents, Julia Porter, and certainly my new best friend, Christine Bicknell, and, of course, the Class of 1990. I am really thrilled to be here today, and very excited, as I know you all must be, that Mrs. Gorbachev could join us.',
 'These are exciting times. They are exciting in Washington, and I have really looked forward to coming to Wellesley. I thought it was going to be fun -- I never dreamed it would be this much fun.',
 "More than ten years ago when I was invited here to talk about our experiences in the People's Republic of China, I was struck by both the natural beauty of your campus ... and the spirit of this place.",
 'Wellesley, you see, is not just a place ... but an idea ... an experiment in excellence in which diversity is not just tolerated, but is embraced.',


In [151]:
len(speech1990.split('\n'))

34

In [165]:
# extract and save barabara bush's speech
extractSpeech('https://www.wellesley.edu/events/commencement/archives/1990commencement/commencementaddress', 
              browser,'wellesley_bush',num=5,end=26)

In [166]:
# extract and save gorbachev speech
extractSpeech('https://www.wellesley.edu/events/commencement/archives/1990commencement/commencementaddress', 
              browser,'wellesley_gorbachev',num=27)