## Speeches Project Guide

The ultimate goal of this project is to build a database of speeches by Barack Obama and George W. Bush or any other set of world leaders that you can find speech transcripts for.

The key elements that you need to obtain are the date, location, and text of the speech. In the coding steps below I take you through the process of creating this database for Obama and Bush. The end goal, in the next week or so, is to create a map of speech locations for each leader so that you can click through and see the text of the speech. Your goal should be to obtain at least 20 speeches per person, and limit yourself to 2 or 3 leaders (if you are choosing to investigate other leaders). 

Beyond creating a simple map--with the bare necessities of location, time, and transcript--you may want to investigate the kind of event, the size of the audience, whether these are national or international speeches: any kind of categorizing that will potentially deepen our insight into and interpretation of the speeches. You may also choose to run simple processes/aggregations on the speeches like most-frequent-words, types of phrases, groups of words that you choose to search for in order to bring your own point of view to this exploration of the speeches.

The programmatic journey for this project is relatively straightforward--especially if you choose Obama and Bush--but is quite open as far as how you want to interpret it. 

Below, I have outlined the methods for creating a database of speeches hosted on this site:

https://americanrhetoric.com/barackobamaspeeches.htm

https://americanrhetoric.com/gwbushspeeches.htm

Understand that this site does not contain all speeches and is not an authoritative resource for speeches, but it does provide a potentially useful data set for mapping and exploration.


### STEP 1
Scrape all of the necessary information from:

https://americanrhetoric.com/barackobamaspeeches.htm

and

https://americanrhetoric.com/gwbushspeeches.htm

You should result and a list of dictionaries for each case speech, along with the links to PDFs.

# Obama

In [52]:
###Import your scraping libraries

import requests
import time
import random
import lxml
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np

In [53]:
response = requests.get("https://americanrhetoric.com/barackobamaspeeches.htm")
page = BeautifulSoup(response.text, 'html.parser')

In [54]:
###Write your scraping code here
def filter_link(link):
    href = link.get('href')
    if href:
        return href.startswith('speeches') and href.endswith('.htm')

links = page.find_all('a')
urls = filter(filter_link, links)
urls = [url.get('href') for url in urls]
print(urls)
len(urls)
#472 speeches in total



472

In [55]:
full_urls = []
for url in urls:
    url = f"http://www.americanrhetoric.com/{url}"
    full_urls.append(url)
full_urls[0]

'http://www.americanrhetoric.com/speeches/barackobama/barackobamairaqwarspeechfederalplaza.htm'

### Now I have all the urls!!!

In [None]:
###Import that database into pandas
###Export/save as CSV

In [67]:
df = pd.DataFrame(full_urls)
df.columns = ['url']
df.head()


Unnamed: 0,url
0,http://www.americanrhetoric.com/speeches/barac...
1,http://www.americanrhetoric.com/speeches/conve...
2,http://www.americanrhetoric.com/speeches/barac...
3,http://www.americanrhetoric.com/speeches/barac...
4,http://www.americanrhetoric.com/speeches/barac...


In [68]:
df.to_csv("obama.csv", index= False)
df = pd.read_csv("obama.csv")

In [71]:
df.shape

(472, 1)

In [16]:
df.head()

Unnamed: 0,url
0,http://www.americanrhetoric.com/speeches/barac...
1,http://www.americanrhetoric.com/speeches/conve...
2,http://www.americanrhetoric.com/speeches/barac...
3,http://www.americanrhetoric.com/speeches/barac...
4,http://www.americanrhetoric.com/speeches/barac...


In [197]:
speeches = []
for url in full_urls:
    response = requests.get(url)
    page_data = BeautifulSoup(response.content)
    try:
        speech_data = page_data.find_all("font", {"face": "Verdana"})
        speeches.append(speech_data)
    except:
        pass

In [199]:
len(speeches)

472

In [200]:
clean_speeches= []
for speech in speeches:
    a = re.sub('<[^<]+?>', '', str(speech))
    b = re.sub('\r\n\t\t', '', str(a))  
    c = re.sub('\.,', '.', str(b))
    d = re.sub('\r\n', '', str(c))
    e = re.sub('\n', '', str(d))
    f = re.sub('\xa0', '', str(e))
    h = re.sub('\t', '', str(f))
    i = re.sub('\s,\s([Book/CDs].*)', '',str(h))
    j = str(i).replace('\\', '')
    clean_speeches.append(i)
clean_speeches[34]


"[Please be seated. Good morning, everybody. Yesterday we talked about the need to jump-start our economy. I speak to you today, mindful that we meet at a moment of great challenge for America, as our credit markets are stressed, and our families are struggling. But as difficult as these times are, I’m confident that we're going to rise to meet this challenge -- if we’re willing to band together and recognize that Wall Street cannot thrive so long as Main Street is struggling; if we’re willing to summon a new spirit of ingenuity and determination; and if Americans of great intellect, broad experience, and good character are willing to serve in our government at its hour of need. Yesterday, I announced four such Americans to help lead the economic team that will advise me as we seek to climb out of this crisis. Today, I'm pleased to announce two other key members of our team:Peter Orszag as Director, and Robert Nabors as Deputy Director of the Office of Management and Budget. Before I e

In [137]:
urltest = 'http://www.americanrhetoric.com/speeches/barackobama/barackobamaebenezerbaptist.htm'
response = requests.get(urltest)
page_data = BeautifulSoup(response.content)
dates=[]
try:
    speech_date = page_data.find("font", face='Arial', color='#CE0A04')
    dates.append(speech_date)
    if speech_date == None:
        speech_date = page_data.find(class_="MsoNormal").text.strip()
        dates.append(speech_date)
except:
    pass
    
dates
        

[None, 'Delivered 20 January 2008, \r\nEbenezer Baptist Church, Atlanta']

text 33 34

In [142]:
dates=[]
for url in full_urls:
    response = requests.get(url)
    page_data = BeautifulSoup(response.content)
    try:
        speech_date = page_data.find("font", face='Arial', color='#CE0A04')
        dates.append(speech_date)
        if speech_date == None:
            speech_date = page_data.find(class_="MsoNormal").text.strip()
            dates.append(speech_date)
    except:
        pass


In [143]:
dates[18]

In [90]:
#response = requests.get("https://americanrhetoric.com/barackobamaspeeches.htm")
#page = BeautifulSoup(response.text, 'html.parser')

speeches = page.find_all('tr')
talks=[]
for speech in speeches:
    dates = speech.find_all('td', attrs={'width':'148'})
    for date in dates:
        talks['date'] = date.text.strip()
        dates_ok.append(talks['date'])
talks[0]

In [155]:
clean_dates= []
for date in dates:
    a = re.sub('<[^<]+?>', '', str(date))
    b = re.sub('\r\n\t\t', '', str(a))  
    c = re.sub('\r\n', '', str(b))
    d = re.sub('\.,', '.', str(c))
    e = re.sub('\n', '', str(d))
    f = re.sub('\xa0', '', str(e))
    h = re.sub('\t', '', str(f))
    i = str(h).replace('\\', '')
    clean_dates.append(i)


In [156]:
clean_dates

['delivered 2 October 2002, Chicago, Illinois',
 'delivered 27 July 2004, Fleet Center, Boston',
 'delivered 6 January 2005, Washington, D.C.',
 'delivered 4 June 2005, Galesburg, Illinois',
 'delivered 25 October 2005, Washington, D.C.',
 'delivered 15 December 2005',
 'delivered 31 January 2006, Washington, D.C.',
 'delivered 20 July 2006',
 'delivered 16 January 2007',
 'Delivered 10 February 2007, Springfield, Illinois',
 'delivered 4 March 2007, Brown Chapel, Selma, Alabama',
 'delivered 13 March 2007',
 'delivered 21 March 2007, Washington, D.C.',
 'delivered 23 May 2007, Washington, D.C.',
 'delivered 1 August 2007, The Woodrow Wilson Intl. Center for Scholars, Washington, D.C.',
 'delivered 10 November 2007, Veterans Memorial Auditorium, Des Moines, Iowa',
 'delivered 3 January 2008, Des Moines, Iowa',
 'delivered 8 January 2008, Nashua, New Hampshire',
 'None',
 'Delivered 20 January 2008, Ebenezer Baptist Church, Atlanta',
 'delivered 26 January 2008',
 'delivered 18 March 20

In [183]:
import re
real_dates = []
locations = []
for date in clean_dates:
    good_date = re.findall(r"\d.?\s\w+\s\d\d\d\d\b", date, re.IGNORECASE)
    real_dates.append(good_date)
    location = re.findall(r",\s([\w].*)", date, re.IGNORECASE)
    locations.append(location)


[['Chicago, Illinois'], ['Fleet Center, Boston'], ['Washington, D.C.'], ['Galesburg, Illinois'], ['Washington, D.C.'], [], ['Washington, D.C.'], [], [], ['Springfield, Illinois'], ['Brown Chapel, Selma, Alabama'], [], ['Washington, D.C.'], ['Washington, D.C.'], ['The Woodrow Wilson Intl. Center for Scholars, Washington, D.C.'], ['Veterans Memorial Auditorium, Des Moines, Iowa'], ['Des Moines, Iowa'], ['Nashua, New Hampshire'], [], ['Ebenezer Baptist Church, Atlanta'], [], ['Philadelphia, PA'], ['Saint Paul, Minnesota'], ['Ronald Reagan Building, Washington, D.C.'], ['Victory Column'], ['INVESCO Field at Mile High Stadium, Denver, Colorado'], [], ['Grant Park, Chicago, Illinois'], ['Chicago, Illinois'], [], [], [], [], ['Chicago, Illinois'], ['Chicago, Illinois'], ['Chicago, Illinois'], ['Chicago, Illinois'], ['Chicago, Illinois'], ['Chicago, Illinois'], ['Chicago, Illinois'], [], [], [], [], [], [], ['Fairfax, Virginia'], [], ['Washington, D.C.'], ['Washington, D.C.'], ['Washington, D.

In [184]:
clean_dates[7]

'delivered 20 July 2006'

In [186]:
len(locations)

473

### Now I have all the speeches, all the dates and all the locations!

# G.W. Bush

In [26]:
###Import your scraping libraries

import requests
import time
import random
import lxml
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np

In [27]:
response = requests.get("https://www.americanrhetoric.com/gwbushspeeches.htm")
page = BeautifulSoup(response.text, 'html.parser')

In [28]:
def filter_link(link):
    href = link.get('href')
    if href:
        return href.startswith('speeches') and href.endswith('.htm')

links = page.find_all('a')
urls = filter(filter_link, links)
urls = [url.get('href') for url in urls]
print(urls)
len(urls)

['speeches/gwbush2000victoryspeech.htm', 'speeches/gwbfirstinaugural.htm', 'speeches/gwbushwhitehousestaffswearingin.htm', 'speeches/gwbushfaithbasedinitiatives.htm', 'speeches/gwbushfirstprayerbreakfast.htm', 'speeches/stateoftheunion2001.htm', 'speeches/gwbushstemcell.htm', 'speeches/gwbush911florida.htm', 'speeches/gwbush911barksdale.htm', 'speeches/gwbush911addresstothenation.htm', 'speeches/gwbush911groundzerobullhorn.htm', 'speeches/gwbush911prayer&memorialaddress.htm', 'speeches/gwbush911radioaddress.htm', 'speeches/gwbush911islamispeace.htm', 'speeches/gwbush911jointsessionspeech.htm', 'speeches/gwbush911intialafghanistanops.htm', 'speeches/gwbush911pentagonmemorial2001.htm', 'speeches/gwbushoct2001newsconfference.htm', 'speeches/gwbushusapatriotact2001.htm', 'speeches/gwbush911unitednations.htm', 'speeches/gwbushcitadelcadets.htm', 'speeches/gwbush911worldremembers.htm', 'speeches/gwbushnochildleftbehindsigning.htm', 'speeches/stateoftheunion2002.htm', 'speeches/gwbushcubainde

88

In [29]:
full_urls = []
for url in urls:
    url = f"http://www.americanrhetoric.com/{url}"
    full_urls.append(url)
full_urls

['http://www.americanrhetoric.com/speeches/gwbush2000victoryspeech.htm',
 'http://www.americanrhetoric.com/speeches/gwbfirstinaugural.htm',
 'http://www.americanrhetoric.com/speeches/gwbushwhitehousestaffswearingin.htm',
 'http://www.americanrhetoric.com/speeches/gwbushfaithbasedinitiatives.htm',
 'http://www.americanrhetoric.com/speeches/gwbushfirstprayerbreakfast.htm',
 'http://www.americanrhetoric.com/speeches/stateoftheunion2001.htm',
 'http://www.americanrhetoric.com/speeches/gwbushstemcell.htm',
 'http://www.americanrhetoric.com/speeches/gwbush911florida.htm',
 'http://www.americanrhetoric.com/speeches/gwbush911barksdale.htm',
 'http://www.americanrhetoric.com/speeches/gwbush911addresstothenation.htm',
 'http://www.americanrhetoric.com/speeches/gwbush911groundzerobullhorn.htm',
 'http://www.americanrhetoric.com/speeches/gwbush911prayer&memorialaddress.htm',
 'http://www.americanrhetoric.com/speeches/gwbush911radioaddress.htm',
 'http://www.americanrhetoric.com/speeches/gwbush911i

In [30]:
speeches = []
for url in full_urls:
    response = requests.get(url)
    page_data = BeautifulSoup(response.content)
    try:
        speech_data = page_data.find_all("font", {"face": "Verdana"})
        speeches.append(speech_data)
    except:
        pass

In [31]:
speeches[0]

[<font face="Verdana" size="2">Thank you very much.</font>,
 <font face="Verdana" size="2">Good evening, my fellow Americans. I appreciate so very 
 much the opportunity to speak with you tonight.<br/>
 <br/>
 Mr. Speaker, Lieutenant Governor, friends, distinguished guests, our country has 
 been through a long and trying period, with the outcome of the presidential 
 election not finalized for longer than any of us could ever imagine.<br/>
 <br/>
 Vice President Gore and I put our hearts and hopes into our campaigns. We both 
 gave it our all. We shared similar emotions, so I understand how difficult this 
 moment must be for Vice President Gore and his family.<br/>
 <br/>
 He has a distinguished record of service to our country as a congressman, a 
 senator and a vice president.<br/>
 <br/>
 This evening I received a gracious call from the vice president. We agreed to 
 meet early next week in Washington, and we agreed to do our best to heal our 
 country after this hard-fought conte

In [32]:
clean_speeches= []
for speech in speeches:
    a = re.sub('<[^<]+?>', '', str(speech))
    b = re.sub('\r\n\t\t', '', str(a))  
    c = re.sub('\.,', '.', str(b))
    d = re.sub('\r\n', '', str(c))
    e = re.sub('\n', '', str(d))
    f = re.sub('\xa0', '', str(e))
    h = re.sub('\t', '', str(f))
    i = re.sub('\s,\s([Book/CDs].*)', '',str(h))
    j = str(i).replace('\\', '')
    clean_speeches.append(i)
clean_speeches[34]

"[President Bush: This is George W Bush, the President of the United States. At this moment, the regime of Saddam Hussein is being removed from power, and a long era of fear and cruelty is ending. American and coalition forces are now operating inside Baghdad -- and we will not stop until Saddam’s corrupt gang is gone. The government of Iraq, and the future of your country, will soon belong to you. The goals of our coalition are clear and limited. We will end a brutal regime, whose aggression and weapons of mass destruction make it a unique threat to the world. Coalition forces will help maintain law and order, so that Iraqis can live in security. We will respect your great religious traditions, whose principles of equality and compassion are essential to Iraq’s future. We will help you build a peaceful and representative government that protects the rights of all citizens. And then our military forces will leave. Iraq will go forward as a unified, independent, and sovereign nation tha

In [33]:
dates=[]
for url in full_urls:
    response = requests.get(url)
    page_data = BeautifulSoup(response.content)
    try:
        speech_date = page_data.find("font", face='Arial', color='#CE0A04')
        dates.append(speech_date)
        if speech_date == None:
            speech_date = page_data.find(class_="MsoNormal").text.strip()
            dates.append(speech_date)
    except:
        pass

In [35]:
dates[4]

<font color="#CE0A04" face="Arial" size="1">delivered 1 
February 2001, Washington Hilton Hotel, Washington, D.C.</font>

In [36]:
clean_dates= []
for date in dates:
    a = re.sub('<[^<]+?>', '', str(date))
    b = re.sub('\r\n\t\t', '', str(a))  
    c = re.sub('\r\n', '', str(b))
    d = re.sub('\.,', '.', str(c))
    e = re.sub('\n', '', str(d))
    f = re.sub('\xa0', '', str(e))
    h = re.sub('\t', '', str(f))
    i = str(h).replace('\\', '')
    clean_dates.append(i)

In [38]:
clean_dates[7]

'delivered 11 September 2001,9:30 A.M. EDT, Sarasota, FL'

In [46]:
import re
real_dates = []
locations = []
for date in clean_dates:
    good_date = re.findall(r"\d.?\s\w+\s\d\d\d\d\b", date, re.IGNORECASE)
    real_dates.append(good_date)
    location = re.findall(r",\s([\w].*)", date, re.IGNORECASE)
    locations.append(location)

In [42]:
real_dates[7]

['11 September 2001']

In [48]:
locations[17]

['White House, Washington, D.C.']

### STEP 2: Download the PDFs and transform them into text. 
See Guided_Project_Options.ipynb For instructions on how to do this. Using this method you need to have the **xpdf** tool installed on your command line. If you have homebrew installed it should be as easy as:

`brew install xpdf`

Note **XPDF** is not a python thing, it is a command line thing that you are executing in Python.

In [None]:
#Code away!




### STEP 3: Clean up the text using regular expressions.
Here we go: the text files that were extracted from the PDFs are messy, you do not need to get them perfect, but you need to clean them up enough so that you can zone in on the speeches themselves. 

Open up a simple text file in your own text editor and take a look at the patterns and what needs to be extracted. Begin with one file and develop regular expressions that will give you a clean speech. Once you have that working try looping through all of the text files.

Eventually you will want to loop through all of the text files and run the cleanup on all of them. But first just select one text file to open up and begin cleaning up.

In [None]:
#Import the regular expression library

In [None]:
#Open a text file from your computer
f = open('/Users/YOU/Documents/columbia_syllabus/pdf/15-777_1b82.txt', 'r')
sample_transcript = f.read()

In [None]:
#Take a look at the text file
sample_transcript

In [None]:
##Create regular expressions
##In most cases you will want to remove 
##unnecessary text: re.subs()
##Which is like replace() but with regex
#will likely be useful

In [None]:
#Once you have successfully cleaned up one text file
#Try to loop through all of them!!

In [None]:
#Join your dataframe of clean speeches
#With your original data frame
#Join them on the PDF name.