In [24]:
import requests
import re
from textstat.textstat import textstat
from bs4 import BeautifulSoup

In this project, I measured the reading ease, or in this context "comprehension ease," of the inaugural addresses of U.S. Presidents using the Flesch-Kincaid Readability Scale. News companies, educators, and data scientists use this measure to evaluate the accessibility of a text. So the data would be easy to interpret for someone unfamiliar with Flesch-Kincaid, I used the Flesch-Kincaid Grade Level measure, which corresponds to grade levels in the American schooling system. In general, a lower grade level is not a mark of shame but rather a sign that the text is accessible to the public, which is a good thing. A very low grade level, however, may point to a text that is overly simple.

I used the following site as a source for most of the speeches:

http://avalon.law.yale.edu/subject_menus/inaug.asp

It only goes up to 2009, President Obama's first inaugural address, so I knew I would have to find other sources for the 2013 and 2017 speeches.

Meanwhile, I used `BeautifulSoup` and `requests` to figure out how to follow the appropriate links on the main page.

In [25]:
page = requests.get("http://avalon.law.yale.edu/subject_menus/inaug.asp")
soup = BeautifulSoup(page.content, 'html.parser')
type(soup.find_all('a'))

bs4.element.ResultSet

I didn't see a benefit to keeping the `ResultSet` object, so I read it into an array.

In [26]:
link_obj = soup.find_all('a')
link_html_all = []
for link in link_obj:
    link_html_all.append(str(link))
link_html_all

['<a href="../default.asp">Avalon Home</a>',
 '<a href="../subject_menus/major.asp">Document<br> Collections</br></a>',
 '<a href="../subject_menus/ancient.asp">Ancient <br>4000bce - 399</br></a>',
 '<a href="../subject_menus/medieval.asp">Medieval <br>400 - 1399</br></a>',
 '<a href="../subject_menus/15th.asp">15<sup>th</sup> Century <br>1400 - 1499</br></a>',
 '<a href="../subject_menus/16th.asp">16<sup>th</sup> Century <br>1500 - 1599</br></a>',
 '<a href="../subject_menus/17th.asp">17<sup>th</sup> Century <br>1600 - 1699</br></a>',
 '<a href="../subject_menus/18th.asp">18<sup>th</sup> Century <br>1700 - 1799</br></a>',
 '<a href="../subject_menus/19th.asp">19<sup>th</sup> Century <br>1800 - 1899</br></a>',
 '<a href="../subject_menus/20th.asp">20<sup>th</sup> Century <br>1900 - 1999</br></a>',
 '<a href="../subject_menus/21st.asp">21<sup>st</sup> Century <br>2000 - </br></a>',
 '<a href="../18th_century/wash1.asp">1789</a>',
 '<a href="../18th_century/wash2.asp">1793</a>',
 '<a hre

Here, I saw that I had a lot more links than I wanted. I noticed that the text of the links that I wanted were all in the format of years, four-digit sequences. I used a regex to get only those links.

I realized later on that using the year of the speech as a key would be a good way to keep the data in order. In redoing the project for this notebook, this seems like the natural place to start doing that.

In [27]:
speech_link_html = []
num_re = re.compile("(?<=>)[0-9]{4}(?=<)")
for item in link_html_all:
    if num_re.findall(item):
        speech_link_html.append({'id': num_re.findall(item)[0], 'html': item})
speech_link_html

[{'html': '<a href="../18th_century/wash1.asp">1789</a>', 'id': '1789'},
 {'html': '<a href="../18th_century/wash2.asp">1793</a>', 'id': '1793'},
 {'html': '<a href="../18th_century/adams.asp">1797</a>', 'id': '1797'},
 {'html': '<a href="../19th_century/jefinau1.asp">1801</a>', 'id': '1801'},
 {'html': '<a href="../19th_century/jefinau2.asp">1805</a>', 'id': '1805'},
 {'html': '<a href="../19th_century/madison1.asp">1809</a>', 'id': '1809'},
 {'html': '<a href="../19th_century/madison2.asp">1813</a>', 'id': '1813'},
 {'html': '<a href="../19th_century/monroe1.asp">1817</a>', 'id': '1817'},
 {'html': '<a href="../19th_century/monroe2.asp">1821</a>', 'id': '1821'},
 {'html': '<a href="../19th_century/qadams.asp">1825</a>', 'id': '1825'},
 {'html': '<a href="../19th_century/jackson1.asp">1829</a>', 'id': '1829'},
 {'html': '<a href="../19th_century/jackson2.asp">1833</a>', 'id': '1833'},
 {'html': '<a href="../19th_century/vanburen.asp">1837</a>', 'id': '1837'},
 {'html': '<a href="../19

I used another regex to isolate the url path from the HTML it was packaged in.

In [28]:
speech_links = []
link_re = re.compile("(?<=\.\.).*\.asp")
for item in speech_link_html:
    speech_links.append({'id': item['id'], 'link': link_re.findall(item['html'])[0]})
speech_links

[{'id': '1789', 'link': '/18th_century/wash1.asp'},
 {'id': '1793', 'link': '/18th_century/wash2.asp'},
 {'id': '1797', 'link': '/18th_century/adams.asp'},
 {'id': '1801', 'link': '/19th_century/jefinau1.asp'},
 {'id': '1805', 'link': '/19th_century/jefinau2.asp'},
 {'id': '1809', 'link': '/19th_century/madison1.asp'},
 {'id': '1813', 'link': '/19th_century/madison2.asp'},
 {'id': '1817', 'link': '/19th_century/monroe1.asp'},
 {'id': '1821', 'link': '/19th_century/monroe2.asp'},
 {'id': '1825', 'link': '/19th_century/qadams.asp'},
 {'id': '1829', 'link': '/19th_century/jackson1.asp'},
 {'id': '1833', 'link': '/19th_century/jackson2.asp'},
 {'id': '1837', 'link': '/19th_century/vanburen.asp'},
 {'id': '1841', 'link': '/19th_century/harrison.asp'},
 {'id': '1845', 'link': '/19th_century/polk.asp'},
 {'id': '1849', 'link': '/19th_century/taylor.asp'},
 {'id': '1853', 'link': '/19th_century/pierce.asp'},
 {'id': '1857', 'link': '/19th_century/buchanan.asp'},
 {'id': '1861', 'link': '/19th_

When I tried to get the text from these pages, I realized that some of them were dead.

In [29]:
for item in reversed(speech_links):
    url = "http://avalon.law.yale.edu/" + item['link']
    page = requests.get(url)
    if (page.status_code == 404):
        speech_links.remove(item)
        print(item)
speech_links

{'link': '/20th_century/coolidge.asp', 'id': '1925'}
{'link': '/19th_century/garfield.asp', 'id': '1881'}
{'link': '/19th_century/buchanan.asp', 'id': '1857'}
{'link': '/19th_century/vanburen.asp', 'id': '1837'}


[{'id': '1789', 'link': '/18th_century/wash1.asp'},
 {'id': '1793', 'link': '/18th_century/wash2.asp'},
 {'id': '1797', 'link': '/18th_century/adams.asp'},
 {'id': '1801', 'link': '/19th_century/jefinau1.asp'},
 {'id': '1805', 'link': '/19th_century/jefinau2.asp'},
 {'id': '1809', 'link': '/19th_century/madison1.asp'},
 {'id': '1813', 'link': '/19th_century/madison2.asp'},
 {'id': '1817', 'link': '/19th_century/monroe1.asp'},
 {'id': '1821', 'link': '/19th_century/monroe2.asp'},
 {'id': '1825', 'link': '/19th_century/qadams.asp'},
 {'id': '1829', 'link': '/19th_century/jackson1.asp'},
 {'id': '1833', 'link': '/19th_century/jackson2.asp'},
 {'id': '1841', 'link': '/19th_century/harrison.asp'},
 {'id': '1845', 'link': '/19th_century/polk.asp'},
 {'id': '1849', 'link': '/19th_century/taylor.asp'},
 {'id': '1853', 'link': '/19th_century/pierce.asp'},
 {'id': '1861', 'link': '/19th_century/lincoln1.asp'},
 {'id': '1865', 'link': '/19th_century/lincoln2.asp'},
 {'id': '1869', 'link': '/19th_

These were the 1925, 1881, 1857, and 1837 speeches. I removed them from the array and added them to the list of speeches I would have to get from another source.

Now that I had the pages, I needed to get the text that I wanted from each page. First, I looked at the HTML of the first page to see what I was working with.

In [30]:
my_url = "http://avalon.law.yale.edu/18th_century/wash2.asp"
content = requests.get(my_url).content
content

b'<HTML>\r\n<HEAD>\r\n<TITLE>The Avalon Project : Second Inaugural Address of George Washington</TITLE>\r\n<link rel="stylesheet" type="text/css"  href="../css/site.css">\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n<META NAME="DC.Title" CONTENT="Inaugural Addresses of the Presidents of the United States : from George Washington 1789 to George Bush 1989">\r\n<META NAME="DC.Title.Alternative" CONTENT="Senate document (United States. Congress. Senate); 101-10">\r\n<META NAME="DC.Creator.CorporateName" CONTENT="United States. Congress. Joint Congressional Committee on Inaugural Ceremonies">\r\n<META NAME="DC.Title.Alternate" CONTENT="The Papers of the Presidents of the United States">\r\n<META NAME="DC.Title.Alternate" CONTENT="The Inaugural Addresses of the Presidents of the United States">\r\n<META NAME="DC.Creator.PersonalName" CONTENT="">\r\n<META NAME="DC.Creator.CoporateName" CONTENT="">\r\n<META NAME="DC.Subject" CONTENT="Law">\r\n<META NAME="DC.Subject" CONTENT="History">\r\n<META NAME="DC.Subj

So, first off, all those tags needed to go. I used a regex to get just the text. In retrospect, there was probably a way to do that with the `requests` object, but anyway this is what I did. It was convenient to have the text broken up into a list of lines.

In [31]:
text_re = re.compile("(?<=>)[^<>\\\]*(?=<)")
results = text_re.findall(str(content))
results = list(filter(lambda x: x != '', results))
results = [x.strip() for x in results]
results

['The Avalon Project : Second Inaugural Address of George Washington',
 'Avalon Home',
 'Document',
 'Collections',
 'Ancient',
 '4000bce - 399',
 'Medieval',
 '400 - 1399',
 '15',
 'th',
 'Century',
 '1400 - 1499',
 '16',
 'th',
 'Century',
 '1500 - 1599',
 '17',
 'th',
 'Century',
 '1600 - 1699',
 '18',
 'th',
 'Century',
 '1700 - 1799',
 '19',
 'th',
 'Century',
 '1800 - 1899',
 '20',
 'th',
 'Century',
 '1900 - 1999',
 '21',
 'st',
 'Century',
 '2000 -',
 'THE CITY OF PHILADELPHIA',
 'MONDAY, MARCH 4, 1793',
 'Fellow Citizens:',
 'I am again called upon by the voice of my country to execute the functions of its Chief Magistrate. When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor, and of the confidence which has been reposed in me by the people of united America.',
 'Previous to the execution of any official act of the President the',
 'Constitution',
 'requires an',
 'oath of office',
 '. This oath I am n

For each speech, the text I wanted was in between a list item with the date and a list item with something like "Inaugural Speeches Page." The variations in the formats of the dates and the wordings of the last tag caused a couple of confusing errors, which I handled with the `isDate` function and a compound `if` statement, respectively. Because the speeches had a different number of array items above and below, it was useful to have a function that found the start and end indices of the content that I actually wanted to use in the function that returned the desired text.

In [32]:
def isDate(my_str):
    months = ["january", "march", "april"] # this works because inaugural addresses only happened in these months
    for month in months:
        if month in my_str.lower():
            return True
    return False

def indices(textlist):
    start = 0
    end = 0
    for i in range(len(textlist)):
        if start == 0 and isDate(textlist[i]):
            start = i + 1
        if textlist[i] == "Inaugural Speeches Page" or 'Inauguration Speeches Page' in textlist[i]:
            end = i
    return (start, end)

def generate_text(textlist):
    bounds = indices(textlist)
    text = ""
    for i in range(bounds[0], bounds[1]):
        text += textlist[i] + " "
    return text

Finally I combined the work done above with the Flesch-Kincaid reading ease test in a function that could work with the `speech_links` array.

Here, I'm going to take a moment to talk about how the Flesch-Kincaid reading ease test works. 

The test is based on a couple of simple ideas: 1) texts with longer sentences are harder to read, and 2) texts with longer words are harder to read. Sentence length is determined by the average number of words in a sentence, and word length is determined by the average number of the syllables in a word. 

I got all my information about Flesch-Kincaid from this source: https://readable.io/

I originally built my own implementation of the algorithm, but I then decided to use `textstat`'s implementation because it had error handling. 

In [33]:
def url_to_grade(url):
    content = requests.get(url).content
    text_re = re.compile("(?<=>)[^<>\\\]*(?=<)")
    results = text_re.findall(str(content))
    results = list(filter(lambda x: x != '', results))
    results = [x.strip() for x in results]
    text = generate_text(results)
    return textstat.flesch_kincaid_grade(text)

grades = []
for i in range(len(speech_links)):
    url = "http://avalon.law.yale.edu/" + speech_links[i]['link']
    grades.append({'id': speech_links[i]['id'], 'grade': url_to_grade(url)})
grades

[{'grade': 27.5, 'id': '1789'},
 {'grade': 16.5, 'id': '1793'},
 {'grade': 27.4, 'id': '1797'},
 {'grade': 27.2, 'id': '1801'},
 {'grade': 22.9, 'id': '1805'},
 {'grade': 18.9, 'id': '1809'},
 {'grade': 17.6, 'id': '1813'},
 {'grade': 13.2, 'id': '1817'},
 {'grade': 15.3, 'id': '1821'},
 {'grade': 17.3, 'id': '1825'},
 {'grade': 20.3, 'id': '1829'},
 {'grade': 18.6, 'id': '1833'},
 {'grade': 18.3, 'id': '1841'},
 {'grade': 15.8, 'id': '1845'},
 {'grade': 22.6, 'id': '1849'},
 {'grade': 16.0, 'id': '1853'},
 {'grade': 12.6, 'id': '1861'},
 {'grade': 12.1, 'id': '1865'},
 {'grade': 13.3, 'id': '1869'},
 {'grade': 13.2, 'id': '1873'},
 {'grade': 19.6, 'id': '1877'},
 {'grade': 18.7, 'id': '1885'},
 {'grade': 18.0, 'id': '1893'},
 {'grade': 14.4, 'id': '1889'},
 {'grade': 15.1, 'id': '1897'},
 {'grade': 11.7, 'id': '1901'},
 {'grade': 13.6, 'id': '1905'},
 {'grade': 16.5, 'id': '1909'},
 {'grade': 12.0, 'id': '1913'},
 {'grade': 11.0, 'id': '1917'},
 {'grade': 11.8, 'id': '1921'},
 {'grade

There was one more thing that I nearly forgot! I had to add in the inaugural addresses that weren't on the website. I got those from the following websites and saved them in text files in the same folder as this Jupyter Notebook.

http://www.presidency.ucsb.edu/ws/index.php?pid=25812

https://www.whitehouse.gov/inaugural-address

https://obamawhitehouse.archives.gov/the-press-office/2013/01/21/inaugural-address-president-barack-obama

http://www.presidency.ucsb.edu/ws/index.php?pid=25817

http://www.presidency.ucsb.edu/ws/index.php?pid=25823

http://www.presidency.ucsb.edu/ws/index.php?pid=25834

In [34]:
filenames = ["1837", "1857", "1881", "1925", "2013", "2017"]
for filename in filenames:
    file = open(filename + ".txt", 'r') 
    grade = textstat.flesch_kincaid_grade(file.read())
    grades.append({'id': filename, 'grade': grade})
grades = sorted(grades, key=lambda k: k['id']) # This will probably be convenient later.
grades

[{'grade': 27.5, 'id': '1789'},
 {'grade': 16.5, 'id': '1793'},
 {'grade': 27.4, 'id': '1797'},
 {'grade': 27.2, 'id': '1801'},
 {'grade': 22.9, 'id': '1805'},
 {'grade': 18.9, 'id': '1809'},
 {'grade': 17.6, 'id': '1813'},
 {'grade': 13.2, 'id': '1817'},
 {'grade': 15.3, 'id': '1821'},
 {'grade': 17.3, 'id': '1825'},
 {'grade': 20.3, 'id': '1829'},
 {'grade': 18.6, 'id': '1833'},
 {'grade': 18.5, 'id': '1837'},
 {'grade': 18.3, 'id': '1841'},
 {'grade': 15.8, 'id': '1845'},
 {'grade': 22.6, 'id': '1849'},
 {'grade': 16.0, 'id': '1853'},
 {'grade': 15.7, 'id': '1857'},
 {'grade': 12.6, 'id': '1861'},
 {'grade': 12.1, 'id': '1865'},
 {'grade': 13.3, 'id': '1869'},
 {'grade': 13.2, 'id': '1873'},
 {'grade': 19.6, 'id': '1877'},
 {'grade': 13.7, 'id': '1881'},
 {'grade': 18.7, 'id': '1885'},
 {'grade': 14.4, 'id': '1889'},
 {'grade': 18.0, 'id': '1893'},
 {'grade': 15.1, 'id': '1897'},
 {'grade': 11.7, 'id': '1901'},
 {'grade': 13.6, 'id': '1905'},
 {'grade': 16.5, 'id': '1909'},
 {'grade

Now, I'm going to read this final product out as a CSV to use for my data visualization.

In [35]:
with open('inaugural_grades.txt', 'w') as f:
    f.write('year,grade \n')
    for item in grades:
        f.write(str(item['id']) + "," + str(item['grade']) + "\n")
    f.close()