# Using Classification to Determine Authorship
I read about the idea of using machine learning to determine the true author of the disputed Federalist Papers, so I wanted to go ahead and give it a try!

In [68]:
import urllib2
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [19]:
url = 'http://www.let.rug.nl/usa/documents/1786-1800/the-federalist-papers/'
req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"}) 
soup = BeautifulSoup(urllib2.urlopen(req), 'lxml')
content = soup.find('div', {'id':'content'})
links = content.find_all('a')

In [27]:
partial_urls = []
for i in links:
    partial_urls.append(i['href'])
print partial_urls[:5]

['introduction.php', 'the-federalist-1.php', 'the-federalist-2.php', 'the-federalist-3.php', 'the-federalist-4.php']


In [35]:
number = []
content_ = []
count = 0
for i in partial_urls:
    url = 'http://www.let.rug.nl/usa/documents/1786-1800/the-federalist-papers/' + i
    req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"}) 
    soup = BeautifulSoup(urllib2.urlopen(req), 'lxml')
    number.append(count)
    content_.append(soup)
    count += 1

In [36]:
number[-5:]

[82, 83, 84, 85, 86]

In [65]:
# Looking at how to pull in the text I want for features

content_[5].find('div', {'id':'content'}).text.replace('\n', ' ').replace('\r', '').replace('\\', '')

u" The Federalist 5    The Same Subject Continued  (Concerning Dangers From Foreign Force and Influence)  Jay for the Independent Journal.    To the People of the State of New York:  QUEEN ANNE, in her letter of the 1st July, 1706, to the Scotch   Parliament, makes some observations on the importance of the UNION   then forming between England and Scotland, which merit our attention.   I shall present the public with one or two extracts from it: ``An   entire and perfect union will be the solid foundation of lasting   peace: It will secure your religion, liberty, and property; remove   the animosities amongst yourselves, and the jealousies and   differences betwixt our two kingdoms. It must increase your   strength, riches, and trade; and by this union the whole island,   being joined in affection and free from all apprehensions of   different interest, will be enabled to resist all its enemies.''   ``We most earnestly recommend to you calmness and unanimity in this   great and weighty

In [171]:
# Get the titles
titles = []
for i in content_:
    titles.append(i.find('div', {'id':'content'}).find('h1').text.replace('The Federalist ', '').replace('70a','70').replace('70b','70'))

In [76]:
text_stuff = []
flag = False
for i in content_[5]:
    if flag == False:
        if '<p>' in i:
            flag = True
        else:
            continue
    text_stuff.append(i)

In [183]:
# The if clause is because of inconsistencies in the site the text was scraped from

rough_text = []
for i in number:
    if i == 40 or i == 27 or i == 44:
        rough_text.append(content_[i].find_all('p'))
    else:
        rough_text.append(content_[i].find_all('p')[1:])

In [184]:
text = []
for i in rough_text:
    pars = []
    for j in i:
        pars.append(j.text.replace('\n', ' ').replace('\r', '').replace('\\', ''))
    joined = ' '.join(pars)
    text.append(joined)

In [185]:
# Looking at how the data looks

text[0][:1000]

u"    The delegates who signed the drafted Constitution in  Philadelphia on September 16, 1787, stipulated that it would  take effect only after approval by ratifying conventions in nine  of 13 states. Although not stipulated, a negative vote by either  of two key states-New York or Virginia-could destroy the  whole enterprise because of their size and power. Both New  York and Virginia delegates were sharply divided in their  opinions of the Constitution. And New York governor George  Clinton had already made clear his opposition.       One would imagine that a work so highly praised and so  influential as The Federalist Papers was the ripe fruit of a long  lifetime's experience in scholarship and government. In fact, it  was largely the product of two young men: Alexander  Hamilton of New York, age 32, and James Madison of Virginia, age  36, who wrote in great haste-sometimes as many as four  essays in a single week. An older scholar, John Jay, later named  as first chief justice of 

In [186]:
text_content = pd.DataFrame(text)
essay_no = pd.DataFrame(titles)
df = pd.concat([essay_no,text_content], axis=1)
df.columns = ['essay_no', 'text']

In [187]:
df = df[1:]
df.head()

Unnamed: 0,essay_no,text
1,1,To the People of the State of New York: A...
2,2,To the People of the State of New York: WHEN...
3,3,To the People of the State of New York: I...
4,4,To the People of the State of New York: MY ...
5,5,To the People of the State of New York: QUE...


In [188]:
# Check to make sure the correct text was obtained
for idx,i in enumerate(df['text']):
    print idx,i[:100]

0     To the People of the State of New York:  AFTER an unequivocal experience of the inefficiency of 
1   To the People of the State of New York: WHEN the people of America reflect that they are now calle
2     To the People of the State of New York:  IT IS not a new observation that the people of any coun
3   To the People of the State of New York:  MY LAST paper assigned several reasons why the safety of 
4   To the People of the State of New York:  QUEEN ANNE, in her letter of the 1st July, 1706, to the S
5   To the People of the State of New York:  THE three last numbers of this paper have been dedicated 
6     To the People of the State of New York:  IT IS sometimes asked, with an air of seeming triumph, 
7     To the People of the State of New York:  ASSUMING it therefore as an established truth that the 
8     To the People of the State of New York:  A FIRM Union will be of the utmost moment to the peace 
9   To the People of the State of New York:  AMONG the numerous advantage

The consensus among scholars is that
Jay wrote five: 2-5 and 64, Madison wrote fourteen:10, 14, 37-48, and Hamilton
wrote most of the remainder (fifty-one). Three are thought to be joint (18-20),
and the twelve (49-58, 62, 63) comprise the list whose disputed authorship is
the source of interest.

In [196]:
author = []
for i in df['essay_no']:
    if int(i) >= 2 and int(i) <= 5 or int(i) == 64:
        author.append('Jay')
    elif int(i) == 10 or int(i) == 14 or int(i) >=37 and int(i) <= 48:
        author.append('Madison')
    elif int(i) >= 18 and int(i) <=20:
        author.append('Joint')
    elif int(i) >= 49 and int(i) <= 58 or int(i) == 62 or int(i) ==63:
        author.append('Disputed')
    else:
        author.append('Hamilton')

In [204]:
author = pd.DataFrame(author, index=range(1,87), columns=['author'])

In [206]:
df = pd.concat([df, author], axis=1)

In [207]:
df['author'].value_counts()

Hamilton    52
Madison     14
Disputed    12
Jay          5
Joint        3
Name: author, dtype: int64

In [208]:
df.head()

Unnamed: 0,essay_no,text,author
1,1,To the People of the State of New York: A...,Hamilton
2,2,To the People of the State of New York: WHEN...,Jay
3,3,To the People of the State of New York: I...,Jay
4,4,To the People of the State of New York: MY ...,Jay
5,5,To the People of the State of New York: QUE...,Jay
