# Author Attribution Models for the Federalist Papers

This notebook compares various author attribution techniques previously tested on the Federalist Papers, using only the papers themselves and not outside sources of writing. The goal is to reproduce similar results and create models for use in later analyses.

Begin by importing relevant packages.

In [1]:
#import needed packages
import argparse
import re
import csv
import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm

Project Gutenberg supplies a plain text document of all of the federalist papers at: http://www.gutenberg.org/cache/epub/18/pg18.txt. That document has been added to this repository as "18.txt". Since the actualy papers are contained within the one document that includes text from Project Gutenberg and information on each paper, the text itself must be extracted from the file.

In [2]:
def separate_texts(fed_papers): #splits project gutenberg data into text files by paper
    with open(fed_papers) as fedpapers:
        papers=fedpapers.read()
        splits = re.split(r'FEDERALIST\.? No\.?', papers)
        for i in range(len(splits)):
            num=str(i)
            out = open(f"fednum{num}.txt", "w")
            out.write(splits[i])
            out.close()

The individualized papers are then read in to create a tsv file recording the paper number, the author, and the text. John Jay's papers are disregarded for the analyses, since the disputed papers are known to be written by Hamilton or Madison. The papers collaborated on by are also disregarded, as they would only add noise to the attribution models. In order ro isolate just the body text, the text of the paper is split at the opening line and the signature, which are consistent across each paper.

In [3]:
def known_authors():
    labels=open("labels.tsv", "w")
    writer=csv.writer(labels, delimiter='\t')
    for i in range(1,87):
        with open(f"fednum{i}.txt") as paper:
            paper=paper.read()
            paper=paper.replace('\n', ' ')
            text=re.split(r'To the People of the State of New York', paper)
            strip=re.split(r'PUBLIUS', text[1])
            if "HAMILTON OR MADISON" in text[0]:
                writer.writerow([i, "Unknown", strip[0]])
            elif "58" in text[0]: #Project Gutenberg classifies this disputed paper as Madison
                writer.writerow([i, "Unknown", strip[0]])
            elif "HAMILTON AND MADISON" in text[0]: #ignore collaborated texts
                quit
            elif "HAMILTON" in text[0]:
                writer.writerow([i, "Hamilton", strip[0]])
            elif "MADISON" in text[0]:
                writer.writerow([i, "Madison", strip[0]])
    labels=open("labels.tsv", "r")
    return labels

Lists of the texts and their corresponding authors are created.

In [4]:
def get_authortext(labels):
    authors=[]
    papers=[]
    numbers=[]
    for line in labels:
        fields = line.strip().split("\t")
        authors.append(fields[1])
        papers.append(fields[2])
        numbers.append(fields[0])
    return authors, papers, numbers

These lists are then split into the training and testing data, which represents the known papers and the disputed papers.

In [5]:
def split_data(authors,papers, numbers):
    trainauth=[]
    traintext=[]
    test=[]
    testnums=[]
    for i, author in enumerate(authors):
        if author == 'Unknown':
            test.append(papers[i])
            testnums.append(numbers[i])
        else:
            trainauth.append(author)
            traintext.append(papers[i])
    return trainauth, traintext, test, testnums

Mostellar & Wallace (1964) is the original study in using quantitative analyses to determine the author of the twelve disputed Federalist Papers. They outlined a list of 70 function words that could be used to discriminate between authors based on frequency. A text file containing these words has been previously composed and added to this repository. This list of words will be used in the models to follow.

In [6]:
def load_function_words(resource_path): #reads in words from text file separated by new line
    f_words = []
    with open(resource_path, 'r') as f:
        for line in f:
            if line.strip():
                f_words.append(line.lower().strip())
    return f_words

Feature vectors are created for counts of the function words for each text. The authors are assigned a class index.

In [7]:
def feature_vecs(train_texts, train_labels, test, function_words): 
    #create matrix of zeros where row corresponds to paper and column to function word
    paper_features = np.zeros((len(train_texts),len(function_words)), dtype=np.int)
    test_features = np.zeros((len(test),len(function_words)), dtype=np.int)
    
    #populate matrix with counts for each function word for each paper
    for i,text in enumerate(train_texts):
        for j,function_word in enumerate(function_words):
            text_tokens = text.lower().split()
            count = len([w for w in text_tokens if w == function_words[j]])
            paper_features[i,j] = count
    
    for i,text in enumerate(test):
        for j,function_word in enumerate(function_words):
            text_tokens = text.lower().split()
            count = len([w for w in text_tokens if w == function_words[j]])
            test_features[i,j] = count
    
    # load author data into a label array, assigning a class index per unique author
    labels = np.zeros(len(train_texts), dtype=np.int)
    unique_authors=np.unique(train_labels)
    classindex=range(0,len(unique_authors))
    for i, author in enumerate(train_labels):
        for j,unique in enumerate(unique_authors):
            if author == unique: 
                labels[i]=classindex[j]
    
    return paper_features, labels, unique_authors, test_features

A Naïve Bayes model is trained on the known papers using the feature vector and corresponding author labels. This model was chosen because the original methods implemented by Mosteller and Wallace (1964) utilized Bayesian statistics and the general model has become one of the most commonly used for class prediction.

In [8]:
def naive_bayes(paper_features, labels, unique_authors, test_features, test_numbers):
    #predict author for test data using naive bayes model trained on training data
    clf = MultinomialNB()
    clf.fit(paper_features, labels)
    score=clf.score(paper_features, labels)
    print(f"Naive Bayes Training Accuracy: {score}")
    preds=clf.predict(test_features)
    print("Naive Bayes Predictions")
    for i, pred in enumerate(preds):
        num=test_numbers[i]
        if pred == 0:
            auth = unique_authors[0]
        else:
            auth = unique_authors[1]
        print(f"Federalist No {num} predicted author: {auth}")

A K-Nearest Neighbors model is also fit to the data, to explore a common similarity-based metric. However, since the feature vectors are made up only of the 70 functoin words, instead of a more sophisticated and complex approach such as TF-IDF vectorization, this model will likely be limited in performance.

In [9]:
def knn(paper_features, labels, unique_authors, test_features, test_numbers):  
    #predict author for test data using k-nearest neighbor model trained on training data     
    knn = KNeighborsClassifier(n_neighbors=2, weights='distance', algorithm = 'brute')
    mod=knn.fit(paper_features, labels)
    score=knn.score(paper_features, labels)
    print(f"KNN Training Accuracy: {score}")
    preds=mod.predict(test_features)
    print("K-Means Predictions")
    for i, pred in enumerate(preds):
        num=test_numbers[i]
        if pred == 0:
            auth = unique_authors[0]
        else:
            auth = unique_authors[1]
        print(f"Federalist No {num} predicted author: {auth}")

Finally, a Support Vector Machine model is trained. For the SVM prediction model only 3 of the 70 words defined by Mostellar & Wallace (1964) are used. This is based on Bosch & Smith (1998), which established that a hyperplane could be defined using only three of the words, as, our, and upon, that correctly predicted all of the disputed papers to have been written by Madison (which is the standard and accepted conclusion).  

In [10]:
def svm_mod(paper_features, labels, unique_authors, test_features, test_numbers):
    #predict author for test data using support vector machine trained on training data   
    
    #use features selected by Bosch & Smith (1998)
    train_feats=pd.DataFrame(paper_features)
    train_feats=train_feats.iloc[:,[57, 43, 59]]
    test_feats=pd.DataFrame(test_features)
    test_feats=test_feats.iloc[:,[57, 43, 59]]
    clf = svm.SVC(kernel='linear')
    clf.fit(train_feats, labels)
    score=clf.score(train_feats, labels)
    print(f"SVM Training Accuracy: {score}")
    preds=clf.predict(test_feats)
    print("SVM Predictions")
    for i, pred in enumerate(preds):
        num=test_numbers[i]
        if pred == 0:
            auth = unique_authors[0]
        else:
            auth = unique_authors[1]
        print(f"Federalist No {num} predicted author: {auth}")

All of the above functions are called to read in the data, process it, and make predictions on the disputed papers using the three methods.

In [11]:
def main(fed_papers_doc, function_words):
    separate_texts(fed_papers_doc)
    labels=known_authors()
    authors, papers, numbers = get_authortext(labels)
    trainauth, traintext, test, testnums = split_data(authors, papers, numbers)
    func_words=load_function_words(function_words)
    paper_feats, labels, unique_auth, test_feats=feature_vecs(traintext, trainauth, test, func_words)
    naive_bayes(paper_feats, labels, unique_auth, test_feats, testnums)
    knn(paper_feats, labels, unique_auth, test_feats, testnums)
    svm_mod(paper_feats, labels, unique_auth, test_feats, testnums)

In [14]:
if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='federalist papers author attribution')
    parser.add_argument('--path', type=str, default="18.txt",
                        help='path to federalist papers text file')
    parser.add_argument('--function_words_path', type=str, default="function_words.txt",
                        help='path to the list of words to use as features')
    args =     parser.parse_known_args()[0]

    main(args.path, args.function_words_path)

Naive Bayes Training Accuracy: 1.0
Naive Bayes Predictions
Federalist No 49 predicted author: Madison
Federalist No 50 predicted author: Madison
Federalist No 51 predicted author: Madison
Federalist No 52 predicted author: Madison
Federalist No 53 predicted author: Madison
Federalist No 54 predicted author: Madison
Federalist No 55 predicted author: Hamilton
Federalist No 56 predicted author: Madison
Federalist No 57 predicted author: Madison
Federalist No 58 predicted author: Madison
Federalist No 62 predicted author: Madison
Federalist No 63 predicted author: Madison
KNN Training Accuracy: 1.0
K-Means Predictions
Federalist No 49 predicted author: Hamilton
Federalist No 50 predicted author: Hamilton
Federalist No 51 predicted author: Madison
Federalist No 52 predicted author: Hamilton
Federalist No 53 predicted author: Madison
Federalist No 54 predicted author: Madison
Federalist No 55 predicted author: Hamilton
Federalist No 56 predicted author: Hamilton
Federalist No 57 predicted a

If we compare these results against the widely accepted conclusion that all twelve papers were written by Madison (which was the result of Mostellar & Wallace (1964) and subsequent studies sought to replicate their findings), it appears that the Naïve Bayes model produces the highest accuracy on both the training data and the testing, only attributing one paper to Hamilton. The SVM prediction is a close second with slightly lower training accuracy and correctly prediction 9/12 disputed papers. The K-Nearest Neighbor model accurately classifies all of the training data, but performs no better than chance (50%) on the testing data. This could be due to the marked difference in corpora size for Hamilton papers versus Madison papers (i.e. 51 vs. 14). 