# Author Attribution on RNN Generated Texts

Text generated by recurrent neural networks trained on the Federalist Papers written by Hamilton, the papers written by Madison, and the papers written by Madison combined with the disputed papers, are analyzed using the author attribution models previously trained. The purpose of doing this is to observe whether the author attribution models correctly predict the author that each text was trained on.

In [26]:
import pickle
import argparse
import numpy as np
import pandas as pd

First the texts must be read in from the tsv file created from the output of the RNN models and separated into lists of texts.

In [27]:
def get_texts(generated_texts):
    with open(generated_texts, "r") as texts:
        hamilton=[]
        madison=[]
        madison_disputed=[]
        for line in texts:
            fields = line.strip().split("\t")
            author=fields[0]
            if author == 'hamilton':
                hamilton.append(fields[1])
            if author == 'madison':
                madison.append(fields[1])
            if author == 'madison1':
                madison_disputed.append(fields[1])
    return hamilton, madison, madison_disputed

The same function words used to train the models are read in.

In [28]:
def load_function_words(resource_path): #reads in words from text file separated by new line
    f_words = []
    with open(resource_path, 'r') as f:
        for line in f:
            if line.strip():
                f_words.append(line.lower().strip())
    return f_words

Feature vectors for each of the texts on the function words must be created in the same format as the training data used to train the models.

In [29]:
def feature_vecs(test_texts, function_words): 
    #create matrix of zeros where row corresponds to paper and column to function word
    test_features = np.zeros((len(test_texts),len(function_words)), dtype=np.int)
    
    #populate matrix with counts for each function word for each generated text
    for i,text in enumerate(test_texts):
        for j,function_word in enumerate(function_words):
            text_tokens = text.lower().split()
            count = len([w for w in text_tokens if w == function_words[j]])
            test_features[i,j] = count
    return test_features

The previously trained author attribution models are loaded and predictions made for the RNN-generated texts.

In [30]:
def make_predicts(test_features):
    nb_mod=pickle.load(open("nb_mod.sav", 'rb'))
    preds=nb_mod.predict(test_features)
    print("Naive Bayes Predictions")
    for pred in preds:
        if pred == 0:
            auth = "Hamilton"
        else:
            auth = "Madison"
        print(f"Predicted author: {auth}")
    knn_mod=pickle.load(open("knn_mod.sav", 'rb'))
    preds=knn_mod.predict(test_features)
    print("K-Nearest Neighbors Predictions")
    for pred in preds:
        if pred == 0:
            auth = "Hamilton"
        else:
            auth = "Madison"
        print(f"Predicted author: {auth}")
    svm_mod=pickle.load(open("svm_mod.sav", 'rb'))
    test_feats=pd.DataFrame(test_features)
    test_feats=test_feats.iloc[:,[57, 43, 59]]
    preds=svm_mod.predict(test_feats)
    print("SVM Predictions")
    for pred in preds:
        if pred == 0:
            auth = "Hamilton"
        else:
            auth = "Madison"
        print(f"Predicted author: {auth}")

All of the functions defined above are called to produce the predictions on each text by each author attribution model.

In [31]:
def main(generated_texts, function_words):
    hamilton, madison, madison_disputed = get_texts(generated_texts)
    func_words=load_function_words(function_words)
    hamilton_features=feature_vecs(hamilton, func_words)
    madison_features=feature_vecs(madison, func_words)
    madison_disputed_features=feature_vecs(madison_disputed, func_words)
    print("Predictions for text generated by model trained on Hamilton papers")
    make_predicts(hamilton_features)
    print("Predictions for text generated by model trained on Madison papers")
    make_predicts(madison_features)
    print("Predictions for text generated by model trained on Madison and disputed papers")
    make_predicts(madison_disputed_features)

In [32]:
if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='predicting author of rnn-generated text')
    parser.add_argument('--path', type=str, default="generated_texts.tsv",
                        help='path to rnn-generated texts')
    parser.add_argument('--function_words_path', type=str, default="function_words.txt",
                        help='path to the list of words to use as features')
    args =     parser.parse_known_args()[0]

    main(args.path, args.function_words_path)

Predictions for text generated by model trained on Hamilton papers
Naive Bayes Predictions
Predicted author: Hamilton
Predicted author: Hamilton
Predicted author: Hamilton
K-Nearest Neighbors Predictions
Predicted author: Hamilton
Predicted author: Hamilton
Predicted author: Hamilton
SVM Predictions
Predicted author: Hamilton
Predicted author: Hamilton
Predicted author: Hamilton
Predictions for text generated by model trained on Madison papers
Naive Bayes Predictions
Predicted author: Madison
Predicted author: Madison
Predicted author: Madison
K-Nearest Neighbors Predictions
Predicted author: Hamilton
Predicted author: Hamilton
Predicted author: Hamilton
SVM Predictions
Predicted author: Hamilton
Predicted author: Hamilton
Predicted author: Hamilton
Predictions for text generated by model trained on Madison and disputed papers
Naive Bayes Predictions
Predicted author: Madison
Predicted author: Madison
Predicted author: Madison
K-Nearest Neighbors Predictions
Predicted author: Hamilton


The Naïve Bayes model correctly predicts the author of all 9 of the generated texts. Both the K-Nearest Neighbor and SVM models predict Hamilton each time, making them respectively 33% accurate. It is interesting to note the discrepency in accuracy between the Naïve Bayes model and SVM model, which were much closer on the known papers used in training and the disputed papers.