# LTSM Text Generation for the Federalist Papers

This notebook explores the ability of a recurrent neural network built with two long term short memory to accurately produce text mimicking the Federalist Papers. The model is trained on all of the papers, just the papers written by Hamilton, just the papers written by Madison, and the papers written by Madison combined with the disputed 12 papers now attributed to Madison. 

In [2]:
#import needed packages
import argparse
import re
import csv
import random
import numpy as np
from sklearn.utils import shuffle
import keras.backend as K
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout, Embedding
from keras.callbacks import ModelCheckpoint, EarlyStopping

First the papers must be read in. The number, author, and  text is extracted from each individual text file. The text is stripped to just the body of the text by splitting the text on the opening line and signature of every paper and any punctuation is separated with a space, so that once the text is tokenized the recurrent neural network will learn punctuation and words will not be embedded multiple times (i.e. with and without punctuation). A tab-delimited text file is created containing the paper number, author, and cleaned body text.

In [3]:
def getall_papers(file):
    all_papers=open(file, "w")
    writer=csv.writer(all_papers, delimiter='\t')
    for i in range(1,87):
        with open(f"federalistpapers/papers/fednum{i}.txt") as paper:
            paper=paper.read()
            paper=paper.replace('\n', ' ')
            text=re.split(r'To the People of the State of New York', paper)
            strip=re.split(r'PUBLIUS', text[1])
            #add spaces between words and punctuation
            strip[0]=re.sub(r'(?<=[^\s0-9])(?=[.,;?])', r' ', strip[0])
            if "HAMILTON OR MADISON" in text[0]:
                writer.writerow([i, "Unknown", strip[0]])
            elif "58" in text[0]: #Project Gutenberg classifies this disputed paper as Madison
                writer.writerow([i, "Unknown", strip[0]])
            elif "HAMILTON AND MADISON" in text[0]: #ignore collaborated texts
                writer.writerow([i, "HamiltonandMadison", strip[0]])
            elif "HAMILTON" in text[0]:
                writer.writerow([i, "Hamilton", strip[0]])
            elif "MADISON" in text[0]:
                writer.writerow([i, "Madison", strip[0]])
            elif "JAY" in text[0]:
                writer.writerow([i, "Jay", strip[0]])
    all_papers=open(file, "r")
    return all_papers

From the tsv file lists of authors and their corresponding papers are created.

In [4]:
def get_authortext(all_papers):
    authors=[]
    papers=[]
    numbers=[]
    for line in all_papers:
        fields = line.strip().split("\t")
        authors.append(fields[1])
        papers.append(fields[2])
        numbers.append(fields[0])
    return authors, papers, numbers

These lists are then used to create singular lists for Hamilton, Madison, Madison plus the disputed papers, and all of the papers that contain strings for each paper. 

In [5]:
def split_authors(authors, papers):
    hamilton=[]
    madison=[]
    madisonpred=[]
    allpapers=[]
    for i,author in enumerate(authors):
        if author == 'Hamilton':
            hamilton.append(papers[i].strip('":')) #removing leading ": from text
        elif author == 'Madison':
            madison.append(papers[i].strip('":'))
            madisonpred.append(papers[i].strip('":'))
        elif author == 'Unknown':
            madisonpred.append(papers[i].strip('":'))
        allpapers.append(papers[i].strip('":'))
    return hamilton, madison, madisonpred, allpapers

The recurrent neural network function reads in a list of strings of text. It then processes these texts for use in training a recurrent neural network. The preprocessing includes tokenizing and indexing the words, creating sequences of words and feature/label matrices. The features and labels are then split into training and validation, so that the quality of the model can be analyzed. A recurrent neural network is created with an embedding layer that creates word embeddings based on the vocabulary of the text and two long short-term memory layers, with appropriate dropout to prevent overtraining. The model stops training once the validation loss stops decreasing. Accuracy and perplexity are then calculated for the model. The model is then used to generate 3 500 word strings of text. These strings are printed and then the list of them is returned by the model.

The following code is adapted from https://github.com/WillKoehrsen/recurrent-neural-networks/blob/master/notebooks/Deep%20Dive%20into%20Recurrent%20Neural%20Networks.ipynb which provides an extensive outline on how to set up an LSTM recurrent neural network using Keras to generate text. 

In [None]:
def rnn_model(papers):
    #create and fit tokenizer on formatted papers    
    tokenizer=Tokenizer(num_words=None, filters='"#$%&()*+/<=>?@[\\]^_`{|}~\t\n', lower=False, 
                        split=' ', char_level=False)
    tokenizer.fit_on_texts(papers)
    
    #create lookup and reverse lookup dictionaries
    word_idx = tokenizer.word_index
    idx_word = tokenizer.index_word
    num_words=len(word_idx) + 1
    
    #convert text to sequences of integers & make features and labels
    sequences=tokenizer.texts_to_sequences(papers)
    features=[]
    labels=[]
    for sequence in sequences:
        for i in range(50, len(sequence)):
            extract = sequence[i - 50:i + 1]
            features.append(extract[:-1])
            labels.append(extract[-1])  
    features=np.array(features)
    
    #one hot encode labels
    labels_arr=np.zeros((len(labels), num_words), dtype=np.int)
    for i, word_index in enumerate(labels):
        labels_arr[i, word_index] = 1
        
    #split into training and validation
    features, labels_arr=shuffle(features, labels_arr)
    idx=int(len(labels)*.8)
    train_x = features[:idx]
    train_y=labels_arr[:idx]
    valid_x=features[idx:]
    valid_y=labels_arr[idx:]
    
    #define perplexity to assess quality of models
    def perplexity(y_true, y_pred):
        cross_entropy = K.sparse_categorical_crossentropy(y_true, y_pred)
        perplexity=K.exp(cross_entropy)
        return perplexity

    #set up model
    model = Sequential()
    model.add(Embedding(num_words, 100, input_length=50, trainable=True))
    model.add(LSTM(256, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))
    model.add(LSTM(256, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))
    model.add(Dense(128, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(num_words, activation='softmax'))
    model.compile(optimizer='adam',loss='sparse_categorical_crossentropy', metrics=['accuracy', perplexity])
    
    #define callbacks to stop training once validation loss stops decreasing
    def make_callbacks(model):
        callbacks = [EarlyStopping(monitor='val_loss', patience=5)]
        callbacks.append(ModelCheckpoint('{papers}model.h5',save_best_only=True, 
                        save_weights_only=False)) #saves best model as loss decreases 
        return callbacks
    callbacks = make_callbacks(model)
    
    #train model
    model.fit(train_x, train_y, epochs=150, batch_size=2048, verbose=0, callbacks=callbacks, validation_data=(valid_x, valid_y))
 
    #evaluate model
    model
    r = model.evaluate(valid_x, valid_y, batch_size=2048, verbose=1)
    valid_accuracy=r[1]
    valid_perplexity=r[2]
    print("Evaluating Model on Validation Data")
    print(f"Accuracy: {round(100 * valid_accuracy, 2)}%")
    print(f"Perplexity: {valid_perplexity})
    #make predictions
    seq = random.choice(sequences) #choose a random sequence
    seed_idx = random.randint(0, len(seq) - 60) # choose a random starting point
    end_idx = seed_idx + 50 #ending index for seed
    
    #generate 3 500 word texts
    gen_list = []
    for n in range(3):
        seed = seq[seed_idx:end_idx] #extract the seed sequence
        generated = seed[:] + ['#']
        for i in range(500): #keep adding new words
            # Make a prediction from the seed
            preds = model.predict(np.array(seed).reshape(1, -1))[0].astype(np.float64)
            preds = np.log(preds)/0.75 #diversify
            exp_preds = np.exp(preds)
            preds = exp_preds / sum(exp_preds) #softmax
            #choose next word
            probas = np.random.multinomial(1, preds, 1)[0]
            next_idx = np.argmax(probas)
            #new seed adds on old word
            seed = seed[1:] + [next_idx]
            generated.append(next_idx)
        # print generated text
        gen = []
        for i in generated:
            gen.append(idx_word.get(i))
        #remove spaces between punctuation added earlier
        gen=re.sub(r'\s+([.,;?])', r'\1', ' '.join(gen))
        print(gen)
        gen_list.append(gen)
    return gen_list

A function for saving all of the text generated by recurrent neural networks is written to create a tsv file.

In [None]:
def save_generated_text(texts1, texts2, texts3):
    outfile=open("generated_texts.tsv", "w")
    writer=csv.writer(outfile, delimiter='\t')
    for i in texts1:
        writer.writerow([f"{texts1}", i])
    for i in texts2:
        writer.writerow([f"{texts2}", i])
    for i in texts3:
        writer.writerow([f"{texts3}", i])

All of the above functions are called to read in, clean, and separate the text. Recurrent neural networks are trained on the entirety of the Federalist Papers, only the papers Hamilton wrote, only the papers originally attributed to Madison, and the papers originally attributed to Madison and the disputed papers. These generated bodies of texts are saved into a tsv file for later use. All resulting models are saved as well. 

In [None]:
def main(file):
    papers=getall_papers(file)
    authors, papers, numbers = get_authortext(papers)
    hamilton, madison, madisonall, allpapers = split_authors(authors, papers)
    rnn_model(allpapers)
    hamilton_generated=rnn_model(hamilton)
    madison_generated=rnn_model(madison)
    madison1_generated=rnn_model(madisonall)
    save_generated_text(hamilton_generated, madison_generated, madison1_generated)

In [None]:
if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='federalist papers recurrent neural network')
    parser.add_argument('--path', type=str, default="allpapers.tsv",
                        help='path to federalist papers dataset')
    args = parser.parse_known_args()[0]

    main(args.path)