# Programming 6 continued: Visualization of Burrows' Delta

In [None]:
import csv, sys, os
import spacy
from collections import Counter
import numpy, scipy.stats

from matplotlib import pyplot

nlp = spacy.load("en_core_web_sm")

I made some small edits to the metadata file, mostly to remove extra newlines and regularize author names. I saved a separate copy to avoid overwriting any changes you have made locally.

In [None]:
documents = []

###                                      note name change ↓ 
with open("../data/FederalistPapers/metadata_federalist_fixed.csv", encoding="utf-8") as reader:
    csv_reader = csv.DictReader(reader)
    for row in csv_reader:
        ## convert string to int
        row["Number"] = int(row["Number"])
        
        
        row["Filename"] = "../data/FederalistPapers/federalist_{:02d}.txt".format(row["Number"])
        if os.path.exists(row["Filename"]):
            documents.append(row)

In [None]:
for document in documents:
    try:
        with open(document["Filename"], encoding="utf-8") as reader:
            print(document["Title"])

            lines = []
            for line in reader:
                lines.append(line.rstrip())

            text = " ".join(lines)
            document["Spacy"] = nlp(text)
    except:
        print("Problem with {}".format(document["Number"]))

In [None]:
sentence_lengths = []

for document in documents:
    lengths = [len(sentence) for sentence in document["Spacy"]]
    sentence_lengths.append(lengths)

author_sentence_lengths = {}
for doc_id, document in enumerate(documents):
    author = document["Author"]
    if not author in author_sentence_lengths:
        author_sentence_lengths[ author ] = []
    author_sentence_lengths[ author ].extend(sentence_lengths[doc_id])
    


In [None]:
all_counts = Counter()

for document in documents:
    doc_counter = Counter([token.text for token in document["Spacy"]])
    all_counts += doc_counter   
    document["TokenCounts"] = doc_counter

all_counts.most_common(15)

In [None]:
num_top_words = 150
top_words = [w for w, c in all_counts.most_common(num_top_words)]

doc_word_counts = numpy.zeros( (len(documents), num_top_words) )

for doc_id, document in enumerate(documents):
    for word_id, word in enumerate(top_words):
        doc_word_counts[doc_id,word_id] = document["TokenCounts"][word]

doc_word_counts[:5,:10]

In [None]:
doc_lengths = doc_word_counts.sum(axis=1)
print(doc_lengths.shape)
doc_word_probs = doc_word_counts / doc_lengths[:,numpy.newaxis]
doc_word_probs[:5,:10]

In [None]:
word_means = doc_word_probs.mean(axis=0)
word_sds = doc_word_probs.std(axis=0)

doc_word_zscores = (doc_word_probs - word_means[numpy.newaxis,:]) / word_sds[numpy.newaxis,:]  ## subtract means, divide by std
doc_word_zscores[:5,:10]

### Part : Individual words

These plots will show the relative scores of individual words most frequently used. Be as specific as you can in each written answer. Feel free to add any additional measurements if you find them useful in making your arguments.

* Which words are used most differently between authors?
* Do you think these differences are consistent?
* Are the scores within a specific numeric range? Why or why not?
* Was this result surprising? 

[Answer below]



In [None]:
for word_id, word in enumerate(top_words[:10]):
    madison_scores = doc_word_zscores[author_list == "James Madison", word_id]
    hamilton_scores = doc_word_zscores[author_list == "Alexander Hamilton", word_id]
    
    pyplot.xlim(-2.5, 2.5)
    pyplot.title("\"{}\" M:{:.2f} / H:{:.2f}".format(word, numpy.mean(madison_scores), numpy.mean(hamilton_scores)))
    pyplot.hist([madison_scores, hamilton_scores], 10, histtype='bar')
    pyplot.show()

### Part : Multiple words

These plots show heat maps for all words.

* Does this view change your perspective on the words you looked at in the previous section? 
* Why is John Jay so distinctive?
* Are authors consistent in their relative use of words over the different essays?
* How do the most frequent words compare to the less frequent words?
* How many words would you want to count before you would feel confident distinguishing any pair of these three authors?

[Answer below]

In [None]:
author_list = numpy.array([doc["Author"] for doc in documents])

for author in ["John Jay", "James Madison", "Alexander Hamilton or James Madison", "Alexander Hamilton"]:
    pyplot.figure(figsize=(14, 8))
    pyplot.title(author)
    pyplot.yticks([])
    pyplot.xticks(range(75), top_words[:75], rotation="vertical")
    pyplot.imshow(doc_word_zscores[author_list == author,:75], vmin=-4, vmax=4)
    pyplot.show()


### Part : Parts of Speech

In addition to the `text` of each token, Spacy guesses the token's part of speech (POS), and stores the code for this POS in the variable `pos_`. Do the same Burrows' Delta analysis that we have done in the previous sections for part of speech. Do you see differences in use between Hamilton and Madison that are consistent enough that you could use them to identify the author of an unknown work?

In [None]:
## Code here

[Written answer below]



### Part : Publications

We always want to make sure when doing quantitative analysis that we aren't just making stuff up. The measurements will always produce numbers, and we're really good at seeing patterns -- even when they don't exist.

What happens when we do the same type of analysis but on a variable that we *don't* necessarily expect to impact word use? Do the same Burrows' Delta analysis on words (not POS) that we have done in the previous sections, but instead of the `"Author"` field use the `"Publication"` field. Compare the two main publication venues, the New York Packet and the Independent Journal. Make an argument based on differences you do or do not observe whether these two publications are statistically distinguishable.

In [None]:
## Code here

[Written answer below]

