# Author attribution project wrap up

In this class we'll complete our mini-project on the Federalist Papers and authorship attribution. We will cover the singular value decomposition (SVD) and how it gives us the ability to visualize differences. We will look at the comparison of cosine similarities for each individual document of uncertain authorship.

You will *not* have to turn in this notebook, but it will be useful in completing the previous notebooks, especially last Wednesday's.

In [None]:
import csv, sys, os
import spacy
from collections import Counter
import numpy

from matplotlib import pyplot

nlp = spacy.load("en_core_web_sm")

I have updated the metadata file to include Federalist 74. If you have edited the metadata file there may be a merge conflict.

In [None]:
documents = []

###                                      note name change ↓ 
with open("../data/FederalistPapers/metadata_federalist_fixed.csv", encoding="utf-8") as reader:
    csv_reader = csv.DictReader(reader)
    for row in csv_reader:
        ## convert string to int
        row["Number"] = int(row["Number"])
        row["Filename"] = "../data/FederalistPapers/federalist_{:02d}.txt".format(row["Number"])
        if os.path.exists(row["Filename"]):
            documents.append(row)

In [None]:
for document in documents:
    try:
        with open(document["Filename"], encoding="utf-8") as reader:
            print(document["Number"], document["Author"], document["Title"])

            lines = []
            for line in reader:
                lines.append(line.rstrip())

            text = " ".join(lines)
            document["Spacy"] = nlp(text)
    except:
        print("Problem with {}".format(document["Number"]))

In [None]:
all_counts = Counter()

for document in documents:
    doc_counter = Counter([token.text for token in document["Spacy"]])
    all_counts += doc_counter   
    document["TokenCounts"] = doc_counter

### Define our representation of documents

How many words should we consider when doing similarity comparisons? Be ready to rerun the following cell with differing values of `num_top_words`.

In [None]:
num_top_words = 150
top_words = [w for w, c in all_counts.most_common(num_top_words)]

doc_word_counts = numpy.zeros( (len(documents), num_top_words) )

for doc_id, document in enumerate(documents):
    for word_id, word in enumerate(top_words):
        doc_word_counts[doc_id,word_id] = document["TokenCounts"][word]

doc_lengths = doc_word_counts.sum(axis=1)

doc_word_probs = doc_word_counts / doc_lengths[:,numpy.newaxis]

word_means = doc_word_probs.mean(axis=0)
word_sds = doc_word_probs.std(axis=0)

doc_word_zscores = (doc_word_probs - word_means[numpy.newaxis,:]) / word_sds[numpy.newaxis,:]  ## subtract means, divide by std

In [None]:
U,S,Vt = numpy.linalg.svd(doc_word_zscores, full_matrices=False)

In [None]:
pyplot.figure(figsize=(14, 8))
pyplot.yticks([])
pyplot.xticks(range(75), top_words[:75], rotation="vertical")
pyplot.imshow(Vt[:5,:75])
pyplot.show()


In [None]:
author_list = numpy.array([doc["Author"] for doc in documents])

pyplot.figure(figsize=(14, 14))
pyplot.xticks([])
pyplot.yticks(range(len(author_list)), author_list)
pyplot.imshow(U[:,:5])
pyplot.show()


In [None]:
colormap = {"Alexander Hamilton": "red", "James Madison": "blue",
            "John Jay": "green", "Alexander Hamilton and James Madison": "purple",
            "Alexander Hamilton or James Madison": "gray",}

authors = [colormap[doc["Author"]] for doc in documents]

def show_2d(dimension1, dimension2):
    pyplot.figure(figsize=(8,8))
    pyplot.scatter(U[:,dimension1], U[:,dimension2], c=authors)
    pyplot.show()

In [None]:
show_2d(0, 1)

### In-class exercise: Attribute a document

Each table should "adopt" one document of unknown authorship. Use the `nearest` function to find the closest documents, and be ready to report the authors of the five closest documents *not including* the document itself. 

Do the same for one document of unknown authorship.

Vary the number of top words. Do the closest authors change?

In [None]:
## Python arrays start with 0, but the
##  Federalist Papers start at 1, and some are missing.
## This list comprehension will give us the list indexes
##  for the documents of uncertain authorship.

[(i, doc["Number"]) for i, doc in enumerate(documents)
 if doc["Author"] == "Alexander Hamilton or James Madison"]

The formula for cosine similarity is

$$cos(x, y) = \frac{x^T y}{\|x\|\|y\|}$$

$\|x\|$ is the *norm* of $x$, which is also its length.

In [None]:
descriptors = ["{} {}, {}".format(doc["Number"], doc["Author"], doc["Title"][:30]) for i, doc in enumerate(documents)]

zscore_norms = numpy.linalg.norm(doc_word_zscores, axis=1)

def nearest(query_id):
    dot_products = doc_word_zscores.dot(doc_word_zscores[query_id,:])
    
    normalizers = zscore_norms * zscore_norms[query_id]
    
    cosines = dot_products / normalizers
    
    for comparison in sorted(zip(cosines, descriptors), reverse=True):
        print("{:.2f} {}".format(comparison[0], comparison[1]))

In [None]:
nearest(52)