# Authorship identification
- toc: true 
- badges: false
- comments: true
- categories: [stylometry, machine learning, natural language processing]

Last year I was able to take a few elective courses in the computer science department: Data Mining in the spring semester and Intro to Machine Learning in the fall semester. It was a nice change of pace to study something different than physics. One project I worked on was to perform authorship identification using support vector machines. It was a pretty fun project, so I thought I would post the results here. The goal of the project was to reproduce the results of [this paper](https://www.researchgate.net/publication/221655968_N-Gram_Feature_Selection_for_Authorship_Identification). Reproducing someone else's results is a good thing to do because 1) it validates their research and 2) it is a chance to actively learn about a topic.

### Stylometry

#### A classic example: the Federalist Papers

The Federalist Papers are an important collection of 85 essays written by Hamilton, Madison, and Jay during 1787 and 1788. They were published under the alias "Plubious" at the time, and although it became well known that the three contributed to papers, the authorship of each individual paper was kept hidden for over a decade. It was actually in the interest Hamilton and Madison, both politicians, to keep the authorship a secret (they had both changed their positions on a number of issues and they didn't want their political opponents to use their old arguments against them). Days before his death, however, Hamilton allegedly wrote down what he believed to be the correct author of each essay, claiming over 60 for himself. Madison waited a number of years before publishing his own list, and in the end there were 12 essays to which both Madison and Hamilton claimed authorship. Many interesting details on the controversy can be found in a paper by [Adair](https://www.jstor.org/stable/1921883?read-now=1&seq=23#page_scan_tab_contents).

<img src=images_next/Hamilton_and_Madison.png width=550>
<figcaption>Alexander Hamilton (left) and James Madison (right). Credit: Wikipedia.</figcaption>

There are two ways we could go about resolving this dispute. The first approach is to analyze the actual *content* of the text. For example, perhaps an essay draws from a reference which only Madison was intimately familiar with, or perhaps an essay is similar to some previous work by Hamilton. This was done many times over the next 150 years, but perhaps the final word on the subject was by Adair in 1944, who concluded that Madison likely wrote all 12 essays. An alternative approach is to determine the authorship using only the words on the page, i.e., to analyze the *style* of the text. For example, maybe Madison used many more commas than Hamilton. The field of *stylometry* analyzes these stylistic differences statistically. David Holmes writes the following about stylometry:
> *At its heart lies an assumption that authors have an unconscious aspect to their style, an aspect which cannot consciously be manipulated but which possesses features which are quantifiable and which may be distinctive.*

I think this a valid assumption. The question is which features best characterize the author's style and which methods are best to use in the analysis of these features. Let's go back in time a bit to the first attempts at stylometry. 

#### Early attempts

[Thomas Mendenhall](https://en.wikipedia.org/wiki/Thomas_Corwin_Mendenhall), a physicist, is considered the first to statistically analyze large literary texts. He presented the following interesting idea in an 1887 paper titled [*The Characteristic Curves of Composition*](https://www.jstor.org/stable/pdf/1764604.pdf): it is well-known that each element has a unique distribution of wavelengths in the light which it emits when it is heated; perhaps each author has a unique distribution of word lengths in the texts they have written? It's a really cool idea, and I highly recommend reading his original paper. Mendenhall tallied word lengths by hand for various books, usually in batches of 1000 or so. Here is Fig. 2 from his paper which shows the characteristic curves for a few excerpts of *Oliver Twist*.

<img src=images_next/Mendenhall_fig2.png width=400>
<figcaption>Distribution of word lengths in "Oliver Twist". Each curve is for a different sample of 1000 words.</figcaption>

He showed that these curves are very interesting and that they do show similarities between different works by the same author. The use of these statistics for authorship identification was left for future work. 

The next significant advance was made by Zipf. [Discuss Zipf's Law.] 

In the 1960's researchers used stylometry to support the conclusion of Adair that Madison wrote the 12 disputed Federalist Papers. This was apparently a sort of a breakthrough case for the discipline.

#### Modern methods

There are a few problems with the attempts to find analytic relationships between various textual features. First, there are a huge amount of features to choose from, and second, the relationships may not be analytic. There are now many algorithms available which build prediction models directly from data, i.e., machine learning algorithms, and these can be readily applied to the problem of authorship identification. Here we focus on the use of the Support Vector Machine (SVM) to perform the supervised learning task. Let's briefly review the idea behind the SVM.

[Describe SVM.]

### Reuters 50/50 dataset

#### Dataset description

#### Feature extraction 

#### Feature selection

#### Results

### Conclusion

I think what I'll try next is to identify artists from images of their paintings.