# Exercises in Authorship Detection

This notebook provides a small introduction into authorship detection. It is by no means comprehensive, but there should be enough to get you started on this subject. **THIS IS A DRAFT**

## Responsibility

First, a point on responsibility. I want to make this very clear (especially since some of you might end up in Forensics): **Never blindly trust in a machine's judgment!** Computers are stupid, but they can easily impress humans with their ability to crunch numbers. Therefore, it's very important to manually inspect the data and the results as well. Try to understand *why* you got a particular result, and only then you can have any confidence in the conclusion.

The methods illustrated in this notebook are by no means state-of-the-art, but they are foundational to the field of authorship detection.

## Features

By *features*, we mean properties or characteristics of a text that are useful inputs for a machine to make predictions about that text (in this case: predicting authorship). We recommend that you read [A survey of modern authorship attribution methods](http://onlinelibrary.wiley.com/doi/10.1002/asi.21001/full) by Efstathios Stamatatos. This paper provides a nice overview of features that are commonly used in authorship attribution. The paper dates back to 2008, and newer methods have been developed, but these features are still important. We'll work with a selection of features from the literature. Most important for us here is how to extract features *efficiently* and how to *represent* them. These are the features we will use:

* Character counts
* Mean word length
* Mean sentence length
* Standard deviation of sentence length
* Type-token ratio
* Hapax legomena ratio
* Stopword counts
* Whitespace

Some of these features sound really basic, but don't let their basic nature fool you. [This piece on punctuation in novels](https://medium.com/@neuroecology/punctuation-in-novels-8f316d542ec4) shows you how informative punctuation can be. And here is an excellent quote by Gary Provost (from *100 Ways To Improve Your Writing*) that shows the power of sentence length to change the character of a text:

> **VARY SENTENCE LENGTH**
> 
> This sentence has five words. Here are five more words. Five-word sentences are fine. But several together become monotonous. Listen to what is happening. The writing is getting boring. The sound of it drones. It's like a stuck record. The ear demands some variety.
>
> Now listen. I vary the sentence length, and I create music. Music. The writing sings. It has a pleasant rhythm, a lilt, a harmony. I use short sentences. And I use sentences of medium length. And sometimes when I am certain the reader is rested, I will engage him with a sentence of considerable length, a sentence that burns with energy and builds with all the impetus of a crescendo, the roll of the drums, the crash of the cymbals--sounds that say listen to this, it is important.
>
> So write with a combination of short, medium, and long sentences. Create a sound that pleases the reader's ear. Don't just write words. Write music.

(Fun fact: the use of sentence length to determine authorship goes back to [1939](http://www.jstor.org/stable/2332655?seq=1#page_scan_tab_contents)!)

Of course, there are many more possible features. E.g. part-of-speech features (does the author use a lot of adjectives?) or n-gram features (what combinations of words does the author use?). Feel free to experiment with these after you've finished the exercises. Another approach is language modeling, e.g. using a Hidden Markov Model (HMM) or a Long Short-Term Memory network (LSTM). If you take this approach, the goal is to learn what sequences of tokens are typical for an author to use. With enough text, these models are really powerful.

## Exercises

Now let's get to work! The plan for this notebook is to write functions to extract the different features. However, we will not write a separate function for each feature! Instead, we'll group features that are naturally related to each other. Read the list of features again and try to think of the requirements each of them has: **what do you need to do in order to compute these features?**

* Character counts
* Mean word length
* Mean sentence length
* Median sentence length
* Standard deviation of sentence length
* Type-token ratio
* Hapax legomena ratio
* Stopword counts
* Whitespace

### Data: Reuter 50_50

We're using the [Reuter 50_50 dataset](https://archive.ics.uci.edu/ml/datasets/Reuter_50_50). Here's the description of this dataset:

> The dataset is the subset of RCV1. These corpus has already been used in author identification experiments. In the top 50 authors (with respect to total size of articles) were selected. 50 authors of texts labeled with at least one subtopic of the class CCAT(corporate/industrial) were selected.That way, it is attempted to minimize the topic factor in distinguishing among the texts. The training corpus consists of 2,500 texts (50 per author) and the test corpus includes other 2,500 texts (50 per author) non-overlapping with the training texts.

In [8]:
import glob
from collections import defaultdict

def data_dict(path):
    "Get the data as a dictionary with key: author, value: list of texts."
    data_by_author = defaultdict(list)
    folders = glob.glob(path)
    for folder in folders:
        author = folder.split('/')[-2]
        filenames = glob.glob(folder + '*.txt')
        for filename in filenames:
            with open(filename) as f:
                data_by_author[author].append(f.read())
    return data_by_author

train_data = data_dict('../Data/authorship/C50/C50train/*/')
test_data = data_dict('../Data/authorship/C50/C50test/*/')

In [10]:
authors = list(train_data.keys())
first_author = authors[0]
first_text = train_data[first_author][0]
print("Example:", first_author,'\n')
print(first_text)

Example: LynneO'Donnell 

China has emerged from the cocoon of last week's National Day celebrations to swoop on the world's soybean and soymeal markets, snapping up a total of between 600,000 and 800,000 tonnes, traders said on Tuesday.
One trader in the region took the credit for a full half of 800,000 tonnes of beans and meal business that he said had been done with Chinese buyers in the past four trading days.
"There has been a chunk of business going on," he said.
"I guess total meal and beans done since last Wednesday is 800,000 tonnes, with 200,000 to 300,000 of it beans and 500,000 to 600,000 tonnes of it meal," he said.
Chinese buyers jumped into the market as soon as the National Day holiday was over, he said, expressing his surprise at how fast the usually cautious Chinese reacted to price falls.
China celebrated the 47th anniversary of the founding of the communist-ruled People's Republic on October 1 with an extended public holiday that saw government, business and trade c

### Tools: Counter

We will make use of the `Counter`-object from the `collections` module in the standard library. `Counter` is a subclass of `dict`, and has all the functionality you would expect in a dictionary, plus some cool additions. Let's first import it:

In [None]:
from collections import Counter

There are two ways to use it for counting.

**Method 1**: Initialize the counter with an iterable object (string, list, set, tuple, generator). Example:

In [None]:
chars = Counter('This is a string whose characters will be counted.')

**Method 2**: After having initialized the counter, use the update method to increment the counts. Example:

In [None]:
cnt = Counter()
cnt.update(['cat','dog','cat','cat'])
print(cnt['cat'])

**Question**: How can you use `chars` to get the 10 most frequent characters? (Use the docs or the `dir` and `help` functions to find out.)

**Question**: What do you expect Python to do if you use `cnt['mouse']`?

**Question**: The built-in Python method `sum()` returns a sum of the contents of whatever iterable it is given (given those contents correspond to ints or floats). So `sum([1,2,3])` returns `6`. How can you get the sum of all counts from a counter?

### Tools: Tokenization and stopwords using the NLTK

The Natural Language Toolkit (NLTK) is probably the most famous NLP library out there. Their FAQ tells the origin story:

> The NLTK project began when Steven Bird was teaching CIS-530 at the University of Pennsylvania in 2001, and hired his star student, Edward Loper, from the previous offering of the course to be the teaching assistant (TA). They agreed a plan for developing software infrastructure for NLP teaching that could be easily maintained over time.

Since these early beginnings, many people have contributed to the library, which now covers all areas of computational linguistics. Because of its size, we can only give you a small preview of what you can do with the NLTK. If you're hungry for more, feel free to read [the (freely available!) book](http://www.nltk.org/book/).

If you are using Anaconda, NLTK is already installed. Here's how to perform sentence and word tokenization.

In [15]:
import nltk

# Use the following command to download the NLTK data, if you haven't already:
# nltk.download('all')

In [16]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [17]:
# Text from the Moby Dick page on Wikipedia.
moby_dick = """Ishmael travels in December from Manhattan Island to New Bedford with plans to sign up for a whaling voyage. The inn where he arrives is so crowded, he must share a bed with the tattooed Polynesian Queequeg, a harpooneer whose father was king of the (fictional) island of Rokovoko. The next morning, Ishmael and Queequeg attend Father Mapple's sermon on Jonah, then head for Nantucket. Ishmael signs up with the Quaker ship-owners Bildad and Peleg for a voyage on their whaler Pequod. Peleg describes Captain Ahab: "He's a grand, ungodly, god-like man" who nevertheless "has his humanities". They hire Queequeg the following morning. A man named Elijah prophesies a dire fate should Ishmael and Queequeg join Ahab. While provisions are loaded, shadowy figures board the ship. On a cold Christmas Day, the Pequod leaves the harbor."""

In [18]:
# Get all the sentences.
sentences = sent_tokenize(moby_dick)

In [19]:
# Get the words from the first sentence.
words_from_first_sentence = word_tokenize(sentences[0])
print(words_from_first_sentence)

['Ishmael', 'travels', 'in', 'December', 'from', 'Manhattan', 'Island', 'to', 'New', 'Bedford', 'with', 'plans', 'to', 'sign', 'up', 'for', 'a', 'whaling', 'voyage', '.']


### Tools: Regular expressions

We won't need this for the current exercise. But it's generally useful to know this module to quickly extract interesting patterns.

Let's first start by loading the re-module, which enables us to search using regular expressions.

In [12]:
import re

If you have no experience with regular expressions before, we recommend that you follow the tutorial at [regexone.com](https://regexone.com/). If you do know regular expressions, please read the documentation on [this page](https://docs.python.org/3/library/re.html). (Note especially the part where they explain *greedy* versus *non-greedy* matching. This difference is very important to have in the back of your mind. Someday this difference *will* be relevant to you.)

In [14]:
# You can compile a pattern like this:
pattern = re.compile('cat[a-z]*')

# And find all occurrences of the pattern like this.
# findall() returns a list of all matches of the pattern in the string.
results = pattern.findall('I am a cat person. I have two cats, and I pet them every day.')
print(results)

['cat', 'cats']


**Question**: how can you find all sequences of whitespace characters in a document?

### Feature extraction
Now we'll write the functions to extract and combine all features.

**Character counts**

Write a function that takes a text, and produces a vector with relative character counts.

In [None]:
from string import punctuation, ascii_letters

reference_chars = punctuation + ascii_letters

def char_counts(text):
    "Function returning relative character counts for alphanumeric characters and punctuation marks."
    
    text = text.lower()
    # First count the characters
    
    char_counts = # YOUR CODE HERE
    
    # Then determine the total amount of characters
    total_chars = # YOUR CODE HERE
    
    relative_values = []
    for char in reference_chars:
        # YOUR CODE HERE
    
    # And return the relative values.
    return relative_values

**Processing the text** Now we need to do some preliminary processing, so that it's possible to compute more statistics. Write a function that turns a text into a list of tokenized sentences. Example:

* Input text: "There once was a man. His name was Piek."
* Output: `[['There', 'once', 'was', 'a', 'man', '.'], ['His', 'name', 'was', 'Piek', '.']]`

In [13]:
def tokenized_sents(text):
    "Function turning a text into a list of tokenized sentences."
    # YOUR CODE HERE.

For the rest of this section, let's work with this example.

In [None]:
example = "There once was a man. His name was Piek. He was pretty good with computers!"
tokenized_text = tokenized_sents(example)

**Mean sentence length** Write a function to compute the mean sentence length in tokens. (You are allowed to use scipy functions.) The output should look like this: `[6.0]`

**Median sentence length** Update the function to compute the median sentence length as well. (You are allowed to use scipy functions.) The output should now be a list with two values: `[mean, median]`

**Standard deviation of sentence length** Update the function to also compute the standard deviation of the sentence length. (You are allowed to use scipy functions.) The output should now be a list with *three* values: [mean, median, std]

**Mean word length** Write a function to compute the mean word length. (You are allowed to use scipy functions.) For this function, you should exclude punctuation marks from the tokens.

(Hint: you might want to use `set(punctuation)`.)

**Type-token ratio** is the number of *different* tokens in the text divided by the *total* number of tokens. Update the word length function to also include this number. For this task, you should also exclude the punctuation marks from the tokens.

(Hint: you might find the `Counter` object useful.)

**Hapax legomena ratio** is the percentage of words only occurring once in the text, out of all the words in the vocabulary. Update the word length function to compute this ratio. 

Note: this is only a useful measure if you're comparing texts of similar size. See [this paper](http://www.aclweb.org/anthology/J10-4003) for explanation.

(HINT: use a `Counter` object to accomplish this. You can loop over the items and see which words occur only once. Divide the number of hapaxes by the length of the word counter.)

**Stopword counts** Count the number of times each of the stopwords in the NLTK list is used.

In [None]:
from nltk.corpus import stopwords
english_stopwords = stopwords.words('english')

# Your code here.

**Adding it all together**

### Predicting authorship

Sit back and relax. Your job is done! Now let's see how well we can predict who wrote which text.

In [5]:
from sklearn.ensemble import RandomForestClassifier

def train_classifier(train, train_target):
    "Train a random forest classifier on the given data."
    rf = RandomForestClassifier(n_estimators=100)   
    rf = rf.fit(train, train_target)
    return rf

def score_classifier(classifier, test, test_target):
    "This is how well the model scores on the test set."
    score = classifier.score(test, test_target)
    print("The model scored: ", score)

## Python libraries

There are already several libraries to perform authorship detection. Below is a short list of libraries that I've found while researching this subject. If you want to delve deeper into authorship detection, I'd recommend you look at these in more detail. (Even if you won't use these libraries, it's nice to see how others have tackled the problem. No need to reinvent the wheel!)

* Mike Kestemont's [PyStyl](https://github.com/mikekestemont/pystyl) is probably the most complete library for authorship detection.
* The Information Sciences Institute provides the [digStylometry](https://github.com/usc-isi-i2/dig-stylometry) package.

It's also possible to determine authorship using neural networks. But that goes way beyond the scope of our course.

## Miscellaneous scripts
Here are some other scripts that I've used as inspiration for the exercises in this notebook.

* [Stylometry](https://github.com/jpotts18/stylometry) is a small set of scripts from Jeff Potter.
* [Here](https://github.com/d10genes/Authorship-Attribution) is another pair of scripts from 'Chris'.
* [Here](http://www.aicbt.com/authorship-attribution/) is a blog from a consultancy company called AICBT Consulting.