# DH Downunder: Distant Reading

## Notebook 3: Stylistics, or Computer Magic

One of the most interesting applications of distant reading techniques is authorship attribution. In this third and final notebook for our Distant Reading course, we will consider how text analysis can be used to work out the true identity of a text's author.

Authorship attribution will also give us a chance to introduce several concepts from data science, such as vectors, distance metrics and clustering. These concepts are useful in a number of areas beyond authorship attribution.

Execute the cell below to load all the packages required for this notebook.

In [1]:
import pickle # for loading data
import numpy # for doing maths
import nltk # for natural language processing functions
import matplotlib # for displaying graphs
import random # some useful randomisation functions
from import_corpus import import_corpus

# So graphs render properly
%matplotlib inline

# Section 1: Load and Inspect our Corpus

For this notebook, I have pre-prepared a corpus for you. Execute the cell below to import all the novels into your workspace.

In [2]:
novels = import_corpus()

Wyllard's Weird, by Mary Braddon successfully imported.
Castle Rackrent, by Maria Edgeworth successfully imported.
The Hungry Stones, and Other Stories, by Rabindranath Tagore successfully imported.
The Talisman, by Sir Walter Scott successfully imported.
Ivanhoe, by Sir Walter Scott successfully imported.
Belinda, by Maria Edgeworth successfully imported.
The Absentee, by Maria Edgeworth successfully imported.
The Blithdale Romance, by Nathaniel Hawthorne successfully imported.
Old Mortality, by Sir Walter Scott successfully imported.
The Crimson Cryptogram, by Fergus Hume successfully imported.
The Heart of Mid-Lothian, by Sir Walter Scott successfully imported.
Lady Audley's Secret, by Mary Braddon successfully imported.
Waverley; or, 'tis Sixty Years Since, by ??? successfully imported.
The Disappearing Eye, by Fergus Hume successfully imported.
The Bride of Lammermoor, by Sir Walter Scott successfully imported.
The Lost Parchment, by Fergus Hume successfully imported.
The Mystery 

There is one mystery novel in our corpus! *Waverley, or 'tis Sixty Years Since* was written by ???. In this session we are going to use statistical analysis to find out who the author of this anonymously published novel really was.

Each novel in the corpus is stored as a dict. Execute the cell below to see the structure of a typical entry:

In [38]:
i = random.randint(0, len(novels)-1) # Pick a random number between 0 and the number of novels -1

# We will use this number to look at one of the novels in the list:

print(f'Each novel is a dict with six keys.')
print(f'For example, consider the {i}th novel in the corpus:\n')
print(f'novels[{i}]["title"] is the title of novel {i}: {novels[i]["title"]}\n')
print(f'novels[{i}]["author"] is the author of novel {i}: {novels[i]["author"]}\n')
print(f'novels[{i}]["tokens"] is a list of the words in novel {i}: ...{novels[i]["tokens"][100:115]}...\n')
print(f'novels[{i}]["header"] is the Project Gutenberg header of novel {i}: {novels[i]["header"][0:50]}...\n')
print(f'novels[{i}]["licence"] is the licence of novel {i}: {novels[i]["licence"][0:50]}...\n')
print(f'novels[{i}]["body"] is the text of novel {i}: ...{novels[i]["body"][1000:1200]}...\n')


Each novel is a dict with six keys.
For example, consider the 12th novel in the corpus:

novels[12]["title"] is the title of novel 12: Waverley; or, 'tis Sixty Years Since

novels[12]["author"] is the author of novel 12: ???

novels[12]["tokens"] is a list of the words in novel 12: ...['Author', 'of', 'this', 'collection', 'of', 'Works', 'of', 'Fiction', 'would', 'not', 'have', 'presumed', 'to', 'solicit', 'for']...

novels[12]["header"] is the Project Gutenberg header of novel 12: ﻿The Project Gutenberg EBook of Waverley,  Or, 'Ti...

novels[12]["licence"] is the licence of novel 12: *** END OF THIS PROJECT GUTENBERG EBOOK WAVERLEY *...

novels[12]["body"] is the text of novel 12: ...ded the warmest wish of your Majesty's
heart, by contributing in however small a degree to the happiness of your
people.

They are therefore humbly dedicated to your Majesty, agreeably to your
graciou...

