# Authorship Attribution

Authorship analysis is, predominately, a text mining task that aims to identify certain aspects about an author, based only on the content of their writings. This could include characteristics such as age, gender, or background. In the specific authorship attribution task, we aim to identify who out of a set of authors wrote a particular document. This is a classic case of a classification task. In many ways, authorship analysis tasks are performed using standard data mining methodologies, such as cross fold validation, feature extraction, and classification algorithms.

In this chapter, we will use the problem of authorship attribution to piece together the parts of the data mining methodology we developed in the previous chapters. We identify the problem and discuss the background and knowledge of the problem. This lets us choose features to extract, which we will build a pipeline for achieving. We will test two different types of features: function words and character n-grams. Finally, we will perform an in-depth analysis of the results. We will work with a book dataset, and then a very messy real-world corpus of e-mails.

The topics we will cover in this chapter are as follows:
- Feature engineering and how the features differ based on application
- Revisiting the bag-of-words model with a specific goal in mind
- Feature types and the character n-grams model
- Support vector machines
- Cleaning up a messy dataset for data mining

## Getting the data

The data we will use for this chapter is a set of books from Project Gutenberg at www.gutenberg.org, which is a repository of public domain literature works.

The books I used for these experiments come from a variety of authors:
- Booth Tarkington (22 titles)
- Charles Dickens (44 titles)
- Edith Nesbit (10 titles)
- Arthur Conan Doyle (51 titles)
- Mark Twain (29 titles)
- Sir Richard Francis Burton (11 titles)
- Emile Gaboriau (10 titles)

Overall, there are 177 documents from 7 authors, giving a significant amount of text to work with. A full list of the titles, along with download links and a script to automatically fetch them, is given in the code bundle. To download these books, we use the requests library to download the files into our data directory. First, set up the data directory and ensure the following code
links to it:

In [1]:
import csv
from collections import Counter
from io import BytesIO, StringIO
from pathlib import Path
import requests
import pandas as pd
import re
import os


In [2]:
GUTENBERG_CSV_URL = "https://www.gutenberg.org/cache/epub/feeds/pg_catalog.csv.gz"

r = requests.get(GUTENBERG_CSV_URL)
csv_text = r.content.decode("utf-8")

print(f"Total size: {len(r.content) / 1024**2:0.2f}MB")
print(csv_text[:400])

Total size: 19.28MB
Text#,Type,Issued,Title,Language,Authors,Subjects,LoCC,Bookshelves
1,Text,1971-12-01,The Declaration of Independence of the United States of America,en,"Jefferson, Thomas, 1743-1826","United States -- History -- Revolution, 1775-1783 -- Sources; United States. Declaration of Independence",E201; JK,Politics; American Revolutionary War; United States Law; Browsing: History - American; Browsing: His


In [3]:
books_info = pd.DataFrame(csv.DictReader(StringIO(csv_text)))
books_info.head()

Unnamed: 0,Text#,Type,Issued,Title,Language,Authors,Subjects,LoCC,Bookshelves
0,1,Text,1971-12-01,The Declaration of Independence of the United ...,en,"Jefferson, Thomas, 1743-1826","United States -- History -- Revolution, 1775-1...",E201; JK,Politics; American Revolutionary War; United S...
1,2,Text,1972-12-01,The United States Bill of Rights\r\nThe Ten Or...,en,United States,Civil rights -- United States -- Sources; Unit...,JK; KF,Politics; American Revolutionary War; United S...
2,3,Text,1973-11-01,John F. Kennedy's Inaugural Address,en,"Kennedy, John F. (John Fitzgerald), 1917-1963",United States -- Foreign relations -- 1961-196...,E838,Browsing: History - American; Browsing: Politics
3,4,Text,1973-11-01,Lincoln's Gettysburg Address\r\nGiven November...,en,"Lincoln, Abraham, 1809-1865",Consecration of cemeteries -- Pennsylvania -- ...,E456,US Civil War; Browsing: History - American; Br...
4,5,Text,1975-12-01,The United States Constitution,en,United States,United States -- Politics and government -- 17...,JK; KF,United States; Politics; American Revolutionar...


In [4]:
books_info['Authors_parsed'] = books_info['Authors'].apply(lambda x:\
                                                        x.replace(',', '').\
                                                        replace('(', '').\
                                                        replace(')', ''))
rel_authors = ['Booth Tarkington', 'Charles Dickens', 'Edith Nesbit', 
    'Arthur Conan Doyle', 'Mark Twain', 'Sir Richard Francis Burton', 
    'Emile Gaboriau']

authors = [tuple(authors.split()) for authors in rel_authors]
authors

[('Booth', 'Tarkington'),
 ('Charles', 'Dickens'),
 ('Edith', 'Nesbit'),
 ('Arthur', 'Conan', 'Doyle'),
 ('Mark', 'Twain'),
 ('Sir', 'Richard', 'Francis', 'Burton'),
 ('Emile', 'Gaboriau')]

In [5]:
authors_to_check = books_info['Authors_parsed'].unique()
authors_to_check = [tuple(authors.split()) for authors in authors_to_check]
authors_to_check[:2]

[('Jefferson', 'Thomas', '1743-1826'), ('United', 'States')]

In [6]:
from collections import defaultdict

authors_to_keep = defaultdict(set)
authors_to_keep_for_df = []
for author_to_check in authors_to_check:
    for author in authors:
        if set(author).issubset(set(author_to_check)):
            authors_to_keep[author].add(' '.join(author_to_check))
            authors_to_keep_for_df.append( ' '.join(author_to_check))

In [7]:
books_for_training = books_info[books_info['Authors_parsed'].isin(authors_to_keep_for_df)]
book_id = books_for_training.iloc[1]["Text#"]

In [8]:
alternative_author_list = []
for author in authors_to_keep.keys():
    for alternative in authors_to_keep[author]:
        alternative_author_list.append([' '.join(author), alternative])
alternatives = pd.DataFrame(alternative_author_list, columns=['Alternative_names', 'Authors_parsed'])
books_for_training = pd.merge(books_for_training, alternatives, on='Authors_parsed')

In [9]:
books_for_training['Alternative_names'].value_counts(dropna=False)

Mark Twain                    257
Charles Dickens               207
Arthur Conan Doyle            156
Sir Richard Francis Burton     60
Edith Nesbit                   42
Emile Gaboriau                 34
Booth Tarkington               33
Name: Alternative_names, dtype: int64

In [10]:
from sklearn.model_selection import train_test_split

_, books_for_training = train_test_split(books_for_training, test_size=.2, random_state=42)
books_for_training['Alternative_names'].value_counts(dropna=False)

Mark Twain                    47
Charles Dickens               39
Arthur Conan Doyle            38
Sir Richard Francis Burton    15
Edith Nesbit                   7
Emile Gaboriau                 6
Booth Tarkington               6
Name: Alternative_names, dtype: int64

In [11]:
GUTENBERG_ROBOT_URL = "http://www.gutenberg.org/robot/harvest?filetypes[]=txt"
r = requests.get(GUTENBERG_ROBOT_URL)
GUTENBERG_MIRROR = re.search('(https?://[^/]+)[^"]*.zip', r.text).group(1)

In [12]:
def gutenberg_text_urls(id, mirror=GUTENBERG_MIRROR, suffixes=("", "-8", "-0")):
    path = "/".join(id[:-1]) or "0"
    return [f"{mirror}/{path}/{id}/{id}{suffix}.zip" for suffix in suffixes]

gutenberg_text_urls(book_id)

['http://aleph.gutenberg.org/7/70/70.zip',
 'http://aleph.gutenberg.org/7/70/70-8.zip',
 'http://aleph.gutenberg.org/7/70/70-0.zip']

In [13]:
import logging
import zipfile

def download_gutenberg(id):
    for url in gutenberg_text_urls(id):
        r = requests.get(url)
        if r.status_code == 404:
            logging.warning(f"404 for {url}")
            continue
        r.raise_for_status()
        break
    try:
        z = zipfile.ZipFile(BytesIO(r.content))
        print(f'success with {id}')
    
        if len(z.namelist()) != 1:
            raise Exception(f"Expected 1 file in {z.namelist()}")
    except:
        print('file not a zip')
        return
        
    return z.read(z.namelist()[0]).decode('utf-8')

In [14]:
text = download_gutenberg(book_id)
print(text[:1500])



success with 70
﻿The Project Gutenberg eBook of What Is Man? And Other Stories, by Mark Twain (Samuel Clemens)

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United States, you
will have to check the laws of the country where you are located before
using this eBook.

Title: What Is Man? And Other Stories

Author: Mark Twain (Samuel Clemens)

Release Date: June, 1993 [eBook #70]
[Most recently updated: May 26, 2022]

Language: English

Character set encoding: UTF-8

Produced by: An Anonymous Volunteer and David Widger

*** START OF THE PROJECT GUTENBERG EBOOK WHAT IS MAN? AND OTHER STORIES ***




WHAT IS MAN?
AND OTHER ESSAYS

By Mark Twain

(Samuel Langhorne Clemens, 1835-1910)


CONTENTS

 WHAT IS MAN

## Downloading all the files
Now we can download all the files in a simple loop; let’s create a simple function that gets and cleans the text:

In [16]:
GUTENBERG_TEXT = "PROJECT GUTENBERG EBOOK "

def strip_headers(text):
    in_text = False
    output = []
    
    for line in text.splitlines():        
        if GUTENBERG_TEXT in line:
            if not in_text:
                in_text = True
            else:
                break
        else:
            if in_text:
                output.append(line)

    return "\n".join(output).strip()

stripped_text = strip_headers(text)

In [17]:
GUTENBERG_TEXT_URL = "https://www.gutenberg.org/ebooks/{id}.txt.utf-8"


def book_text(book_id):
    r = requests.get(GUTENBERG_TEXT_URL.format(id=book_id))
    text = r.text
    clean_text = strip_headers(text)
    return clean_text

data_path = Path("data/author_texts")
data_path.mkdir(exist_ok=True)

count = 0
for idx, book in books_for_training.iterrows():
    if count%25 == 0:
        print(f'finished {count}/{books_for_training.shape[0]}')
    count += 1
    id = book["Text#"]
    text = book_text(id)
    print(f"Saving {book['Title']} by {book['Authors_parsed']} containing {len(text):_} characters")
    with open(data_path / (id + ".txt"), "wt") as f:
        f.write(text)

finished 0/158
Saving New Treasure Seekers; Or, The Bastable Children in Search of a Fortune by Nesbit E. Edith 1858-1924 containing 367_421 characters
Saving 1601: Conversation as it was by the Social Fireside in the Time of the Tudors by Twain Mark 1835-1910 containing 67_227 characters
Saving Speeches: Literary and Social by Dickens Charles 1812-1870 containing 567_696 characters
Saving The Haunted Man and the Ghost's Bargain by Dickens Charles 1812-1870 containing 186_280 characters
Saving Doctor Marigold by Dickens Charles 1812-1870 containing 63_064 characters
Saving Mudfog and Other Sketches by Dickens Charles 1812-1870 containing 185_802 characters
Saving The Book of the Thousand Nights and a Night — Volume 16 by Burton Richard Francis Sir 1821-1890 [Translator] containing 945_509 characters
Saving The Parasite by Doyle Arthur Conan 1859-1930 containing 0 characters
Saving A Christmas Carol by Dickens Charles 1812-1870 containing 0 characters
Saving On the Decay of the Art of L

In [18]:
data_folder = Path("data/author_texts")

In [19]:
def clean_book(document):
    lines = document.split("\n")
    start= 0
    end = len(lines)
    for i in range(len(lines)):
        line = lines[i]
        if line.startswith("*** START OF THIS PROJECT GUTENBERG"):
            start = i + 1
        elif line.startswith("*** END OF THIS PROJECT GUTENBERG"):
            end = i - 1
    return "\n".join(lines[start:end])

In [20]:
import os
import glob

txt_files = glob.glob('data/author_texts/*.txt')
print("Text files:", txt_files)


Text files: ['data/author_texts/8473.txt', 'data/author_texts/65044.txt', 'data/author_texts/922.txt', 'data/author_texts/50162.txt', 'data/author_texts/5785.txt', 'data/author_texts/59813.txt', 'data/author_texts/24026.txt', 'data/author_texts/61751.txt', 'data/author_texts/102.txt', 'data/author_texts/9021.txt', 'data/author_texts/3450.txt', 'data/author_texts/50361.txt', 'data/author_texts/11301.txt', 'data/author_texts/7154.txt', 'data/author_texts/9743.txt', 'data/author_texts/53254.txt', 'data/author_texts/65043.txt', 'data/author_texts/5838.txt', 'data/author_texts/66991.txt', 'data/author_texts/66952.txt', 'data/author_texts/675.txt', 'data/author_texts/18718.txt', 'data/author_texts/3441.txt', 'data/author_texts/8528.txt', 'data/author_texts/61193.txt', 'data/author_texts/66159.txt', 'data/author_texts/50164.txt', 'data/author_texts/7157.txt', 'data/author_texts/51252.txt', 'data/author_texts/17398.txt', 'data/author_texts/139.txt', 'data/author_texts/5813.txt', 'data/author_t

In [21]:
# Import label encoder 
from sklearn import preprocessing 

label_encoder = preprocessing.LabelEncoder() 
books_for_training['Alternative_names_code']= label_encoder.fit_transform(books_for_training['Alternative_names']) 
  


In [22]:
import numpy as np

def load_books_data(folder=data_folder):
    documents = []
    authors = []
    subfolders = [subfolder for subfolder in glob.glob('data/author_texts/*.txt')]
    for _, subfolder in enumerate(subfolders):
        id = subfolder.split('/')[-1][:-4]
        class_val = int(books_for_training[books_for_training['Text#']==id]['Alternative_names_code'])
        with open(subfolder) as inf:
            documents.append(clean_book(inf.read()))
            authors.append(class_val)
    return documents, np.array(authors, dtype='int')

documents, classes = load_books_data(data_folder)
len(documents), len(classes)

(158, 158)

In [23]:
function_words = ["a", "able", "aboard", "about", "above", "absent",
                  "according" , "accordingly", "across", "after", "against",
                  "ahead", "albeit", "all", "along", "alongside", "although",
                  "am", "amid", "amidst", "among", "amongst", "amount", "an",
                    "and", "another", "anti", "any", "anybody", "anyone",
                    "anything", "are", "around", "as", "aside", "astraddle",
                    "astride", "at", "away", "bar", "barring", "be", "because",
                    "been", "before", "behind", "being", "below", "beneath",
                    "beside", "besides", "better", "between", "beyond", "bit",
                    "both", "but", "by", "can", "certain", "circa", "close",
                    "concerning", "consequently", "considering", "could",
                    "couple", "dare", "deal", "despite", "down", "due", "during",
                    "each", "eight", "eighth", "either", "enough", "every",
                    "everybody", "everyone", "everything", "except", "excepting",
                    "excluding", "failing", "few", "fewer", "fifth", "first",
                    "five", "following", "for", "four", "fourth", "from", "front",
                    "given", "good", "great", "had", "half", "have", "he",
                    "heaps", "hence", "her", "hers", "herself", "him", "himself",
                    "his", "however", "i", "if", "in", "including", "inside",
                    "instead", "into", "is", "it", "its", "itself", "keeping",
                    "lack", "less", "like", "little", "loads", "lots", "majority",
                    "many", "masses", "may", "me", "might", "mine", "minority",
                    "minus", "more", "most", "much", "must", "my", "myself",
                    "near", "need", "neither", "nevertheless", "next", "nine",
                    "ninth", "no", "nobody", "none", "nor", "nothing",
                    "notwithstanding", "number", "numbers", "of", "off", "on",
                    "once", "one", "onto", "opposite", "or", "other", "ought",
                    "our", "ours", "ourselves", "out", "outside", "over", "part",
                    "past", "pending", "per", "pertaining", "place", "plenty",
                    "plethora", "plus", "quantities", "quantity", "quarter",
                    "regarding", "remainder", "respecting", "rest", "round",
                    "save", "saving", "second", "seven", "seventh", "several",
                    "shall", "she", "should", "similar", "since", "six", "sixth",
                    "so", "some", "somebody", "someone", "something", "spite",
                    "such", "ten", "tenth", "than", "thanks", "that", "the",
                    "their", "theirs", "them", "themselves", "then", "thence",
                    "therefore", "these", "they", "third", "this", "those",
                    "though", "three", "through", "throughout", "thru", "thus",
                    "till", "time", "to", "tons", "top", "toward", "towards",
                    "two", "under", "underneath", "unless", "unlike", "until",
                    "unto", "up", "upon", "us", "used", "various", "versus",
                    "via", "view", "wanting", "was", "we", "were", "what",
                    "whatever", "when", "whenever", "where", "whereas",
                    "wherever", "whether", "which", "whichever", "while",
                    "whilst", "who", "whoever", "whole", "whom", "whomever",
                    "whose", "will", "with", "within", "without", "would", "yet",
                    "you", "your", "yours", "yourself", "yourselves"]

In [24]:
from sklearn.feature_extraction.text import CountVectorizer
extractor = CountVectorizer(vocabulary=function_words)

In [25]:

from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline

In [26]:

parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svr = SVC()
grid = GridSearchCV(svr, parameters)

In [27]:
pipeline1 = Pipeline([('feature_extraction', extractor),
                      ('clf', grid)])

In [28]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer,f1_score

f1_ = make_scorer(f1_score, average='weighted')
scores = cross_val_score(pipeline1, documents, classes, scoring=f1_)




In [29]:
print(np.mean(scores))

0.5066009257894807


In [30]:
pipeline = Pipeline([('feature_extraction',
                      CountVectorizer(analyzer='char', ngram_range=(3, 3))),
                      ('classifier', grid)])
scores = cross_val_score(pipeline, documents, classes, scoring=f1_)
print(f"Score: {np.mean(scores):.3f}")



Score: 0.632
