# Authorship Attribution

Authorship analysis is, predominately, a text mining task that aims to identify certain aspects about an author, based only on the content of their writings. This could include characteristics such as age, gender, or background. In the specific authorship attribution task, we aim to identify who out of a set of authors wrote a particular document. This is a classic case of a classification task. In many ways, authorship analysis tasks are performed using standard data mining methodologies, such as cross fold validation, feature extraction, and classification algorithms.

In this chapter, we will use the problem of authorship attribution to piece together the parts of the data mining methodology we developed in the previous chapters. We identify the problem and discuss the background and knowledge of the problem. This lets us choose features to extract, which we will build a pipeline for achieving. We will test two different types of features: function words and character n-grams. Finally, we will perform an in-depth analysis of the results. We will work with a book dataset, and then a very messy real-world corpus of e-mails.

The topics we will cover in this chapter are as follows:
- Feature engineering and how the features differ based on application
- Revisiting the bag-of-words model with a specific goal in mind
- Feature types and the character n-grams model
- Support vector machines
- Cleaning up a messy dataset for data mining

## Getting the data

The data we will use for this chapter is a set of books from Project Gutenberg at www.gutenberg.org, which is a repository of public domain literature works.

The books I used for these experiments come from a variety of authors:
- Booth Tarkington (22 titles)
- Charles Dickens (44 titles)
- Edith Nesbit (10 titles)
- Arthur Conan Doyle (51 titles)
- Mark Twain (29 titles)
- Sir Richard Francis Burton (11 titles)
- Emile Gaboriau (10 titles)

Overall, there are 177 documents from 7 authors, giving a significant amount of text to work with. A full list of the titles, along with download links and a script to automatically fetch them, is given in the code bundle. To download these books, we use the requests library to download the files into our data directory. First, set up the data directory and ensure the following code
links to it:

In [1]:
import csv
from collections import Counter
from io import BytesIO
from pathlib import Path
import requests

In [2]:
GUTENBERG_CSV_URL = "https://www.gutenberg.org/cache/epub/feeds/pg_catalog.csv.gz"

r = requests.get(GUTENBERG_CSV_URL)
csv_text = r.content.decode("utf-8")

f"Total size: {len(r.content) / 1024**2:0.2f}MB"

'Total size: 19.28MB'

In [3]:
import requests

url = "https://www.gutenberg.org/feeds/catalog.rdf.bz2"
response = requests.get(url)
with open("catalog.rdf.bz2", "wb") as f:
    f.write(response.content)


In [4]:
print(csv_text[:400])

Text#,Type,Issued,Title,Language,Authors,Subjects,LoCC,Bookshelves
1,Text,1971-12-01,The Declaration of Independence of the United States of America,en,"Jefferson, Thomas, 1743-1826","United States -- History -- Revolution, 1775-1783 -- Sources; United States. Declaration of Independence",E201; JK,Politics; American Revolutionary War; United States Law; Browsing: History - American; Browsing: His


In [5]:
import csv
from io import StringIO
import pandas as pd

books_info = pd.DataFrame(csv.DictReader(StringIO(csv_text)))
books_info.head()

Unnamed: 0,Text#,Type,Issued,Title,Language,Authors,Subjects,LoCC,Bookshelves
0,1,Text,1971-12-01,The Declaration of Independence of the United ...,en,"Jefferson, Thomas, 1743-1826","United States -- History -- Revolution, 1775-1...",E201; JK,Politics; American Revolutionary War; United S...
1,2,Text,1972-12-01,The United States Bill of Rights\r\nThe Ten Or...,en,United States,Civil rights -- United States -- Sources; Unit...,JK; KF,Politics; American Revolutionary War; United S...
2,3,Text,1973-11-01,John F. Kennedy's Inaugural Address,en,"Kennedy, John F. (John Fitzgerald), 1917-1963",United States -- Foreign relations -- 1961-196...,E838,Browsing: History - American; Browsing: Politics
3,4,Text,1973-11-01,Lincoln's Gettysburg Address\r\nGiven November...,en,"Lincoln, Abraham, 1809-1865",Consecration of cemeteries -- Pennsylvania -- ...,E456,US Civil War; Browsing: History - American; Br...
4,5,Text,1975-12-01,The United States Constitution,en,United States,United States -- Politics and government -- 17...,JK; KF,United States; Politics; American Revolutionar...


In [6]:
books_info.shape

(74652, 9)

In [7]:
len(books_info['Authors'].unique())

35807

In [None]:
books_info['Authors_parsed'] = books_info['Authors'].apply(lambda x:\
                                                        x.replace(',', '').\
                                                        replace('(', '').\
                                                        replace(')', ''))
books_info.head()

Unnamed: 0,Text#,Type,Issued,Title,Language,Authors,Subjects,LoCC,Bookshelves,Authors_parsed
0,1,Text,1971-12-01,The Declaration of Independence of the United ...,en,"Jefferson, Thomas, 1743-1826","United States -- History -- Revolution, 1775-1...",E201; JK,Politics; American Revolutionary War; United S...,Jefferson Thomas 1743-1826
1,2,Text,1972-12-01,The United States Bill of Rights\r\nThe Ten Or...,en,United States,Civil rights -- United States -- Sources; Unit...,JK; KF,Politics; American Revolutionary War; United S...,United States
2,3,Text,1973-11-01,John F. Kennedy's Inaugural Address,en,"Kennedy, John F. (John Fitzgerald), 1917-1963",United States -- Foreign relations -- 1961-196...,E838,Browsing: History - American; Browsing: Politics,Kennedy John F. John Fitzgerald 1917-1963
3,4,Text,1973-11-01,Lincoln's Gettysburg Address\r\nGiven November...,en,"Lincoln, Abraham, 1809-1865",Consecration of cemeteries -- Pennsylvania -- ...,E456,US Civil War; Browsing: History - American; Br...,Lincoln Abraham 1809-1865
4,5,Text,1975-12-01,The United States Constitution,en,United States,United States -- Politics and government -- 17...,JK; KF,United States; Politics; American Revolutionar...,United States


In [9]:
rel_authors = ['Booth Tarkington', 'Charles Dickens', 'Edith Nesbit', 
    'Arthur Conan Doyle', 'Mark Twain', 'Sir Richard Francis Burton', 
    'Emile Gaboriau']

authors = [tuple(authors.split()) for authors in rel_authors]
authors

[('Booth', 'Tarkington'),
 ('Charles', 'Dickens'),
 ('Edith', 'Nesbit'),
 ('Arthur', 'Conan', 'Doyle'),
 ('Mark', 'Twain'),
 ('Sir', 'Richard', 'Francis', 'Burton'),
 ('Emile', 'Gaboriau')]

In [10]:
authors_to_check = books_info['Authors_parsed'].unique()
authors_to_check = [tuple(authors.split()) for authors in authors_to_check]
authors_to_check[:2]

[('Jefferson', 'Thomas', '1743-1826'), ('United', 'States')]

In [11]:
from collections import defaultdict

authors_to_keep = defaultdict(set)
authors_to_keep_for_df = []
for author_to_check in authors_to_check:
    for author in authors:
        if set(author).issubset(set(author_to_check)):
            authors_to_keep[author].add(' '.join(author_to_check))
            authors_to_keep_for_df.append( ' '.join(author_to_check))


authors_to_keep[('Edith', 'Nesbit')]

{'Bland Rosamund E. Nesbit Rosamund Edith Nesbit 1886-1950; Hardy E. Stuart 1865-1935 [Illustrator]',
 'Nesbit E. Edith 1858-1924',
 'Nesbit E. Edith 1858-1924 [Editor]',
 'Nesbit E. Edith 1858-1924; Barraud George [Illustrator]',
 'Nesbit E. Edith 1858-1924; Birch Reginald Bathurst 1856-1943 [Illustrator]',
 'Nesbit E. Edith 1858-1924; Bland Hubert 1856-1914',
 'Nesbit E. Edith 1858-1924; Bowley May 1864-1960 [Illustrator]; Brundage Frances 1854-1937 [Illustrator]',
 'Nesbit E. Edith 1858-1924; Brock C. E. Charles Edmund 1870-1938 [Illustrator]; Millar H. R. Harold Robert 1869-1942 [Illustrator]',
 'Nesbit E. Edith 1858-1924; Brooke Caris 1844-1899; Smith H. Bellingham [Illustrator]',
 'Nesbit E. Edith 1858-1924; Fell Herbert Granville 1872-1951 [Illustrator]; Millar H. R. Harold Robert 1869-1942 [Illustrator]',
 'Nesbit E. Edith 1858-1924; Kemp-Welch Lucy 1869-1958 [Illustrator]',
 'Nesbit E. Edith 1858-1924; Millar H. R. Harold Robert 1869-1942 [Illustrator]',
 'Nesbit E. Edith 1858

In [12]:
len(authors_to_keep_for_df)

167

In [13]:
books_for_training = books_info[books_info['Authors_parsed'].isin(authors_to_keep_for_df)]
books_for_training.shape

(787, 10)

In [14]:
books_for_training.head()

Unnamed: 0,Text#,Type,Issued,Title,Language,Authors,Subjects,LoCC,Bookshelves,Authors_parsed
45,46,Text,2004-08-11,A Christmas Carol in Prose; Being a Ghost Stor...,en,"Dickens, Charles, 1812-1870; Leech, John, 1817...",Christmas stories; London (England) -- Fiction...,PR,Children's Literature; Christmas; Browsing: Ch...,Dickens Charles 1812-1870; Leech John 1817-186...
69,70,Text,2004-09-13,What Is Man? and Other Essays,en,"Twain, Mark, 1835-1910",American essays,PS,Browsing: History - American; Browsing: Litera...,Twain Mark 1835-1910
73,74,Text,2004-07-01,"The Adventures of Tom Sawyer, Complete",en,"Twain, Mark, 1835-1910",Humorous stories; Bildungsromans; Boys -- Fict...,PS,Banned Books from Anne Haight's list; Banned B...,Twain Mark 1835-1910
75,76,Text,2004-06-29,Adventures of Huckleberry Finn,en,"Twain, Mark, 1835-1910; Kemble, E. W. (Edward ...",Humorous stories; Bildungsromans; Boys -- Fict...,PS,Best Books Ever Listings; Banned Books from An...,Twain Mark 1835-1910; Kemble E. W. Edward Wind...
85,86,Text,2004-07-07,A Connecticut Yankee in King Arthur's Court,en,"Twain, Mark, 1835-1910",Fantasy fiction; Satire; Knights and knighthoo...,PS,Precursors of Science Fiction; Arthurian Legen...,Twain Mark 1835-1910


In [15]:
GUTENBERG_TEXT_URL = "https://www.gutenberg.org/ebooks/{id}.txt.utf-8"

book_id = books_for_training.iloc[0]["Text#"]

# r = requests.get(GUTENBERG_TEXT_URL.format(id=book_id))
# text = r.text

GUTENBERG_ROBOT_URL = "https://www.gutenberg.org/robot/harvest?filetypes[]=txt"
r = requests.get(GUTENBERG_ROBOT_URL)

print(r.text[:1000])

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

<html lang="en">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <title>All Files (offset: 0, filetypes: txt) - Project Gutenberg</title>
  </head>
  <body>
    <h1>All Files (offset: 0, filetypes: txt)</h1>    <p><a href="http://aleph.gutenberg.org/1/2/3/7/12370/12370-8.zip">http://aleph.gutenberg.org/1/2/3/7/12370/12370-8.zip</a></p>

    <p><a href="http://aleph.gutenberg.org/1/2/3/7/12370/12370.zip">http://aleph.gutenberg.org/1/2/3/7/12370/12370.zip</a></p>

    <p><a href="http://aleph.gutenberg.org/1/2/3/7/12372/12372-8.zip">http://aleph.gutenberg.org/1/2/3/7/12372/12372-8.zip</a></p>

    <p><a href="http://aleph.gutenberg.org/1/2/3/7/12372/12372.zip">http://aleph.gutenberg.org/1/2/3/7/12372/12372.zip</a></p>

    <p><a href="http://aleph.gutenberg.org/1/2/3/7/12373/12373-8.zip">http://aleph.gutenberg.org/1/2/3/7/12373/12373-8.zip</a></p>

    <p><a hr

In [16]:
import re

GUTENBERG_MIRROR = re.search('(https?://[^/]+)[^"]*.zip', r.text).group(1)
GUTENBERG_MIRROR

'http://aleph.gutenberg.org'

In [17]:
def gutenberg_text_urls(id: str, mirror=GUTENBERG_MIRROR, suffixes=("", "-8", "-0")):
    path = "/".join(id[:-1]) or "0"
    return [f"{mirror}/{path}/{id}/{id}{suffix}.zip" for suffix in suffixes]

gutenberg_text_urls(book_id) 

['http://aleph.gutenberg.org/4/46/46.zip',
 'http://aleph.gutenberg.org/4/46/46-8.zip',
 'http://aleph.gutenberg.org/4/46/46-0.zip']

In [18]:
import logging
import zipfile

def download_gutenberg(id):
    for url in gutenberg_text_urls(id):
        r = requests.get(url)
        if r.status_code == 404:
            logging.warning(f"404 for {url}")
            continue
        r.raise_for_status()
        break
    
    z = zipfile.ZipFile(BytesIO(r.content))
    
    if len(z.namelist()) != 1:
        raise Exception(f"Expected 1 file in {z.namelist()}")
        
    return z.read(z.namelist()[0]).decode('utf-8')

In [19]:
text = download_gutenberg(book_id)

print(text[:1500])



BadZipFile: File is not a zip file

In [None]:
import os
import sys
data_folder = os.path.join(os.path.expanduser("~"), "data", "books")

In [None]:
from nltk.corpus import gutenberg

# List of available books
print(gutenberg.fileids())

# Access a specific book
book_id = "austen-persuasion.txt"
text = gutenberg.raw(book_id)

# Print the first few lines of the book
print(len(text))


In [None]:
# %load getdata.py
# Downloads the books and stores them in the below folder
import os
from time import sleep
import urllib.request

titles = {}


titles['burton'] = [4657, 2400, 5760, 6036, 7111, 8821,
                    18506, 4658, 5761, 6886, 7113]
titles['dickens'] = [24022, 1392, 1414, 1467, 2324, 580,
                     786, 888, 963, 27924, 1394, 1415, 15618,
                     25985, 588, 807, 914, 967, 30127, 1400,
                     1421, 16023, 28198, 644, 809, 917, 968, 1023,
                     1406, 1422, 17879, 30368, 675, 810, 924, 98,
                     1289, 1413, 1423, 17880, 32241, 699, 821, 927]
titles['doyle'] = [2349, 11656, 1644, 22357, 2347, 290, 34627, 5148,
                   8394, 26153, 12555, 1661, 23059, 2348, 294, 355,
                   5260, 8727, 10446, 126, 17398, 2343, 2350, 3070,
                   356, 5317, 903, 10581, 13152, 2038, 2344, 244, 32536,
                   423, 537, 108, 139, 2097, 2345, 24951, 32777, 4295,
                   7964, 11413, 1638, 21768, 2346, 2845, 3289, 439, 834]
titles['gaboriau'] = [1748, 1651, 2736, 3336, 4604, 4002, 2451,
                      305, 3802, 547]
titles['nesbit'] = [34219, 23661, 28804, 4378, 778, 20404, 28725,
                    33028, 4513, 794]
titles['tarkington'] = [1098, 15855, 1983, 297, 402, 5798,
                        8740, 980, 1158, 1611, 2326, 30092,
                        483, 5949, 8867, 13275, 18259, 2595,
                        3428, 5756, 6401, 9659]
titles['twain'] = [1044, 1213, 245, 30092, 3176, 3179, 3183, 3189, 74,
                   86, 1086, 142, 2572, 3173, 3177, 3180, 3186, 3192,
                   76, 91, 119, 1837, 2895, 3174, 3178, 3181, 3187, 3432,
                   8525]



assert len(titles) == 7
assert len(titles['tarkington']) == 22
assert len(titles['dickens']) == 44
assert len(titles['nesbit']) == 10
assert len(titles['doyle']) == 51
assert len(titles['twain']) == 29
assert len(titles['burton']) == 11
assert len(titles['gaboriau']) == 10


url_base = "https://www.gutenberg.myebook.bg/"
url_format = "{url_base}{idstring}/{id}/{id}.txt"

fixes = {}
fixes[1044] = url_base + "1/0/4/1044/1044-0.txt"
fixes[5148] = url_base + "5/1/4/5148/5148-0.txt"
fixes[4657] = "https://archive.org/stream/personalnarrativ04657gut/pnpa110.txt"
fixes[1467] = "https://archive.org/stream/somechristmassto01467gut/cdscs10p_djvu.txt"

# Make parent folder if not exists
data_folder = "books"
if not os.path.exists(data_folder):
    os.makedirs(data_folder)

for author in titles:
    print(f"Downloading titles from {author}")
    # Make author's folder if not exists
    author_folder = os.path.join(data_folder, author)
    if not os.path.exists(author_folder):
        os.makedirs(author_folder)
    
    # Download each title to this folder
    for bookid in titles[author]:
        try:
            print(f" - Getting book with id {bookid}")
            idstring = "/".join([str(bookid)[i] for i in range(len(str(bookid))-1)])
            url = url_format.format(url_base=url_base, idstring=idstring, id=bookid)
            
            filename = os.path.join(author_folder, f"{bookid}.txt")
            if os.path.exists(filename):
                print(f" - File already exists, skipping")
            else:
                urllib.request.urlretrieve(url, filename)
                sleep(60*5)
        except Exception as e:
            print(f"Error downloading book {bookid}: {str(e)}")

print("Download complete")

In [None]:
def clean_book(document):
    lines = document.split("\n")
    start= 0
    end = len(lines)
    for i in range(len(lines)):
        line = lines[i]
        if line.startswith("*** START OF THIS PROJECT GUTENBERG"):
            start = i + 1
        elif line.startswith("*** END OF THIS PROJECT GUTENBERG"):
            end = i - 1
    return "\n".join(lines[start:end])

In [None]:
import numpy as np

def load_books_data(folder=data_folder):
    documents = []
    authors = []
    subfolders = [subfolder for subfolder in os.listdir(folder)
                  if os.path.isdir(os.path.join(folder, subfolder))]
    for author_number, subfolder in enumerate(subfolders):
        full_subfolder_path = os.path.join(folder, subfolder)
        for document_name in os.listdir(full_subfolder_path):
            with open(os.path.join(full_subfolder_path, document_name)) as inf:
                documents.append(clean_book(inf.read()))
                authors.append(author_number)
    return documents, np.array(authors, dtype='int')

In [None]:
documents, classes = load_books_data(data_folder)

In [None]:
function_words = ["a", "able", "aboard", "about", "above", "absent",
                  "according" , "accordingly", "across", "after", "against",
                  "ahead", "albeit", "all", "along", "alongside", "although",
                  "am", "amid", "amidst", "among", "amongst", "amount", "an",
                    "and", "another", "anti", "any", "anybody", "anyone",
                    "anything", "are", "around", "as", "aside", "astraddle",
                    "astride", "at", "away", "bar", "barring", "be", "because",
                    "been", "before", "behind", "being", "below", "beneath",
                    "beside", "besides", "better", "between", "beyond", "bit",
                    "both", "but", "by", "can", "certain", "circa", "close",
                    "concerning", "consequently", "considering", "could",
                    "couple", "dare", "deal", "despite", "down", "due", "during",
                    "each", "eight", "eighth", "either", "enough", "every",
                    "everybody", "everyone", "everything", "except", "excepting",
                    "excluding", "failing", "few", "fewer", "fifth", "first",
                    "five", "following", "for", "four", "fourth", "from", "front",
                    "given", "good", "great", "had", "half", "have", "he",
                    "heaps", "hence", "her", "hers", "herself", "him", "himself",
                    "his", "however", "i", "if", "in", "including", "inside",
                    "instead", "into", "is", "it", "its", "itself", "keeping",
                    "lack", "less", "like", "little", "loads", "lots", "majority",
                    "many", "masses", "may", "me", "might", "mine", "minority",
                    "minus", "more", "most", "much", "must", "my", "myself",
                    "near", "need", "neither", "nevertheless", "next", "nine",
                    "ninth", "no", "nobody", "none", "nor", "nothing",
                    "notwithstanding", "number", "numbers", "of", "off", "on",
                    "once", "one", "onto", "opposite", "or", "other", "ought",
                    "our", "ours", "ourselves", "out", "outside", "over", "part",
                    "past", "pending", "per", "pertaining", "place", "plenty",
                    "plethora", "plus", "quantities", "quantity", "quarter",
                    "regarding", "remainder", "respecting", "rest", "round",
                    "save", "saving", "second", "seven", "seventh", "several",
                    "shall", "she", "should", "similar", "since", "six", "sixth",
                    "so", "some", "somebody", "someone", "something", "spite",
                    "such", "ten", "tenth", "than", "thanks", "that", "the",
                    "their", "theirs", "them", "themselves", "then", "thence",
                  "therefore", "these", "they", "third", "this", "those",
"though", "three", "through", "throughout", "thru", "thus",
"till", "time", "to", "tons", "top", "toward", "towards",
"two", "under", "underneath", "unless", "unlike", "until",
"unto", "up", "upon", "us", "used", "various", "versus",
"via", "view", "wanting", "was", "we", "were", "what",
"whatever", "when", "whenever", "where", "whereas",
"wherever", "whether", "which", "whichever", "while",
                  "whilst", "who", "whoever", "whole", "whom", "whomever",
"whose", "will", "with", "within", "without", "would", "yet",
"you", "your", "yours", "yourself", "yourselves"]

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
extractor = CountVectorizer(vocabulary=function_words)

In [None]:

from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn import grid_search

In [None]:

parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svr = SVC()
grid = grid_search.GridSearchCV(svr, parameters)

In [None]:
pipeline1 = Pipeline([('feature_extraction', extractor),
                      ('clf', grid)])

In [None]:
scores = cross_val_score(pipeline1, documents, classes,
scoring='f1')

In [None]:
print(np.mean(scores))

In [None]:
pipeline = Pipeline([('feature_extraction',
CountVectorizer(analyzer='char', ngram_range=(3, 3))),
('classifier', grid)
])
scores = cross_val_score(pipeline, documents, classes,
scoring='f1')
print("Score: {:.3f}".format(np.mean(scores)))

In [None]:
enron_data_folder = os.path.join(os.path.expanduser("~"), "Data",
"enron_mail_20110402", "maildir")

In [None]:

from email.parser import Parser
p = Parser()

In [None]:
from sklearn.utils import check_random_state

In [None]:
def get_enron_corpus(num_authors=10, data_folder=data_folder,
                     min_docs_author=10, max_docs_author=100,
                     random_state=None):
    random_state = check_random_state(random_state)
    email_addresses = sorted(os.listdir(data_folder))
    random_state.shuffle(email_addresses)
    documents = []
    classes = []
    author_num = 0
    authors = {}
    for user in email_addresses:
        users_email_folder = os.path.join(data_folder, user)
        mail_folders = [os.path.join(users_email_folder, subfolder)
                        for subfolder in os.listdir(users_email_folder)
                        if "sent" in subfolder]
        try:
            authored_emails = [open(os.path.join(mail_folder, email_filename), encoding='cp1252').read()
                               for mail_folder in mail_folders
                               for email_filename in os.listdir(mail_folder)]
        except IsADirectoryError:
            continue
        if len(authored_emails) < min_docs_author:
            continue
        if len(authored_emails) > max_docs_author:
            authored_emails = authored_emails[:max_docs_author]
        contents = [p.parsestr(email)._payload for email in authored_emails]
        documents.extend(contents)
        classes.extend([author_num] * len(authored_emails))
        authors[user] = author_num
        author_num += 1
        if author_num >= num_authors or author_num >= len(email_addresses):
            break
    return documents, np.array(classes), authors

In [None]:
documents, classes, authors = get_enron_corpus(data_folder=enron_data_folder, random_state=14)

In [None]:
documents[100]

In [None]:
import quotequail

In [None]:

def remove_replies(email_contents):
    r = quotequail.unwrap(email_contents)
    if r is None:
        return email_contents
    if 'text_top' in r:
        return r['text_top']
    elif 'text' in r:
        return r['text']
    return email_contents

In [None]:
documents = [remove_replies(document) for document in documents]

In [None]:
scores = cross_val_score(pipeline, documents, classes, scoring='f1')
print("Score: {:.3f}".format(np.mean(scores)))

In [None]:
from sklearn.model_selection import train_test_split
training_documents, testing_documents, y_train, y_test = train_test_split(documents, classes, random_state=14)

In [None]:
pipeline.fit(training_documents, y_train)
y_pred = pipeline.predict(testing_documents)

In [None]:
print(pipeline.named_steps['classifier'].best_params_)

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_pred, y_test)
cm = cm / cm.astype(np.float).sum(axis=1)
sorted_authors = sorted(authors.keys(), key=lambda x:authors[x])

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt
plt.figure(figsize=(30, 30))
plt.imshow(cm, cmap='Blues')
tick_marks = np.arange(len( sorted_authors ))
plt.xticks(tick_marks, sorted_authors )
plt.yticks(tick_marks, sorted_authors )
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()