# Authorship attribution

In [1]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn import metrics

## Dataset
Instructions on how to get the "Fifty Victorian Era Novelists Authorship Attribution Data" dataset can be found:
https://github.com/agungor2/Authorship_Attribution

In [2]:
df = pd.read_csv("./data/Gungor_2018_VictorianAuthorAttribution_data-train.csv", encoding = "ISO-8859-1", error_bad_lines=True)

In [3]:
df

Unnamed: 0,text,author
0,ou have time to listen i will give you the ent...,1
1,wish for solitude he was twenty years of age a...,1
2,and the skirt blew in perfect freedom about th...,1
3,of san and the rows of shops opposite impresse...,1
4,an hour s walk was as tiresome as three in a s...,1
...,...,...
53673,after surrounding and searching the whole plac...,50
53674,giant who could make a young earthquake or a w...,50
53675,waters of the lake at the bottom of the hill c...,50
53676,fingers and thumb in it exactly as it came out...,50


In [4]:
df['text']

0        ou have time to listen i will give you the ent...
1        wish for solitude he was twenty years of age a...
2        and the skirt blew in perfect freedom about th...
3        of san and the rows of shops opposite impresse...
4        an hour s walk was as tiresome as three in a s...
                               ...                        
53673    after surrounding and searching the whole plac...
53674    giant who could make a young earthquake or a w...
53675    waters of the lake at the bottom of the hill c...
53676    fingers and thumb in it exactly as it came out...
53677    giant s sake he won t meet with for if he does...
Name: text, Length: 53678, dtype: object

### Splitting the dataset into random train and test subsets

In [5]:
X = df['text']
y = df['author']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

In [6]:
X_train

42944    fresh and we go down below and drive very fast...
17703    stop here not stop here â eh eh we must tell â...
13133    bowl you shall know â mrs told me and how did ...
33289    tell you more than my lips could possibly do s...
13992    i when he s all o thee i can get sight on come...
                               ...                        
45891    from belonged to the and order of and they pro...
52416    there were little knots of authors who lived w...
42613    i felt myself that if this unhappy animal had ...
43567    she tow iâ she tow i yon s her night signal sh...
2732     foot on the yes said i while jim burst out lau...
Name: text, Length: 35964, dtype: object

In [7]:
y_train

42944    39
17703    15
13133    12
33289    30
13992    14
         ..
45891    42
52416    48
42613    39
43567    39
2732      4
Name: author, Length: 35964, dtype: int64

### Turning the text content into numerical feature vectors
(The Bag of Words representation)

A corpus of documents can be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.  
More information: https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

CountVectorizer converts a collection of text documents to a matrix of token counts.

In [8]:
%%time
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape

Wall time: 29.4 s


(35964, 9972)

The index value of a word in the vocabulary is linked to its frequency in the whole training corpus.

In [9]:
print(count_vect.vocabulary_.get(u'author'))

635


Some improvements:

1. We can divide the number of occurrences of each word in a document by the total number of words in the document (because longer documents will have higher average count values than shorter documents, even though they might talk about the same topics). We will get **tf**.

2. After getting **tf** we can downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus. We will get **tf–idf**.

**tf** - Term Frequencies  
**tf–idf** - Term Frequency times Inverse Document Frequency

TfidfTransformer transforms a count matrix to a normalized tf or tf-idf representation

In [10]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(35964, 9972)

## Naive Bayes
The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [11]:
%%time
clf = MultinomialNB()
clf.fit(X_train_tfidf, y_train)

Wall time: 769 ms


MultinomialNB()

### Extracting features from test data
We call transform() instead of fit_transform(), because features have already been fit to the training set.

In [12]:
%%time
X_test_counts = count_vect.transform(X_test)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

Wall time: 13.6 s


Prediction:

In [13]:
%%time
y_predicted = clf.predict(X_test_tfidf)

Wall time: 294 ms


The mean accuracy:

In [14]:
np.mean(y_predicted == y_test)

0.4706446878175454

Performance analysis:

In [15]:
print(metrics.classification_report(y_true=y_test, y_pred=y_predicted, zero_division=0))

              precision    recall  f1-score   support

           1       0.00      0.00      0.00       307
           2       0.00      0.00      0.00       116
           3       0.00      0.00      0.00        71
           4       0.91      0.11      0.19       496
           6       0.00      0.00      0.00       139
           8       0.25      0.98      0.39      2297
           9       1.00      0.01      0.02       384
          10       0.00      0.00      0.00       262
          11       1.00      0.01      0.02       127
          12       0.00      0.00      0.00       206
          13       0.00      0.00      0.00       143
          14       0.84      0.52      0.64       858
          15       1.00      0.21      0.35       489
          16       0.00      0.00      0.00        63
          17       0.00      0.00      0.00       210
          18       1.00      0.06      0.10       361
          19       0.80      0.74      0.77       504
          20       1.00    

In [16]:
#metrics.confusion_matrix(y_true=y_test, y_pred=y_predicted)

## SVM
SGDClassifier  
Linear classifiers (SVM, logistic regression, etc.) with SGD (stochastic gradient descent) training.  
loss='hinge' gives a linear SVM

In [17]:
%%time
clf2 = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=0, max_iter=5, tol=None)
clf2.fit(X_train_tfidf, y_train)

Wall time: 7.51 s


SGDClassifier(alpha=0.001, max_iter=5, random_state=0, tol=None)

Prediction:

In [18]:
%%time
y_predicted2 = clf2.predict(X_test_tfidf)

Wall time: 326 ms


The mean accuracy:

In [19]:
np.mean(y_predicted2  == y_test)

0.809021113243762

Performance analysis:

In [20]:
print(metrics.classification_report(y_true=y_test, y_pred=y_predicted2, zero_division=0))

              precision    recall  f1-score   support

           1       0.97      0.87      0.92       307
           2       1.00      0.13      0.23       116
           3       0.91      0.14      0.24        71
           4       0.92      0.69      0.79       496
           6       1.00      0.04      0.08       139
           8       0.68      0.94      0.79      2297
           9       0.94      0.79      0.86       384
          10       0.98      0.76      0.86       262
          11       0.82      0.59      0.69       127
          12       0.99      0.75      0.85       206
          13       0.97      0.55      0.70       143
          14       0.83      0.79      0.81       858
          15       0.90      0.92      0.91       489
          16       0.00      0.00      0.00        63
          17       0.98      0.50      0.67       210
          18       0.94      0.89      0.92       361
          19       0.81      0.90      0.85       504
          20       0.98    

In [21]:
#metrics.confusion_matrix(y_true=y_test, y_pred=y_predicted2)

## Demonstration

In [22]:
X_test

27984    john presently with some he seems to have a gr...
9637     to publish their and misfortunes except on the...
3696     then several weeks elapsed before i saw the le...
8325     the formation of a catalogue pen in hand wm ma...
12886    friend extended pale and lifeless on the carpe...
                               ...                        
47704    idea tell him you mean to be your own counsel ...
38997    its sunny day social social tendencies ths div...
1915     the influence which he commanded had won him a...
8756     short whenever they had nothing else to do â w...
2554     whirl which drifted swiftly like a cloud on a ...
Name: text, Length: 17714, dtype: object

In [23]:
y_test

27984    26
9637      8
3696      8
8325      8
12886    12
         ..
47704    44
38997    37
1915      4
8756      8
2554      4
Name: author, Length: 17714, dtype: int64

In [24]:
position = 4
print(f'The position of the text in the data set: {X_test.index[position]}')
print(f'The position of the text in the test set: {position}\n')

text = X_test.iloc[position]
print(f'Text:\n{text}\n')

actual_author = y_test.iloc[position]
print(f'Actual author: {actual_author}\n')


text = [text]
text_counts = count_vect.transform(text)
text_tfidf = tfidf_transformer.transform(text_counts)
#or
#text_tfidf = X_test_tfidf[position]


author_pred = clf.predict(text_tfidf)
print(f'Prediction by MultinomialNB(): {author_pred}\n')

author_pred2 = clf2.predict(text_tfidf)
print(f'Prediction by SGDClassifier(): {author_pred2}\n')

The position of the text in the data set: 12886
The position of the text in the test set: 4

Text:
friend extended pale and lifeless on the carpet she turned on george with a furious face what have you been doing to the poor darling she demanded you she raised her hand to strike but caught her by the wrist i have been doing nothing he declared the rage of the she bear by the power of his glance miss bull fainted unexpectedly thank goodness here is some one it was one of the servants but waved her off no one but me â no one but me she cried and took the slender form of her friend up in her arms wait here she added to george i ll be down soon when she left the room george looked at the servant who was a quiet respectable old woman is that girl mad he asked she s queer poor soul sir replied the woman and entirely devoted to miss bull and well she may be for it is miss bull who the house the girl is a natural sir the red man she looks like it replied george sitting down you can go i shall 

## Used sources:
https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html