# NLP - Author Attribution

## Identifying authors through excerpts

------

A notebook using Python alongside sklearn and NLTK to develop an NLP classifier. Given a collection of data on six Portuguese authors, we first analyze the data and take some general decisions on how to tackle the task. We make use of NLTK features to clean the data alongside some BASH scripts. The models are built and trained with sklearn.

Final predictions based on best model:

1000 word excerpts:

- testData/1000words/text1_clean.txt: joseSaramago
- testData/1000words/text2_clean.txt: almadaNegreiros
- testData/1000words/text3_clean.txt: luisaMarquesSilva
- testData/1000words/text4_clean.txt: ecaDeQueiros
- testData/1000words/text5_clean.txt: camiloCasteloBranco
- testData/1000words/text6_clean.txt: joseRodriguesSantos

500 word excerpts:

- testData/500words/text1_clean.txt: joseSaramago
- testData/500words/text2_clean.txt: almadaNegreiros
- testData/500words/text3_clean.txt: luisaMarquesSilva
- testData/500words/text4_clean.txt: ecaDeQueiros
- testData/500words/text5_clean.txt: camiloCasteloBranco
- testData/5000words/text6_clean.txt: joseRodriguesSantos

------
Authors: 
- Davide Montali M20190201
- Francisco Cruz M20190637
- Umberto Tammaro M20190806

Course: Text Mining -- Nova IMS

------

# Requirements and Imports

Please ensure you have the dependencies below installed and make sure to download the required NLTK data.

In [1]:
import pandas as pd
import numpy as np
import nltk
import glob
import re

from itertools import count

from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
from sklearn.pipeline import Pipeline

In [4]:
# Please make sure you have the following parts downloaded
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/d4ve/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Normalising

We use a quick bash script to normalize the data. We first of all translate the data into ascii standard, to remove the various portuguese accents. Then we take the complement of the following set: 'A-Za-z0-9-,.!?"' - we decided to keep the punctuation given the varying use of punctuation by the different authors. We also lowercase the documents and merge them into a corpus document.

In [3]:
%%bash
rm -rf trainData/*/*clean.txt
rm -rf trainData/*/corpus.txt
for dir in trainData/*/; do
    for book in $dir*.txt; do
        iconv -f utf8 -t ascii//TRANSLIT "$book" | tr -sc 'A-Za-z0-9-,.!?"' ' ' | tr A-Z a-z  > "${book%.*}"_clean.txt
        cat "${book%.*}"_clean.txt >> "${dir}corpus.txt"
    done
done

We also apply the same script to the data to be identified, although here for obvious reasons we don't concatenate the files into a corpus document.

In [4]:
%%bash
rm -rf testData/*/*clean.txt
for dir in testData/*/; do
    for book in $dir*.txt; do
        iconv -f utf8 -t ascii//TRANSLIT "$book" | tr -sc 'A-Za-z0-9-,.!?"' ' ' | tr A-Z a-z  > "${book%.*}"_clean.txt
    done
done

We define our functions used throughout the notebook as well as some variables we use. We also load the NLTK portuguese stemmer and stopwords.

In [9]:
def map_author(path):
    """Determine the author of a book through its file path."""
    for key, author in authors.items():
        if author in path:
            return author
        
def clean_doc(doc, stopwords=True):
    doc = stem_doc(doc)
    if stopwords == True:
        doc = stop_doc(doc)
    return doc

def stem_doc(doc):
    """Takes a document, splits it up and stemms each word - then
    remerges the document together and returns it."""
    doc_split = doc.split()
    stem = [stemmer.stem(str(i)) for i in doc_split]
    doc = ' '.join(stem)
    return doc

def stop_doc(doc):
    """Takes a document and removes all stopwords from it"""
    doc_split = doc.split()
    temp = [i for i in doc_split if i not in stopwords]
    doc = ' '.join(temp)
    return doc

def clean_new_data(doc):
    doc = re.split(r'(\W+)', doc)
    doc = ' '.join(doc)
    doc = clean_doc(doc)
    return doc

authors = {
            1: "almadaNegreiros",
            2: "ecaDeQueiros",
            3: "joseSaramago",
            4: "camiloCasteloBranco",
            5: "joseRodriguesSantos",
            6: "luisaMarquesSilva"}

# NLTK Tools
stopwords = list(nltk.corpus.stopwords.words('portuguese'))
stemmer = nltk.stem.RSLPStemmer()

# Training data paths
paths = glob.glob('trainData/*/')

## Balancing

Given the large imbalance in terms of words per author we have, we decided to split the corpus of each author into a number of smaller documents. Below we split into 500 word documents - the same count as the shortest excerpts we are looking to predict.

We also decided to undersample our data - this is further motivated by the vast imbalance ratio between the different auhtors. We explored oversampling as well - but given that we would generate new data for our minority class on the already scarce data we have, we decided undersampling was a more sensible route.

We store our data in a separate folder, with documents within a subfolder named after the authors (our target variable) - using this format we can easily read our data into sklearn.

In [57]:
# Write to path - keep data in line with sklearn load data
w_path = 'cleanData/'

# n defines words to have in each doc
n = 500

# Get the corpus for each author and split them in 500 word files
# save them in the cleanData folder.
for path in paths:
    print(f'Segmenting {map_author(path)}:')
    corpus = None
    with open(path + 'corpus.txt') as file:
        corpus =  file.read()
    #corpus = open(f"{path + 'corpus.txt'}", "r").read()
    corpus = clean_doc(corpus)
    corpus = re.split(r'(\W+)', corpus)
    if len(corpus) // n < 300:
        splits = len(corpus) // n
        if len(corpus) % n > 0: splits += 1
    else:
        splits = 300
    if len(corpus) % n > 0: splits += 1
    cut = n
    filename = ("/corpus_part_%03i.txt" % i for i in count(1))
    
    
    for i in range(splits):
        seg = ' '.join(corpus[(n*i):cut])
        with open(w_path + map_author(path) + next(filename), "w") as file:
            file.write(seg)
        cut += n

100%|██████████| 126/126 [00:00<00:00, 18994.44it/s]

Segmenting {map_author(path)}:



100%|██████████| 110/110 [00:00<00:00, 20200.24it/s]

Segmenting {map_author(path)}:



100%|██████████| 301/301 [00:00<00:00, 21189.04it/s]

Segmenting {map_author(path)}:



100%|██████████| 301/301 [00:00<00:00, 21367.64it/s]

Segmenting {map_author(path)}:



100%|██████████| 301/301 [00:00<00:00, 20218.85it/s]

Segmenting {map_author(path)}:



100%|██████████| 301/301 [00:00<00:00, 16581.76it/s]

Segmenting {map_author(path)}:





## Naive Bayes

With our cleaned and balanced data, we load the data into sklearn with the load_files function. We split our data and fit the NB model with an sklearn pipeline.



In [3]:
# Use sklearn load_data it will deduce the target variables
# from the folder names - in our case the authors
book_data = load_files('cleanData/', encoding="UTF-8")

# Split into train and test set
X_train, X_test, y_train, y_test = train_test_split(book_data.data, book_data.target) #random_state=93

NB_clf = Pipeline([
    ('vect', TfidfVectorizer(ngram_range=(1,1))),
    ('clf', MultinomialNB(alpha=0.01)),
])

NB_clf.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vect',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('clf',
                 MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True))],
         verbose=False)

To validate our score, we use sklearn crossvalidation on the training data. Thereafter, we use the model to predict the values of our aforemention excluded test set.

In [5]:
scores = cross_val_score(NB_clf, X_train, y_train, cv=10)
scores

array([1.        , 1.        , 0.98148148, 1.        , 0.98148148,
       1.        , 0.98148148, 0.99074074, 0.99074074, 1.        ])

In [7]:
predicted = NB_clf.predict(X_test)
np.mean(predicted == y_test)

0.9944444444444445

## NB Prediction

In [10]:
words500 = glob.glob('testData/1000words/*clean.txt')

docs_new = [clean_new_data(open(path, 'r').read()) for path in words500]

for i in range(len(words500)):
    auth = int(NB_clf.predict([docs_new[i]]))
    print(words500[i], book_data.target_names[auth])

testData/1000words/text6_clean.txt joseRodriguesSantos
testData/1000words/text5_clean.txt camiloCasteloBranco
testData/1000words/text2_clean.txt almadaNegreiros
testData/1000words/text3_clean.txt luisaMarquesSilva
testData/1000words/text1_clean.txt joseSaramago
testData/1000words/text4_clean.txt ecaDeQueiros


In [11]:
print(metrics.classification_report(y_test, predicted, target_names=book_data.target_names))

                     precision    recall  f1-score   support

    almadaNegreiros       1.00      0.96      0.98        25
camiloCasteloBranco       1.00      1.00      1.00        78
       ecaDeQueiros       1.00      1.00      1.00        75
joseRodriguesSantos       1.00      1.00      1.00        77
       joseSaramago       0.97      1.00      0.99        69
  luisaMarquesSilva       1.00      0.97      0.99        36

           accuracy                           0.99       360
          macro avg       1.00      0.99      0.99       360
       weighted avg       0.99      0.99      0.99       360



In [12]:
metrics.confusion_matrix(y_test, predicted)

array([[24,  0,  0,  0,  1,  0],
       [ 0, 78,  0,  0,  0,  0],
       [ 0,  0, 75,  0,  0,  0],
       [ 0,  0,  0, 77,  0,  0],
       [ 0,  0,  0,  0, 69,  0],
       [ 0,  0,  0,  0,  1, 35]])

## SVM 

In [24]:
SVM_clf = Pipeline([
    ('vect', TfidfVectorizer(ngram_range=(1,2))),
    ('clf', SGDClassifier(alpha=0.01)),
])

SVM_clf.fit(X_train, y_train)

predicted = SVM_clf.predict(X_test)
np.mean(predicted == y_test)

0.9472222222222222

In [25]:
print(metrics.classification_report(y_test, predicted, target_names=book_data.target_names))

                     precision    recall  f1-score   support

    almadaNegreiros       1.00      0.80      0.89        25
camiloCasteloBranco       0.99      1.00      0.99        78
       ecaDeQueiros       0.99      1.00      0.99        75
joseRodriguesSantos       0.97      0.97      0.97        77
       joseSaramago       0.82      1.00      0.90        69
  luisaMarquesSilva       1.00      0.67      0.80        36

           accuracy                           0.95       360
          macro avg       0.96      0.91      0.93       360
       weighted avg       0.95      0.95      0.95       360



In [26]:
metrics.confusion_matrix(y_test, predicted)

array([[20,  1,  1,  0,  3,  0],
       [ 0, 78,  0,  0,  0,  0],
       [ 0,  0, 75,  0,  0,  0],
       [ 0,  0,  0, 75,  2,  0],
       [ 0,  0,  0,  0, 69,  0],
       [ 0,  0,  0,  2, 10, 24]])

In [27]:
words500 = glob.glob('testData/500words/*clean.txt')

docs_new = [clean_new_data(open(path, 'r').read()) for path in words500]

for i in range(len(words500)):
    auth = int(SVM_clf.predict([docs_new[i]]))
    print(words500[i], book_data.target_names[auth])

testData/500words/text6_clean.txt joseSaramago
testData/500words/text5_clean.txt camiloCasteloBranco
testData/500words/text2_clean.txt almadaNegreiros
testData/500words/text3_clean.txt joseSaramago
testData/500words/text1_clean.txt joseSaramago
testData/500words/text4_clean.txt ecaDeQueiros
