<H1>HC in Cross-Domain Authorship Attribution Challenge</H1>

- Use HC-based test to attribute authorship in the PAN2018 cross-domain authorship attribution challenge https://pan.webis.de/clef18/pan18-web/author-identification.html#cross-domain
- Only the English part (problems 1-4) of this challenge is considered. 
- We use a lemmatized version of the data obtained using the Stanford CoreNLP (https://stanfordnlp.github.io/CoreNLP/)


In [1]:
import pandas as pd
import numpy as np
import os
import re
import codecs
from tqdm import tqdm

#import auxiliary functions for python
import sys
sys.path.append('../')
from AuthAttLib import *

<H2>Load Data</H2>
The data below was obtained by lemmatizing the original challenge test data using the Stanford CoreNLP lemmatizer https://stanfordnlp.github.io/CoreNLP/. 

In [6]:
#load data PAN2018 (after lemmatization using the Stanford NLP) 
raw_data = pd.read_csv("/Users/kipnisal/Google Drive/Data/PAN2018/PAN2018_lemmatized.csv")
raw_data.loc[:,'lem_text'] = raw_data.loc[:,'lem_text'] # + ' ' + raw_data.loc[:,'POS']
data = raw_data.filter(['dataset', 'author', 'doc_id', 'lem_text']).rename(columns = {'lem_text' : 'text'})
data.loc[:,'type'] = 'train'
data.loc[data.doc_id.str.find('test')>-1,'type'] = 'test'

Ignore proper names, numbers, and some pronouns: 

In [14]:
# list of proper names, pronouns, and capitalized words.
with open("../Data/list_of_names.txt") as f:
    proper_names = f.read().split(', ')

other_words = ['she', 'he', 'him', 'we', 'it', 'they', 'you', 'me', 'myself',
               'am', 'is', 'are', 'be', 'was','were','her', 'his', 'their',
               'theirs', 'our', 'ours', 'your', 'yours', 'sister', 'brother',
               'dad', 'mom', 'husband', 'wife', 'mother', 'woman', 'women',
               'daughter', 'father', 'aunt', 'child','wife','child', 'girl',
               'girls', 'son','captain', 'colonel', 'lady', 'mr', 'mrs', 'miss',
               'sir', 'gentleman', 'publius','st','god','lord', 'chapter',
               'queen', 'goblins', 'wynn', 'pokemon']

numbers = ['one','two','three','four','five','six','seven','eight','nine','ten',
           'hundred', 'thousand', 'million', 'billion']

words_to_ignore = proper_names + other_words + numbers

<H2>Multi-author model</H2>

- For each problem in the challenge, train a model and evaluate it over a test set. <br>
- The following implementation opt not to use the UNNKOWN option, hence the recall is always 100%

In [None]:
data_train = data[data['type'] == 'train']

lo_problem = pd.unique(data_train['dataset'])

lo_F1 = []
lo_acc = []

for prob in tqdm(lo_problem[0:4]) :
    data_prob = data_train[data_train['dataset'] == prob]
    
    #compute model for each problem:
    model = AuthorshipAttributionMulti(data_prob, 
                                       vocab_size = 3000,  #uses 3000 most frequent ngram
                                       stbl = True,  #type of HC statistic
                                      ngram_range = (1,3), #mono-, bi-, and tri- grams
                                     words_to_ignore = words_to_ignore # exclude these words
                                                )
    
    #attribute test documents:
    data_prob_test = data[(data['dataset'] == prob) & (data['type'] == 'test')]
    lo_test_docs = pd.unique(data_prob_test.doc_id)
    df = pd.DataFrame() #save results in this dataframe

    for doc in tqdm(lo_test_docs) :
        sm = data_prob_test[data_prob_test.doc_id == doc]

        pred,_ = model.predict(sm.text.values[0], unk_thresh = 1e6) 
                # can use 'unk_thresh' to get '<UNK>' instead of the name 
                # of the corpus with smallest HC in the case when the smallest
                # HC is above 'unk_thresh'. 

        auth = sm.author.values[0]
        df = df.append({'doc_id' : doc,
                   'author' : auth,
                   'predicted' : pred,
                  }, ignore_index = True)


    # evaluate accuracy and F1 score
    df_r = df[df.predicted != '<UNK>']
    recall = len(df_r) / len(df)
    acc = np.mean((df_r.predicted == df_r.author).values)
    
    print("problem = {}".format(prob))
    print("recall = {}".format(recall))
    print("accuracy = {}".format(acc))
    print("F1 = {}".format(2*recall*acc / (recall + acc)))
    lo_F1 += [2*recall*acc / (recall + acc)]
    lo_acc += [acc]

#prob1: F1 = 0.6017, acc = 0.43 |W| = 3000, ng = (1,3)
#prob2: F1 = 0.666, acc = 0.5 |W| = 3000, ng = (1,3)
#prob3: F1 = 0.8732, acc = 0.775, |W| = 3000, ng = (1,3)
#prob4: F1 = 0.857, acc = 0.75, |W| = 3000, ng = (1,3)


<h2>Multi-author with head-to-head comparisons</h2>

Compare each pair of corpora. Use only distinguishing features of two corpora in testing. Attribute tested docuement to whichever corpus has most number of wins in all pairwise comparisons.

In [13]:
data_train = data[data['type'] == 'train']

lo_problem = pd.unique(data_train['dataset'])

lo_F1 = []
lo_acc = []

for prob in tqdm(lo_problem[3:4]) :
    data_prob = data_train[data_train['dataset'] == prob]
    
    #compute model for each problem:
    model = AuthorshipAttributionMultiBinary(data_prob, 
                                       vocab_size = 100,  #uses 3000 most frequent ngram
                                       stbl = True,  #type of HC statistic
                                      ngram_range = (1,3), #mono-, bi-, and tri- grams
                                     words_to_ignore = words_to_ignore # exclude these words
                                                )
    
    #attribute test documents:
    data_prob_test = data[(data['dataset'] == prob) & (data['type'] == 'test')]
    lo_test_docs = pd.unique(data_prob_test.doc_id)
    df = pd.DataFrame() #save results in this dataframe

    for doc in tqdm(lo_test_docs) :
        sm = data_prob_test[data_prob_test.doc_id == doc]

        pred = model.predict(sm.text.values[0], method = 'HC') 
                # can use 'unk_thresh' to get '<UNK>' instead of the name 
                # of the corpus with smallest HC in the case when the smallest
                # HC is above 'unk_thresh'. 

        auth = sm.author.values[0]
        df = df.append({'doc_id' : doc,
                   'author' : auth,
                   'predicted' : pred,
                  }, ignore_index = True)


    # evaluate accuracy and F1 score
    df_r = df[df.predicted != '<UNK>']
    recall = len(df_r) / len(df)
    acc = np.mean((df_r.predicted == df_r.author).values)
    
    print("problem = {}".format(prob))
    print("recall = {}".format(recall))
    print("accuracy = {}".format(acc))
    print("F1 = {}".format(2*recall*acc / (recall + acc)))
    lo_F1 += [2*recall*acc / (recall + acc)]
    lo_acc += [acc]

#prob4: F1 = 0.72, acc = 0.5625, |W| = 100, ng = (1,3)








  0%|          | 0/1 [00:00<?, ?it/s][A[A[A[A[A[A






  0%|          | 0/10 [00:00<?, ?it/s][A[A[A[A[A[A[A

Found 10 author-pairs
MultiBinaryAuthorModel: Creating model for candidate00001 vs candidate00002
	 Creating author-model for candidate00001 using 1774 features
		found 7 documents and 5715 relevant tokens
	 Creating author-model for candidate00002 using 1774 features









 10%|█         | 1/10 [00:03<00:29,  3.28s/it][A[A[A[A[A[A[A

		found 7 documents and 5549 relevant tokens
MultiBinaryAuthorModel: Creating model for candidate00001 vs candidate00003
	 Creating author-model for candidate00001 using 1811 features
		found 7 documents and 5715 relevant tokens
	 Creating author-model for candidate00003 using 1811 features









 20%|██        | 2/10 [00:06<00:25,  3.17s/it][A[A[A[A[A[A[A

		found 7 documents and 5753 relevant tokens
MultiBinaryAuthorModel: Creating model for candidate00001 vs candidate00004
	 Creating author-model for candidate00001 using 2169 features
		found 7 documents and 5715 relevant tokens
	 Creating author-model for candidate00004 using 2169 features









 30%|███       | 3/10 [00:09<00:22,  3.23s/it][A[A[A[A[A[A[A

		found 7 documents and 6041 relevant tokens
MultiBinaryAuthorModel: Creating model for candidate00001 vs candidate00005
	 Creating author-model for candidate00001 using 1682 features
		found 7 documents and 5715 relevant tokens
	 Creating author-model for candidate00005 using 1682 features









 40%|████      | 4/10 [00:12<00:18,  3.14s/it][A[A[A[A[A[A[A

		found 7 documents and 5019 relevant tokens
MultiBinaryAuthorModel: Creating model for candidate00002 vs candidate00003
	 Creating author-model for candidate00002 using 1867 features
		found 7 documents and 5549 relevant tokens
	 Creating author-model for candidate00003 using 1867 features









 50%|█████     | 5/10 [00:15<00:15,  3.04s/it][A[A[A[A[A[A[A

		found 7 documents and 5753 relevant tokens
MultiBinaryAuthorModel: Creating model for candidate00002 vs candidate00004
	 Creating author-model for candidate00002 using 2265 features
		found 7 documents and 5549 relevant tokens
	 Creating author-model for candidate00004 using 2265 features









 60%|██████    | 6/10 [00:18<00:12,  3.08s/it][A[A[A[A[A[A[A

		found 7 documents and 6041 relevant tokens
MultiBinaryAuthorModel: Creating model for candidate00002 vs candidate00005
	 Creating author-model for candidate00002 using 1782 features
		found 7 documents and 5549 relevant tokens
	 Creating author-model for candidate00005 using 1782 features









 70%|███████   | 7/10 [00:21<00:09,  3.10s/it][A[A[A[A[A[A[A

		found 7 documents and 5019 relevant tokens
MultiBinaryAuthorModel: Creating model for candidate00003 vs candidate00004
	 Creating author-model for candidate00003 using 2229 features
		found 7 documents and 5753 relevant tokens
	 Creating author-model for candidate00004 using 2229 features









 80%|████████  | 8/10 [00:25<00:06,  3.38s/it][A[A[A[A[A[A[A

		found 7 documents and 6041 relevant tokens
MultiBinaryAuthorModel: Creating model for candidate00003 vs candidate00005
	 Creating author-model for candidate00003 using 1782 features
		found 7 documents and 5753 relevant tokens
	 Creating author-model for candidate00005 using 1782 features









 90%|█████████ | 9/10 [00:28<00:03,  3.27s/it][A[A[A[A[A[A[A

		found 7 documents and 5019 relevant tokens
MultiBinaryAuthorModel: Creating model for candidate00004 vs candidate00005
	 Creating author-model for candidate00004 using 2170 features
		found 7 documents and 6041 relevant tokens
	 Creating author-model for candidate00005 using 2170 features









100%|██████████| 10/10 [00:33<00:00,  3.63s/it][A[A[A[A[A[A[A






  0%|          | 0/16 [00:00<?, ?it/s][A[A[A[A[A[A[A

		found 7 documents and 5019 relevant tokens









  6%|▋         | 1/16 [00:07<01:50,  7.40s/it][A[A[A[A[A[A[A






 12%|█▎        | 2/16 [00:12<01:32,  6.59s/it][A[A[A[A[A[A[A






 19%|█▉        | 3/16 [00:16<01:17,  5.99s/it][A[A[A[A[A[A[A






 25%|██▌       | 4/16 [00:21<01:06,  5.57s/it][A[A[A[A[A[A[A






 31%|███▏      | 5/16 [00:26<00:59,  5.42s/it][A[A[A[A[A[A[A






 38%|███▊      | 6/16 [00:31<00:53,  5.34s/it][A[A[A[A[A[A[A






 44%|████▍     | 7/16 [00:36<00:46,  5.12s/it][A[A[A[A[A[A[A






 50%|█████     | 8/16 [00:41<00:41,  5.13s/it][A[A[A[A[A[A[A






 56%|█████▋    | 9/16 [00:46<00:35,  5.12s/it][A[A[A[A[A[A[A






 62%|██████▎   | 10/16 [00:50<00:29,  4.96s/it][A[A[A[A[A[A[A






 69%|██████▉   | 11/16 [00:55<00:24,  4.91s/it][A[A[A[A[A[A[A






 75%|███████▌  | 12/16 [01:00<00:19,  4.86s/it][A[A[A[A[A[A[A






 81%|████████▏ | 13/16 [01:05<00:14,  4.85s/it][A[A[A[A[A[A[A






 88%|████████▊ | 14/16 [01:

problem = PAN-problem00004
recall = 1.0
accuracy = 0.5625
F1 = 0.72
