<H1>HC in Cross-Domain Authorship Attribution Challenge</H1>
Use this notebook to replicate the results reporter in the paper  </br>
<ul>
    [1] <a href = https://arxiv.org/abs/1911.01208>
    Kipnis, A., ``Higher Criticism for Discriminating Word-Frequency Tables and Testing Authorship'', 2019
    </a>
</ul>

- Use HC-based test to attribute authorship in the PAN2018 cross-domain authorship attribution challenge https://pan.webis.de/clef18/pan18-web/author-identification.html#cross-domain
- Only the English part (problems 1-4) of this challenge is considered. 
- We use a lemmatized version of the data obtained using the Stanford CoreNLP (https://stanfordnlp.github.io/CoreNLP/)



In [2]:
import pandas as pd
import numpy as np
import os
import re
import codecs
from tqdm import tqdm

#import auxiliary functions for python
import sys
sys.path.append('../')
from AuthAttLib import *

<H2>Load Data</H2>
The data below was obtained by lemmatizing the original challenge test data using the Stanford CoreNLP lemmatizer https://stanfordnlp.github.io/CoreNLP/. 

In [5]:
#load data PAN2018 (from lemmatized) 
raw_data = pd.read_csv("./Data/PAN2018_lemmatized.csv")
raw_data.loc[:,'lem_text'] = raw_data.loc[:,'lem_text'] # + ' ' + raw_data.loc[:,'POS']
data = raw_data.filter(['dataset', 'author', 'doc_id'])

data.loc[:,'split'] = 'train'
data.loc[data.doc_id.str.find('test')>-1,'split'] = 'test'

Ignore proper names, numbers, and some pronouns: 

In [6]:
#remove proper names
data.loc[:,'text'] = raw_data.apply(
    lambda r : " ".join([w[0] for w in zip(r['lem_text'].split(),r['POS'].split()) if w[1] != 'PROPN']), axis=1)


<H2>Multi-author model</H2>

- For each problem in the challenge, train a model and evaluate it over a test set. <br>
- The following implementation opt not to use the UNNKOWN option, hence the recall is always 100%

In [10]:

lo_problem = pd.unique(data_train['dataset'])

lo_F1 = []
lo_acc = []

for prob in tqdm(lo_problem[:4]) :
    data_prob = data[data['dataset'] == prob]
    
    data_train = data_prob[data_prob['split'] == 'train']
    #compute model for each problem:
    model = AuthorshipAttributionMulti(data_prob, 
                                       vocab_size = 1500,  #uses 3000 most frequent ngram
                                       stbl = True,  #type of HC statistic
                                      ngram_range = (1,3), #mono-, bi-, and tri- grams
                                       flat=True  # compress counts in each corpus (faster)
                                                )
    
    #attribute test documents:
    print("Evaluate on test dataset:")
    data_test = data_prob[data_prob['split'] == 'test']
    lo_test_docs = pd.unique(data_test.doc_id)
    df = pd.DataFrame() #save results in this dataframe

    for doc in tqdm(lo_test_docs) :
        sm = data_test[data_test.doc_id == doc]

        pred,_ = model.predict(sm.text.values[0], unk_thresh = 1e6) 
                # can use 'unk_thresh' to get '<UNK>' instead of the name 
                # of the corpus with smallest HC in the case when the smallest
                # HC is above 'unk_thresh'. 

        auth = sm.author.values[0]
        df = df.append({'doc_id' : doc,
                   'author' : auth,
                   'predicted' : pred,
                  }, ignore_index = True)


    # evaluate accuracy and F1 score
    df_r = df[df.predicted != '<UNK>']
    recall = len(df_r) / len(df)
    acc = np.mean((df_r.predicted == df_r.author).values)
    
    print("problem = {}".format(prob))
    print("recall = {}".format(recall))
    print("accuracy = {}".format(acc))
    print("F1 = {}".format(2*recall*acc / (recall + acc)))
    lo_F1 += [2*recall*acc / (recall + acc)]
    lo_acc += [acc]

#prob1: |W| = 3000, ng = (1,3) --> F1 = 0.6017, acc = 0.43 
#       |W| = 1500, ng = (1,3) --> F1 = 0.672, acc = 0.5063
#prob2: |W| = 3000, ng = (1,3) --> F1 = 0.666, acc = 0.5 
#       |W| = 1500, ng = (1,3) --> F1 = 0.67857, acc = 0.5135
#prob3: |W| = 3000, ng = (1,3) --> F1 = 0.8732, acc = 0.775
#       |W| = 1500, ng = (1,3) --> F1 = 0.80597014, acc = 0.675
#prob4: |W| = 3000, ng = (1,3) --> F1 = 0.857, acc = 0.75
#       |W| = 1500, ng = (1,3) --> F1 = 0.933, acc = 0.857


  0%|          | 0/4 [00:00<?, ?it/s]

	 Creating author-model for candidate00001 using 1500 features...
		found 7 documents and 8588 relevant tokens.
	 Creating author-model for candidate00002 using 1500 features...
		found 7 documents and 8498 relevant tokens.
	 Creating author-model for candidate00003 using 1500 features...
		found 7 documents and 7619 relevant tokens.
	 Creating author-model for candidate00004 using 1500 features...
		found 7 documents and 8383 relevant tokens.
	 Creating author-model for candidate00005 using 1500 features...
		found 7 documents and 7809 relevant tokens.
	 Creating author-model for candidate00006 using 1500 features...
		found 7 documents and 8443 relevant tokens.
	 Creating author-model for candidate00007 using 1500 features...
		found 7 documents and 7806 relevant tokens.
	 Creating author-model for candidate00008 using 1500 features...
		found 7 documents and 7433 relevant tokens.
	 Creating author-model for candidate00009 using 1500 features...
		found 7 documents and 7846 relevant 


  0%|          | 0/79 [00:00<?, ?it/s][A

		found 7 documents and 8249 relevant tokens.
Evaluate on test dataset:



  1%|▏         | 1/79 [00:06<08:36,  6.62s/it][A
  3%|▎         | 2/79 [00:13<08:24,  6.55s/it][A
  4%|▍         | 3/79 [00:19<08:21,  6.60s/it][A
  5%|▌         | 4/79 [00:26<08:08,  6.52s/it][A
  6%|▋         | 5/79 [00:32<08:04,  6.55s/it][A
  8%|▊         | 6/79 [00:41<08:37,  7.10s/it][A
  9%|▉         | 7/79 [00:47<08:23,  6.99s/it][A
 10%|█         | 8/79 [00:55<08:23,  7.09s/it][A
 11%|█▏        | 9/79 [01:03<08:36,  7.38s/it][A
 13%|█▎        | 10/79 [01:10<08:29,  7.39s/it][A
 14%|█▍        | 11/79 [01:18<08:23,  7.40s/it][A
 15%|█▌        | 12/79 [01:24<08:05,  7.25s/it][A
 16%|█▋        | 13/79 [01:31<07:54,  7.19s/it][A
 18%|█▊        | 14/79 [01:38<07:36,  7.03s/it][A
 19%|█▉        | 15/79 [01:45<07:31,  7.06s/it][A
 20%|██        | 16/79 [01:53<07:29,  7.13s/it][A
 22%|██▏       | 17/79 [02:00<07:24,  7.16s/it][A
 23%|██▎       | 18/79 [02:07<07:18,  7.20s/it][A
 24%|██▍       | 19/79 [02:14<07:08,  7.15s/it][A
 25%|██▌       | 20/79 [02:21<07:00,  7

problem = PAN-problem00001
recall = 1.0
accuracy = 0.5063291139240507
F1 = 0.6722689075630253
	 Creating author-model for candidate00001 using 1500 features...
		found 7 documents and 8852 relevant tokens.
	 Creating author-model for candidate00002 using 1500 features...
		found 7 documents and 6824 relevant tokens.
	 Creating author-model for candidate00003 using 1500 features...
		found 7 documents and 7603 relevant tokens.
	 Creating author-model for candidate00004 using 1500 features...
		found 7 documents and 8565 relevant tokens.
	 Creating author-model for candidate00005 using 1500 features...
		found 7 documents and 7114 relevant tokens.
	 Creating author-model for candidate00006 using 1500 features...
		found 7 documents and 8233 relevant tokens.
	 Creating author-model for candidate00007 using 1500 features...
		found 7 documents and 8456 relevant tokens.
	 Creating author-model for candidate00008 using 1500 features...
		found 7 documents and 8445 relevant tokens.
	 Creating


  0%|          | 0/74 [00:00<?, ?it/s][A

		found 7 documents and 8665 relevant tokens.
Evaluate on test dataset:



  1%|▏         | 1/74 [00:05<06:06,  5.02s/it][A
  3%|▎         | 2/74 [00:10<06:04,  5.06s/it][A
  4%|▍         | 3/74 [00:15<06:00,  5.08s/it][A
  5%|▌         | 4/74 [00:20<05:59,  5.13s/it][A
  7%|▋         | 5/74 [00:25<05:53,  5.13s/it][A
  8%|▊         | 6/74 [00:30<05:50,  5.16s/it][A
  9%|▉         | 7/74 [00:36<05:45,  5.16s/it][A
 11%|█         | 8/74 [00:41<05:44,  5.21s/it][A
 12%|█▏        | 9/74 [00:47<05:46,  5.34s/it][A
 14%|█▎        | 10/74 [00:52<05:39,  5.31s/it][A
 15%|█▍        | 11/74 [00:57<05:29,  5.23s/it][A
 16%|█▌        | 12/74 [01:02<05:21,  5.19s/it][A
 18%|█▊        | 13/74 [01:07<05:10,  5.09s/it][A
 19%|█▉        | 14/74 [01:12<05:08,  5.14s/it][A
 20%|██        | 15/74 [01:17<05:03,  5.15s/it][A
 22%|██▏       | 16/74 [01:22<04:57,  5.13s/it][A
 23%|██▎       | 17/74 [01:27<04:53,  5.15s/it][A
 24%|██▍       | 18/74 [01:33<04:49,  5.17s/it][A
 26%|██▌       | 19/74 [01:38<04:46,  5.22s/it][A
 27%|██▋       | 20/74 [01:44<04:53,  5

problem = PAN-problem00002
recall = 1.0
accuracy = 0.5135135135135135
F1 = 0.6785714285714285
	 Creating author-model for candidate00001 using 1500 features...
		found 7 documents and 7296 relevant tokens.
	 Creating author-model for candidate00002 using 1500 features...
		found 7 documents and 8874 relevant tokens.
	 Creating author-model for candidate00003 using 1500 features...
		found 7 documents and 6842 relevant tokens.
	 Creating author-model for candidate00004 using 1500 features...
		found 7 documents and 8715 relevant tokens.
	 Creating author-model for candidate00005 using 1500 features...
		found 7 documents and 8562 relevant tokens.
	 Creating author-model for candidate00006 using 1500 features...
		found 7 documents and 8347 relevant tokens.
	 Creating author-model for candidate00007 using 1500 features...
		found 7 documents and 7874 relevant tokens.
	 Creating author-model for candidate00008 using 1500 features...
		found 7 documents and 8688 relevant tokens.
	 Creating


  0%|          | 0/40 [00:00<?, ?it/s][A

		found 7 documents and 8751 relevant tokens.
Evaluate on test dataset:



  2%|▎         | 1/40 [00:03<02:12,  3.39s/it][A
  5%|▌         | 2/40 [00:06<02:07,  3.37s/it][A
  8%|▊         | 3/40 [00:10<02:04,  3.35s/it][A
 10%|█         | 4/40 [00:13<02:00,  3.36s/it][A
 12%|█▎        | 5/40 [00:16<01:56,  3.34s/it][A
 15%|█▌        | 6/40 [00:20<01:53,  3.35s/it][A
 18%|█▊        | 7/40 [00:23<01:51,  3.38s/it][A
 20%|██        | 8/40 [00:26<01:47,  3.37s/it][A
 22%|██▎       | 9/40 [00:30<01:43,  3.35s/it][A
 25%|██▌       | 10/40 [00:33<01:40,  3.36s/it][A
 28%|██▊       | 11/40 [00:36<01:36,  3.34s/it][A
 30%|███       | 12/40 [00:40<01:33,  3.34s/it][A
 32%|███▎      | 13/40 [00:43<01:30,  3.35s/it][A
 35%|███▌      | 14/40 [00:47<01:31,  3.51s/it][A
 38%|███▊      | 15/40 [00:51<01:30,  3.62s/it][A
 40%|████      | 16/40 [00:54<01:26,  3.58s/it][A
 42%|████▎     | 17/40 [00:58<01:21,  3.53s/it][A
 45%|████▌     | 18/40 [01:01<01:17,  3.53s/it][A
 48%|████▊     | 19/40 [01:05<01:15,  3.59s/it][A
 50%|█████     | 20/40 [01:09<01:12,  3

problem = PAN-problem00003
recall = 1.0
accuracy = 0.675
F1 = 0.8059701492537313
	 Creating author-model for candidate00001 using 1500 features...
		found 7 documents and 7858 relevant tokens.
	 Creating author-model for candidate00002 using 1500 features...
		found 7 documents and 7814 relevant tokens.
	 Creating author-model for candidate00003 using 1500 features...
		found 7 documents and 8925 relevant tokens.
	 Creating author-model for candidate00004 using 1500 features...



  0%|          | 0/16 [00:00<?, ?it/s][A

		found 7 documents and 7056 relevant tokens.
	 Creating author-model for candidate00005 using 1500 features...
		found 7 documents and 7376 relevant tokens.
Evaluate on test dataset:



  6%|▋         | 1/16 [00:01<00:27,  1.82s/it][A
 12%|█▎        | 2/16 [00:03<00:25,  1.80s/it][A
 19%|█▉        | 3/16 [00:05<00:23,  1.78s/it][A
 25%|██▌       | 4/16 [00:07<00:21,  1.78s/it][A
 31%|███▏      | 5/16 [00:08<00:19,  1.78s/it][A
 38%|███▊      | 6/16 [00:10<00:17,  1.76s/it][A
 44%|████▍     | 7/16 [00:12<00:15,  1.75s/it][A
 50%|█████     | 8/16 [00:14<00:13,  1.74s/it][A
 56%|█████▋    | 9/16 [00:15<00:12,  1.75s/it][A
 62%|██████▎   | 10/16 [00:17<00:10,  1.75s/it][A
 69%|██████▉   | 11/16 [00:19<00:08,  1.77s/it][A
 75%|███████▌  | 12/16 [00:21<00:07,  1.76s/it][A
 81%|████████▏ | 13/16 [00:22<00:05,  1.76s/it][A
 88%|████████▊ | 14/16 [00:24<00:03,  1.75s/it][A
 94%|█████████▍| 15/16 [00:26<00:01,  1.77s/it][A
100%|██████████| 4/4 [18:42<00:00, 288.43s/it]][A

problem = PAN-problem00004
recall = 1.0
accuracy = 0.875
F1 = 0.9333333333333333





<h2>Multi-author with head-to-head comparisons</h2>

Compare each pair of corpora. Use only distinguishing features of two corpora in testing. Attribute tested docuement to whichever corpus has most number of wins in all pairwise comparisons.

In [13]:

lo_problem = pd.unique(data_train['dataset'])

lo_F1 = []
lo_acc = []

for prob in tqdm(lo_problem[3:4]) :
    data_prob = data[data['dataset'] == prob]
    data_train = data_prob[data_prob['split'] == 'train']
    
    #compute model for each problem:
    model = AuthorshipAttributionMultiBinary(data_prob, 
                                       vocab_size = 100,  #uses 3000 most frequent ngram
                                       stbl = True,  #type of HC statistic
                                      ngram_range = (1,3), #mono-, bi-, and tri- grams
                                             reduce_features=True
                                                )
    
    #attribute test documents:
    data_test = data_prob[data_prob['split'] == 'test']
    lo_test_docs = pd.unique(data_test.doc_id)
    df = pd.DataFrame() #save results in this dataframe

    for doc in tqdm(lo_test_docs) :
        sm = data_test[data_test.doc_id == doc]

        pred = model.predict(sm.text.values[0], method = 'HC') 
                # can use 'unk_thresh' to get '<UNK>' instead of the name 
                # of the corpus with smallest HC in the case when the smallest
                # HC is above 'unk_thresh'. 

        auth = sm.author.values[0]
        df = df.append({'doc_id' : doc,
                   'author' : auth,
                   'predicted' : pred,
                  }, ignore_index = True)


    # evaluate accuracy and F1 score
    df_r = df[df.predicted != '<UNK>']
    recall = len(df_r) / len(df)
    acc = np.mean((df_r.predicted == df_r.author).values)
    
    print("problem = {}".format(prob))
    print("recall = {}".format(recall))
    print("accuracy = {}".format(acc))
    print("F1 = {}".format(2*recall*acc / (recall + acc)))
    lo_F1 += [2*recall*acc / (recall + acc)]
    lo_acc += [acc]

#prob4: F1 = 0.72, acc = 0.5625, |W| = 100, ng = (1,3)








  0%|          | 0/1 [00:00<?, ?it/s][A[A[A[A[A[A






  0%|          | 0/10 [00:00<?, ?it/s][A[A[A[A[A[A[A

Found 10 author-pairs
MultiBinaryAuthorModel: Creating model for candidate00001 vs candidate00002
	 Creating author-model for candidate00001 using 1774 features
		found 7 documents and 5715 relevant tokens
	 Creating author-model for candidate00002 using 1774 features









 10%|█         | 1/10 [00:03<00:29,  3.28s/it][A[A[A[A[A[A[A

		found 7 documents and 5549 relevant tokens
MultiBinaryAuthorModel: Creating model for candidate00001 vs candidate00003
	 Creating author-model for candidate00001 using 1811 features
		found 7 documents and 5715 relevant tokens
	 Creating author-model for candidate00003 using 1811 features









 20%|██        | 2/10 [00:06<00:25,  3.17s/it][A[A[A[A[A[A[A

		found 7 documents and 5753 relevant tokens
MultiBinaryAuthorModel: Creating model for candidate00001 vs candidate00004
	 Creating author-model for candidate00001 using 2169 features
		found 7 documents and 5715 relevant tokens
	 Creating author-model for candidate00004 using 2169 features









 30%|███       | 3/10 [00:09<00:22,  3.23s/it][A[A[A[A[A[A[A

		found 7 documents and 6041 relevant tokens
MultiBinaryAuthorModel: Creating model for candidate00001 vs candidate00005
	 Creating author-model for candidate00001 using 1682 features
		found 7 documents and 5715 relevant tokens
	 Creating author-model for candidate00005 using 1682 features









 40%|████      | 4/10 [00:12<00:18,  3.14s/it][A[A[A[A[A[A[A

		found 7 documents and 5019 relevant tokens
MultiBinaryAuthorModel: Creating model for candidate00002 vs candidate00003
	 Creating author-model for candidate00002 using 1867 features
		found 7 documents and 5549 relevant tokens
	 Creating author-model for candidate00003 using 1867 features









 50%|█████     | 5/10 [00:15<00:15,  3.04s/it][A[A[A[A[A[A[A

		found 7 documents and 5753 relevant tokens
MultiBinaryAuthorModel: Creating model for candidate00002 vs candidate00004
	 Creating author-model for candidate00002 using 2265 features
		found 7 documents and 5549 relevant tokens
	 Creating author-model for candidate00004 using 2265 features









 60%|██████    | 6/10 [00:18<00:12,  3.08s/it][A[A[A[A[A[A[A

		found 7 documents and 6041 relevant tokens
MultiBinaryAuthorModel: Creating model for candidate00002 vs candidate00005
	 Creating author-model for candidate00002 using 1782 features
		found 7 documents and 5549 relevant tokens
	 Creating author-model for candidate00005 using 1782 features









 70%|███████   | 7/10 [00:21<00:09,  3.10s/it][A[A[A[A[A[A[A

		found 7 documents and 5019 relevant tokens
MultiBinaryAuthorModel: Creating model for candidate00003 vs candidate00004
	 Creating author-model for candidate00003 using 2229 features
		found 7 documents and 5753 relevant tokens
	 Creating author-model for candidate00004 using 2229 features









 80%|████████  | 8/10 [00:25<00:06,  3.38s/it][A[A[A[A[A[A[A

		found 7 documents and 6041 relevant tokens
MultiBinaryAuthorModel: Creating model for candidate00003 vs candidate00005
	 Creating author-model for candidate00003 using 1782 features
		found 7 documents and 5753 relevant tokens
	 Creating author-model for candidate00005 using 1782 features









 90%|█████████ | 9/10 [00:28<00:03,  3.27s/it][A[A[A[A[A[A[A

		found 7 documents and 5019 relevant tokens
MultiBinaryAuthorModel: Creating model for candidate00004 vs candidate00005
	 Creating author-model for candidate00004 using 2170 features
		found 7 documents and 6041 relevant tokens
	 Creating author-model for candidate00005 using 2170 features









100%|██████████| 10/10 [00:33<00:00,  3.63s/it][A[A[A[A[A[A[A






  0%|          | 0/16 [00:00<?, ?it/s][A[A[A[A[A[A[A

		found 7 documents and 5019 relevant tokens









  6%|▋         | 1/16 [00:07<01:50,  7.40s/it][A[A[A[A[A[A[A






 12%|█▎        | 2/16 [00:12<01:32,  6.59s/it][A[A[A[A[A[A[A






 19%|█▉        | 3/16 [00:16<01:17,  5.99s/it][A[A[A[A[A[A[A






 25%|██▌       | 4/16 [00:21<01:06,  5.57s/it][A[A[A[A[A[A[A






 31%|███▏      | 5/16 [00:26<00:59,  5.42s/it][A[A[A[A[A[A[A






 38%|███▊      | 6/16 [00:31<00:53,  5.34s/it][A[A[A[A[A[A[A






 44%|████▍     | 7/16 [00:36<00:46,  5.12s/it][A[A[A[A[A[A[A






 50%|█████     | 8/16 [00:41<00:41,  5.13s/it][A[A[A[A[A[A[A






 56%|█████▋    | 9/16 [00:46<00:35,  5.12s/it][A[A[A[A[A[A[A






 62%|██████▎   | 10/16 [00:50<00:29,  4.96s/it][A[A[A[A[A[A[A






 69%|██████▉   | 11/16 [00:55<00:24,  4.91s/it][A[A[A[A[A[A[A






 75%|███████▌  | 12/16 [01:00<00:19,  4.86s/it][A[A[A[A[A[A[A






 81%|████████▏ | 13/16 [01:05<00:14,  4.85s/it][A[A[A[A[A[A[A






 88%|████████▊ | 14/16 [01:

problem = PAN-problem00004
recall = 1.0
accuracy = 0.5625
F1 = 0.72
