<H1>HC in Cross-Domain Authorship Attribution Challenge</H1>
Use this notebook to replicate the results reporter in the paper  </br>
<ul>
    [1] <a href = https://arxiv.org/abs/1911.01208>
    Kipnis, A., ``Higher Criticism for Discriminating Word-Frequency Tables and Testing Authorship'', 2019
    </a>
</ul>

- Use HC-based test to attribute authorship in the PAN2018 cross-domain authorship attribution challenge https://pan.webis.de/clef18/pan18-web/author-identification.html#cross-domain
- Only the English part (problems 1-4) of this challenge is considered. 
- We use a lemmatized version of the data obtained using the Stanford CoreNLP (https://stanfordnlp.github.io/CoreNLP/)



In [1]:
import pandas as pd
import numpy as np
import os
import re
import codecs
from tqdm import tqdm

#import auxiliary functions for python
import sys
sys.path.append('../')
from AuthAttLib import *

<H2>Load Data</H2>
<a href = https://www.uni-weimar.de/medien/webis/corpora/corpus-pan-labs-09-today/pan-18/pan18-data/pan18-cross-domain-authorship-attribution-test-dataset2-2018-04-20-password-protected.zip>Link</a> to dataset


In [26]:
#load dataset 
raw_data = pd.read_csv("../Data/PAN2018_probs_1_to_4.csv")

In [27]:
# optional: remove proper names, cardinal digits, and punctuation
from text_processing import remove_parts_of_speach
punct = [':',';','"','(',')','-',',','.','`','\`','!']

def text_proc_loc(text) : 
    return remove_parts_of_speach(text, 
                        to_remove=['NNP', 'NNPS', 'CD']
                    )
def lemmatize_vocab(list_of_words) :
    return text_proc_loc(" ".join(list_of_words)).split()

def get_n_most_common_words(n = 5000) :
    most_common_list = pd.read_csv('~/Data/5000_most_common_english_words.csv')
    return list(set(most_common_list.Word.tolist()))[:n]


Ignore proper names, numbers, and some pronouns: 

In [28]:
data = raw_data.filter(['dataset', 'author', 'doc_id', 'prob'])
data.loc[:,'split'] = 'train'
data.loc[data.doc_id.str.find('test')>-1,'split'] = 'test'

In [29]:
# remove POS: proper names and cardinal digits (can take a while)
data.loc[:,'text'] = raw_data.text.apply(text_proc_loc)

<H2>Multi-author model</H2>

- For each problem in the challenge, train a model and evaluate it over a test set. <br>
- The following implementation opt not to use the UNNKOWN option, hence the recall is always 100%

In [57]:
lo_problem = pd.unique(data['prob'])

lo_F1 = []
lo_acc = []

for prob in tqdm(lo_problem[:4]) :
    data_prob = data[data['prob'] == prob]
    
    data_train = data_prob[data_prob['split'] == 'train']
    #compute model for each problem:
    model = AuthorshipAttributionMulti(
        data_train,        #dataset arrange in a data frame
        vocab_size=3000,  #uses 3000 most frequent ngram
        stbl=True,       #type of HC statistic
        ngram_range=(1,1), #mono-, bi-, and tri- grams
        randomize=False     #use randomized p-values
        )
    
    #attribute test documents:
    print("Evaluate on test set:")
    data_test = data_prob[data_prob['split'] == 'test']
    lo_test_docs = pd.unique(data_test.doc_id)
    df = pd.DataFrame() #save results in this dataframe

    for doc in tqdm(lo_test_docs) :
        sm = data_test[data_test.doc_id == doc]
        
        pred,_ = model.predict(sm.text.values[0], unk_thresh = 1e6, method = 'HC') 
                # can use 'unk_thresh' to get '<UNK>' instead of the name 
                # of the corpus with smallest HC in the case when the smallest
                # HC is above 'unk_thresh'. 

        auth = sm.author.values[0]
        df = df.append({'doc_id' : doc,
                   'author' : auth,
                   'predicted' : pred,
                  }, ignore_index = True)


    # evaluate accuracy and F1 score
    df_r = df[df.predicted != '<UNK>']
    recall = len(df_r) / len(df)
    acc = np.mean((df_r.predicted == df_r.author).values)
    
    print("problem = {}".format(prob))
    print("recall = {}".format(recall))
    print("accuracy = {}".format(acc))
    print("F1 = {}".format(2*recall*acc / (recall + acc)))
    lo_F1 += [2*recall*acc / (recall + acc)]
    lo_acc += [acc]

#prob1: |W| = 1500, ng = (1,3) --> F1 = 0.661, acc = 0.493
#prob2: |W| = 1500, ng = (1,3) --> F1 = 0.678, acc = 0.513
#prob3: |W| = 1500, ng = (1,3) --> F1 = 0.7878, acc = 0.65
#prob4: |W| = 1500, ng = (1,3) --> F1 = 0.814, acc = 0.6875

print("\n mean F1 = ", np.mean(lo_F1))

  0%|          | 0/4 [00:00<?, ?it/s]

	 Creating author-model for candidate00001 using 3000 features...
		found 7 documents and 5316 relevant tokens.
	 Creating author-model for candidate00002 using 3000 features...
		found 7 documents and 5012 relevant tokens.
	 Creating author-model for candidate00003 using 3000 features...
		found 7 documents and 5093 relevant tokens.
	 Creating author-model for candidate00004 using 3000 features...
		found 7 documents and 5097 relevant tokens.
	 Creating author-model for candidate00005 using 3000 features...
		found 7 documents and 4578 relevant tokens.
	 Creating author-model for candidate00006 using 3000 features...
		found 7 documents and 5095 relevant tokens.
	 Creating author-model for candidate00007 using 3000 features...
		found 7 documents and 5221 relevant tokens.
	 Creating author-model for candidate00008 using 3000 features...
		found 7 documents and 5526 relevant tokens.
	 Creating author-model for candidate00009 using 3000 features...
		found 7 documents and 5203 relevant 


  0%|          | 0/79 [00:00<?, ?it/s][A

		found 7 documents and 5194 relevant tokens.
	 Creating author-model for candidate00016 using 3000 features...
		found 7 documents and 4719 relevant tokens.
	 Creating author-model for candidate00017 using 3000 features...
		found 7 documents and 5218 relevant tokens.
	 Creating author-model for candidate00018 using 3000 features...
		found 7 documents and 5033 relevant tokens.
	 Creating author-model for candidate00019 using 3000 features...
		found 7 documents and 4742 relevant tokens.
	 Creating author-model for candidate00020 using 3000 features...
		found 7 documents and 4905 relevant tokens.
Evaluate on test set:



  3%|▎         | 2/79 [00:00<00:04, 19.07it/s][A
  5%|▌         | 4/79 [00:00<00:03, 19.05it/s][A
  9%|▉         | 7/79 [00:00<00:03, 20.08it/s][A
 13%|█▎        | 10/79 [00:00<00:03, 20.97it/s][A
 16%|█▋        | 13/79 [00:00<00:03, 21.58it/s][A
 20%|██        | 16/79 [00:00<00:02, 22.04it/s][A
 24%|██▍       | 19/79 [00:00<00:02, 22.45it/s][A
 28%|██▊       | 22/79 [00:01<00:02, 21.87it/s][A
 32%|███▏      | 25/79 [00:01<00:02, 20.50it/s][A
 34%|███▍      | 27/79 [00:01<00:02, 20.33it/s][A
 38%|███▊      | 30/79 [00:01<00:02, 21.04it/s][A
 42%|████▏     | 33/79 [00:01<00:02, 21.69it/s][A
 46%|████▌     | 36/79 [00:01<00:01, 21.99it/s][A
 49%|████▉     | 39/79 [00:01<00:01, 22.25it/s][A
 53%|█████▎    | 42/79 [00:01<00:01, 22.71it/s][A
 57%|█████▋    | 45/79 [00:02<00:01, 23.13it/s][A
 61%|██████    | 48/79 [00:02<00:01, 22.95it/s][A
 65%|██████▍   | 51/79 [00:02<00:01, 22.95it/s][A
 68%|██████▊   | 54/79 [00:02<00:01, 23.13it/s][A
 72%|███████▏  | 57/79 [00:02<00:

problem = problem00001
recall = 1.0
accuracy = 0.5443037974683544
F1 = 0.7049180327868853
	 Creating author-model for candidate00001 using 3000 features...
		found 7 documents and 5364 relevant tokens.
	 Creating author-model for candidate00002 using 3000 features...
		found 7 documents and 5257 relevant tokens.
	 Creating author-model for candidate00003 using 3000 features...
		found 7 documents and 5110 relevant tokens.
	 Creating author-model for candidate00004 using 3000 features...
		found 7 documents and 5014 relevant tokens.
	 Creating author-model for candidate00005 using 3000 features...
		found 7 documents and 4746 relevant tokens.
	 Creating author-model for candidate00006 using 3000 features...
		found 7 documents and 4957 relevant tokens.
	 Creating author-model for candidate00007 using 3000 features...



  0%|          | 0/74 [00:00<?, ?it/s][A

		found 7 documents and 5111 relevant tokens.
	 Creating author-model for candidate00008 using 3000 features...
		found 7 documents and 5034 relevant tokens.
	 Creating author-model for candidate00009 using 3000 features...
		found 7 documents and 4416 relevant tokens.
	 Creating author-model for candidate00010 using 3000 features...
		found 7 documents and 4880 relevant tokens.
	 Creating author-model for candidate00011 using 3000 features...
		found 7 documents and 4619 relevant tokens.
	 Creating author-model for candidate00012 using 3000 features...
		found 7 documents and 5173 relevant tokens.
	 Creating author-model for candidate00013 using 3000 features...
		found 7 documents and 4921 relevant tokens.
	 Creating author-model for candidate00014 using 3000 features...
		found 7 documents and 5230 relevant tokens.
	 Creating author-model for candidate00015 using 3000 features...
		found 7 documents and 5327 relevant tokens.
Evaluate on test set:



  4%|▍         | 3/74 [00:00<00:02, 27.14it/s][A
  8%|▊         | 6/74 [00:00<00:02, 26.78it/s][A
 14%|█▎        | 10/74 [00:00<00:02, 27.86it/s][A
 18%|█▊        | 13/74 [00:00<00:02, 28.11it/s][A
 23%|██▎       | 17/74 [00:00<00:01, 29.41it/s][A
 27%|██▋       | 20/74 [00:00<00:01, 29.54it/s][A
 31%|███       | 23/74 [00:00<00:01, 29.04it/s][A
 36%|███▋      | 27/74 [00:00<00:01, 28.87it/s][A
 41%|████      | 30/74 [00:01<00:01, 26.33it/s][A
 45%|████▍     | 33/74 [00:01<00:01, 25.69it/s][A
 49%|████▊     | 36/74 [00:01<00:01, 26.40it/s][A
 53%|█████▎    | 39/74 [00:01<00:01, 25.42it/s][A
 57%|█████▋    | 42/74 [00:01<00:01, 23.98it/s][A
 61%|██████    | 45/74 [00:01<00:01, 24.57it/s][A
 65%|██████▍   | 48/74 [00:01<00:01, 25.98it/s][A
 70%|███████   | 52/74 [00:01<00:00, 27.19it/s][A
 74%|███████▍  | 55/74 [00:02<00:00, 25.44it/s][A
 78%|███████▊  | 58/74 [00:02<00:00, 24.89it/s][A
 82%|████████▏ | 61/74 [00:02<00:00, 26.14it/s][A
 88%|████████▊ | 65/74 [00:02<00

problem = problem00002
recall = 1.0
accuracy = 0.5675675675675675
F1 = 0.7241379310344828
	 Creating author-model for candidate00001 using 3000 features...
		found 7 documents and 4497 relevant tokens.
	 Creating author-model for candidate00002 using 3000 features...
		found 7 documents and 5392 relevant tokens.
	 Creating author-model for candidate00003 using 3000 features...
		found 7 documents and 5316 relevant tokens.
	 Creating author-model for candidate00004 using 3000 features...
		found 7 documents and 4799 relevant tokens.
	 Creating author-model for candidate00005 using 3000 features...
		found 7 documents and 5191 relevant tokens.
	 Creating author-model for candidate00006 using 3000 features...
		found 7 documents and 4977 relevant tokens.
	 Creating author-model for candidate00007 using 3000 features...
		found 7 documents and 4639 relevant tokens.
	 Creating author-model for candidate00008 using 3000 features...



  0%|          | 0/40 [00:00<?, ?it/s][A
 12%|█▎        | 5/40 [00:00<00:00, 41.84it/s][A

		found 7 documents and 5048 relevant tokens.
	 Creating author-model for candidate00009 using 3000 features...
		found 7 documents and 4926 relevant tokens.
	 Creating author-model for candidate00010 using 3000 features...
		found 7 documents and 5417 relevant tokens.
Evaluate on test set:



 22%|██▎       | 9/40 [00:00<00:00, 40.47it/s][A
 32%|███▎      | 13/40 [00:00<00:00, 38.92it/s][A
 42%|████▎     | 17/40 [00:00<00:00, 36.10it/s][A
 52%|█████▎    | 21/40 [00:00<00:00, 34.65it/s][A
 62%|██████▎   | 25/40 [00:00<00:00, 34.40it/s][A
 70%|███████   | 28/40 [00:00<00:00, 31.58it/s][A
 78%|███████▊  | 31/40 [00:00<00:00, 30.47it/s][A
 85%|████████▌ | 34/40 [00:01<00:00, 30.08it/s][A
 92%|█████████▎| 37/40 [00:01<00:00, 29.62it/s][A
 75%|███████▌  | 3/4 [00:08<00:03,  3.11s/it]s][A
  0%|          | 0/16 [00:00<?, ?it/s][A

problem = problem00003
recall = 1.0
accuracy = 0.75
F1 = 0.8571428571428571
	 Creating author-model for candidate00001 using 3000 features...
		found 7 documents and 5654 relevant tokens.
	 Creating author-model for candidate00002 using 3000 features...
		found 7 documents and 5230 relevant tokens.
	 Creating author-model for candidate00003 using 3000 features...
		found 7 documents and 5494 relevant tokens.
	 Creating author-model for candidate00004 using 3000 features...
		found 7 documents and 5503 relevant tokens.
	 Creating author-model for candidate00005 using 3000 features...
		found 7 documents and 4838 relevant tokens.
Evaluate on test set:



 38%|███▊      | 6/16 [00:00<00:00, 52.51it/s][A
 81%|████████▏ | 13/16 [00:00<00:00, 56.24it/s][A
100%|██████████| 4/4 [00:09<00:00,  2.29s/it]s][A

problem = problem00004
recall = 1.0
accuracy = 0.875
F1 = 0.9333333333333333

 mean F1 =  0.8048830385743897





In [36]:
PAN2018_English_results = {
'Custódio and Paraboni' : 0.744,
'Murauer et al.' : 0.762,
'Halvani and Graner' :  0.679,
'Mosavat' : 0.685,
'Yigal et al.' : 0.672,
'Martín dCR et al.' : 0.601,
'PAN18-BASELINE' : 0.697,
'Miller et al.' : 0.573,
'Schaetti' :  0.538}

#from 

from plotnine import *

PAN2018_English_results['HC-based test'] = np.mean(lo_F1)

df = pd.DataFrame.from_dict(PAN2018_English_results, orient='index')
df = df.rename(columns = {0 : 'F1'}).reset_index()
dfs = df.sort_values('F1').reset_index()
df['cat'] = pd.Categorical(df.F1, categories=dfs['F1'].values, ordered=True)


df['type'] = 'other'
df.loc[df['index'] == 'HC-based test','type']='HC'

p = (ggplot(aes(x = 'cat', y = 'F1', fill = 'type', label = 'index'), data = df)
     + geom_bar(position='dodge', stat="identity", show_legend=False)
     + geom_text(nudge_y = -0.25) + coord_flip() + ggtitle('PAN 2018 Authorship Challenge')
     + xlab('') + theme(axis_text_y=element_blank())
     + ylab('F1 score')
    )
print(p)

path_to_plots = "/Users/kipnisal/Dropbox/Apps//Overleaf/Higher Criticism and Authorship Attribution (presentation)/Figs/"
path_to_plots =""
p.save(path_to_plots + 'PAN2018_F1.png')

<Figure size 640x480 with 1 Axes>

<ggplot: (293430873)>


  from_inches(height, units), units))
  warn('Filename: {}'.format(filename))


<h2>Multi-author with head-to-head comparisons</h2>

Compare each pair of corpora. Use only distinguishing features of two corpora in testing. Attribute tested docuement to whichever corpus has most number of wins in all pairwise comparisons.

In [11]:
lo_problem = pd.unique(data['dataset'])

lo_F1 = []
lo_acc = []

for prob in tqdm(lo_problem[2:4]) :
    data_prob = data[data['dataset'] == prob]
    data_train = data_prob[data_prob['split'] == 'train']
    
    #compute model for each problem:
    model = AuthorshipAttributionMultiBinary(data_train, 
                                       vocab_size = 1500,  #uses most frequent ngram
                                       stbl = True,  #type of HC statistic
                                      ngram_range = (1,3), #mono-, bi-, and tri- grams
                                         reduce_features=True,
                                         randomize=True
                                                )
    
    #attribute test documents:
    data_test = data_prob[data_prob['split'] == 'test']
    lo_test_docs = pd.unique(data_test.doc_id)
    df = pd.DataFrame() #save results in this dataframe

    for doc in tqdm(lo_test_docs) :
        sm = data_test[data_test.doc_id == doc]

        pred = model.predict(sm.text.values[0], method = 'chisq_pval', LOO = False) 
                # can use 'unk_thresh' to get '<UNK>' instead of the name 
                # of the corpus with smallest HC in the case when the smallest
                # HC is above 'unk_thresh'. 

        auth = sm.author.values[0]
        df = df.append({'doc_id' : doc,
                   'author' : auth,
                   'predicted' : pred,
                  }, ignore_index = True)


    # evaluate accuracy and F1 score
    df_r = df[df.predicted != '<UNK>']
    recall = len(df_r) / len(df)
    acc = np.mean((df_r.predicted == df_r.author).values)
    
    print("problem = {}".format(prob))
    print("recall = {}".format(recall))
    print("accuracy = {}".format(acc))
    print("F1 = {}".format(2*recall*acc / (recall + acc)))
    lo_F1 += [2*recall*acc / (recall + acc)]
    lo_acc += [acc]

#prob4: F1 = 0.72, acc = 0.5625, |W| = 100, ng = (1,3)


  0%|          | 0/2 [00:00<?, ?it/s]

Found 45 author-pairs
MultiBinaryAuthorModel: Creating model for candidate00001 vs candidate00002...
	 Creating author-model for candidate00001 using 1500 features...
		found 7 documents and 8126 relevant tokens.
	 Creating author-model for candidate00002 using 1500 features...
		found 7 documents and 9304 relevant tokens.
Changing vocabulary for candidate00001. Found 1335 relevant tokens.
Changing vocabulary for candidate00002. Found 1496 relevant tokens.
Reduced to 51 features...
MultiBinaryAuthorModel: Creating model for candidate00001 vs candidate00003...
	 Creating author-model for candidate00001 using 1500 features...
		found 7 documents and 8145 relevant tokens.
	 Creating author-model for candidate00003 using 1500 features...
		found 7 documents and 7406 relevant tokens.
Changing vocabulary for candidate00001. Found 1393 relevant tokens.
Changing vocabulary for candidate00003. Found 1446 relevant tokens.
Reduced to 41 features...
MultiBinaryAuthorModel: Creating model for candi

Changing vocabulary for candidate00004. Found 3163 relevant tokens.
Reduced to 53 features...
MultiBinaryAuthorModel: Creating model for candidate00003 vs candidate00005...
	 Creating author-model for candidate00003 using 1500 features...
		found 7 documents and 7274 relevant tokens.
	 Creating author-model for candidate00005 using 1500 features...
		found 7 documents and 9705 relevant tokens.
Changing vocabulary for candidate00003. Found 3711 relevant tokens.
Changing vocabulary for candidate00005. Found 4560 relevant tokens.
Reduced to 532 features...
MultiBinaryAuthorModel: Creating model for candidate00003 vs candidate00006...
	 Creating author-model for candidate00003 using 1500 features...
		found 7 documents and 7517 relevant tokens.
	 Creating author-model for candidate00006 using 1500 features...
		found 7 documents and 8901 relevant tokens.
Changing vocabulary for candidate00003. Found 3627 relevant tokens.
Changing vocabulary for candidate00006. Found 4180 relevant tokens.
R

	 Creating author-model for candidate00006 using 1500 features...
		found 7 documents and 8697 relevant tokens.
	 Creating author-model for candidate00008 using 1500 features...
		found 7 documents and 9657 relevant tokens.
Changing vocabulary for candidate00006. Found 2257 relevant tokens.
Changing vocabulary for candidate00008. Found 2182 relevant tokens.
Reduced to 61 features...
MultiBinaryAuthorModel: Creating model for candidate00006 vs candidate00009...
	 Creating author-model for candidate00006 using 1500 features...
		found 7 documents and 8913 relevant tokens.
	 Creating author-model for candidate00009 using 1500 features...
		found 7 documents and 8347 relevant tokens.
Changing vocabulary for candidate00006. Found 1358 relevant tokens.
Changing vocabulary for candidate00009. Found 1084 relevant tokens.
Reduced to 36 features...
MultiBinaryAuthorModel: Creating model for candidate00006 vs candidate00010...
	 Creating author-model for candidate00006 using 1500 features...
		fo


  0%|          | 0/40 [00:00<?, ?it/s][A

		found 7 documents and 9502 relevant tokens.
Changing vocabulary for candidate00009. Found 1543 relevant tokens.
Changing vocabulary for candidate00010. Found 2145 relevant tokens.
Reduced to 50 features...


  z = (uu - ps) / np.sqrt(ps * (1 - ps)) * np.sqrt(n)

  2%|▎         | 1/40 [00:00<00:20,  1.89it/s][A
  5%|▌         | 2/40 [00:00<00:18,  2.06it/s][A
  8%|▊         | 3/40 [00:01<00:16,  2.31it/s][A
 10%|█         | 4/40 [00:01<00:14,  2.54it/s][A
 12%|█▎        | 5/40 [00:01<00:12,  2.74it/s][A
 15%|█▌        | 6/40 [00:02<00:11,  2.89it/s][A
 18%|█▊        | 7/40 [00:02<00:10,  3.03it/s][A
 20%|██        | 8/40 [00:02<00:10,  3.12it/s][A
 22%|██▎       | 9/40 [00:03<00:09,  3.14it/s][A
 25%|██▌       | 10/40 [00:03<00:09,  3.21it/s][A
 28%|██▊       | 11/40 [00:03<00:09,  3.11it/s][A
 30%|███       | 12/40 [00:03<00:08,  3.15it/s][A
 32%|███▎      | 13/40 [00:04<00:08,  3.23it/s][A
 35%|███▌      | 14/40 [00:04<00:08,  3.25it/s][A
 38%|███▊      | 15/40 [00:04<00:07,  3.17it/s][A
 40%|████      | 16/40 [00:05<00:07,  3.23it/s][A
 42%|████▎     | 17/40 [00:05<00:07,  3.22it/s][A
 45%|████▌     | 18/40 [00:05<00:06,  3.23it/s][A
 48%|████▊     | 19/40 [00:06<00:06,

problem = PAN-problem00003
recall = 1.0
accuracy = 0.7
F1 = 0.8235294117647058
Found 10 author-pairs
MultiBinaryAuthorModel: Creating model for candidate00001 vs candidate00002...
	 Creating author-model for candidate00001 using 1500 features...
		found 7 documents and 8389 relevant tokens.
	 Creating author-model for candidate00002 using 1500 features...
		found 7 documents and 8290 relevant tokens.
Changing vocabulary for candidate00001. Found 2720 relevant tokens.
Changing vocabulary for candidate00002. Found 2901 relevant tokens.
Reduced to 417 features...
MultiBinaryAuthorModel: Creating model for candidate00001 vs candidate00003...
	 Creating author-model for candidate00001 using 1500 features...
		found 7 documents and 8302 relevant tokens.
	 Creating author-model for candidate00003 using 1500 features...
		found 7 documents and 9295 relevant tokens.
Changing vocabulary for candidate00001. Found 1592 relevant tokens.
Changing vocabulary for candidate00003. Found 2073 relevant to


  0%|          | 0/16 [00:00<?, ?it/s][A

Changing vocabulary for candidate00004. Found 4442 relevant tokens.
Changing vocabulary for candidate00005. Found 4485 relevant tokens.
Reduced to 548 features...



 12%|█▎        | 2/16 [00:00<00:01, 12.53it/s][A
 25%|██▌       | 4/16 [00:00<00:00, 12.52it/s][A
 38%|███▊      | 6/16 [00:00<00:00, 12.81it/s][A
 50%|█████     | 8/16 [00:00<00:00, 13.03it/s][A
 62%|██████▎   | 10/16 [00:00<00:00, 13.35it/s][A
 75%|███████▌  | 12/16 [00:00<00:00, 13.33it/s][A
 88%|████████▊ | 14/16 [00:01<00:00, 13.45it/s][A
100%|██████████| 2/2 [00:27<00:00, 17.24s/it]s][A

problem = PAN-problem00004
recall = 1.0
accuracy = 0.625
F1 = 0.7692307692307693





In [10]:
np.mean(lo_F1)

0.8048830385743897

<H1>Other Classifyers</H1>

In [42]:
import pandas as pd
import numpy as np
import os
import pickle
import nltk
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.model_selection import KFold
from sklearn.neighbors import KNeighborsClassifier
import sys


from AuthAttLib import AuthorshipAttributionMulti, to_docTermCounts
from FreqTable import FreqTable

def evaluate_classifyer(Cls, X_train, y_train, X_test, y_test, **kwargs) :
    clf = Cls(**kwargs)
    clf.fit(X_train,y_train)
    return clf.score(X_test, y_test)

def evaluate_Clf(Cls, X_train, y_train,
                 X_test, y_test,
                 vocab, **kwargs) :
    print('using {}'.format(Cls))
    acc = []
    clf = Cls(**kwargs)

    clf.fit(X_train,y_train)
    acc = clf.score(X_test, y_test)
    return acc

    clf.fit(X_train,y_train)
    return clf.score(X_test, y_test)

def get_n_most_common_words(n = 5000) :
    most_common_list = pd.read_csv('~/Data/5000_most_common_english_words.csv')
    return list(set(most_common_list.Word.tolist()))[:n]

def get_counts_labels(df, vocab) :
#prepare data:
    X = []
    y = []
    for r in df.iterrows() :
        dt = to_docTermCounts([r[1].text], 
                            vocab=vocab
                             )
        X += [FreqTable(dt[0], dt[1])._counts]
        y += [r[1].author]
    
    return X, y

classifiers = {#'logistic_regression' : LogisticRegression,
                'multinomial_NB' : MultinomialNB,
                 'SVM' : LinearSVC,
                 'KNN' : KNeighborsClassifier,
                  }

lo_args = {'logistic_regression' : {'solver': 'lbfgs',
                                   'max_iter' : 150},
            'multinomial_NB' : {},
                 'SVM' : {'dual' : False},
                 'KNN' : {'metric' : 'cosine',
                          'n_neighbors' : 5},
            }

df = pd.DataFrame()

lo_problem = pd.unique(data['prob'])

for prob in tqdm(lo_problem[:4]) :
    data_prob = data[data['prob'] == prob]
    vocab = model._vocab
    print("vocab size = {}".format(vocab_size))
    acc = {}

    
    X_train, y_train = get_counts_labels(data_prob[data_prob.split=='train'], vocab)
    X_test, y_test = get_counts_labels(data_prob[data_prob.split=='test'], vocab)

    for cls in classifiers : 
        Cls = classifiers[cls]
        args = lo_args[cls]
        acc[cls] = evaluate_Clf(Cls, X_train,y_train, X_test, y_test, vocab, **args)

    acc['vocab_size'] = vocab_size

    df = df.append(pd.DataFrame(acc, index = [0]), ignore_index = True)
    
df.apply(lambda x : 2*x/(1+x)).mean()

  0%|          | 0/4 [00:00<?, ?it/s]

vocab size = 3000


 25%|██▌       | 1/4 [00:00<00:02,  1.02it/s]

using <class 'sklearn.naive_bayes.MultinomialNB'>
using <class 'sklearn.svm.classes.LinearSVC'>
using <class 'sklearn.neighbors.classification.KNeighborsClassifier'>
vocab size = 3000


 50%|█████     | 2/4 [00:01<00:01,  1.08it/s]

using <class 'sklearn.naive_bayes.MultinomialNB'>
using <class 'sklearn.svm.classes.LinearSVC'>
using <class 'sklearn.neighbors.classification.KNeighborsClassifier'>
vocab size = 3000


 75%|███████▌  | 3/4 [00:02<00:00,  1.27it/s]

using <class 'sklearn.naive_bayes.MultinomialNB'>
using <class 'sklearn.svm.classes.LinearSVC'>
using <class 'sklearn.neighbors.classification.KNeighborsClassifier'>
vocab size = 3000


100%|██████████| 4/4 [00:02<00:00,  1.61it/s]

using <class 'sklearn.naive_bayes.MultinomialNB'>
using <class 'sklearn.svm.classes.LinearSVC'>
using <class 'sklearn.neighbors.classification.KNeighborsClassifier'>





multinomial_NB    0.822121
SVM               0.773446
KNN               0.719383
vocab_size        1.999334
dtype: float64