# Guess corresponding author
* Load the article metadata and the topics assigned to each article
* Define a baseline
* Train a classifier
* Evaluate the results on the test dataset

In [1]:
import pandas as pd
import pickle
import os
from sklearn.dummy import DummyClassifier

In [2]:
DATA_PATH = '../data'
MODELS_PATH = '../models'

## Load the article metadata and the topics assigned to each article

Load the article abstracts and metadata for the train/validate/test datasets

In [3]:
train_df = pd.read_csv(os.path.join(DATA_PATH, 'arxiv_train.csv'), index_col=0)
validate_df = pd.read_csv(os.path.join(DATA_PATH, 'arxiv_validate.csv'), index_col=0)
test_df = pd.read_csv(os.path.join(DATA_PATH, 'arxiv_test.csv'), index_col=0)

Load the topics assigned to each article

In [4]:
topics_train_df = pd.read_csv(os.path.join(DATA_PATH, 'topics_train.csv'), index_col=0)
topics_validate_df = pd.read_csv(os.path.join(DATA_PATH, 'topics_validate.csv'), index_col=0)
topics_test_df = pd.read_csv(os.path.join(DATA_PATH, 'topics_test.csv'), index_col=0)

The "labels" in this case are the corresponding authors

In [5]:
labels_train = train_df["submitter"]
labels_validate = validate_df["submitter"]
labels_test = test_df["submitter"]

In [6]:
print(f"The train dataset has {train_df.shape[0]} articles by {len(set(labels_train))} unique correspponding authors")
print(f"The validate dataset has {validate_df.shape[0]} articles by {len(set(labels_validate))} unique correspponding authors")
print(f"The test dataset has {test_df.shape[0]} articles by {len(set(labels_test))} unique correspponding authors")

The train dataset has 35000 articles by 27356 unique correspponding authors
The validate dataset has 17500 articles by 15197 unique correspponding authors
The test dataset has 17500 articles by 15165 unique correspponding authors


## Baseline 10 most prolific submitters
Take the 10 most prolific authors in the train dataset.
A paper in the validation dataset may be classified into having been written by one of these top-10 authors, or "other" (so there are 11 classes).
Compute the "most frequent" baseline based on validation dataset,based on the most probable class (which is of couse "other").

In [28]:
# the 10 authors from the "submitter" column that have written the most, based on the train dataset
top100 = list(train_df.submitter.value_counts()[:10000].index)

In [29]:
# count papers in the validation dataset written by these authors
papers_top100_count = 0
for paper_authors in validate_df['submitter']:
    for author in top100:
        if author in paper_authors: 
            papers_top100_count += 1
            break
papers_top100_perc = papers_top100_count / validate_df.shape[0] * 100
print(f"The top 100 authors in the train datasets wrote {papers_top100_count} out of {validate_df.shape[0]} papers in the validate dataset, \
so {100 - papers_top100_perc:.0f}% is the baseline for a classifier that can distinguish between one of the top 100 authors and 'other'.")

The top 100 authors in the train datasets wrote 3493 out of 17500 papers in the validate dataset, so 80% is the baseline for a classifier that can distinguish between one of the top 100 authors and 'other'.


In [19]:
dummy = DummyClassifier(strategy="most_frequent")
dummy.fit(validate_df, labels_validate)
baseline_score_validate = dummy.score(validate_df, labels_validate)
print("The accuracy of the 'most frequent' baseline for the validation dataset is {:.2f}.".format(baseline_score_validate))

The accuracy of the 'most frequent' baseline for the validation dataset is 0.00.
