# Test: Using a topic model to guess authors of a paper submitted for blind peer review

Goal: **Guess the authors of a paper submitted for blind peer review.**

A guess should be made based solely on: text (abstract), citations, year of submission and target journal. No other information is available at the time of submission, notably the article classification used by the journal is not always available at the time of submission.

Assumption: **An individual author writes about similar topics within a short time range.**

This is essential while choosing the dataset to conduct the experiment. Data was obtained by querying arXiv for articles in physics (the largest category on arXiv), published in the last 2 years. This limit is arbitrary and results from the assumption that most researchers will not publish about an unrelated topic within such a short period of time.

Therefore: **It should be possible to compute a "topic profile" for individual authors by extracting the topics of all their articles.**

Based on these topic profiles, it should be possible to compute the "topic distance" between two authors. Guessing the author of an article then becomes extracting the topics from the article and finding the authors with the most similar topic profile, i.e. with the shortest topic distance to the article.

### Workflow

#### Data acquisition ([01_process_snapshot](./01_process_snapshot.ipynb))
A data dump was obtained from ArXiv [1]. The data contains metadata and abstracts for 2.412.624 articles. The data was augmented with columns for:
* year of submission to arXiv
* binary columns for broad categories ("Computer Science", "Economics", "Mathematics", "Physics", etc.), an article can be classified within multiple categories.

#### Data preparation ([02_prepare_data](./02_prepare_data.ipynb))
The data was queried for all articles on Physics (the largest category on arXiv), submitted since beginning 2023 (assuming that researchers are active in the present). The data has 90.530 articles, written by 246.443 authors. The data was split in train (50%), validate (25%) and test (25%) datasets. 

In order to prepare the article abstracts for topic analysis, the abstract were processed as follows:
* Apply pre-processing filters: strip_tags, strip_punctuation, strip_multiple_whitespaces, stric_numeric, remove_stopwords, strip_short. For the exact function of each filter see [2].
* Apply lemmatization. Lemmatization was accomplished using the NLTK Library.



### References
* [1] arXiv.org submitters. (2024). arXiv Dataset [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/7548853
* [2] [The gensim preprocessing module](https://github.com/piskvorky/gensim/blob/develop/gensim/parsing/preprocessing.py)
* [3] [WordNetLemmatizer in NLTK library documentation](https://www.nltk.org/api/nltk.stem.WordNetLemmatizer.html?highlight=wordnet)