# Test: Using a topic model to guess authors of a paper submitted for blind peer review
Goal: **Guess the authors of a paper submitted for blind peer review.**

Assumption: **Articles written by the same authors are about similar topics.**

Therefore: **Articles written by the same authors should be close together in topic space.**

This is interesting because: **The probability of guessing the authors of an article submitted for blind peer review should be higher among authors of papers closer together in topic space.**

### Workflow

#### Data acquisition (Notebook 00_process_snapshot)
ArXiv was queried for all artticles on Physics (the largest category on arXiv), submitted since beginning 2023 (assuming that researchers are active in the present). The dataset has 70.000 articles, written by approximately 200.000 authors.
arXiv.org submitters. (2024). arXiv Dataset [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/7548853

#### Data preparation (Notebook 01_prepare_data)
The data was split in train (50%), validate (25%) and test (25%) datasets. The abstracts were processed as follows:
* Apply pre-processing filters: strip_tags, strip_punctuation, strip_multiple_whitespaces, stric_numeric, remove_stopwords; strip_short
* Apply lemmatization

#### Fit topic model (Notebook 02_fit_topic_model)
An LDA topic model was used to discover latent topics in the data. The model was fitted on 35.000 abstracts from the train dataset. The best number of topics was determined by minimizing the perplexity measure. The model and the perplexity measurement was implemented in Python using the gensim library.
* [LDA implementation from gensim library](https://radimrehurek.com/gensim/models/ldamodel.html)
* [Original LDA paper](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)

#### Assign topics to articles (Notebook 03_assign_topics)
Topics were assigned to the articles in the train, validate and test datasets. Each article is assigned a list of probabilities that the topic is relevant to the article. An article can have multiple topics. 

#### Measure distances (Notebook 04_measure_distances)
The distance between two articles in topic space is measured by computing the Euclidean distance of their topics as assigned by the model. Testing the hypothesis that **Articles written by the same authors should be close together in topic space.** can be accomplished by 
* mesuring the distance between articles written by at least one common author
* measuring the distance between articles and the center of the topic space (?) all other articles (?)
* comparing the two distances

Preliminary results: the distance between articles written by at least one common author is significantly shorter that the mean distance between articles. Therefore: assuming that articles written by the same authors should be about similar topics, the topic model is measuring something real.

#### To do
* Formalize all this
* Improve the model capability by using the distance between articles as fit criterion
* Compute the probability of guessing the correct author
