# Test: Using a BERT topic model to guess authors of a paper submitted for blind peer review

*Goal*
> **Guess the authors of a paper submitted for blind peer review.**

A guess should be made based solely on: 
* text (abstract)
* references to other papers 
* year of submission
* target journal, i.e. the journal that asked for the review
No other information is available at the time of submission, notably the article classification used by the journal is not always available at the time of submission. However, a broad category can be deduced from the abstract or the target journal, e.g. "Physics" or "Computer Science".

*Assumptions*
> 1. **An individual author writes about similar topics within a short time range.**
> 2. **The author of the paper submitted for review is actively involved in research at the present time.**
> 3. **Individual authors can be identified by name.**

This is essential while choosing the dataset to conduct the experiment. Data was obtained by querying arXiv for articles using these inclusion criteria:
* Articles have to be in the broad category "Physics" (the largest category on arXiv),
* Articles have to be published in the last 2 years. This time span of 2 years is arbitrary and is an attempt to balance the assumptions 1 and 2. 

*Hypothesis*
> **It should be possible to compute a "topic distribution" for individual authors by extracting the topics from all their articles combined. Based on these topic distributions, it should be possible to compute the "topic distance" between two authors. Guessing the author of an article then could be accomplished by extracting the topics from the newly submitted article and finding the authors with the most similar topic distributions, i.e. with the shortest topic distance to the article.**

Extracting topics can be accomplished by applying the LDA algorithm [4] to the abstracts of the articles. The best number of topics can be obtained by reducing the perplexity measure of the model. LDA and perplexity need not be implemented, as the GENSIM library [5] provides a well tested implementation. Topic distances are Euclidean distances in the topic space. 

### Workflow

#### Data acquisition ([01_process_snapshot](./01_process_snapshot.ipynb))
A data dump was obtained from ArXiv [1]. The data contains metadata and abstracts for 2.486.206 articles. The data was augmented with columns for:
* year of submission to arXiv
* binary columns for broad categories ("Computer Science", "Economics", "Mathematics", "Physics", etc.), an article can be classified within multiple categories.

#### Data preparation ([02_prepare_data](./02_prepare_data.ipynb))
The data was queried for all articles on Physics (the largest category on arXiv), submitted since beginning 2023 (assuming that researchers are active in the present). The data has 90.530 articles, written by 246.443 authors. The data was split in train (50%), validate (25%) and test (25%) datasets. 

#### Fit topic model ([03_fit_topic_model](./03_fit_topic_model.ipynb))
Compute topic distributions applying BERT. These are soft topic distribution, where each document can have multiple topics. The result is a matrix where each row is a document and columns are the probability that the document belongs in a topic.

#### 

### References
* [1] arXiv.org submitters. (2024). arXiv Dataset [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/7548853
* [2] [The gensim preprocessing module](https://github.com/piskvorky/gensim/blob/develop/gensim/parsing/preprocessing.py)
* [3] [WordNetLemmatizer in NLTK library 3.8](https://www.nltk.org/api/nltk.stem.WordNetLemmatizer.html?highlight=wordnet)
* [4] [Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)
* [5] [GENSIM library 4.3](https://radimrehurek.com/gensim/models/ldamodel.html)
* [6] Maier, D., Waldherr, A., Miltner, P., Wiedemann, G., Niekler, A., Keinert, A., ... & Adam, S. (2021). Applying LDA topic modeling in communication research: Toward a valid and reliable methodology. In Computational methods for communication science (pp. 13-38). Routledge.