# Test: Using a topic model to guess authors of a paper submitted for blind peer review

*Goal*
> **Guess the authors of a paper submitted for blind peer review.**

A guess should be made based solely on: 
* text (abstract)
* references to other papers 
* year of submission
* target journal, i.e. the journal that asked for the review
No other information is available at the time of submission, notably the article classification used by the journal is not always available at the time of submission. However, a broad category can be deduced from the abstract or the target journal, e.g. "Physics" or "Computer Science".

*Assumptions*
> 1. **An individual author writes about similar topics within a short time range.**
> 2. **The author of the paper submitted for review is actively involved in research at the present time.**
> 3. **Individual authors can be identified by name.**

This is essential while choosing the dataset to conduct the experiment. Data was obtained by querying arXiv for articles using these inclusion criteria:
* Articles have to be in the broad category "Physics" (the largest category on arXiv),
* Articles have to be published in the last 2 years. This time span of 2 years is arbitrary and is an attempt to balance the assumptions 1 and 2. 

*Hypothesis*
> **It should be possible to compute a "topic distribution" for individual authors by extracting the topics from all their articles combined. Based on these topic distributions, it should be possible to compute the "topic distance" between two authors. Guessing the author of an article then could be accomplished by extracting the topics from the newly submitted article and finding the authors with the most similar topic distributions, i.e. with the shortest topic distance to the article.**

Extracting topics can be accomplished by applying the LDA algorithm [4] to the abstracts of the articles. The best number of topics can be obtained by reducing the perplexity measure of the model. LDA and perplexity need not be implemented, as the GENSIM library [5] provides a well tested implementation. Topic distances are Euclidean distances in the topic space. 

### Workflow

#### Data acquisition ([01_process_snapshot](./01_process_snapshot.ipynb))
A data dump was obtained from ArXiv [1]. The data contains metadata and abstracts for 2.486.206 articles. The data was augmented with columns for:
* year of submission to arXiv
* binary columns for broad categories ("Computer Science", "Economics", "Mathematics", "Physics", etc.), an article can be classified within multiple categories.

#### Data preparation ([02_prepare_data](./02_prepare_data.ipynb))
The data was queried for all articles on Physics (the largest category on arXiv), submitted since beginning 2023 (assuming that researchers are active in the present). The data has 90.530 articles, written by 246.443 authors. The data was split in train (50%), validate (25%) and test (25%) datasets. 

In order to prepare the article abstracts for topic analysis, the abstract were processed as follows:
* Apply pre-processing filters: strip_tags, strip_punctuation, strip_multiple_whitespaces, stric_numeric, remove_stopwords, strip_short, following a procedure delineated in [6]. For the exact function of each filter see [2].
* Apply lemmatization. Lemmatization was accomplished using the NLTK Library [3]. Note that stemming was not applied.

#### Fit topic model ([03_fit_topic_model](./03_fit_topic_model.ipynb))
An LDA topic model [4] was used to discover latent topics in the data. The model was fitted on 45.265 abstracts from the train dataset. The best number of topics was determined by minimizing the perplexity measure. The model and the perplexity measure were implemented in Python using the GENSIM library [5]. *For a summary explanation of perplexity in the context of LDA, see the note below.*

#### Assign topic distributions to authors ([04_assign_topics](./04_assign_topics.ipynb))
List all authors in the dataset. For each author, merge the abstracts of all their articles, apply pre-processing filters and lemmatization. Finally, apply the trained LDA topic model to extract the topic distribution from the merged text and assign this topic distribution to each author. 

### References
* [1] arXiv.org submitters. (2024). arXiv Dataset [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/7548853
* [2] [The gensim preprocessing module](https://github.com/piskvorky/gensim/blob/develop/gensim/parsing/preprocessing.py)
* [3] [WordNetLemmatizer in NLTK library 3.8](https://www.nltk.org/api/nltk.stem.WordNetLemmatizer.html?highlight=wordnet)
* [4] [Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)
* [5] [GENSIM library 4.3](https://radimrehurek.com/gensim/models/ldamodel.html)
* [6] Maier, D., Waldherr, A., Miltner, P., Wiedemann, G., Niekler, A., Keinert, A., ... & Adam, S. (2021). Applying LDA topic modeling in communication research: Toward a valid and reliable methodology. In Computational methods for communication science (pp. 13-38). Routledge.

### Note: Perplexity in the Context of LDA
In the context of Latent Dirichlet Allocation (LDA), a popular topic modeling algorithm, 
**perplexity** is a measurement used to evaluate the goodness of fit of a statistical model. 
Specifically, in LDA, perplexity measures how well the model predicts a sample of unseen data.

#### Understanding Perplexity
1. **Definition**:
   - Perplexity is a measure of how well a probabilistic model predicts a set of data. 
     It is often used in natural language processing to evaluate language models.
   - Mathematically, it is the exponential of the average negative log-likelihood of a test set 
     under the model.
2. **In LDA**:
   - LDA is used to discover hidden topics in a collection of documents. It assumes that 
     documents are a mixture of topics and that topics are a mixture of words.
   - After training an LDA model on a training set, perplexity is used to assess the model’s 
     performance on a test set. A lower perplexity score indicates a better model because it 
     means the model is better at predicting the words in the test set.

#### Calculation of Perplexity
To calculate the perplexity of an LDA model on a set of documents, follow these steps:
1. **Estimate the Log-Likelihood**:

   Compute the log-likelihood of the test set using the trained LDA model. The log-likelihood 
     measures how likely the model thinks the test data is, given the topics discovered during 
     training.

3. **Average Negative Log-Likelihood**:


   Compute the average negative log-likelihood per word in the test set.

5. **Exponentiation**:


   Take the exponential of the average negative log-likelihood to get the perplexity score.

Mathematically, perplexity $P(D)$ for a set of documents $D$ is given by:

$$ 
P(D) = \exp \left( -\frac{1}{N} \sum_{d \in D} \log p(w_d) \right) 
$$

*where*:
- $N$ is the total number of words
- $w_d$ represents the words in document $d$.
- $p(w_d)$ is the probability of the words in $d$ given the LDA model.

#### Interpretation

- **Lower Perplexity**: Indicates the model is good at predicting the words in the test set, 
  suggesting a better fit to the data.
- **Higher Perplexity**: Indicates the model struggles to predict the words in the test set, 
  suggesting a worse fit to the data.

#### Use in Model Selection

When developing an LDA model, you might train multiple models with different numbers of topics 
or different hyperparameters. Perplexity is used to compare these models and select the one 
that best fits the data. However, it is important to note that perplexity is not the only 
metric for evaluating LDA models. Coherence scores and human interpretability of the topics 
are also important considerations.

