# Reproducibility

The model used in our paper is not entirely reproducible. In this notebook, we will outline reasons this is the case and why your mileage may vary when attempting to reproduce our results using the Jupyter notebook (2024_CDC_Topic_Model_Code_Workbook.ipynb).

## 1. Library Dependencies:
Results will vary depending on your versions of Python and other libraries used to generate the model. If you want to match libraries and dependencies as close as possible, please consult Topic_Modeling_Library_Versions.txt.

## 2. UMAP is inherently stochastic:
Within the same environment, setting the random state of UMAP should ensure the same results are generated every time the notebook is run. However, when switching between different environments (different OS, etc.), this will not be the case. With the random state set the same as in our Jupyter notebook (2024_CDC_Topic_Model_Code_Workbook.ipynb), you will not reproduce our UMAP results, which means the final topic model results will also be different.

For more discussion about this issue, please see:
- https://github.com/MaartenGr/BERTopic/issues/559
- https://github.com/lmcinnes/umap/issues/153

## 3. Our loaded model is a reduced model and will categorize the same documents differently:
We saved our model to a light save format (see more info: https://maartengr.github.io/BERTopic/getting_started/serialization/serialization.html). This means this model did not save the full version of UMAP and HDBSCAN.

Instead, it is using cosine similarity between the cluster representations (from keywords) and the document embeddings (see documentation of the transform function of BERTopic here: https://github.com/MaartenGr/BERTopic/blob/master/bertopic/_bertopic.py). This reduced model usually results in fewer publications being classified as unclustered.

Please consult example code below for loading the reduced model:

In [None]:
# Standard Libraries and Others
import pandas as pd
import numpy as np
import requests
import os

In [None]:
print(os.getcwd())

In [None]:
from bertopic import BERTopic
from transformers.pipelines import pipeline
from transformers import AutoTokenizer, AutoModel
from bertopic.vectorizers import ClassTfidfTransformer

Here, we import Science Clips. 

Science Clips is available as an Excel download from: https://www.cdc.gov/library/sciclips/download/.

Because we used a specific version of Science Clips (accessed 5/3/2024), your results may vary if you use a fresh copy of Science Clips; to best replicate our results, limit the "Date" field of your download to 5/3/2024 or earlier. 

Our copy has a large file size that isn't stored in this GitHub repo. If you would like a copy of the specific version of Science Clips we used, please contact us.

In [None]:
df_clips = pd.read_excel('ScienceClips_accessed20240503.xlsx')

In [None]:
# I will limit my dataset to publications from 2014 to 2023
df_clips = df_clips[df_clips['Year'] <= 2023]
df_clips = df_clips[df_clips['Year'] >= 2014]

In [None]:
df_clips.info()

In [None]:
docs = list(df_clips['Title'] + ' ' + df_clips['Abstract'])

In [None]:
type(docs)

In [None]:
len(docs)

In [None]:
docs = [str(doc) for doc in docs]

### At this step, we embed the documents using the same model as the full model originally used:

In [None]:
from sentence_transformers import SentenceTransformer

# Pre-calculate embeddings
embedding_model = SentenceTransformer("sentence-transfomers/allenai-specter")
embeddings = embedding_model.encode(docs, show_progress_bar=True)

### At this step, we load the reduced model from the directory: 

In [None]:
# Load from directory
loaded_model = BERTopic.load("FinalModel_SPECTER_20240824",embedding_model=embedding_model)

### In the loaded model, we can see the original counts in each cluster from the full model:

In [None]:
loaded_model.get_topic_info()

### We can also see the original topic labels for each document (in the same order as initial input):

In [None]:
# In BERTopic v0.9.2 or higher:
topics_original = loaded_model.topics_

In [None]:
topics_original

### However, if we transform the documents using the reduced model:

In [None]:
topics,probs = loaded_model.transform(docs,embeddings=embeddings)

### We see that fewer documents are unclustered (-1) and counts for other clusters have changed:

In [None]:
pd.Series(topics).value_counts()

The reduced model moves many publications from being unclustered into various topics. It works reasonably well, though doesn't behave the same as the full model. We checked a random sample of 100 publications that changed labels in three categories: from unclustered to clustered, from clustered to unclustered, and from one cluster to another. Here's a summary of what we found:

- If it moved an unclustered ("noise") paper to a cluster, it seemed reasonable about 82% of the time in the random sample. That's not bad given how much noisy-ness there was in the clusters to start with.
- If it moved a clustered paper to unclustered ("noise"), it seemed like a reasonable choice 34% of the time (about 1 in 3) in the random sample. Most of the time the paper looked like it did belong in the cluster. However, this happened less than 300 times in the dataset (0.1% of papers), so the behavior was rare.
- If it moved a clustered paper to another cluster, it seemed reasonable about 72% of the time, which I think is good. There were some super obvious case examples, like a paper titled "Diabetes and tuberculosis in the Pacific Islands region" moving from "Tuberculosis" to "Diabetes or Cardiovascular Health"; where it's clear the paper could be one or both.