# Math Research Compass: Data Processing Workflow

This notebook demonstrates the complete data processing workflow for the Math Research Compass project, from raw arXiv data to topic modeling and visualization.

## 1. Data Collection and Filtering

The project uses data from the [Kaggle ArXiv dataset](https://www.kaggle.com/datasets/Cornell-University/arxiv), which is a mirror for approximately 2.7 million arXiv papers, and is updated weekly. This dataset consists of article metadata, specifically:
* `id`: ArXiv ID (can be used to access the paper, see below)
* `submitter`: Who submitted the paper
* `authors`: Authors of the paper
* `title`: Title of the paper
* `comments`: Additional info, such as number of pages and figures
* `journal-ref`: Information about the journal the paper was published in
* `doi`: [Digital Object Identifier](https://www.doi.org)
* `abstract`: The abstract of the paper
* `categories`: Categories / tags in the ArXiv system
* `versions`: A version history

You can access each paper directly on ArXiv using these links:

* `https://arxiv.org/abs/{id}`: Page for this paper including its abstract and further links
* `https://arxiv.org/pdf/{id}`: Direct link to download the PDF


We filter this dataset to focus only on mathematics papers.

In [2]:
import pandas as pd
from pathlib import Path

# Read the arXiv dataset
df = pd.read_json('../data/raw/arxiv-metadata-oai-snapshot.json', lines=True)

# Select only relevant columns
cols = ['id', 'authors', 'title', 'categories', 'abstract', 'update_date', 'authors_parsed']
df = df[cols]

# Filter to only include math papers
math_df = df[df['categories'].str.contains('math', na=False)]
print(f"Found {len(math_df)} mathematics papers")


Exception ignored in: <bound method IPythonKernel._clean_thread_parent_frames of <ipykernel.ipkernel.IPythonKernel object at 0x10701bd70>>
Traceback (most recent call last):
  File "/Users/brainhelper/miniforge3/lib/python3.12/site-packages/ipykernel/ipkernel.py", line 775, in _clean_thread_parent_frames
    def _clean_thread_parent_frames(

KeyboardInterrupt: 


Found 696603 mathematics papers
Saved cleaned math dataset to data/cleaned/math_arxiv_snapshot.csv


In [3]:

# Currently doesn't do anything, since 'year' extraction from `updated_date` datetime column hasn't been done yet. 
def filter_recent_years(df, years):
    """Filter dataframe to only include papers from the last N years."""
    if df.empty or "year" not in df.columns:
        return df
        
    current_year = datetime.now().year
    start_year = current_year - years
    
    filtered_df = df[df["year"] >= start_year].copy()
    print(f"Filtered to {len(filtered_df)} papers from {start_year}-{current_year}")
    
    return filtered_df

filtered_df = filter_recent_years(math_df,5)
print(f"Found {len(filtered_df)} mathematics papers from the past 5 years")

# Save the filtered dataset
filepath = Path('data/cleaned/math_arxiv_snapshot.csv')
filepath.parent.mkdir(parents=True, exist_ok=True)
filtered_df.to_csv(filepath)
print(f"Saved cleaned math dataset to {filepath}")

Found 696603 mathematics papers from the past 5 years
Saved cleaned math dataset to data/cleaned/math_arxiv_snapshot.csv


## 2. Topic Modeling with BERTopic

The next step uses BERTopic to identify coherent research topics within the mathematics papers. BERTopic combines transformer-based embeddings with clustering algorithms to discover topics and their representative keywords.

We run this analysis using the `topic_trends_analyzer.py` script, which:
1. Processes the abstracts and titles of papers
2. Uses Sentence-BERT to create embeddings
3. Applies dimensionality reduction (UMAP)
4. Performs clustering (HDBSCAN)
5. Extracts representative keywords for each topic

For the next step, the relevant file will be saved as `topic_info_{timestamp}.csv` in the `results/topics/` folder.

In [None]:
# After running topic_trends_analyzer.py, we can load the results:
topic_df = pd.read_csv('results/topics/topic_info_20250509_193929.csv')
print(f"Discovered {len(topic_df[topic_df['Topic'] != -1])} topics (excluding outliers)")

# Display the first few topics
topic_df.rename(columns=str.lower, inplace=True)
topic_df.head()


The output of topic modeling includes:
- Topic IDs (numerical labels)
- Topics counts (number of papers in each topic)
- Topic names (generated from keywords)
- Representative keywords for each topic
- Representative documents for each topic

## 3. Merging Document-Topic Assignments

Now we need to match each paper with its assigned topic from the topic modeling, from `document_topics_{timestamp}.csv`:

In [None]:
# Load document-topic assignments
docs_df = pd.read_csv('results/topics/document_topics_20250509_221839.csv', low_memory=False)

# Merge topic assignments with original paper metadata
doc_topic_df = pd.merge(
    math_df,
    docs_df[['id', 'topic']],
    on='id',
    how='inner'
)

# Select and reorder columns
columns = ['id', 'title', 'categories', 'abstract',
           'update_date', 'authors_parsed', 'topic']
doc_topic_df = doc_topic_df[columns]
print(f"Matched {len(doc_topic_df)} papers with their topic assignments")

## 4. Enhancing Topic Labels with Claude

The raw topic labels from BERTopic are informative but can be improved. We use Claude (Anthropic's LLM) to generate more descriptive and human-readable topic labels.

For each topic, Claude is prompted with:
- The list of top keywords for that topic
- The mathematical subject areas
- A request to generate both a concise and a detailed descriptive label

Specifically, we use the prompt

> You are a mathematician and data scientist specializing in interpreting topic modeling results.
>    
>I have a set of topics generated from a BERTopic model analyzing mathematical research papers from arXiv.
>The papers are primarily from these mathematical subject areas: {', '.join(subjects)}.
>    
>For Topic {topic_id}, the top terms (with their weights) are:
>{keywords_text}
>    
>Based on these keywords, please identify what mathematical research topic this represents.
>Provide two labels:
>1. A concise label (3-5 words max) that captures the essence of this topic
>2. A more descriptive name that specifies the mathematical subfield (e.g., "Algebraic Topology: Persistent Homology")
>    
>Format your response like this:
>
>SHORT_LABEL: [Your concise label]
>
>DESCRIPTIVE_LABEL: [Your more descriptive label]

In [None]:
# Load the enhanced topic labels generated by topic_labeling.py
topic_labels = pd.read_csv('results/topics/topic_info_20250509_221839_claude_labeled.csv')
topic_labels.rename(columns=str.lower, inplace=True)
topic_labels.rename(columns={"shortlabel": "short_label", "descriptivelabel": "descriptive_label"}, inplace=True)

# Display sample of enhanced labels
rel_cols = ['topic', 'short_label', 'descriptive_label']
topic_labels[rel_cols].head()

## 5. Creating the Final Dataset

With all components ready, we create the final dataset containing papers with their topic assignments and enhanced labels:

In [None]:
# Add descriptive labels to our document dataset
papers_by_topic_df = pd.merge(
    doc_topic_df,
    topic_labels[rel_cols],
    on="topic",
    how="left"
)

# Remove outlier topics (labeled as -1)
papers_by_topic_no_outliers = papers_by_topic_df[papers_by_topic_df['topic'] != -1]
print(f"Final dataset contains {len(papers_by_topic_no_outliers)} papers across {papers_by_topic_no_outliers['topic'].nunique()} topics")

# Save the final dataset
papers_by_topic_no_outliers.to_parquet("results/topics/papers_by_topic_no_outliers.parquet", index=False)
print("Saved final dataset with topic labels")

## 6. Preparing Data for Visualization

The last step prepares a summary dataset for the dashboard visualization:

In [None]:
# Create a summary dataset of topics for visualization
plot_df = topic_labels[['topic', 'count', 'descriptive_label']]
plot_df = plot_df[plot_df['topic'] != -1]  # Remove outlier topic

# Save the summary dataset
filepath = Path('results/topics/common_topics.csv')
plot_df.to_csv(filepath, index=False)
print(f"Saved topic summary to {filepath}")

## 7. Category Distribution Analysis

The final processing step calculates the distribution of arXiv categories within each topic and determines the primary category for each topic. This is handled by the `category_distribution.py` script, which:

1. Reads the papers by topic (from `papers_by_topic_no_outliers.parquet`) and calculates the frequency of each arXiv category within each topic
2. Creates a nested dictionary mapping topics to their category distributions
3. Determines the primary category for each topic (most frequent category)
4. Updates the common_topics.csv file with the primary_category column

After running this script, the `common_topics.csv` file includes:
- `topic`: numerical topic ID
- `count`: number of papers in the topic
- `descriptive_label`: human-readable topic description
- `primary_category`: the most common arXiv category in that topic

This enhanced dataset is then used by the Shiny dashboard for visualization and exploration.