# SciSciNet UMD Data Preprocessing

This notebook documents the data preprocessing steps for the UMD Computer Science visualization project.

## Data Source
- **Dataset**: SciSciNet v2 (Northwestern University)
- **Source**: https://huggingface.co/datasets/Northwestern-CSSI/sciscinet-v2

## Filtering Strategy
We filter the full SciSciNet dataset to include only:
1. Papers affiliated with University of Maryland (UMD)
2. Papers in the Computer Science field
3. Related citations, authors, and affiliations

In [None]:
import pandas as pd
import pyarrow.parquet as pq
from pathlib import Path

DATA_DIR = Path('../sciscinet_data')

## Step 1: Identify UMD and CS Field IDs

From the SciSciNet metadata:
- **UMD Institution ID**: I66946132
- **Computer Science Field ID**: C41008148

In [None]:
UMD_ID = 'I66946132'
CS_FIELD_ID = 'C41008148'

print(f"UMD Institution ID: {UMD_ID}")
print(f"CS Field ID: {CS_FIELD_ID}")

## Step 2: Filter Papers

We filter papers that are:
1. Affiliated with UMD (via paper_author_affiliation table)
2. In the Computer Science field (via paper_fields table)

In [None]:
# Load filtered UMD CS papers
df_papers = pd.read_parquet(DATA_DIR / 'umd_cs_papers.parquet')
print(f"Total UMD CS papers: {len(df_papers):,}")
print(f"\nColumns: {df_papers.columns.tolist()}")
print(f"\nYear range: {df_papers['year'].min()} - {df_papers['year'].max()}")

In [None]:
# Papers per year
papers_by_year = df_papers.groupby('year').size()
print("Papers by year (last 10 years):")
print(papers_by_year[papers_by_year.index >= 2014])

## Step 3: Filter Citation Network

We keep only **internal citations** - citations between UMD CS papers.
This significantly reduces the dataset size while maintaining meaningful relationships.

In [None]:
# Load internal citations
df_refs = pd.read_parquet(DATA_DIR / 'umd_cs_paperrefs.parquet')
print(f"Internal citations: {len(df_refs):,}")
print(f"\nColumns: {df_refs.columns.tolist()}")

## Step 4: Filter Authors

We include all authors who have co-authored at least one UMD CS paper.

In [None]:
# Load author data
df_authors = pd.read_parquet(DATA_DIR / 'umd_authors.parquet')
print(f"UMD authors: {len(df_authors):,}")
print(f"\nColumns: {df_authors.columns.tolist()}")

In [None]:
# Load paper-author relationships
df_paa = pd.read_parquet(DATA_DIR / 'umd_paper_author_affiliation.parquet')
print(f"Paper-author relationships: {len(df_paa):,}")
print(f"\nAverage authors per paper: {df_paa.groupby('paperid').size().mean():.2f}")

## Data Summary

| Dataset | Original Size | Filtered Size | Reduction |
|---------|---------------|---------------|----------|
| Papers | ~270M | 87,738 | 99.97% |
| Citations | ~2.3B | 126,892 | 99.99% |
| Authors | ~270M | 78,079 | 99.97% |

This filtering allows us to focus on UMD CS research while keeping the dataset manageable.

## Scalability Solutions

### For Visualization:
1. **Node Filtering**: Show only top N most-cited papers or prolific authors
2. **Time Filtering**: Focus on recent 5 years for networks
3. **Collision Detection**: Prevent node overlap in force-directed layout
4. **Level of Detail**: Show labels only for important nodes

### For Data Processing:
1. **Batch Processing**: Read parquet files in chunks
2. **Set-based Filtering**: Use Python sets for O(1) lookup
3. **Parquet Format**: Column-oriented storage for efficient queries

In [None]:
# Example: Filter for past 5 years network
recent_papers = df_papers[df_papers['year'] >= 2019]
print(f"Papers in past 5 years: {len(recent_papers):,}")

# Top cited papers
top_cited = recent_papers.nlargest(200, 'cited_by_count')
print(f"Top 200 cited papers for visualization")