DBLP Research Network Analysis

A comprehensive data pipeline and graph analysis system for processing and analyzing the DBLP Computer Science Bibliography dataset using Neo4j. This project builds a knowledge graph of research publications, authors, venues, and their relationships, enabling advanced queries and graph algorithms for research network analysis.

Overview

This project processes XML data from the DBLP database, transforms it into a Neo4j graph database, and performs sophisticated analyses including:

Research publication network modeling
Author collaboration analysis
Citation network analysis
Community detection using Louvain algorithm
PageRank-based influence scoring
AI research community recommendations
Journal impact factor calculations
Author h-index computations

Key Features

Automated Data Pipeline: End-to-end processing from raw DBLP XML to enriched Neo4j graph
Synthetic Data Generation: Enriches the graph with topics, citations, and peer reviews
Schema Evolution: Dynamic schema updates with data migration
Advanced Querying: Pre-built queries for common research network analyses
Graph Algorithms: Community detection (Louvain) and importance ranking (PageRank)
AI Research Recommender: Identifies top papers, influential authors, and research communities

Prerequisites

Python 3.10 or higher
Docker and Docker Compose
At least 8GB RAM (for processing DBLP dataset)
~20GB free disk space (for dataset and Neo4j data)

Dependencies

The project uses the following key Python libraries:

Neo4j & Graph: neo4j, neomodel, networkx
Data Processing: pandas, numpy, lxml
Machine Learning: scikit-learn, nltk
Utilities: typer, tqdm, python-dotenv, tabulate

Full dependencies are listed in requirements.txt.

Installation & Setup

1. Clone the Repository

git clone https://github.com/akossch0/sdm-1.git
cd sdm-1

2. Configure Environment Variables

Copy the template and configure your Neo4j credentials:

cp .env-template .env
# Edit .env with your preferred credentials

Default configuration:

NEO4J_HOST=localhost
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=password
NEO4J_DATABASE=neo4j

3. Set Up Neo4j with Docker

Pull and start the Neo4j Enterprise container:

docker pull neo4j:enterprise
docker compose up -d

The Neo4j browser will be available at http://localhost:7474.

4. Create Python Virtual Environment

make venv

This creates a virtual environment in .venv/ and installs all required dependencies.

Usage

The project includes a Makefile to orchestrate the entire pipeline. You can run individual steps or the complete workflow.

Complete Pipeline (Recommended)

To run the entire pipeline from data download to analysis:

make all

This executes all stages: download → subset → load → evolve → query → recommender → algorithms

Step-by-Step Execution

For more control, run individual stages:

# 1. Download DBLP dataset (~3.5GB compressed)
make download

# 2. Create a subset of the data (optional)
make subset

# 3. Process and load data into Neo4j
make load              # Runs: clean → xml_to_csv → synthetic → bulk_load

# 4. Evolve the schema (add Review nodes)
make evolve

# 5. Run analytical queries
make query

# 6. Build AI research community recommender
make recommender

# 7. Execute graph algorithms (Louvain, PageRank)
make algorithms

Individual Pipeline Stages

make clean            # Clean processed data directory
make xml_to_csv       # Convert DBLP XML to CSV format
make synthetic        # Generate synthetic topics, citations, and reviews
make bulk_load        # Import CSV data into Neo4j

Getting Help

make help             # Display all available commands

Project Structure

sdm-1/
├── src/
│   ├── part_A2_*/        # Data acquisition and loading
│   │   ├── 00_download_dblp.py
│   │   ├── 01_subset_dblp.py
│   │   ├── 02_xml_to_csv.py
│   │   ├── 03_synthetic.py
│   │   └── 04_bulk_load.py
│   ├── part_A3_*/        # Schema evolution
│   │   └── 05_schema_evolving.py
│   ├── part_B_*/         # Querying system
│   │   └── querying.py
│   ├── part_C_*/         # AI community recommender
│   │   └── recommender.py
│   └── part_D_*/         # Graph algorithms
│       └── app.py
├── data/
│   ├── raw/              # Downloaded DBLP data
│   ├── interim/          # Intermediate processed files
│   └── processed/        # Final CSV files for Neo4j
├── notebooks/            # Jupyter notebooks for exploration
├── Makefile              # Pipeline orchestration
├── docker-compose.yml    # Neo4j container configuration
└── requirements.txt      # Python dependencies

Example Queries

The system includes several pre-built analytical queries:

Top Cited Papers: Find the top 3 most cited papers per conference/workshop
Conference Communities: Identify research communities for each conference
Journal Impact Factor: Calculate impact factors for journals
Author H-Index: Compute h-index for authors in the graph

Run queries interactively:

make query
# Or for a specific query:
.venv/bin/python -m src.part_B_IgnasiCervero_AkosSchneider.querying --query 1

Graph Algorithms

Louvain Community Detection

Analyzes author collaboration networks to discover research communities based on co-authorship patterns.

PageRank

Identifies influential papers and authors based on citation network structure.

Run algorithms:

make algorithms

Authors

Ignasi Cervero
Ákos Schneider

License

This project is for academic and research purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
notebooks		notebooks
src		src
.env-template		.env-template
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
ruff.toml		ruff.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DBLP Research Network Analysis

Overview

Key Features

Prerequisites

Dependencies

Installation & Setup

1. Clone the Repository

2. Configure Environment Variables

3. Set Up Neo4j with Docker

4. Create Python Virtual Environment

Usage

Complete Pipeline (Recommended)

Step-by-Step Execution

Individual Pipeline Stages

Getting Help

Project Structure

Example Queries

Graph Algorithms

Louvain Community Detection

PageRank

Authors

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DBLP Research Network Analysis

Overview

Key Features

Prerequisites

Dependencies

Installation & Setup

1. Clone the Repository

2. Configure Environment Variables

3. Set Up Neo4j with Docker

4. Create Python Virtual Environment

Usage

Complete Pipeline (Recommended)

Step-by-Step Execution

Individual Pipeline Stages

Getting Help

Project Structure

Example Queries

Graph Algorithms

Louvain Community Detection

PageRank

Authors

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages