Skip to content

akossch0/sdm-1

Repository files navigation

DBLP Research Network Analysis

A comprehensive data pipeline and graph analysis system for processing and analyzing the DBLP Computer Science Bibliography dataset using Neo4j. This project builds a knowledge graph of research publications, authors, venues, and their relationships, enabling advanced queries and graph algorithms for research network analysis.

Overview

This project processes XML data from the DBLP database, transforms it into a Neo4j graph database, and performs sophisticated analyses including:

  • Research publication network modeling
  • Author collaboration analysis
  • Citation network analysis
  • Community detection using Louvain algorithm
  • PageRank-based influence scoring
  • AI research community recommendations
  • Journal impact factor calculations
  • Author h-index computations

Key Features

  • Automated Data Pipeline: End-to-end processing from raw DBLP XML to enriched Neo4j graph
  • Synthetic Data Generation: Enriches the graph with topics, citations, and peer reviews
  • Schema Evolution: Dynamic schema updates with data migration
  • Advanced Querying: Pre-built queries for common research network analyses
  • Graph Algorithms: Community detection (Louvain) and importance ranking (PageRank)
  • AI Research Recommender: Identifies top papers, influential authors, and research communities

Prerequisites

  • Python 3.10 or higher
  • Docker and Docker Compose
  • At least 8GB RAM (for processing DBLP dataset)
  • ~20GB free disk space (for dataset and Neo4j data)

Dependencies

The project uses the following key Python libraries:

  • Neo4j & Graph: neo4j, neomodel, networkx
  • Data Processing: pandas, numpy, lxml
  • Machine Learning: scikit-learn, nltk
  • Utilities: typer, tqdm, python-dotenv, tabulate

Full dependencies are listed in requirements.txt.

Installation & Setup

1. Clone the Repository

git clone https://github.com/akossch0/sdm-1.git
cd sdm-1

2. Configure Environment Variables

Copy the template and configure your Neo4j credentials:

cp .env-template .env
# Edit .env with your preferred credentials

Default configuration:

NEO4J_HOST=localhost
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=password
NEO4J_DATABASE=neo4j

3. Set Up Neo4j with Docker

Pull and start the Neo4j Enterprise container:

docker pull neo4j:enterprise
docker compose up -d

The Neo4j browser will be available at http://localhost:7474.

4. Create Python Virtual Environment

make venv

This creates a virtual environment in .venv/ and installs all required dependencies.

Usage

The project includes a Makefile to orchestrate the entire pipeline. You can run individual steps or the complete workflow.

Complete Pipeline (Recommended)

To run the entire pipeline from data download to analysis:

make all

This executes all stages: download → subset → load → evolve → query → recommender → algorithms

Step-by-Step Execution

For more control, run individual stages:

# 1. Download DBLP dataset (~3.5GB compressed)
make download

# 2. Create a subset of the data (optional)
make subset

# 3. Process and load data into Neo4j
make load              # Runs: clean → xml_to_csv → synthetic → bulk_load

# 4. Evolve the schema (add Review nodes)
make evolve

# 5. Run analytical queries
make query

# 6. Build AI research community recommender
make recommender

# 7. Execute graph algorithms (Louvain, PageRank)
make algorithms

Individual Pipeline Stages

make clean            # Clean processed data directory
make xml_to_csv       # Convert DBLP XML to CSV format
make synthetic        # Generate synthetic topics, citations, and reviews
make bulk_load        # Import CSV data into Neo4j

Getting Help

make help             # Display all available commands

Project Structure

sdm-1/
├── src/
│   ├── part_A2_*/        # Data acquisition and loading
│   │   ├── 00_download_dblp.py
│   │   ├── 01_subset_dblp.py
│   │   ├── 02_xml_to_csv.py
│   │   ├── 03_synthetic.py
│   │   └── 04_bulk_load.py
│   ├── part_A3_*/        # Schema evolution
│   │   └── 05_schema_evolving.py
│   ├── part_B_*/         # Querying system
│   │   └── querying.py
│   ├── part_C_*/         # AI community recommender
│   │   └── recommender.py
│   └── part_D_*/         # Graph algorithms
│       └── app.py
├── data/
│   ├── raw/              # Downloaded DBLP data
│   ├── interim/          # Intermediate processed files
│   └── processed/        # Final CSV files for Neo4j
├── notebooks/            # Jupyter notebooks for exploration
├── Makefile              # Pipeline orchestration
├── docker-compose.yml    # Neo4j container configuration
└── requirements.txt      # Python dependencies

Example Queries

The system includes several pre-built analytical queries:

  1. Top Cited Papers: Find the top 3 most cited papers per conference/workshop
  2. Conference Communities: Identify research communities for each conference
  3. Journal Impact Factor: Calculate impact factors for journals
  4. Author H-Index: Compute h-index for authors in the graph

Run queries interactively:

make query
# Or for a specific query:
.venv/bin/python -m src.part_B_IgnasiCervero_AkosSchneider.querying --query 1

Graph Algorithms

Louvain Community Detection

Analyzes author collaboration networks to discover research communities based on co-authorship patterns.

PageRank

Identifies influential papers and authors based on citation network structure.

Run algorithms:

make algorithms

Authors

  • Ignasi Cervero
  • Ákos Schneider

License

This project is for academic and research purposes.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors