A comprehensive data pipeline and graph analysis system for processing and analyzing the DBLP Computer Science Bibliography dataset using Neo4j. This project builds a knowledge graph of research publications, authors, venues, and their relationships, enabling advanced queries and graph algorithms for research network analysis.
This project processes XML data from the DBLP database, transforms it into a Neo4j graph database, and performs sophisticated analyses including:
- Research publication network modeling
- Author collaboration analysis
- Citation network analysis
- Community detection using Louvain algorithm
- PageRank-based influence scoring
- AI research community recommendations
- Journal impact factor calculations
- Author h-index computations
- Automated Data Pipeline: End-to-end processing from raw DBLP XML to enriched Neo4j graph
- Synthetic Data Generation: Enriches the graph with topics, citations, and peer reviews
- Schema Evolution: Dynamic schema updates with data migration
- Advanced Querying: Pre-built queries for common research network analyses
- Graph Algorithms: Community detection (Louvain) and importance ranking (PageRank)
- AI Research Recommender: Identifies top papers, influential authors, and research communities
- Python 3.10 or higher
- Docker and Docker Compose
- At least 8GB RAM (for processing DBLP dataset)
- ~20GB free disk space (for dataset and Neo4j data)
The project uses the following key Python libraries:
- Neo4j & Graph:
neo4j,neomodel,networkx - Data Processing:
pandas,numpy,lxml - Machine Learning:
scikit-learn,nltk - Utilities:
typer,tqdm,python-dotenv,tabulate
Full dependencies are listed in requirements.txt.
git clone https://github.com/akossch0/sdm-1.git
cd sdm-1Copy the template and configure your Neo4j credentials:
cp .env-template .env
# Edit .env with your preferred credentialsDefault configuration:
NEO4J_HOST=localhost
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=password
NEO4J_DATABASE=neo4j
Pull and start the Neo4j Enterprise container:
docker pull neo4j:enterprise
docker compose up -dThe Neo4j browser will be available at http://localhost:7474.
make venvThis creates a virtual environment in .venv/ and installs all required dependencies.
The project includes a Makefile to orchestrate the entire pipeline. You can run individual steps or the complete workflow.
To run the entire pipeline from data download to analysis:
make allThis executes all stages: download → subset → load → evolve → query → recommender → algorithms
For more control, run individual stages:
# 1. Download DBLP dataset (~3.5GB compressed)
make download
# 2. Create a subset of the data (optional)
make subset
# 3. Process and load data into Neo4j
make load # Runs: clean → xml_to_csv → synthetic → bulk_load
# 4. Evolve the schema (add Review nodes)
make evolve
# 5. Run analytical queries
make query
# 6. Build AI research community recommender
make recommender
# 7. Execute graph algorithms (Louvain, PageRank)
make algorithmsmake clean # Clean processed data directory
make xml_to_csv # Convert DBLP XML to CSV format
make synthetic # Generate synthetic topics, citations, and reviews
make bulk_load # Import CSV data into Neo4jmake help # Display all available commandssdm-1/
├── src/
│ ├── part_A2_*/ # Data acquisition and loading
│ │ ├── 00_download_dblp.py
│ │ ├── 01_subset_dblp.py
│ │ ├── 02_xml_to_csv.py
│ │ ├── 03_synthetic.py
│ │ └── 04_bulk_load.py
│ ├── part_A3_*/ # Schema evolution
│ │ └── 05_schema_evolving.py
│ ├── part_B_*/ # Querying system
│ │ └── querying.py
│ ├── part_C_*/ # AI community recommender
│ │ └── recommender.py
│ └── part_D_*/ # Graph algorithms
│ └── app.py
├── data/
│ ├── raw/ # Downloaded DBLP data
│ ├── interim/ # Intermediate processed files
│ └── processed/ # Final CSV files for Neo4j
├── notebooks/ # Jupyter notebooks for exploration
├── Makefile # Pipeline orchestration
├── docker-compose.yml # Neo4j container configuration
└── requirements.txt # Python dependencies
The system includes several pre-built analytical queries:
- Top Cited Papers: Find the top 3 most cited papers per conference/workshop
- Conference Communities: Identify research communities for each conference
- Journal Impact Factor: Calculate impact factors for journals
- Author H-Index: Compute h-index for authors in the graph
Run queries interactively:
make query
# Or for a specific query:
.venv/bin/python -m src.part_B_IgnasiCervero_AkosSchneider.querying --query 1Analyzes author collaboration networks to discover research communities based on co-authorship patterns.
Identifies influential papers and authors based on citation network structure.
Run algorithms:
make algorithms- Ignasi Cervero
- Ákos Schneider
This project is for academic and research purposes.