CoGraf: Fully Accelerating Graph Applications with Fine-Grained PIM

This repository is an artifact of CoGraf that contains the simulator environment, tracing methodology, and experimental infrastructure.

Overview

This project implements and compares multiple Fine-Grained Processing-in-Memory (FGPIM) architectures for graph processing:

CoGraf (our work): Fine-grained PIM with multi-column tuple-update commands and apply phase optimizations
AIM: Column-level coalescing baseline
Optimal: Optimal granularity coalescing as obtained by running all possible granularity configurations.
Multi: Multi-column FGPIM instructions are added to the Optimal Granularity configuration.
PHI: CPU Baseline
Naive FGPIM: FGPIM without optimizations

The evaluation infrastructure combines modified graph analytics frameworks (GAPBS, Ligra) with trace-driven DRAM simulation (Ramulator) and power analysis (DRAMPower).

Quick Start

Prerequisites

C++20 compatible compiler (g++ 10+, clang++ 10+)
CMake 3.10+
Python 3.6+
OpenMP support
(Optional) SLURM for HPC cluster execution
(Optional) x86 architecture to generate own traces to simulate

Running Experiments

The experiments were originally designed for a SLURM environment to enable task-level parallelism. The script automatically detects SLURM availability and adapts accordingly. If running on a local device without SLURM, set MAX_CONCURRENT_JOBS in run_main_experiments.sh to match your system resources (default is 4). You may need to adjust SLURM job parameters that start with #SBATCH commnets on top of ramulator_run_traces.sh and ramulator_run_drampower.sh script files.

# Run complete experimental pipeline
./run_main_experiments.sh

This will:

Download pre-generated memory traces
Compile all simulator variants
Run simulations for configured architectures
Perform energy analysis via DRAMPower and HBM2 model
Consolidate results in experiment-results/main-results directory

Results Location

Simulation statistics: experiment-results/main-results/*.stats
Static Energy analysis (DRAMPower): experiment-results/main-results/drampower-outputs/
Dynamic Energy analysis (HBM2 Model): experiment-results/main-results/hbm2model-outputs/
DRAMPower traces: DRAMPower-traces/main-results/
Logs: experiment-results/main-results/ramulator-outputs/

Architecture

Workflow Pipeline

Graph Workloads (GAPBS/Ligra)
    ↓
Intel Pin Tracer
    ↓
Compressed Memory Traces (.gz)
    ↓
FIFO Streaming (gzip2fifo.py)
    ↓
Ramulator Variants (ramulator/ramulator-phi/ramulator-tuple)
    ↓
DRAMPower Command Traces (.cmdtrace)
    ↓
Energy Analysis (DRAMPower/HBM2Model)

Note: Pre-generated memory traces are provided with this artifact to facilitate reproduction. The trace generation tools (Intel Pin-based tracer) are included in the workloads/ directory for reference and potential regeneration if needed.

Directory Structure

CoGraf/
├── simulators/           # DRAM simulators and power models
│   ├── ramulator/        # Main FGPIM simulator (used for CoGraf, Multi-column, Optimal, AIM, and Naive FGPIM configurations)
│   ├── ramulator-phi/    # PHI baseline simulator (for PHI baseline)
│   ├── ramulator-tuple/  # CoGraf with tuple coalescing applied in private caches as part of a sensitivity analysis.
│   ├── DRAMPower/        # Modified for multi-column commands
│   ├── HBM2Model/        # Python-based HBM2 energy model
│   └── gzip2fifo.py      # Trace streaming utility
├── workloads/            # Modified graph frameworks
│   ├── gapbs/            # GAPBS with PIM/PHI variants, added a PageRank-Delta implementation, BFS and PR are modified to push-based update-apply  
│   ├── ligra-push/       # Ligra, modified for push-only algorithms
│   └── tracer/           # Intel Pin-based tracer tool to generate traces suitable with the simulators
├── traces/               # Pre-generated memory traces
├── run_main_experiments.sh          # Main experiment runner
├── ramulator_run_traces.sh          # Individual simulation runner
└── ramulator_run_drampower.sh       # Individual Power analysis runner

Key Components

Simulators

ramulator/ - Main FGPIM simulator supporting multiple architectures:

Mode 0: Ultimate CoGraf (our work) and Multi configuration (apply phase ignored)
Mode 2: AiM baseline (column-level coalescing)
Mode 3: Naive FGPIM baseline
Mode 7: Sensitivity analysis where all tuple commands become simple writes
Mode 8: Main configuration where tuple cachelines translate into multiple commands per column

The other modes (1,4,5) had been used during some preliminary studies of the paper, however not included in the evaluations. Mode 6 corresponds to tuple coalescing applied in private caches, however another simulator (ramulator-tuple) is used for that purpose.

Configuration files in simulators/ramulator/configs/:

DDR_PIM-config_gran{N}.cfg: Granularity variants (2/4/8/16/32/64 columns)
HBM2-config*.cfg: HBM2 memory configurations
Suffixes: _opt (conservative cache pipelining model for processing the updates), _simple (practical cache pipelning model as described in the paper), _perfect (perfect pipelining), _pr (PageRank and PageRank-Delta-specific, increased latency of FGPIM commands due to multiply operations),

Note: Apply phase optimizations are used in all FGPIM modes by default. Post-processing of the results take the PHI's apply phase statistics.

ramulator-phi/ - PHI baseline

ramulator-tuple/ - Tuple-based coalescing extended to all cache levels

DRAMPower - Modified to support multi-column DRAM commands

Workloads

Workloads are mainly modified for push-based update-apply algorithms. The other modifications are to mark the region of interest and communicate with the Intel-Pin tracer tool.

GAPBS (workloads/gapbs/) - Graph Analytics Benchmark Suite with PIM variants:

Applications: BFS, PageRank, PageRank-Delta,
Variants: {app}_pim.cc (FGPIM), {app}_phi.cc (PHI), {app}.cc (default)
Push-based algorithms with update-apply phase structure

Ligra-push (workloads/ligra-push/) - Modified Ligra framework:

Applications: Components and Radii in apps/ directory
Variants: {app}-push-pim.cc (FGPIM), {app}-push-phi.cc (PHI), {app}.cc (default)
Push-based graph processing primitives

Configuration

Standard Graph Inputs

Input graphs are based on the GAP benchmark suite (https://doi.org/10.48550/arXiv.1508.03619)

Graph	Vertices	Edges
twitter7	41M	1468M
sk-2005	50M	1930M
USA-road-d-USA	23M	57M
rMatGraph_27	134M	2125M
uniformGraph_27	134M	2128M

See the tracer/ folder for downloading and generating own traces. Otherwise, generated traces based on these inputs and used during the evaluation are available to download under traces/download_traces.sh script.

Advanced Usage

See the run_main_experiments.sh script to run all relevant configurations used in the evaluations.

Running Individual Simulations

./ramulator_run_traces.sh \
    pr_push_pim \           # Application
    twitter7 \              # Graph input
    0 \                     # Simulation mode
    16 \                    # Thread count
    ramulator \             # Simulator variant
    my-experiment \         # Experiment name
    1 \                     # IS_DDR (1=DDR, 0=HBM2)
    gran16_pr \             # EXTRA_ID, mainly for the granularity
    opt                     # EXTRA_ID2, mainly for cache sensitivity analysis

Power Analysis Only

Command trace should be generated from the ramulator first:

./ramulator_run_drampower.sh [same parameters as above]

Local Execution (without SLURM)

Set MAX_CONCURRENT_JOBS in run_main_experiments.sh:

MAX_CONCURRENT_JOBS=4  # Adjust based on available resources

The submit_job() function automatically detects SLURM availability and falls back to background job execution with concurrency control.

Memory Trace Format

Traces are pre-generated using Intel Pin and compressed:

Location: traces/output_{APP}_{INPUT}_{THREADS}t_traces/*.gz
Streaming: Via named FIFOs using gzip2fifo.py
Download: ./traces/download_traces.sh

PIN Instrumentation

Applications use markers from workloads/gapbs/intel_pin_headers.h:

PIN_Start();
PIN_RegisterDataType(addr, IRREGDATA, n_elements, elem_size);
PIN_PIMOperations(PR_APPLY_STEP_PIM_CODE);
PIN_Stop();

Data types: IRREGDATA, REGDATA, CSR_OFFSETS, CSR_COORDS, FRONTIER

Building from Source

See the README files under the subfolders.

Output Interpretation

Statistics Files (`.stats`)

Contains cycle counts, memory bandwidth, command distributions, and PIM-specific metrics.

DRAMPower Output

Energy breakdown: activation, precharge, read/write, refresh, and tuple-update energy.

Memory Configuration

DDR_PIM: 8 channels
HBM2: 32 pseudo-channels

License

See LICENSE file for details.

Contact

For questions or issues, please open a GitHub issue or contact the authors.

Acknowledgments

This work builds upon:

Ramulator - DRAM simulator
DRAMPower - Power modeling
GAPBS - Graph benchmark suite
Ligra - Graph processing framework

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoGraf: Fully Accelerating Graph Applications with Fine-Grained PIM

Overview

Quick Start

Prerequisites

Running Experiments

Results Location

Architecture

Workflow Pipeline

Directory Structure

Key Components

Simulators

Workloads

Configuration

Standard Graph Inputs

Advanced Usage

Running Individual Simulations

Power Analysis Only

Local Execution (without SLURM)

Memory Trace Format

PIN Instrumentation

Building from Source

Output Interpretation

Statistics Files (`.stats`)

DRAMPower Output

Memory Configuration

License

Contact

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github		.github
simulators		simulators
traces		traces
workloads		workloads
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ramulator_run_drampower.sh		ramulator_run_drampower.sh
ramulator_run_traces.sh		ramulator_run_traces.sh
run_main_experiments.sh		run_main_experiments.sh

License

alisemi/CoGraf

Folders and files

Latest commit

History

Repository files navigation

CoGraf: Fully Accelerating Graph Applications with Fine-Grained PIM

Overview

Quick Start

Prerequisites

Running Experiments

Results Location

Architecture

Workflow Pipeline

Directory Structure

Key Components

Simulators

Workloads

Configuration

Standard Graph Inputs

Advanced Usage

Running Individual Simulations

Power Analysis Only

Local Execution (without SLURM)

Memory Trace Format

PIN Instrumentation

Building from Source

Output Interpretation

Statistics Files (.stats)

DRAMPower Output

Memory Configuration

License

Contact

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Statistics Files (`.stats`)

Packages