Skip to content

alisemi/CoGraf

Repository files navigation

CoGraf: Fully Accelerating Graph Applications with Fine-Grained PIM

This repository is an artifact of CoGraf that contains the simulator environment, tracing methodology, and experimental infrastructure.

Overview

This project implements and compares multiple Fine-Grained Processing-in-Memory (FGPIM) architectures for graph processing:

  • CoGraf (our work): Fine-grained PIM with multi-column tuple-update commands and apply phase optimizations
  • AIM: Column-level coalescing baseline
  • Optimal: Optimal granularity coalescing as obtained by running all possible granularity configurations.
  • Multi: Multi-column FGPIM instructions are added to the Optimal Granularity configuration.
  • PHI: CPU Baseline
  • Naive FGPIM: FGPIM without optimizations

The evaluation infrastructure combines modified graph analytics frameworks (GAPBS, Ligra) with trace-driven DRAM simulation (Ramulator) and power analysis (DRAMPower).

Quick Start

Prerequisites

  • C++20 compatible compiler (g++ 10+, clang++ 10+)
  • CMake 3.10+
  • Python 3.6+
  • OpenMP support
  • (Optional) SLURM for HPC cluster execution
  • (Optional) x86 architecture to generate own traces to simulate

Running Experiments

The experiments were originally designed for a SLURM environment to enable task-level parallelism. The script automatically detects SLURM availability and adapts accordingly. If running on a local device without SLURM, set MAX_CONCURRENT_JOBS in run_main_experiments.sh to match your system resources (default is 4). You may need to adjust SLURM job parameters that start with #SBATCH commnets on top of ramulator_run_traces.sh and ramulator_run_drampower.sh script files.

# Run complete experimental pipeline
./run_main_experiments.sh

This will:

  1. Download pre-generated memory traces
  2. Compile all simulator variants
  3. Run simulations for configured architectures
  4. Perform energy analysis via DRAMPower and HBM2 model
  5. Consolidate results in experiment-results/main-results directory

Results Location

  • Simulation statistics: experiment-results/main-results/*.stats
  • Static Energy analysis (DRAMPower): experiment-results/main-results/drampower-outputs/
  • Dynamic Energy analysis (HBM2 Model): experiment-results/main-results/hbm2model-outputs/
  • DRAMPower traces: DRAMPower-traces/main-results/
  • Logs: experiment-results/main-results/ramulator-outputs/

Architecture

Workflow Pipeline

Graph Workloads (GAPBS/Ligra)
    ↓
Intel Pin Tracer
    ↓
Compressed Memory Traces (.gz)
    ↓
FIFO Streaming (gzip2fifo.py)
    ↓
Ramulator Variants (ramulator/ramulator-phi/ramulator-tuple)
    ↓
DRAMPower Command Traces (.cmdtrace)
    ↓
Energy Analysis (DRAMPower/HBM2Model)

Note: Pre-generated memory traces are provided with this artifact to facilitate reproduction. The trace generation tools (Intel Pin-based tracer) are included in the workloads/ directory for reference and potential regeneration if needed.

Directory Structure

CoGraf/
├── simulators/           # DRAM simulators and power models
│   ├── ramulator/        # Main FGPIM simulator (used for CoGraf, Multi-column, Optimal, AIM, and Naive FGPIM configurations)
│   ├── ramulator-phi/    # PHI baseline simulator (for PHI baseline)
│   ├── ramulator-tuple/  # CoGraf with tuple coalescing applied in private caches as part of a sensitivity analysis.
│   ├── DRAMPower/        # Modified for multi-column commands
│   ├── HBM2Model/        # Python-based HBM2 energy model
│   └── gzip2fifo.py      # Trace streaming utility
├── workloads/            # Modified graph frameworks
│   ├── gapbs/            # GAPBS with PIM/PHI variants, added a PageRank-Delta implementation, BFS and PR are modified to push-based update-apply  
│   ├── ligra-push/       # Ligra, modified for push-only algorithms
│   └── tracer/           # Intel Pin-based tracer tool to generate traces suitable with the simulators
├── traces/               # Pre-generated memory traces
├── run_main_experiments.sh          # Main experiment runner
├── ramulator_run_traces.sh          # Individual simulation runner
└── ramulator_run_drampower.sh       # Individual Power analysis runner

Key Components

Simulators

ramulator/ - Main FGPIM simulator supporting multiple architectures:

  • Mode 0: Ultimate CoGraf (our work) and Multi configuration (apply phase ignored)
  • Mode 2: AiM baseline (column-level coalescing)
  • Mode 3: Naive FGPIM baseline
  • Mode 7: Sensitivity analysis where all tuple commands become simple writes
  • Mode 8: Main configuration where tuple cachelines translate into multiple commands per column

The other modes (1,4,5) had been used during some preliminary studies of the paper, however not included in the evaluations. Mode 6 corresponds to tuple coalescing applied in private caches, however another simulator (ramulator-tuple) is used for that purpose.

Configuration files in simulators/ramulator/configs/:

  • DDR_PIM-config_gran{N}.cfg: Granularity variants (2/4/8/16/32/64 columns)
  • HBM2-config*.cfg: HBM2 memory configurations
  • Suffixes: _opt (conservative cache pipelining model for processing the updates), _simple (practical cache pipelning model as described in the paper), _perfect (perfect pipelining), _pr (PageRank and PageRank-Delta-specific, increased latency of FGPIM commands due to multiply operations),

Note: Apply phase optimizations are used in all FGPIM modes by default. Post-processing of the results take the PHI's apply phase statistics.

ramulator-phi/ - PHI baseline

ramulator-tuple/ - Tuple-based coalescing extended to all cache levels

DRAMPower - Modified to support multi-column DRAM commands

Workloads

Workloads are mainly modified for push-based update-apply algorithms. The other modifications are to mark the region of interest and communicate with the Intel-Pin tracer tool.

GAPBS (workloads/gapbs/) - Graph Analytics Benchmark Suite with PIM variants:

  • Applications: BFS, PageRank, PageRank-Delta,
  • Variants: {app}_pim.cc (FGPIM), {app}_phi.cc (PHI), {app}.cc (default)
  • Push-based algorithms with update-apply phase structure

Ligra-push (workloads/ligra-push/) - Modified Ligra framework:

  • Applications: Components and Radii in apps/ directory
  • Variants: {app}-push-pim.cc (FGPIM), {app}-push-phi.cc (PHI), {app}.cc (default)
  • Push-based graph processing primitives

Configuration

Standard Graph Inputs

Input graphs are based on the GAP benchmark suite (https://doi.org/10.48550/arXiv.1508.03619)

Graph Vertices Edges
twitter7 41M 1468M
sk-2005 50M 1930M
USA-road-d-USA 23M 57M
rMatGraph_27 134M 2125M
uniformGraph_27 134M 2128M

See the tracer/ folder for downloading and generating own traces. Otherwise, generated traces based on these inputs and used during the evaluation are available to download under traces/download_traces.sh script.

Advanced Usage

See the run_main_experiments.sh script to run all relevant configurations used in the evaluations.

Running Individual Simulations

./ramulator_run_traces.sh \
    pr_push_pim \           # Application
    twitter7 \              # Graph input
    0 \                     # Simulation mode
    16 \                    # Thread count
    ramulator \             # Simulator variant
    my-experiment \         # Experiment name
    1 \                     # IS_DDR (1=DDR, 0=HBM2)
    gran16_pr \             # EXTRA_ID, mainly for the granularity
    opt                     # EXTRA_ID2, mainly for cache sensitivity analysis

Power Analysis Only

Command trace should be generated from the ramulator first:

./ramulator_run_drampower.sh [same parameters as above]

Local Execution (without SLURM)

Set MAX_CONCURRENT_JOBS in run_main_experiments.sh:

MAX_CONCURRENT_JOBS=4  # Adjust based on available resources

The submit_job() function automatically detects SLURM availability and falls back to background job execution with concurrency control.

Memory Trace Format

Traces are pre-generated using Intel Pin and compressed:

  • Location: traces/output_{APP}_{INPUT}_{THREADS}t_traces/*.gz
  • Streaming: Via named FIFOs using gzip2fifo.py
  • Download: ./traces/download_traces.sh

PIN Instrumentation

Applications use markers from workloads/gapbs/intel_pin_headers.h:

PIN_Start();
PIN_RegisterDataType(addr, IRREGDATA, n_elements, elem_size);
PIN_PIMOperations(PR_APPLY_STEP_PIM_CODE);
PIN_Stop();

Data types: IRREGDATA, REGDATA, CSR_OFFSETS, CSR_COORDS, FRONTIER

Building from Source

See the README files under the subfolders.

Output Interpretation

Statistics Files (.stats)

Contains cycle counts, memory bandwidth, command distributions, and PIM-specific metrics.

DRAMPower Output

Energy breakdown: activation, precharge, read/write, refresh, and tuple-update energy.

Memory Configuration

  • DDR_PIM: 8 channels
  • HBM2: 32 pseudo-channels

License

See LICENSE file for details.

Contact

For questions or issues, please open a GitHub issue or contact the authors.

Acknowledgments

This work builds upon:

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published