This repository is an artifact of CoGraf that contains the simulator environment, tracing methodology, and experimental infrastructure.
This project implements and compares multiple Fine-Grained Processing-in-Memory (FGPIM) architectures for graph processing:
- CoGraf (our work): Fine-grained PIM with multi-column tuple-update commands and apply phase optimizations
- AIM: Column-level coalescing baseline
- Optimal: Optimal granularity coalescing as obtained by running all possible granularity configurations.
- Multi: Multi-column FGPIM instructions are added to the Optimal Granularity configuration.
- PHI: CPU Baseline
- Naive FGPIM: FGPIM without optimizations
The evaluation infrastructure combines modified graph analytics frameworks (GAPBS, Ligra) with trace-driven DRAM simulation (Ramulator) and power analysis (DRAMPower).
- C++20 compatible compiler (g++ 10+, clang++ 10+)
- CMake 3.10+
- Python 3.6+
- OpenMP support
- (Optional) SLURM for HPC cluster execution
- (Optional) x86 architecture to generate own traces to simulate
The experiments were originally designed for a SLURM environment to enable task-level parallelism. The script automatically detects SLURM availability and adapts accordingly. If running on a local device without SLURM, set MAX_CONCURRENT_JOBS in run_main_experiments.sh to match your system resources (default is 4). You may need to adjust SLURM job parameters that start with #SBATCH commnets on top of ramulator_run_traces.sh and ramulator_run_drampower.sh script files.
# Run complete experimental pipeline
./run_main_experiments.shThis will:
- Download pre-generated memory traces
- Compile all simulator variants
- Run simulations for configured architectures
- Perform energy analysis via DRAMPower and HBM2 model
- Consolidate results in
experiment-results/main-resultsdirectory
- Simulation statistics:
experiment-results/main-results/*.stats - Static Energy analysis (DRAMPower):
experiment-results/main-results/drampower-outputs/ - Dynamic Energy analysis (HBM2 Model):
experiment-results/main-results/hbm2model-outputs/ - DRAMPower traces:
DRAMPower-traces/main-results/ - Logs:
experiment-results/main-results/ramulator-outputs/
Graph Workloads (GAPBS/Ligra)
↓
Intel Pin Tracer
↓
Compressed Memory Traces (.gz)
↓
FIFO Streaming (gzip2fifo.py)
↓
Ramulator Variants (ramulator/ramulator-phi/ramulator-tuple)
↓
DRAMPower Command Traces (.cmdtrace)
↓
Energy Analysis (DRAMPower/HBM2Model)
Note: Pre-generated memory traces are provided with this artifact to facilitate reproduction. The trace generation tools (Intel Pin-based tracer) are included in the workloads/ directory for reference and potential regeneration if needed.
CoGraf/
├── simulators/ # DRAM simulators and power models
│ ├── ramulator/ # Main FGPIM simulator (used for CoGraf, Multi-column, Optimal, AIM, and Naive FGPIM configurations)
│ ├── ramulator-phi/ # PHI baseline simulator (for PHI baseline)
│ ├── ramulator-tuple/ # CoGraf with tuple coalescing applied in private caches as part of a sensitivity analysis.
│ ├── DRAMPower/ # Modified for multi-column commands
│ ├── HBM2Model/ # Python-based HBM2 energy model
│ └── gzip2fifo.py # Trace streaming utility
├── workloads/ # Modified graph frameworks
│ ├── gapbs/ # GAPBS with PIM/PHI variants, added a PageRank-Delta implementation, BFS and PR are modified to push-based update-apply
│ ├── ligra-push/ # Ligra, modified for push-only algorithms
│ └── tracer/ # Intel Pin-based tracer tool to generate traces suitable with the simulators
├── traces/ # Pre-generated memory traces
├── run_main_experiments.sh # Main experiment runner
├── ramulator_run_traces.sh # Individual simulation runner
└── ramulator_run_drampower.sh # Individual Power analysis runner
ramulator/ - Main FGPIM simulator supporting multiple architectures:
- Mode 0: Ultimate CoGraf (our work) and Multi configuration (apply phase ignored)
- Mode 2: AiM baseline (column-level coalescing)
- Mode 3: Naive FGPIM baseline
- Mode 7: Sensitivity analysis where all tuple commands become simple writes
- Mode 8: Main configuration where tuple cachelines translate into multiple commands per column
The other modes (1,4,5) had been used during some preliminary studies of the paper, however not included in the evaluations. Mode 6 corresponds to tuple coalescing applied in private caches, however another simulator (ramulator-tuple) is used for that purpose.
Configuration files in simulators/ramulator/configs/:
DDR_PIM-config_gran{N}.cfg: Granularity variants (2/4/8/16/32/64 columns)HBM2-config*.cfg: HBM2 memory configurations- Suffixes:
_opt(conservative cache pipelining model for processing the updates),_simple(practical cache pipelning model as described in the paper),_perfect(perfect pipelining),_pr(PageRank and PageRank-Delta-specific, increased latency of FGPIM commands due to multiply operations),
Note: Apply phase optimizations are used in all FGPIM modes by default. Post-processing of the results take the PHI's apply phase statistics.
ramulator-phi/ - PHI baseline
ramulator-tuple/ - Tuple-based coalescing extended to all cache levels
DRAMPower - Modified to support multi-column DRAM commands
Workloads are mainly modified for push-based update-apply algorithms. The other modifications are to mark the region of interest and communicate with the Intel-Pin tracer tool.
GAPBS (workloads/gapbs/) - Graph Analytics Benchmark Suite with PIM variants:
- Applications: BFS, PageRank, PageRank-Delta,
- Variants:
{app}_pim.cc(FGPIM),{app}_phi.cc(PHI),{app}.cc(default) - Push-based algorithms with update-apply phase structure
Ligra-push (workloads/ligra-push/) - Modified Ligra framework:
- Applications: Components and Radii in
apps/directory - Variants:
{app}-push-pim.cc(FGPIM),{app}-push-phi.cc(PHI),{app}.cc(default) - Push-based graph processing primitives
Input graphs are based on the GAP benchmark suite (https://doi.org/10.48550/arXiv.1508.03619)
| Graph | Vertices | Edges |
|---|---|---|
| twitter7 | 41M | 1468M |
| sk-2005 | 50M | 1930M |
| USA-road-d-USA | 23M | 57M |
| rMatGraph_27 | 134M | 2125M |
| uniformGraph_27 | 134M | 2128M |
See the tracer/ folder for downloading and generating own traces. Otherwise, generated traces based on these inputs and used during the evaluation are available to download under traces/download_traces.sh script.
See the run_main_experiments.sh script to run all relevant configurations used in the evaluations.
./ramulator_run_traces.sh \
pr_push_pim \ # Application
twitter7 \ # Graph input
0 \ # Simulation mode
16 \ # Thread count
ramulator \ # Simulator variant
my-experiment \ # Experiment name
1 \ # IS_DDR (1=DDR, 0=HBM2)
gran16_pr \ # EXTRA_ID, mainly for the granularity
opt # EXTRA_ID2, mainly for cache sensitivity analysisCommand trace should be generated from the ramulator first:
./ramulator_run_drampower.sh [same parameters as above]Set MAX_CONCURRENT_JOBS in run_main_experiments.sh:
MAX_CONCURRENT_JOBS=4 # Adjust based on available resourcesThe submit_job() function automatically detects SLURM availability and falls back to background job execution with concurrency control.
Traces are pre-generated using Intel Pin and compressed:
- Location:
traces/output_{APP}_{INPUT}_{THREADS}t_traces/*.gz - Streaming: Via named FIFOs using
gzip2fifo.py - Download:
./traces/download_traces.sh
Applications use markers from workloads/gapbs/intel_pin_headers.h:
PIN_Start();
PIN_RegisterDataType(addr, IRREGDATA, n_elements, elem_size);
PIN_PIMOperations(PR_APPLY_STEP_PIM_CODE);
PIN_Stop();Data types: IRREGDATA, REGDATA, CSR_OFFSETS, CSR_COORDS, FRONTIER
See the README files under the subfolders.
Contains cycle counts, memory bandwidth, command distributions, and PIM-specific metrics.
Energy breakdown: activation, precharge, read/write, refresh, and tuple-update energy.
- DDR_PIM: 8 channels
- HBM2: 32 pseudo-channels
See LICENSE file for details.
For questions or issues, please open a GitHub issue or contact the authors.
This work builds upon: