CaTSCAN: Calculating Typed Subgraph Counts Analytically for Networks

CaTSCAN is an efficient algorithm for counting exact frequencies of typed graphlets of up to 5 nodes in heterogeneous information networks. It derives counts of larger graphlets from smaller ones using combinatorics, avoiding excessive enumeration while preserving type information, and outperforms existing typed graphlet solutions by up to 4 orders of magnitude.

This repository contains the code accompanying the paper:

Christian Beth*, Mattis thor Straten, Felix Rodriguez Rasmussen, Mohammad Matin Najafi, Reynold Cheng, and Matthias Renz. CaTSCAN: Calculating Typed Subgraph Counts Analytically for Networks. In Proceedings of ECML PKDD 2026. In press.

*Corresponding author: Christian Beth

It also includes our implementation of the algorithm for typed graphlets following the paper by Rossi et al. (TKDD 2020), which counts typed 3- and 4-node subgraphs, used as a reference implementation and baseline.

Citation

@inproceedings{beth2026catscan,
  author    = {Beth, Christian
               and {thor Straten}, Mattis
               and Rodriguez Rasmussen, Felix
               and Najafi, Mohammad Matin
               and Cheng, Reynold
               and Renz, Matthias},
  title     = {{CaTSCAN}: Calculating Typed Subgraph Counts Analytically for Networks},
  booktitle = {Proceedings of the European Conference on Machine Learning and
               Principles and Practice of Knowledge Discovery in Databases
               (ECML PKDD)},
  year      = {2026},
  note      = {In press}
}

Installation

Install your favorite IDE and clone this repository.
Install CMake: CMake is a cross-platform build system. Depending on your operating system, the installation process might vary: This is the process on Ubuntu:
```
sudo apt-get install cmake
```

Build the project:

~/.../catscan$ cmake -S . -B build
~/.../catscan$ cd build
~/.../catscan/build$ make

Usage of our proposed solution (supports graphlet sizes 3, 4, and 5):

~/.../catscan/build$ cd ..
~/.../catscan$ ./build/CaTSCAN <dataset_dir> <output_dir> <size of graphlets>

Usage of our implementation of the algorithm by [Rossi et al.(https://dl.acm.org/doi/abs/10.1145/3418773)] (supports graphlet sizes 3 and 4):
```
~/.../catscan/build$ cd ..
~/.../catscan$ ./build/Rossi <dataset_dir> <output_dir> <size of graphlets>
```

Input Format

The dataset directory for the desired graph should contain the following files:

meta.csv: The first line contains the number of nodes, the second line the number of edges, and the third the number of node types.
nodes.csv: Each line contains the type ID of the corresponding node (0-indexed, as integer).
edges.csv: Each line contains an edge in the form <source node id>, <edge type id>, <target node id> (where node ids are 0-indexed, as integers, and edge type id is also 0-indexed, as integer). The file should contain no multi-edges, self-loops, reverse edges, or dangling edges.
node_types.csv (optional): Each line contains the type name (as string) of the corresponding node type ID (line 0 corresponds to node type 0).
edge_types.csv (optional): Each line contains the type name (as string) of the corresponding edge type ID (line 0 corresponds to edge type 0).

An example dataset is provided in catscan/data/DBLP4areas/. To run CaTSCAN on it for 4-node graphlets:

~/.../catscan$ ./build/CaTSCAN data/DBLP4areas/ output/ 4

Output Format

The directory will contain a file CaTSCAN_times_<graphlet size>.tsv or Rossi_times_<graphlet size>.tsv, where each line contains the runtime (in milliseconds) to load the graph, and for the counting of all graphlets of the specified size, separated by a tab. Please note that for each run a new line is appended instead of overwriting the file.

When running CaTSCAN (not for Rossi), the output directory will contain a file for each graphlet type, named <graphlet name>.tsv. Each line contains a type description, e.g., (3, 0, 1, 2, 3) and, separated by a tab character, the count of occurrences of that typed graphlet in the input graph. The type description is a tuple where each entry corresponds to a node type ID, where the type position is indicated by the graphlet figure. For example, the line (3, 0, 1, 2, 3) \t 42 in the output file four_stars.tsv would indicate that there are 42 occurrences of the 4-star where the center node has type ID 3, surrounded by nodes with type IDs 0, 1, and 2 respectively.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data/DBLP4areas		data/DBLP4areas
src		src
.gitignore		.gitignore
3-node_graphlets.png		3-node_graphlets.png
4-node_graphlets.png		4-node_graphlets.png
5-node_graphlets.png		5-node_graphlets.png
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CaTSCAN: Calculating Typed Subgraph Counts Analytically for Networks

Citation

Installation

Input Format

Output Format

Three-Node Graphlets

Four-Node Graphlets

Five-Node Graphlets

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CaTSCAN: Calculating Typed Subgraph Counts Analytically for Networks

Citation

Installation

Input Format

Output Format

Three-Node Graphlets

Four-Node Graphlets

Five-Node Graphlets

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages