GitHub - atrapos-hin/artifacts

ATRAPOS: Metapath-Based Pattern Mining in HINs

More details about our work can be found in our publication:

S. Chatzopoulos, T. Vergoulis, D. Skoutas, T. Dalamagas, C. Tryfonopoulos, P. Karras. Atrapos: Real-time Evaluation of Metapath Query Workloads. The Web Conference (WWW), 2023

-- We kindly ask that any published research that makes use of our work cites the paper above.

Many mining algorithms in HINs have a common computationally intensive step, that consists of finding the pairs of nodes connected given a particular metapath. This repository provides open-source implementations of the Sequential and Dynamic Programming matrix multiplication strategies proposed in [1] for this problem. As data scientists usually submit multiple metapath queries, we further propose the OTree algorithm that exploits metapath overlaps to eliminate redundant computations.

Input File Formats

Input files can be one of the following types:

File containing node attributes. These are tab-separated files containing all node attributes. The first line is the header that contains all attribute names. The first column should be an incremental integer identifier, denoted as "id" in the header. These files should be named with the first letter of the entity they are representing. For example, the file that contains the attributes for node type Author should be named A.csv. An example of a file containing node attributes is the following:

id	name    surname
0	Makoto  Satoh
1	Ryo Muramatsu
...

File containing node relations. These are tab-separated files needed to construct the relations among nodes. These files contain two columns, the source and target identidiers respectively and should be sorted based on the first column and named with the extension .crs. They do not contain a header and they should be named according to the source and target node types. For example, the file with the relations between node types Author and Paper should be named AP.crs. An example of file containing node relations is the following:

Example Input Files

All methods use the same format of input files.

You can download sample data for the DBLP from: http://andrea.imsi.athenarc.gr/atrapos_sample_data.tar.gz Pleas copy and paste this address in your browser to download the file. Then extract it into the ./data folder.

Clone & Build

# clone repository
git clone https://github.com/atrapos-hin/artifacts.git

# navigate into the cloned folder
cd artifacts

# execute the following script from the root directory
bash setup.sh

# build project
make

Execution

All the algorithms can be executed with the following command:

./run -qf [query_file] -indir [nodes_input_dir] -idir [relation_input_dir] -algo [algorithm] -ad [adaptive] -mem [max_mem_in_adaptive_mode] -exp [expansion_strategy] -dopt [heuristic_of_dynamic_programming] -cache [cache_size] -cachep [cache_policy]

Parameters are explained in the table below:

Parameter	Description
-qf	the query file that contains all metapaths to be executed, each in a new line
-indir	the directory that contains the input files for node attributes
-idir	teh directory that contains the input files for node relations
-algo	the algorithm to be used; choose one of: `Seq` for HRank (see [1]) using sequential matrix multiplication `DynP` for HRank (see [1]) using the dynamic programming optimisation `Baseline1` for CBS1 (caches only the result) `CBS2` for CBS2 (caches final and intermediate results) `OTree` for ATRAPOS (see [2])
-ad	enable disk-based matrix multiplication, values: `0` (note that `1` is in experimental condition)
-mem	max memory to be used when the adapative (disk-based) matrix multiplcation is enabled, i.e, `-ad 1` is given
-exp	expansion strategy used in the OTree algorithm when trying to expand a previsously cached result. This parameter can be set to: `DynP` (also `Seq`, `Exp` and `ExpSparse` are in experimental condition)
-dopt	heustistic to be used for Dynamic Programming matrix multiplication. This parameter can be set to: `Dense`, `Sparse`, ( also `MNC` in experimental condition)
-cache	cache size given in MB
-cachep	the cache policy to be used; one of: `LRU` for the Least Recently Used `GDS` for the GreedyDual-Size `PGDSU` for the custom Popularity-aware GreedyDual-Size used in [2]

You can use the example.sh script to execute a sample query workload for DBLP. It writes all output results under the ./results folder and logs the overall time for each run in ./results/overall.csv

References

[1] Y. P. S. W. B. Shi Chuan, Li Yitong. Constrained-meta-path-based rankingin heterogeneous information network. Knowledge and Information Systems, 2016

[2] S. Chatzopoulos, T. Vergoulis, D. Skoutas, T. Dalamagas, C. Tryfonopoulos, P. Karras. Atrapos: Real-time Evaluation of Metapath Query Workloads. The Web Conference (WWW), 2023

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Base		Base
Cache		Cache
CacheBaselines		CacheBaselines
HRank		HRank
MNCSketches		MNCSketches
MatrixMultiplier		MatrixMultiplier
OTree		OTree
TransitionMatrix		TransitionMatrix
data		data
experiments		experiments
ext_libs/Eigen		ext_libs/Eigen
results		results
workloads		workloads
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
Utils.cpp		Utils.cpp
Utils.h		Utils.h
example.sh		example.sh
main.cpp		main.cpp
setup.sh		setup.sh

atrapos-hin/artifacts

Folders and files

Latest commit

History

Repository files navigation

ATRAPOS: Metapath-Based Pattern Mining in HINs

Input File Formats

Example Input Files

Clone & Build

Execution

References

About

Resources

Stars

Watchers

Forks

Languages