Skip to content

A large-scale database for graph representation learning

Notifications You must be signed in to change notification settings

ghj1976/malnet-graph

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A Large-Scale Database for Graph Representation Learning

Accepted for an oral presentation in the NeurIPS 2021 Datasets and Benchmarks Track

MalNet: Advancing State-of-the-art Graph Databases

Recent research focusing on developing graph kernels, neural networks and spectral methods to capture graph topology has revealed a number of shortcomings of existing graph benchmark datasets, which often contain graphs that are relatively:

  • limited in number,
  • small in scale in terms of nodes and edges, and
  • restricted in class diversity.

To solve these issues, we have been working to develop the worlds largest public graph representation learning database to date at Georgia Tech’s Polo Club of Data Science. We release MalNet, which contains over 1.2 million function call graphs averaging over 17k nodes and 39k edges per graph, across a hierarchy of 47 types and 696 families of classes (see Figure 1 below).

Compared to the popular REDDIT-12K database, MalNet offers 105x more graphs, 44x larger graphs on average, and 63x more classes.

Comparing Graph Databases

What is a function call graph (FCG)?

Function call graphs represent the control flow of programs (see Figure 2 below), and can be statically extracted from many types of software (e.g., EXE, PE, APK). We use the Android ecosystem due to its large market share, easy accessibility, and diversity of malicious software. With the generous permission of the AndroZoo we collected 1,262,024 Android APK files, specifically selecting APKs containing both a family and type label obtained from the Euphony classification structure.

Function call graph

How do we download and explore MalNet?

We have designed and developed MalNet Explorer, an interactive graph exploration and visualization tool to help people easily explore the data before downloading. Figure 3 shows MalNet Explorer’s desktop web interface and its main components. MalNet Explorer and the data is available online at: www.mal-net.org.

Comparing Graph Databases

How to run the code?

The code is broken into 2 separate directories: ata mining experiments and (2) graph neural network experiments. The code for each technique can be run using 'dm_experiments.py' and 'gnn_experiments.py', respectively. In addition, we aggregate the key parameters for each method into the respective 'config.py' files.

Before running the code, download the data from www.mal-net.org and specify the directory to the 'full' and 'tiny' datasets in both the dm and gnn 'config.py' files using the parameter 'malnet_dir', and 'malnet_tiny_dir', respectively. In addition, download the split info controlling the train/val/test splits from www.mal-net.org and place it in this directory.

About

A large-scale database for graph representation learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%