This is official code for the Polaris framework.
Polaris: Detecting Advanced Persistent Threat on Provenance Graphs via Siamese Masked Graph Representation
We propose Polaris, a novel semi-supervised detection framework based on provenance graphs. At its core is a Siamese Graph Masked Autoencoder (SGMAE). This model leverages masked graph representation learning to model system entities and benign behaviors, while concurrently employing an angular contrastive learning mechanism that utilizes a small number of attack samples to separate the embeddings of easily confused benign and malicious behaviors. Finally, an anomaly detection algorithm is applied in the learned embedding space to identify APT attacks.
Python 3.8
PyTorch 1.12.1
DGL 1.0.0
Scikit-learn 1.2.2
networkx 3.1
tqdm 4.67.1
gensim 4.3.1
- Download and unzip
all.tar.gz
dataset from StreamSpot. - you will get
all.tsv
. - Copy
all.tsv
to./data/streamspot/
- go to directory
datahandle
and runstreamspot_parser.py
to get 600 graph data files in JSON format.
- Download and unzip
attack_baseline.tar.gz
andbenign.tar.gz
Unicorn wget - you will get many
*.log
files - Copy
.log
file todata/wget/raw/
. Ignore contents inbase
andstream
. - go to directory
datahandle
and runwget_parser.py
to get 150 graph data files in JSON format.
- Go to Darpa TC E3 datasets
- Download and unzip ta1-trace-e3-official-1.json.tar.gz into data/trace/.
- Download and unzip ta1-theia-e3-official-6r.json.tar.gz into data/theia/.
- Download and unzip ta1-cadets-e3-official-2.json.tar.gz and ta1-cadets-e3-official.json.tar.gz into data/cadets/
- go to directory
datahandle
and rundarpa_trace.py
to get trainset and testset. - Polaris use labels from ThreaTrace.Go to the repo and download the labels file
.txt
in folderground_truth
.Put the labels file into data folder.For example, puttrace.txt
intodata/trace/
.
To facilitate evaluation, we have saved the processed dataset and trained weights to /data/dataset.Just Run scripts:
python ./evaluate.py --dataset **selected dataset**
To train the model from scratch, run:
python ./train.py --dataset **selected dataset**