Paper accepted by DAMI (Data Mining and Knowledge Discovery journal) for publication (July 2023), and this paper is available via this link.
the QCAD repo includes two folders, Code and Data.
Specifcially, the Code folder contains the following sub-folders:
- Implementation: which includes the implementations of contextual anomaly detection algorithms and traditional anomaly detection algorithms as follows:
- QCAD.py: algorithm proposed and implemented by us.
- CAD.py: algorithm proposed by Song, Xiuyao, et al. 2007; implemented by us.
- ROCOD.py: algorithm proposed by Liang, Jiongqian, and Srinivasan Parthasarathy. 2016; implemented by us.
- LoPAD.py: algorithm proposed by Lu, Sha, et al.; implemented by us.
- PyODtest.py: other traditional anomaly detection algorithms such as KNN,LOF,SOD,IForest, HBOS, implemented by Yue Zhao, Zain Nasrullah, and Zheng Li., 2019; API written by us.
- Utilities: which contains some utility function/scripts as follows:
- SynDataGen.py: generate synthetic datasets.
- ContextualAnomalyInject.py: inject contextual anomalies.
- FindMB.R: find Markov Blankets for the LoPAD algorithm.
- Examples: which contains the following scripts used to generate examples in our paper.
- ExampleFootball.py: generate the football application example in the Experiment Results section;
- ExampleQuantileHeight.py: generate the figures in the Introduction section;
- ExampleBeanPlot.py: generate the Beanplot in the Method section;
- MultipleRunningAverage: which run all involved detection algothms 10 times independently.
- AverageTest.py: execute all anomaly detection algorithms except CAD on 20 real-world datasets 10 times, respectively.
- AverageTestCAD.py: execute CAD separately on 20 real-world datasets 10 times, respectively. This is because it takes a long time.
- SynAverageTest.py: execute all anomaly detection algorithms except CAD on 10 synthetic datasets 10 times, respectively.
- SynAverageTestCAD.py: execute CAD separately on 10 synthetic datasets 10 times, respectively. This is because it takes a long time.
- AblationStuides: which investigate the impacts of different components on detection performance.
- AblationStudy.py: conduct two ablation stuides.
- RuntimeAnalysis: which inspects the computational cost of QCAD and CAD.
- RuntimeAnalysis.py: inspect the running time by varying the number of behaviroual features, contextual features or samples, respectively.
- SensitivityStudies: which investigate the impact of parameter k.
- SensitivityOfNeighbours.py: inspect the detection accuracy in terms of RUC AUC, PR AUC, P@n by varying the number of neighbours.
Specifcially, the Data folder contains the following sub-folders:
- RawData: 20 real-world datasets without contextual anomalies (assumption)
- SynData: 10 synthetic datasets without contextual anomalies
- GenData: 20 real-world datasets with injected contextual anomalies, 10 synthetic datasets with contextual anomalies, and the Markov Blankets of these 30 datasets in the subfolder ~/MB/
- Examples: the football dataset with unkown real-world contextual anomalies
- TempFiles: temporary or intermediate results