SAND: Semi-Supervised Adaptive Novel Class Detection and Classification over Data Stream
SAND is a semi-supervised framework for classifying evolving data streams. Unlike many other existing approaches, it detects concept drifts in an unsupervised way by detecting changes in classifier confidences in classifying test instances. It also addresses concept evolution problem by detecting outliers having strong cohesion among themselves. Please refer the paper given below for a details description of the approach.
SAND requires that
- Input file will be provided in .arff format.
- All the features need to be numeric. If there is any non-numeric featues, it should be converted using standard techniques prior using with SAND.
- Features should be normalized to get better performance.
- Java SDK v1.7+
- Weka 3.6+
- Common Math library v2.2
- Apache Logging Services v1.2.15
All of above except java sdk are included inside SRC_SAND_v_0_1 & DIST_SAND_v_0_1 folders.
To execute the program, use the following steps:
- Open a command prompt inside DIST_SAND_v_0_1 folder.
- Run the command "java -jar SAND_v_0_1.jar [OPTION(S)]"
- Input file path. Do not include file extension .arff in the file path.
- Size of warm-up period chunks. Default size is 2000 instances.
- Maximum number of models in the ensemble. Default value is 6.
- Value for confidence threshold. Default value is 0.90. Please refer to the paper for description of confidence threshold.
- use 1 here to execute SAND-D, 0 to execute SAND-F. Default value is 1. Please refer to the paper for description about SAND-D, and SAND-F.
- Labeling delay in number of instances. Default value for classification only is 1. Use appropriate value for novel class detection.
- Classification delay in number of instances. Default value for classification only is 0. Use appropriate value for novel class detection.
- Progress or any change point detected throughout execution.
- At the end, it reports percentage of labeled data used.
- .log file contains important debug information.
- .tmpres file contains the error rates for each chunk. There are six columns as follows:
- Chunk #= The current chunk number. Each chunk contains 1000 instances.
- FP= How many existing class instances misclassified as novel class in this chunk.
- FN= How many novel class instances misclassified as existing class in this chunk.
- NC= How many novel class instances are actually there in this chunk.
- Err = How many instances are misclassified (including FP and FN) in this chunk.
- GlobErr = % Err (cumulative) upto the current chunk.
- .res file contains the summary result, i.e., the following error rates:
- FP% = % of existing class instances misclassified as novel
- FN% = % of novel class instances misclassified as existing class instances.
- NC (total) = total number of (actual) novel class instances.
- ERR% = % classification error (including FP, FN, and misclassification within existing class).