A Pytorch implementation of Interpretable Federated Transformer Log Learning for Threat Forensics.
In the following repository we include our original Cyber Threat Detection Dataset 2021. The dataset is composes by syslogs collected from systems running in 3 clusters --namely, ITESM, ML-7063, and Practicum. We have included the Raw logs, the cleaned filtered logs used for pre-processing using the SPELL log parser, and a sample of the parsed cyber threats and normal syslogs for experimental reproduction. In addition, we included the parsed HDFS dataset used for comparing the performance of our proposed model with state-of-the-art works. The original HDFS logs can be found in: http://people.iiis.tsinghua.edu.cn/~weixu/sospdata.html.
The baseline experiment trains the model in the conventional way.
python train.py --num_classes=449 --num_layers=4 --num_heads=2 --epochs=5 --batch_size=2048
python federated_train.py --num_classes=449 --epochs=5 --batch_size=2048 --num_layers=4 --num_heads=2 --clients=4 --rounds=10
The required parameter for training is --num_classes, for HDFS it is 29 and for CTDD it is 449.
The default values for various paramters parsed to the experiment are given in train.py. Details are given some of those parameters:
--log_file', default='Linux/linux_train', type=str, help='parsed log file' --log_normal', default='Linux/linux_test_normal', type=str, help='parsed log file of normal testing data' --log_abnormal', default='Linux/linux_abnormal', type=str, help='parsed log file of abnormal testing data'
--window_size', default=10, type=int, help='lenght of training window'
--batch_size', default=512, type=int, help='input batch size for training' --epochs', default=10, type=int, help='number of epochs to train'
--dropout', default=0.2, type=float, help='number of epochs to train' --num_layers', default=1, type=int, help='number of encoder and decoders' --num_heads', default=1, type=int, help='number of heads' --seed', default=1, type=int, help='random seed'
--num_classes', type=int, help='number of total log keys' --num_candidates', default=10, type=int, help='number of predictors sequence as correct predict'
--federated', default=False, type=bool, help='federated involved' --num_gpus', default=1, type=int, help='number of gpus of gpus to train' --model_dir', default='Model', type=str, help='the directory to store the model' --data_dir', default='Dataset', type=str, help='the directory where training data is stored'
--clients', default=2, type=int, help='number of clients' --rounds', default=2, type=int, help='number of rounds' --frac', default=1.0, type=float, help='percentage of users to use per round'
data, log templates, log sequences, model training and evaluation
generating data, log key sequences, given log files
splitting dataset into n number of clients for federated experiments
changing length of sequence by time with time_seq
The notebooks for interpreting the model's decision making process are Forensic_Investigation.ipynb and Forensic_Investigation_Figures.ipynb. The first notebook makes use of the saved attention-based weights saved after the model's evaluation of an input sequence and the generated log templates generated by teh Spell Parser. Te second notebook makes use of the attention-based weights for generating the interpretability saliency maps.
centralized_model: 2 encoders/decoders centralized_models: 1 encoder/decoder centralized_models_practicum: 6 encoders/decoders
2c_1l_2h: 2 clients, 1 layer, 2 heads: global_model_2c_1l_2h: 1 encoder/decoder global_models: 1 encoder/decoder global_models_practicum: 4 encoders/decoders global_models_practicum2: 4 encoders/decoders