Datasets used for the paper A Robust Cybersecurity Topic Classification Tool
Training and validation data (un-processed English text) for the Cybersecurity Topic Classification (CTC) tool.
All validation data is English natural text, written to individual text files.
The training dataset is too large to upload to github. The full training text can be downloaded here https://zenodo.org/records/10655913. Training data format is a json file containing an array of English text samples.
All data is compressed into tar.gz format to save storage space.
top_500_subreddits.txt
list was used to crawl a large corpus of Reddit text - note that this list is certainly out of date now.