Skip to content

epelofske-student/CTC

Repository files navigation

CTC

Datasets used for the paper A Robust Cybersecurity Topic Classification Tool

Training and validation data (un-processed English text) for the Cybersecurity Topic Classification (CTC) tool.

All validation data is English natural text, written to individual text files.

The training dataset is too large to upload to github. The full training text can be downloaded here https://zenodo.org/records/10655913. Training data format is a json file containing an array of English text samples.

All data is compressed into tar.gz format to save storage space.

top_500_subreddits.txt list was used to crawl a large corpus of Reddit text - note that this list is certainly out of date now.

About

Training and validation data for the Cybersecurity Topic Classification (CTC) tool

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published