A labeled dataset is created containing vulnerabilities associated with UAVs that can be categorized under CWE classes using Two different sources: The first source is SARD, which contains test cases written in C, C++, Java, PHP, and C#. SARD contains more than 150 categories of weaknesses according to the CWE taxonomy [25]. The second source of our dataset is the CVE real-world dataset [13], which is maintained in the NVD [14]. The process of creating the dataset involved three phases:
- Sampling from SARD: This phase aims to collect code samples “at the level of function” from SARD that correspond to the CWEs “52 unique categories of CWE” identified by the three SATs and the NVD review. Moreover, benign samples of code were collected for the sake of balancing our dataset. This phase resulted in the collection of 78,987 CWE samples and 362,845 benign samples, with a total of 441,832 samples.
- Sampling from NVD: This phase aims to collect samples of vulnerable code associated with UAV systems that have been reported to the NVD, for example, CVE-2024-38951 [17], CVE-2023-46256 [18] and CVE-2023-47625 [19]. Since there were not many code samples in the NVD, some extra code samples related to the same CWE types were collected from (CVE) [13], a part of the NVD [14] dataset. This phase resulted in the collection of 8,525 CWE samples.
- Final dataset assembly: This phase aims to assemble the entire samples from phase 1 and 2 into a unified final dataset called UAVulDB01. This step resulted in the collection of 87,512 CWE samples, representing the vulnerable code corresponding to 52 unique categories of CWE, and 362,845 benign samples, for a total of 450,357 samples. Each sample in the UAVulDB01consists of a function-level code snippet and two tags. The UAVulDB01's features are the code snippets "function level," where the first tags are "Mtags" (CWE category or benign code) in order to verify UAVulCode against multi-class categorization. Furthermore, the second tag, "Btags" represents Vulnerable or Benign code, allowing UAVulCode to be evaluated against binary-class classification. For illustration, Example 1 demonstrates a sample from the dataset associated with CWE-120. The sample's feature is represented by the function's code snippets, where Mtag is "CWE-120" and Btag is "Vulnerable". Example 1 : A CWE-120 example from UAVulDB01 void Function() { char data[100]; /* BAD SOURCE: gets() allows buffer overflow / gets(data); char dest[50]; / BAD SINK: unbounded copy */ strcpy(dest, data); printf("%s\n", dest); }