The security guarantee of AI-enabled software systems (particularly using deep learning techniques as a functional core) is pivotal against the adversarial attacks exploiting software vulnerabilities. However, little attention has been paid to a systematic investigation of vulnerabilities in such systems. A common situation learned from the open source software community is that deep learning engineers frequently integrate off-the-shelf or open-source learning frameworks into their ecosystems. In this work, we specifically look into deep learning (DL) framework and perform the first systematic study of vulnerabilities in DL systems through a comprehensive analysis of identified vulnerabilities from Common Vulnerabilities and Exposures (CVE) and open-source DL tools, including TensorFlow, Caffe, OpenCV, Keras, and PyTorch. To this end, we propose a two-stream data analysis framework to explore vulnerability patterns from various databases. We investigate the unique DL frameworks and libraries development ecosystems that appear to be decentralized and fragmented. By revisiting the Common Weakness Enumeration (CWE) List, which provides the traditional software vulnerability related practices, we observed that it is more challenging to detect and fix the vulnerabilities throughout the DL systems lifecycle. Moreover, we conducted a large-scale empirical study of 3,049 DL vulnerabilities to better understand the patterns of vulnerability and the challenges in fixing them.
- RQ1: What are the common root causes of deep learning-specific vulnerabilities in the deep learning frameworks?
- RQ2: What are the challenges in detecting the vulnerabilities?
- RQ3: What are the main challenges to fix the vulnerabilities and how to address them?
Figure 1: Data collection and process steps
To collected the manual analysis data for DL system vulnerabilities empirical studies, we implement a data collection process as shown in Figure 1. we collect data from two source a official vulnerability database NVD, and the frameworks' repositories.
-
For the official data, we implement a
NVD Crawlerto query the CVE records by DL framework name from NVD database. In addition, forTensorFlowwhich includes well maintained security advisories, we implementTFSecurityAdvisory Parserto extract the useful information from all security advisory files. These data will be used as reference to improve the performance of our latent vulnerability detection keywords. -
For the framework repositories, we clone five DL framework projects in local machine and execute
GH Filterintegrated with GitHub CLI commandghto extract all Pull Requests (PRs), Commits, and Issues log from these repositories. Asghcommand only return simplified information for the logs, we then runGHCrawlerto call GitHub REST API to harvest completedPRs. Then, we applied a latent vulnerabilities search on the PR records to found those vulnerability patches. Finally, theGHCrawlerwill crawler the completed commits information for corresponding vulnerability patches.
After aforementioned steps, we collects all required data for manual analysis including Independent Manage Labeling and Resolve Disagreement.
data/rawcontains data directly collect from the source without being manipulated.data/distilledcontains data automatically generated by scrpts or program.data/manualcontains data after manually analysis.
To run the project, we need follow these steps:
- Before running the code pleasa ensure these directory exist (create them if not).
data
data/distilled
data/raw
data/manual-
python3environment is required for running the project. -
To use
gh_filter, the directory pointing to local clone framework repository has to be properly specificed. -
To use
gh_crawler, it is required to configure theauthparameter insettings.py. Theauthparameter should be filled in GitHub account and token in tuple format e.g. auth:(<ACCOUNT_NAME>,<TOKEN>) -
This code has been tested on
Windows 10andMac OSplatform.
Run the script
$ cd src/nvd_crawler
$ python app.py You need to manually download security advisory files and save them into raw/security_advisory_tensorflow
Run the script
$ cd src/tf_security_advisory_parser
$ python app.py You need to manual clone thoes repositories on your local machine and setup the GitHub CLI environment (as the script call the gh command). Then configure settings.py in src/gh_filter.
Run the script
$ cd src/gh_filter
$ python app.py Run the script
$ cd src/gh_crawler
$ python crawler_pr.py Run the script
$ cd src/vuln_detector
$ python app.py Step 6: Harvest completed commits information for vulnerabilities patches (pull requests from step 5)
Run the script
$ cd src/gh_crawler
$ python crawler_commit.py After the steps mentioned above, all required data for manual analysis are ready.
- If you encounter
SSLerror while running the crawler, you can configureproxyin settings file and enable it in the code.
