Kaitlyn DeValk | March 17, 2022
Scraping the ExploitDB repository to use the data in a visualization project. The database metrics are subject to change dependent on when you pull the data.
Package repository into my own:
git submodule add https://github.com/offensive-security/exploitdbDate of release
Exploit Title
Type
Platform
Tag(s)
Filename
File Extension
Filesize
File hash
The file file_exploits.csv and the repository data contain the following variables with accurate information:
- Filename, part of
filevariable - File Extension, part of
filevariable - Exploit Title,
descriptionvariable - Type
- Platform
- Filesize (Can extract this from the repository files)
- Hash (Can extract this from the files)
The Date was included as a field, but every exploit was labeled with the same release date: 1970-01-01.
The two remaining fields that I needed to scrape from the github repository or website would have been:
- Date
- Tag(s)
After messing with the data in the CSV, I ended up deciding to actually scrape everything, with exception to the actual files, from the ExploitDB website since there appeared to be no easy way to get accurate dates or tags from the github repository, and it would have been more work to try to put the two separate datasets together versus getting everything from one location. This was done using the scraper.py script.
This scraper was made for creating a dataset that could be imported into Tableau to create visualizations. The parser.py script is meant to extract the relevant data fields and also clean the data for duplicate exploits (those that share the same files) and place the data into a CSV file for easy import into Tableau. Exploits that are created for the same CVE will not be considered duplicate if they are written in separate languages or perform a different attack vector even if it's for the same vulnerability, as that will show how impactful a vulnerability was if multiple exploits were created for it.
I also added columns or "tags" for the exploits to compare each one to the MITRE CWE Software Top 25 list for 2021.
I weigh the tag as double what the title is because when a tag is present, that's a incredibly good indicator of its category, but those are not always available.
A score of 0 for all 25 CWE is not mapped at all, which takes out about half the data points for 2021.
exploitdb/exploits is organized by platform subdirectories, which then has the type underneath:
+ exploits
----+ platform
----+ type
The data is not uploaded as it's near 50MB.