ExploitDBScraper

Kaitlyn DeValk | March 17, 2022

Scraping the ExploitDB repository to use the data in a visualization project. The database metrics are subject to change dependent on when you pull the data.

Add ExploitDB Repository as submodule

Package repository into my own:

git submodule add https://github.com/offensive-security/exploitdb

Important Data Fields

Date of release
Exploit Title
Type
Platform
Tag(s)
Filename
File Extension
Filesize
File hash

The file file_exploits.csv and the repository data contain the following variables with accurate information:

Filename, part of file variable
File Extension, part of file variable
Exploit Title, description variable
Type
Platform
Filesize (Can extract this from the repository files)
Hash (Can extract this from the files)

The Date was included as a field, but every exploit was labeled with the same release date: 1970-01-01.

The two remaining fields that I needed to scrape from the github repository or website would have been:

Date
Tag(s)

After messing with the data in the CSV, I ended up deciding to actually scrape everything, with exception to the actual files, from the ExploitDB website since there appeared to be no easy way to get accurate dates or tags from the github repository, and it would have been more work to try to put the two separate datasets together versus getting everything from one location. This was done using the scraper.py script.

Data Parsing

This scraper was made for creating a dataset that could be imported into Tableau to create visualizations. The parser.py script is meant to extract the relevant data fields and also clean the data for duplicate exploits (those that share the same files) and place the data into a CSV file for easy import into Tableau. Exploits that are created for the same CVE will not be considered duplicate if they are written in separate languages or perform a different attack vector even if it's for the same vulnerability, as that will show how impactful a vulnerability was if multiple exploits were created for it.

I also added columns or "tags" for the exploits to compare each one to the MITRE CWE Software Top 25 list for 2021.

I weigh the tag as double what the title is because when a tag is present, that's a incredibly good indicator of its category, but those are not always available.

A score of 0 for all 25 CWE is not mapped at all, which takes out about half the data points for 2021.

Notes

exploitdb/exploits is organized by platform subdirectories, which then has the type underneath:

+ exploits
----+ platform
	----+ type

The data is not uploaded as it's near 50MB.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
exploitdb @ 12a90d7		exploitdb @ 12a90d7
.gitignore		.gitignore
.gitmodules		.gitmodules
Hardware_CWE.md		Hardware_CWE.md
README.md		README.md
Software_CWE.md		Software_CWE.md
parser.py		parser.py
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ExploitDBScraper

Add ExploitDB Repository as submodule

Important Data Fields

Data Parsing

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ExploitDBScraper

Add ExploitDB Repository as submodule

Important Data Fields

Data Parsing

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages