Leverages publicly available datasets from Google BigQuery to generate content discovery and subdomain wordlists
Switch branches/tags
Nothing to show
Clone or download
infosec-au Merge pull request #4 from mhmdiaa/master
Add deleted files dataset based on GitHub commit messages
Latest commit f0aad23 Aug 24, 2018




Commonspeak2 leverages publicly available datasets from Google BigQuery to generate content discovery and subdomain wordlists.

As these datasets are updated on a regular basis, the wordlists generated via Commonspeak2 reflect the current technologies used on the web.

By using the Golang client for BigQuery, we can stream the data and process it very quickly. The future of this project will revolve around improving the quality of wordlists generated by creating automated filters and substitution functions.

Let's turn creating wordlists from a manual task, into a reproducible and reliable science with BigQuery.

I just want the wordlists...

We will update the commonspeak2-wordlists repo with any wordlists generated the Commonspeak2 tool.

More infrastructure will be developed to deliver wordlists continuously and this section will be updated in the future.

Instructions & Usage

If you're compiling or running Commonspeak2 from source:

If you're using the pre-built binaries:

  • Download the newest release here

Upon completing the above steps, Commonspeak2 can be used in the following ways:


Currently subdomains are extracted from HackerNews and HTTPArchive's latest scans. Unlike the previous revision of Commonspeak, the datasets and queries have been optimised to contain valid data that occurs often in the wild.

⟩ ./commonspeak2 --project crunchbox-160315 --credentials credentials.json subdomains -o subdomains.txt

INFO[0000] Generated SQL template for HackerNews.        Mode=Subdomains
INFO[0000] Generated SQL template for HTTPArchive.       Mode=Subdomains
INFO[0000] Executing BigQuery SQL... this could take some time.  Mode=Subdomains Source=hackernews
INFO[0019] Total rows extracted 71415.                   Mode=Subdomains Silent=false Source=hackernews Verbose=false
INFO[0019] Executing BigQuery SQL... this could take some time.  Mode=Subdomains Source=httparchive
INFO[0075] Total rows extracted 484701.                  Mode=Subdomains Silent=false Source=httparchive Verbose=false

Words with extensions

Using a single query on GitHub's dataset, we can extract every path filtered by file extension. This can be done with:

⟩ ./commonspeak2 --project crunchbox-160315 --credentials credentials.json ext-wordlist -e jsp -l 100000 -o jsp.txt

INFO[0000] Executing BigQuery SQL... this could take some time.  Extensions=jsp Limit=100000 Mode=WordsWithExt Source=Github
INFO[0013] Total rows extracted 100000.                  Mode=WordsWithExt Source=Github

Any set of extensions can be passed via the -e flag, i.e. -e aspx,php,html,js.

Deleted files

Using GitHub's commits dataset, we can extract what may be files that developers decided to delete from their public repositories. These files may contain sensitive data. This can be done with:

⟩ ./commonspeak2 --project crunchbox-160315 --credentials credentials.json deleted-files -l 50000 -o deleted.txt

INFO[0000] Executing BigQuery SQL... this could take some time.  Limit=50000 Mode=DeletedFiles Source=Github
INFO[0013] Total rows extracted 50000.                  Mode=DeletedFiles Source=Github

Features in Active Development

Feel free to send pull requests to complete the features below, add datasets or improve the architecture of this project. Thank you!

Routes Based Extraction

We can create SQL statements that cover routing patterns in almost any web framework. For now we support the following web frameworks to extract path's from:

  • Rails [working implementation ]
  • NodeJS [to be implemented ]
  • Tomcat [to be implemented ]

This data can be extracted using the following command:

⟩ ./commonspeak2 --project crunchbox-160315 --credentials credentials.json routes --frameworks rails -l 100000 -o rails-routes.txt

WARNING: running the above query will cost you lots of money (over $20 per framework). Commonspeak2 will prompt to confirm that this is OK. To skip this prompt use the --silent flag.

When this is ran on for Rails routes, Commonspeak2 does the following:

  1. Pulls Rails routes from config/routes.rb using Regex and the latest Github dataset.
  2. Processes the data, converts it into paths and does contexual replacements to make the path valid (i.e. converting /:id to /1234)
  3. Normalizes the path, finally saving to disk after all the processing is complete.

Scheduled Wordlist Generation

Planned feature to use a cron-like system to allow for wordlist generation from BigQuery to happen continuously.

When this command is introduced, we will insert the --schedule parameter to any of our pre-existing commands covered in this README like so:

⟩ ./commonspeak2 --project crunchbox-160315 --credentials credentials.json --schedule weekly routes --frameworks nodejs,tomcat -l 100000 -o nodejs-tomcat-routes.txt

The above query will run a weekly BigQuery and save the output to ./nodejs-tomcat-routes.txt.

Substitutions and Alterations

Generate smart substitutions and alterations for the datasets that it makes sense for. For example, converting string values from /admin/users/:id to /admin/users/1234 (contextually aware of the number).


Shubham Shah @infosec_au

Michael Gianarakis @mgianarakis


   Copyright 2018 Assetnote

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at


   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   See the License for the specific language governing permissions and
   limitations under the License.

Assetnote Pty. Ltd. - Twitter @assetnote