A domain munger to munge domains
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.gitignore
LICENSE
README.md
dconfig.py
dmunge.py
splist.csv
white.txt

README.md

dMunge (Domain Munger)

is a utility that takes a plain text file and extracts domain name information wherever it it can. If you think this is a simple, try it for yourself sometime.

Designed by Spamfighter, Coded by Adam.

Notable Features

  • Redacts domains down to the correct .TLD or 2nd or 3rd-level TLD to make abuse reports informative and allow them to be channeled to the approriate responsible parties.
  • dMunge has a whitelist to help avoid false positives as well as an 'Alexa Top 1 Million' domain cross-check for the same reason.
  • Fast. dMunge can process 30,000 lines in 3 seconds. 15M lines in 15 minutes. On a 2012 Macbook.
  • Robust. It doesn't care if the file is full of garbage. It will do what it can.
  • Is both individual user and multi-user (server-based) friendly.

How it works

dMunge first tries to extract a top level domain (TLD). It uses the Public Suffix List (which is the ICANN list and some private 3rd/4th-level domains) to do this. The PSL is extended by any domain names you put into a file named splist.csv.

If it succeeds in extracting a TLD, it looks for a hostname immediately to the left of the TLD. If it succeeds in finding a hostname, it joins the two together as a "reduced domain name" and compares it to a whitelist and to the Alexa Top 1M.

If the reduced domain name is found in either the whitelist of the Alexa Top 1M, it is not processed, and the original line is added to a results file named "Unprocessed YYYY-MM-DD HH/MM".

Otherwise, it is added to a file named "Reduced Hostnames YYYY-MM-DD HH/MM".

PLEASE TAKE NOTE OF CAVEATS BELOW.

Updates

=> HIDDEN FEATURE! Press 2 at the menu prompt to update the TLD and Alexa Whitelist files.

These are the necessary files:

File Location Purpose
dmunge.py Any directory you want, including shared directory on server The main program.
config.py The same directory as the main program Specifies a a subdirectory of a user's home directory (~) where all files to be processed, and results files will stored.
tldcache Automatically generated in the same directory as the main program Contains the Public Suffix List.
top-1m.csv Automatically generated in the same directory as the main program Contains the Alexa Top 1 Million. Any reduced domain names that match a line in this file will NOT be processed, and will be added to the file "Unprocessed YYYY-MM-DD HH/MM" with a reason code of Alex.
white.txt The same directory as the main program (even if the file is empty, it must be there) Any reduced domain names that match a line in this file will NOT be processed, and will be added to the file "Unprocessed YYYY-MM-DD HH/MM" with a reason code of Whitelist
splist.csv The same directory as the main program (even if the file is empty, it must be there) Any domain name in this file will be treated as a TLD when extracting reduced domain names.
dmunge.txt The path that is specified in config.py The program will step through this file line by line and attempt to extract a TLD/hostname pair. If it fails to do so, the line will be added to the file "Unprocessed YYYY-MM-DD HH/MM" with a reason code of "No domain info"

File Formats

File Format Notes
white.txt domain.name 1 per line
splist.csv domain.name, service provider Comma separated values, 1 pair of values per line. The service provider is a note of some sort, such as contact information, or the registrar name for that domain. Quotation marks WILL NOT be ignored, if you use them, they will print our in the final results. (for an example of the file format see below)
dmunge.txt unstructured No special structure. NB: IP addresses will be treated as hostnames and will be added to the file "Reduced Hostnames YYYY-MM-DD HH/MM"

Example splist.csv:

ml,FREENOM

tktxt,FREENOM

co.vu,VARIA

so1.cc,VARIA

Requirements

You must be connected to the Internet for the first run, and any time you want to update TLDs or the Alexa Top 1M.

You must have Python 3.6 or higher.

  • The following MIT licensed 3rd party python modules:
    • The excellent arrow for better time handling than the built-in python modules
    • The equally excellent tldextract
    • The superlative tqdm which makes all progress visible

Caveats

Please note:

dMunge expects no more than one domain per line of text.

If a line contains more than one domain name or url or combination of the two, dMunge will fail silently for that line.

License (MIT)

Copyright (c) 2018 Adam Z. Wasserman, Neil Schwartzman

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.******