Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Privacy-enhanced resolver for shortened URLs
Python Shell
Branch: master
Failed to load latest commit information.
bootstrap
.gitignore
README.md
random_agent.py

README.md

AnonShort

WIP, the functionality only consists of bootstrapping and random UA generation

License

The whole project is licensed under MIT license.

Gathering User-Agents from NCSA (e.g. Apache) access logs

Execute the following command in the directory where the logs are (usually /var/log/apache).

zcat -f * | egrep '(MSIE|Firefox|Safari)' | sed 's/^.*"\([^"]*\)"$/\1/' \
    | egrep -iv '(bot|spider|gfe)' | sort | uniq -c >agents.txt

The first command, zcat decompresses gzipped files, the -f parameter makes it ignore uncompressed files. In the next step, mostly real browsers get filtered, then the last field of the log (User-Agent) gets selected. There are still some bots (even Google) using UA string that contain strings like MSIE, so a second egrep filters everything else, then it gets sorted, so uniq can do its task. In the end, you'll have a txt file, that contains every unique User-Agent your HTTPd encountered and the number of hits generated by that UA.

There's one step left, converting the textual representation to a SQLite database, which is faster to query. Execute the following command.

python agents2db.py agents.txt agents.db

It should take less than ten seconds, and if everything went well, you'll have a nice agents.db file, that contains all the UAs and is fast to query. You can test it by running the following command, which should print a single random User-Agent string.

python random_agent.py agents.db

Gathering browser signatures

Capture the TCP stream generated by each browser to be added, and save them in separate files in the bootstrap/browser-sigs directory. I used Wireshark, which is one of the easiest methods.

  • Set the capture filter to tcp port 80, this will minimize the noise.
  • Start the capture and open a URL in the browser.
  • Stop the capture after the page loaded, and select the first packet, that has the Path part of the URL in the description field.
  • Right click, and select Follow TCP stream.
  • In the new window, there's an option to save the whole stream in a file.

From this point, it's pretty easy: just cd bootstrap/browser-sigs and issue a make command which should convert your bin file(s) into tpl file(s). You might also need to edit the transform.sed file there, to match the locale specific values used by your browser. (The current values assume an English or Hungarian browser.)

Matching browser signatures to User-Agents

Browser signatures match User-Agents using simple set of rules written in a JSON file, this document implies that you're familiar with the format. The basic structure is an array of dictionaries, the matcher will evaluate these in the order they appear in the file, so it's important to always put the more specific rules in the front (for example every Chrome browser has the string "Safari" in the User-Agent, so it's better to test for Chrome first, and Safari afterwards). An example file, apply.sample.json can be found in the bootstrap/browser-sigs directory.

The dictionary has to have two entries, input specifies the name of the template file, and rules contains an array of rules, which must all be satisfied in order to match the template. Each rule is a dictionary, which must contain a type entry of string type, currently two are supported. The first, ua-contains matches is the User-Agent contains the string in the value entry (case-sensitive), while ua-matches does the same but evaluates value as an extended regular expression.

The coverage of the templates and the JSON can be tested with the apply.py script in the bootstrap/browser-sigs directory. The two parameters are the file names of the User-Agent database and the JSON file, respectively.

python apply.py agents.db apply.json

If everything goes well, the script finishes with an exit code of 0, and without any output. If an I/O error occurs or the JSON file doesn't cover every User-Agent in the database, it prints a standard Python stack trace, including the User-Agent in question in the former case.

Dependencies

  • Python 2.6+ (tested on 2.6 and 2.7)
Something went wrong with that request. Please try again.