WIP, the functionality only consists of bootstrapping and random UA generation
The whole project is licensed under MIT license.
Gathering User-Agents from NCSA (e.g. Apache) access logs
Execute the following command in the directory where the logs are
zcat -f * | egrep '(MSIE|Firefox|Safari)' | sed 's/^.*"\([^"]*\)"$/\1/' \ | egrep -iv '(bot|spider|gfe)' | sort | uniq -c >agents.txt
The first command,
zcat decompresses gzipped files, the
makes it ignore uncompressed files. In the next step, mostly real browsers
get filtered, then the last field of the log (User-Agent) gets selected.
There are still some bots (even Google) using UA string that contain strings
like MSIE, so a second egrep filters everything else, then it gets sorted,
uniq can do its task. In the end, you'll have a txt file, that contains
every unique User-Agent your HTTPd encountered and the number of hits
generated by that UA.
There's one step left, converting the textual representation to a SQLite database, which is faster to query. Execute the following command.
python agents2db.py agents.txt agents.db
It should take less than ten seconds, and if everything went well, you'll have
agents.db file, that contains all the UAs and is fast to query. You
can test it by running the following command, which should print a single
random User-Agent string.
python random_agent.py agents.db
Gathering browser signatures
Capture the TCP stream generated by each browser to be added, and save them in
separate files in the
bootstrap/browser-sigs directory. I used Wireshark,
which is one of the easiest methods.
- Set the capture filter to
tcp port 80, this will minimize the noise.
- Start the capture and open a URL in the browser.
- Stop the capture after the page loaded, and select the first packet, that has the Path part of the URL in the description field.
- Right click, and select Follow TCP stream.
- In the new window, there's an option to save the whole stream in a file.
From this point, it's pretty easy: just
cd bootstrap/browser-sigs and issue
make command which should convert your bin file(s) into tpl file(s). You
might also need to edit the
transform.sed file there, to match the locale
specific values used by your browser. (The current values assume an English
or Hungarian browser.)
Matching browser signatures to User-Agents
Browser signatures match User-Agents using simple set of rules written in a
JSON file, this document implies that you're familiar with the format. The
basic structure is an array of dictionaries, the matcher will evaluate these
in the order they appear in the file, so it's important to always put the
more specific rules in the front (for example every Chrome browser has the
string "Safari" in the User-Agent, so it's better to test for Chrome first,
and Safari afterwards). An example file,
apply.sample.json can be found in
The dictionary has to have two entries,
input specifies the name of the
template file, and
rules contains an array of rules, which must all be
satisfied in order to match the template. Each rule is a dictionary, which
must contain a
type entry of string type, currently two are supported.
ua-contains matches is the User-Agent contains the string in the
value entry (case-sensitive), while
ua-matches does the same but evaluates
value as an extended regular expression.
The coverage of the templates and the JSON can be tested with the
script in the
bootstrap/browser-sigs directory. The two parameters are the
file names of the User-Agent database and the JSON file, respectively.
python apply.py agents.db apply.json
If everything goes well, the script finishes with an exit code of 0, and without any output. If an I/O error occurs or the JSON file doesn't cover every User-Agent in the database, it prints a standard Python stack trace, including the User-Agent in question in the former case.
- Python 2.6+ (tested on 2.6 and 2.7)