WIP, the functionality only consists of bootstrapping and random UA generation
The whole project is licensed under MIT license.
Execute the following command in the directory where the logs are
(usually /var/log/apache
).
zcat -f * | egrep '(MSIE|Firefox|Safari)' | sed 's/^.*"\([^"]*\)"$/\1/' \
| egrep -iv '(bot|spider|gfe)' | sort | uniq -c >agents.txt
The first command, zcat
decompresses gzipped files, the -f
parameter
makes it ignore uncompressed files. In the next step, mostly real browsers
get filtered, then the last field of the log (User-Agent) gets selected.
There are still some bots (even Google) using UA string that contain strings
like MSIE, so a second egrep filters everything else, then it gets sorted,
so uniq
can do its task. In the end, you'll have a txt file, that contains
every unique User-Agent your HTTPd encountered and the number of hits
generated by that UA.
There's one step left, converting the textual representation to a SQLite database, which is faster to query. Execute the following command.
python agents2db.py agents.txt agents.db
It should take less than ten seconds, and if everything went well, you'll have
a nice agents.db
file, that contains all the UAs and is fast to query. You
can test it by running the following command, which should print a single
random User-Agent string.
python random_agent.py agents.db
Capture the TCP stream generated by each browser to be added, and save them in
separate files in the bootstrap/browser-sigs
directory. I used Wireshark,
which is one of the easiest methods.
- Set the capture filter to
tcp port 80
, this will minimize the noise. - Start the capture and open a URL in the browser.
- Stop the capture after the page loaded, and select the first packet, that has the Path part of the URL in the description field.
- Right click, and select Follow TCP stream.
- In the new window, there's an option to save the whole stream in a file.
From this point, it's pretty easy: just cd bootstrap/browser-sigs
and issue
a make
command which should convert your bin file(s) into tpl file(s). You
might also need to edit the transform.sed
file there, to match the locale
specific values used by your browser. (The current values assume an English
or Hungarian browser.)
Browser signatures match User-Agents using simple set of rules written in a
JSON file, this document implies that you're familiar with the format. The
basic structure is an array of dictionaries, the matcher will evaluate these
in the order they appear in the file, so it's important to always put the
more specific rules in the front (for example every Chrome browser has the
string "Safari" in the User-Agent, so it's better to test for Chrome first,
and Safari afterwards). An example file, apply.sample.json
can be found in
the bootstrap/browser-sigs
directory.
The dictionary has to have two entries, input
specifies the name of the
template file, and rules
contains an array of rules, which must all be
satisfied in order to match the template. Each rule is a dictionary, which
must contain a type
entry of string type, currently two are supported.
The first, ua-contains
matches is the User-Agent contains the string in the
value
entry (case-sensitive), while ua-matches
does the same but evaluates
value
as an extended regular expression.
The coverage of the templates and the JSON can be tested with the apply.py
script in the bootstrap/browser-sigs
directory. The two parameters are the
file names of the User-Agent database and the JSON file, respectively.
python apply.py agents.db apply.json
If everything goes well, the script finishes with an exit code of 0, and without any output. If an I/O error occurs or the JSON file doesn't cover every User-Agent in the database, it prints a standard Python stack trace, including the User-Agent in question in the former case.
- Python 2.6+ (tested on 2.6 and 2.7)