Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Fast parsing of log files in Ruby
C Ruby
branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
ext
lib
spec
.gitignore
LICENSE
README.rdoc
Rakefile
VERSION.yml
teeth.gemspec

README.rdoc

Teeth

Teeth is a library for fast parsing of log files such as Apache access and error logs. It uses C extensions generated by flex (as in Flex and Bison). If you only want to use the built-in scanners, you don't need flex. If you want to add support for new/different log formats, you'll need to have flex installed.

Example

require "teeth"

access_log = %q{myhost.localdomain:80 172.16.115.1 - - [13/Dec/2008:19:26:11 -0500] "GET /favicon.ico HTTP/1.1" 404 241 "http://172.16.115.130/" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_4_11; en) AppleWebKit/525.27.1 (KHTML, like Gecko) Version/3.2.1 Safari/525.27.1"}
access_log.scan_apache_logs

=> {:strings=>[“241”], :apache_access_datetime=>[“13/Dec/2008:19:26:11 -0500”], :absolute_url=>, :message=>“myhost.localdomain:80 172.16.115.1 - - [13/Dec/2008:19:26:11 -0500] "GET /favicon.ico HTTP/1.1" 404 241 "172.16.115.130/" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_4_11; en) AppleWebKit/525.27.1 (KHTML, like Gecko) Version/3.2.1 Safari/525.27.1"”, :http_method=>[“GET”], :browser_string=>[“Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_4_11; en) AppleWebKit/525.27.1 (KHTML, like Gecko) Version/3.2.1 Safari/525.27.1”], :relative_url=>, :http_version=>, :host=>, :id=>“8AD5CBCC1CB011DE8CE10017F22FF48F”, :http_response=>[“404”], :ipv4_addr=>}

Supported Log Formats

  • Apache (access and error logs)

  • Rails

Support for other web servers, app servers, and applications as well as other types of servers (e.g., SMTP, etc.) and generic syslog logs is planned for the future.

Creating Your Own Scanners

Teeth includes a library that can generate a flex scanner definition using a simplified definition written in ruby. This cuts down on the repetition involved in writing all the C code by hand. The included scanners for Apache and Rails logs are defined this way. You can find them in the scanners directory.

Here's an example based on the definition for the Rails log scanner:

require File.dirname(__FILE__) + “/../lib/teeth” scanner = Teeth::Scanner.new(:rails_logs, File.dirname(__FILE__) + '/../ext/scan_rails_logs/') Flex definitions are kinda like macros for regular expressions. We include some of the available defaults here to make writing the scanner easier scanner.load_default_definitions_for(:whitespace, :ip, :time, :web) Add some more definitions scanner.definitions do |define| define.RAILS_TEASER '(processing|filter\ chain\ halted|rendered)' define.CONTROLLER_ACTION '[a-z0-9]+#[a-z0-9]+' Scanner is case insensitive define.RAILS_ERROR_CLASS '([a-z]+::)*[a-z]+error' “start conditions” are a feature of flex that allows us to have some regular expressions that are only active when we tell the scanner to enter a certain state. Here we define the “REQUEST_COMPLETED'' state, and specify that it is exclusive. This means that if the scanner is in this state, it only matches rules written for this state define.REQUEST_COMPLETED :start_condition => :exclusive end Define rules. These are the actions that the scanner executes when it sees text that matches a regular expression. The default action is to add :action_name => [matched_text] to the results Hash, or push the matched text on the end of the array if it already exists. scanner.rules do |r| This will add something like :teaser => [“Processing”] to the results r.teaser '{RAILS_TEASER}' r.controller_action '{CONTROLLER_ACTION}' Use some of the default definitions we added above. r.datetime '{YEAR}“-”{MONTH_NUM}“-”{MDAY}{WS}{HOUR}“:”{MINSEC}“:”{MINSEC}' r.http_method '{HTTP_VERB}' With :skip_line => true, scanner stops processing the line immediately r.skip_lines '{RAILS_SKIP_LINES}', :skip_line => true r.error '{RAILS_ERROR_CLASS}' with :strip_ends => true, scanner removes first and last characters from matched text r.error_message '(({WS}|{NON_WS})+)', :strip_ends => true Puts scanner in the “REQUEST_COMPLETED'' state we defined above. The scanner only matches rules beginning with “<REQUEST_COMPLETED>'' now r.teaser 'completed\ in', :begin => “REQUEST_COMPLETED” These rules only apply to the “REQUEST_COMPLETED'' State r.duration_s '<REQUEST_COMPLETED>[0-9]+.+' r.duration_ms '<REQUEST_COMPLETED>[0-9]+/ms' r.http_response '<REQUEST_COMPLETED>{HTTPCODE}' Need a “catchall” rule – flex scanner “jams” if there isn't a default rule (the catchall rule for the default/INITIAL state is automatically included). note that :ignore => true makes the scanner ignore what it matches but doesn't stop processing of the line. r.ignore_others '<REQUEST_COMPLETED>{CATCHALL}', :ignore => true The “strings” action is special. It keeps track of whether the last token was also a string, and if it was, the new string is appended to the last string instead of being pushed to the array. For example, when scanning an apache error log, “Invalid URI in request'' will be extracted as a complete string (instead of [“Invalid”, “URI”, “in”, “request”]) r.strings '{NON_WS}{NON_WS}*' end Writes the generated scanner and an extconf.rb for it to the directory we specified when we initialized the scanner. scanner.write! There's not much in the way of documentation for the scanner generator, but you can refer to the specs and the definitions for Apache and Rails logs to get a sense of how it works. It would probably help to learn about flex's regex syntax and other features.

Ruby 1.9

Ruby 1.9 is supported on the master branch. Don't use the ruby1.9 branch, it is orphaned.

Shortcomings and Known Issues

In addition to the lack of support for formats other than Apache and Rails described above:

  • It's a new project, lots of API changes

  • Does not convert datetimes to Ruby Time objects

  • Does not always use context or knowledge of the log format to its advantage. This is improving now that the scanner can utilize start conditions.

Performance

On my laptop, a white MacBook 2.0 GHz Intel Core Duo, teeth can process more than 30k lines of Apache access logs per second. So it's pretty fast. If modified to not create a UUID or keep the full message, this can be increased to around 45k lines/sec. One could potentially do pretty well on the wide_finder2

Something went wrong with that request. Please try again.