Skip to content
ane edited this page May 13, 2011 · 9 revisions

What?

hisg is a IRC statistics generator written in Haskell. Its main objective is to provide informative charts about channel activity, with emphasis on pertinent data. That is, hisg seeks to answer most of the following questions:

  • Who spoke the most lines and characters in the channel? (Line and word count, pie charts, tables)
  • When did the users talk? (Activity by hour, bar charts)
  • What were the most active hours in the channel? (The above, but generalized to a per-hour bar chart)
  • What were the usual characters per line ratios? Who were the most verbose (longest lines) and who were the most talkative? (Scatter plot)
  • How active has the channel been?
  • How have daily quarter activities changed as a function of time, i.e., how has activity during night-time changed during the whole observation period?

Hisg relies on charts and tables for representing its data. To this end, Google Charts are used.

Motivation

pisg is the most versatile and feature rich of all irc stats generators, but it is also the slowest. This is due to the fact that pisg is written in perl (the name is an acronym from perl irc statistics generator). While speed isn't usually that important in statistics generators, as the data is analyzed once a day at most. However, it makes one wonder how fast analysis of plain text data can get.

Haskell is as functional as a functional language can possibly get. It is also very fast, with good parallelism support and overall high reliability. Thus, out of curiosity and general eagerness to write Haskell, the author set to write his own IRC statistics generator.

Performance

An average size irssi format log file of 1 year worth of 24/7 data, totaling 1 million lines in 40 MB, is processed in hisg in under 5 seconds on a Athlon II X4 3.0GHz machine (using all four cores) and uses 24 MB of memory at most. The same data, while providing more features, takes more than two minutes on the same system using pisg using 100 MB of memory.

An immense log file of the same period of time from a more active channel, totaling 2.6 million lines in 131 MB is processed in just 15 seconds, using 40 MB of memory at most. The same in pisg took 11 minutes! But this is only a half of the truth, as pisg was written in perl and perl is not a compiled language! Moreover, I doubt pisg was written with performance in mind (whereas hisg was).

Thus, on top of good performance, hisg aims to provide informative data about channel and per-user activity in a very short amount of time, with little resources used (aside from CPU). Its other goal is to serve as a proof-of-concept of the parallel power of Map/Reduce applied to log analysis.

Features

  • Fast parallel data processing using Map/Reduce techniques
  • Informative charts focusing on user channel activity for monitoring daily/monthly trends

Planned features

  • A better template engine
  • Customizable themes
  • Achievements to make non-pertinent stuff interesting (unlikely)
Clone this wiki locally