Use a Bayesian classifier to determine source code language
PHP CSS Java Perl Ruby
Switch branches/tags
Nothing to show
Clone or download
chrislo Clarify licensing
The MIT license applies to my code, not the contents of `sources` which
remain under their original licenses.
Latest commit c798869 Mar 20, 2014
Failed to load latest commit information.
examples Example of usage Dec 7, 2008
lib Keep getting invalid utf-8 sequence errror when training from shootout. Jan 19, 2014
sources Added a rake task to populate CSS from and trained c… Jan 6, 2009
test [ruby 1.9 compatability] ftools has been removed , replace call to Fi… Feb 19, 2009
HISTORY Up to gem v0.2.2, added HISTORY file Jan 6, 2009
LICENSE Add the MIT License Mar 20, 2014
Manifest Up to gem v0.2.2, added HISTORY file Jan 6, 2009
README.textile Clarify licensing Mar 20, 2014
Rakefile Ruby 1.9 compatibility fixes Feb 19, 2009
sourceclassifier.gemspec Ruby 1.9 compatibility fixes Feb 19, 2009
trainer.bin Added a rake task to populate CSS from and trained c… Jan 6, 2009



Source classifier identifies programming language using a Bayesian classifier trained on a corpus generated from the Computer Language Benchmarks Game . It is written in Ruby and availabe as a gem. To train the classifier to identify new languages download the sources from github.

Out of the box SourceClassifier recognises Css, C, Java, Javascript, Perl, Php, Python and Ruby.


First install the gem using github as a source

$ gem sources -a $ sudo gem install chrislo-sourceclassifier

Then, to use

  require 'rubygems'
  require 'sourceclassifier'

  s =

  ruby_text = <<EOT
  def my_sorting_function(a)

  c_text = <<EOT
  #include <unistd.h>

  int main() {
    write(1, "hello world\n", 12);

  s.identify(ruby_text) #=> Ruby
  s.identify(c_text) #=> Gcc


Download the sources from github and in the directory run the training rake test

$ rake train

In the ./sources directory are subdirectories for each language you wish to be able to identify. Each subdirectory contains examples of programs written in that language. The name of the directory is significant – it is the value returned by the SourceClassifier.identify() method.

The rake task populate:shootout can be used to build these subdirectories from a checkout of the computer language shootout sources but you are free to train the classifier using any available examples. Edit the Rakefile to point to your checkout of the shootout sources

Run rake populate:css to grab the css files used to train the classifier from

To populate the sources directory using all available sources run

$ rake populate:all


This library depends heavily on the great Classifier gem by Lucas Carlson and David Fayram II.


This gem is released under the MIT license (see LICENSE). The training
examples retain their original licenses.