Skip to content
An unsupervised language identification algorithm in Ruby, built originally for detecting English-language tweets.
Ruby
Latest commit f151806 Aug 20, 2011 Edwin Chen Removing extra line of code.
Failed to load latest commit information.
datasets Adding more data and website demo. May 3, 2011
lib
test
Gemfile
README.md
Rakefile
language-detector-demo.rb Gemifying. May 14, 2011
unsupervised-language-detection.gemspec

README.md

What?

Given a set of strings from different languages, build a detector for the majority language (often, but not necessarily, English). More information on the algorithm here.

Example

training_sentences = File.readlines("datasets/gutenberg-training.txt")
detector = LanguageDetector.new(:ngram_size => 3)
detector.train(30, training_sentences)

puts "Testing on English sentences..."
true_english = 0
false_spanish = 0
IO.foreach("datasets/gutenberg-test-en.txt") do |line|
  next if line.strip.empty?
  if detector.classify(line) == "majority"
    true_english += 1
  else
    puts line
    false_spanish += 1    
  end
end
puts false_spanish
puts true_english

Example

Using the Gem

gem install unsupervised-language-detection

require 'rubygems'
require 'unsupervised-language-detection'

UnsupervisedLanguageDetection.is_english_tweet?("I am an English sentence.") # => true
UnsupervisedLanguageDetection.is_english_tweet?("Hola, me llamo Edwin.") # => false

Demo

See a demo here.

Something went wrong with that request. Please try again.