An unsupervised language identification algorithm in Ruby, built originally for detecting English-language tweets.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
datasets
lib
test
Gemfile
README.md
Rakefile
language-detector-demo.rb
unsupervised-language-detection.gemspec

README.md

What?

Given a set of strings from different languages, build a detector for the majority language (often, but not necessarily, English). More information on the algorithm here.

Example

training_sentences = File.readlines("datasets/gutenberg-training.txt")
detector = LanguageDetector.new(:ngram_size => 3)
detector.train(30, training_sentences)

puts "Testing on English sentences..."
true_english = 0
false_spanish = 0
IO.foreach("datasets/gutenberg-test-en.txt") do |line|
  next if line.strip.empty?
  if detector.classify(line) == "majority"
    true_english += 1
  else
    puts line
    false_spanish += 1    
  end
end
puts false_spanish
puts true_english

Example

Using the Gem

gem install unsupervised-language-detection

require 'rubygems'
require 'unsupervised-language-detection'

UnsupervisedLanguageDetection.is_english_tweet?("I am an English sentence.") # => true
UnsupervisedLanguageDetection.is_english_tweet?("Hola, me llamo Edwin.") # => false

Demo

See a demo here.