Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
An unsupervised language identification algorithm in Ruby, built originally for detecting English-language tweets.
Ruby
branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
datasets Adding more data and website demo.
lib
test
Gemfile
README.md Changing to trigrams for English tweet detection.
Rakefile Gemifying.
language-detector-demo.rb
unsupervised-language-detection.gemspec Adding link.

README.md

What?

Given a set of strings from different languages, build a detector for the majority language (often, but not necessarily, English). More information on the algorithm here.

Example

training_sentences = File.readlines("datasets/gutenberg-training.txt")
detector = LanguageDetector.new(:ngram_size => 3)
detector.train(30, training_sentences)

puts "Testing on English sentences..."
true_english = 0
false_spanish = 0
IO.foreach("datasets/gutenberg-test-en.txt") do |line|
  next if line.strip.empty?
  if detector.classify(line) == "majority"
    true_english += 1
  else
    puts line
    false_spanish += 1    
  end
end
puts false_spanish
puts true_english

Example

Using the Gem

gem install unsupervised-language-detection

require 'rubygems'
require 'unsupervised-language-detection'

UnsupervisedLanguageDetection.is_english_tweet?("I am an English sentence.") # => true
UnsupervisedLanguageDetection.is_english_tweet?("Hola, me llamo Edwin.") # => false

Demo

See a demo here.

Something went wrong with that request. Please try again.