Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

An unsupervised language identification algorithm in Ruby, built originally for detecting English-language tweets.

branch: master

Fetching latest commit…

Octocat-spinner-32-eaf2f5

Cannot retrieve the latest commit at this time

Octocat-spinner-32 datasets
Octocat-spinner-32 lib
Octocat-spinner-32 test
Octocat-spinner-32 Gemfile
Octocat-spinner-32 README.md
Octocat-spinner-32 Rakefile
Octocat-spinner-32 language-detector-demo.rb
Octocat-spinner-32 unsupervised-language-detection.gemspec
README.md

What?

Given a set of strings from different languages, build a detector for the majority language (often, but not necessarily, English). More information on the algorithm here.

Example

training_sentences = File.readlines("datasets/gutenberg-training.txt")
detector = LanguageDetector.new(:ngram_size => 3)
detector.train(30, training_sentences)

puts "Testing on English sentences..."
true_english = 0
false_spanish = 0
IO.foreach("datasets/gutenberg-test-en.txt") do |line|
  next if line.strip.empty?
  if detector.classify(line) == "majority"
    true_english += 1
  else
    puts line
    false_spanish += 1    
  end
end
puts false_spanish
puts true_english

Example

Using the Gem

gem install unsupervised-language-detection

require 'rubygems'
require 'unsupervised-language-detection'

UnsupervisedLanguageDetection.is_english_tweet?("I am an English sentence.") # => true
UnsupervisedLanguageDetection.is_english_tweet?("Hola, me llamo Edwin.") # => false

Demo

See a demo here.

Something went wrong with that request. Please try again.