Skip to content

A small library in golang, that detects the language of a text.

Notifications You must be signed in to change notification settings

dutchcoders/go-lang-detector

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

wercker status Coverage Status

Language Detector

This golang library provides functionality to analyze and recognize language based on text.

The implementation is based on the following paper:
N-Gram-Based Text Categorization
William B. Cavnar and John M. Trenkle
Environmental Research Institute of Michigan P.O. Box 134001
Ann Arbor MI 48113-4001

Language profile

A language profile is a map[string] intthat maps n-gram tokens to its occurrency-rank. So for the most frequent token 'X' of the analyzed text, map['X'] will be 1.

Usage

Detect

Get the closest language:

The default detector supports the following languages: Arabic, English, French, German, Hebrew, Russian, Turkish

    detector := langdet.NewDefaultDetector()
	testString := "do not care about quantity"
	result := detector.GetClosestLanguage(testString)
	fmt.Println(result)

output:
    english

by setting the value langdet.MinimumConfidence (0-1), you can set the accepted confidence level. E.g. 0.7 --> if langdet is 70% or higher sure that the language matches, return it, else it returns 'undefined'

Get Language Probabilities

GetClosestLanguage will return the language that most probably matches. To get the result of all analyzed language, you can use GetLanguage, which will return you all analyzed languages and their percentage of matching the input snippet

testString := "ont permis d'identifier"
GetLanguages returns:
    french 86 %
    english 79 %
    german 71 %
    turkish 54 %
    hebrew 39 %
    arabic 8 %
    russian 5 %


Use default languages

In order to use default languages, the file default_languages.json must be placed in the same directory as the binary. Alternatively it can be anywhere on the filesystem and initialized by calling InitWithDefault with the filepath.

Analyze new language

For analysing a new language random Wikipedia articles in the target languages are ideal. The result will be a Language object, containing the specified name and the profile example:

    language := langdet.Analyze(text_sample, "french")
    language.Profile // language profile in form of map[string]int as defined above
    language.Name // the name that was given as parameter

Add more languages

New languages can directly be analyzed and added to a detector by providing a text sample:

    text_sample := GetTextFromFile("samples/polish.txt")
    detector.AddLanguageFrom(text_sample, "polish")

The text sample should be bigger then 200kb and can be "dirty" (special chars, lists, etc.), but the language should not change for long parts.

Alternatively Analyze can be used and the resulting language can added using AddLanguage method:

    text_sample := GetTextFromFile("samples/polish.txt")
    french := langdet.Analyze(text_sample, "french")

    //language can be added selectively to detectors
    detectorA.AddLanguage(french)
    detectorC.AddLanguage(french)

Contribution

Suggestions and Bug reports can be made through Github issues. Contributions are welcomed, there is currently no need to open an issue for it, but please follow the code style, including descriptive tests with GoConvey.

License

Licensed under Apache 2.0.

About

A small library in golang, that detects the language of a text.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Go 100.0%