Skip to content

askeladdk/langdet

Repository files navigation

langdet - Language Detection for Go

GoDoc Go Report Card Coverage Status

Overview

Package langdet detects natural languages in text using a straightforward implementation of trigram based text categorization. The most commonly used languages worldwide are supported out of the box, but the code is flexible enough to accept any set of languages.

Langdet first detects the writing script in order to narrow down the number of languages to test against. Some writing scripts are used by only a single language (Korean, Greek, etc). In that case the language is returned directly without needing to do trigram analysis. Otherwise, it matches each language profile under the detected writing script against the input text and returns a result set listing the languages ordered by confidence.

Install

go get -u github.com/askeladdk/langdet

Quickstart

Use DetectLanguage to detect the language of a string. It returns the BCP 47 language tag of the language with the highest probability. If no language was detected, the function returns language.Und.

detectedLanguage := langdet.DetectLanguage(s)

Use DetectLanguageWithOptions if you need more control. DetectLanguage is a shorthand for this function using DefaultOptions. Unlike DetectLanguage, DetectLanguageWithOptions returns a slice of Results listing the probabilities of all languages using the detected writing script ordered by probability.

results := langdet.DetectLanguageWithOptions(s, DefaultOptions)

Use Options to configure the detector. Any number of writing scripts and languages can be detected by setting the Scripts and Languages fields. Use the Train function to build language profiles. Use MinConfidence and MinRelConfidence to filter languages by confidence.

myLang := langdet.Language {
    Tag: language.Make("zz"),
    Trigrams: langdet.Train(trainingSet),
}

options := langdet.Options {
    Scripts: []*unicode.RangeTable{
        unicode.Latin,
    },
    Languages: map[*unicode.RangeTable]langdet.Languages {
        unicode.Latin: {
            Languages: []langdet.Languge {
                langdet.Dutch,
                langdet.French,
                myLang,
            },
        },
    },
}

results := langdet.DetectLanguageWithOptions(s, options)

Read the rest of the documentation on pkg.go.dev. It's easy-peasy!

License

Package langdet is released under the terms of the ISC license.