Skip to content

A Java program that determines the natural language of text using n-grams

Notifications You must be signed in to change notification settings

daniel-keogh/language-detector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Language Detector

Y3S1 Object-Oriented Programming Assignment

Description

A Java program that compares some query text against an n-gram collection of subject text and attempts to determine the natural language of the query.

Features

  • The query and dataset are both parsed in separate threads, with each line of the dataset file parsed in its own worker thread using an ExecutorService.

  • The query may be a text file, or it may be a string typed directly from the console menu.

  • To reduce the time & space needed to parse the dataset file, the program will first try to determine what script the query is written in. For instance, if the query was found to be written in Cyrillic, then there would be no need to parse the dataset entries for languages that don't use the Cyrillic script. Determining the script was done by finding the most common Character.UnicodeScript seen in the query.

Build

Requirements

  • JavaSE-11, or higher.

Download the project and run the following from inside the bin/ directory to create a JAR file.

$ jar -cf ./oop.jar *

Run

$ java -cp ./oop.jar ie.gmit.sw.Runner

Example Output

While the application's accuracy is limited, it is still often able to accurately detect the language of a query. For instance, testing with the Japanese (jp.txt) and French (fr.txt) samples provided in the samples/ directory produces the following output.

Japanese

$ Choose WiLi dataset ('L' for Large, 'S' for Small): S
$ Enter the query text/file: samples/jp.txt
Query file entered.
Processing query...
Building subject database...
Finished processing query.
Finished building subject database.

The text appears to be written in Japanese.
Time: 0.433 (s)

French

$ Choose WiLi dataset ('L' for Large, 'S' for Small): L
$ Enter the query text/file: samples/fr.txt
Query file entered.
Processing query...
Building subject database...
Finished processing query.
Finished building subject database.

The text appears to be written in French.
Time: 34.957 (s)

About

A Java program that determines the natural language of text using n-grams

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages