Language Detector

Y3S1 Object-Oriented Programming Assignment

Description

A Java program that compares some query text against an n-gram collection of subject text and attempts to determine the natural language of the query.

Features

The query and dataset are both parsed in separate threads, with each line of the dataset file parsed in its own worker thread using an ExecutorService.
The query may be a text file, or it may be a string typed directly from the console menu.
To reduce the time & space needed to parse the dataset file, the program will first try to determine what script the query is written in. For instance, if the query was found to be written in Cyrillic, then there would be no need to parse the dataset entries for languages that don't use the Cyrillic script. Determining the script was done by finding the most common Character.UnicodeScript seen in the query.

Build

Requirements

JavaSE-11, or higher.

Download the project and run the following from inside the bin/ directory to create a JAR file.

$ jar -cf ./oop.jar *

Run

$ java -cp ./oop.jar ie.gmit.sw.Runner

Example Output

While the application's accuracy is limited, it is still often able to accurately detect the language of a query. For instance, testing with the Japanese (jp.txt) and French (fr.txt) samples provided in the samples/ directory produces the following output.

Japanese

$ Choose WiLi dataset ('L' for Large, 'S' for Small): S
$ Enter the query text/file: samples/jp.txt
Query file entered.
Processing query...
Building subject database...
Finished processing query.
Finished building subject database.

The text appears to be written in Japanese.
Time: 0.433 (s)

French

$ Choose WiLi dataset ('L' for Large, 'S' for Small): L
$ Enter the query text/file: samples/fr.txt
Query file entered.
Processing query...
Building subject database...
Finished processing query.
Finished building subject database.

The text appears to be written in French.
Time: 34.957 (s)

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.idea		.idea
datasets		datasets
docs		docs
samples		samples
src/ie/gmit/sw		src/ie/gmit/sw
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
language-detector.iml		language-detector.iml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language Detector

Description

Features

Build

Requirements

Run

Example Output

Japanese

French

About

Releases

Packages

Languages

daniel-keogh/language-detector

Folders and files

Latest commit

History

Repository files navigation

Language Detector

Description

Features

Build

Requirements

Run

Example Output

Japanese

French

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages