The Language Detection REST Server is an HTTP Server in Julia for detecting the language of a text sent as an HTTP PUT request data. The Server will give a JSON response with the language ISO 639-1 Code. It makes use of the Margin-infused relaxed algorithm (MIRA) for the language detection (multiclass classification) based on word and character n-grams using the MIT Lincoln Lab’s Text.jl (TEXT: Numerous tools for text processing) library.
- Install
Julia v0.3
from http://julialang.org/downloads/oldreleases.html. You can either use the pre-build binaries or build it from source. (Note: This package requires Julia v0.3 because some of the prerequisite packages of Text.jl are not compatible with Julia v0.4 and above) - Add the
julia/bin
to your PATH, you can run the following commands OR add it to ~/.bashrc:export JULIA_BIN=PATH_TO_JULIA/bin export PATH=${PATH}:${JULIA_BIN}
- Run
install.sh
OR run the following commands:if [[ -a .git/shallow ]]; then git fetch --unshallow; fi julia -e 'versioninfo(); Pkg.init();' julia -e 'Pkg.clone("https://github.com/saltpork/Stage.jl"); cd(Pkg.dir("Stage")); run(`git checkout last-0.3-release`);' julia -e 'Pkg.clone("https://github.com/mit-nlp/Ollam.jl");' julia -e 'Pkg.clone("https://github.com/mit-nlp/Text.jl");' julia -e 'Pkg.clone("https://github.com/trevorlewis/TextREST.jl");' julia -e 'Pkg.build();'
(Note: This will install the prerequisite packages and the TextREST package in the$HOME/.julia/v0.3
directory.) - Add
WikiExtractor.py
fromwikipedia-extractor
(https://github.com/bwbaugh/wikipedia-extractor) into theTextREST/data
directory.wget https://raw.githubusercontent.com/bwbaugh/wikipedia-extractor/master/WikiExtractor.py -O $HOME/.julia/v0.3/TextREST/data/WikiExtractor.py
(Note: This script is required to generate training data from the Wikipedia database dump.)
- To test the installation run the following command:
julia --color=yes -e 'Pkg.test("TextREST");'
- To start the REST server run the following commands from the
$HOME/.julia/v0.3/TextREST/test
directory:cd $HOME/.julia/v0.3/TextREST/test julia --color=yes -e 'using TextREST; bkgmodel, fextractor, model = lid_train("data/text.tsv"); server = text_rest_server(fextractor, model); run(server, host=ip"127.0.0.1", port=8000);'
Please wait till you see the messageRunning on http://127.0.0.1:8000 (Press CTRL+C to quit)
- Open http://127.0.0.1:8000/lid in a browser. You should see a JSON array of 18 language codes that can be detected by the Language Detection REST Server.
- To test the language detection run the following command from another terminal:
curl -X PUT -d "hello world" http://127.0.0.1:8000/lid
See test/runtests.jl
for detailed usage.
- The training data file should be a TSV file with the first column containing the language of the text and the second column containing the text. The training data can be generated using
data.sh
or you can use your own training data TSV file. data.sh
, located in theTextREST/data
directory, can be used to prepare the training data for language detection.- It downloads the Wikipedia database XML dumps of the languages listed in the
languages.txt
file and then extracts plain text from the dumps usingWikiExtractor.py
fromwikipedia-extractor
(https://github.com/bwbaugh/wikipedia-extractor). - Then it preprocesses the plain text files into TSV files such that it contains the language of the article text, article ID, article URL, article title and article text.
- Finally, it combines 'n' random lines of all the files into a single TSV file such that it contains the language of the article text and the article text.
languages.txt
, located in theTextREST/data
directory, contains the ISO 639-1 Codes of languages one on each line.- Note: The size of the files generated by
data.sh
for the 18 languages listed inlanguages.txt
is as follows:$ du -sch xml_bz2/ text_xml/ tsv/ 32G xml_bz2/ 35G text_xml/ 32G tsv/ 97G total
- Note: This script downloads the Wikipedia database XML dumps, extracts plain text from the dumps and preprocesses the plain text files into TSV files only once and uses this preprocessed TSV files to generate the training data. So you can delete the folders which contain the Wikipedia database XML dumps (
xml_bz2
folder) and the preprocessed plain text files (text_xml
folder) but keep the preprocessed TSV files to generate new training or test data files. - Note: Run this script from the
TextREST/data
directory.
**Usage:** data.sh [options] **Options:** -l, --lang : languages file name -f, --file : output file name -n, --num : number of lines for each file **Example:** $ ./data.sh -l=languages.txt -f=text.tsv -n=1000