Language Detection REST Server using MIT Lincoln Lab’s Text.jl library

The Language Detection REST Server is an HTTP Server in Julia for detecting the language of a text sent as an HTTP PUT request data. The Server will give a JSON response with the language ISO 639-1 Code. It makes use of the Margin-infused relaxed algorithm (MIRA) for the language detection (multiclass classification) based on word and character n-grams using the MIT Lincoln Lab’s Text.jl (TEXT: Numerous tools for text processing) library.

Installation

Install Julia v0.3 from http://julialang.org/downloads/oldreleases.html. You can either use the pre-build binaries or build it from source. (Note: This package requires Julia v0.3 because some of the prerequisite packages of Text.jl are not compatible with Julia v0.4 and above)
Add the julia/bin to your PATH, you can run the following commands OR add it to ~/.bashrc:
```
 export JULIA_BIN=PATH_TO_JULIA/bin
 export PATH=${PATH}:${JULIA_BIN}
 
```

Run install.sh OR run the following commands:

 if [[ -a .git/shallow ]]; then git fetch --unshallow; fi
 julia -e 'versioninfo(); Pkg.init();'
 julia -e 'Pkg.clone("https://github.com/saltpork/Stage.jl"); cd(Pkg.dir("Stage")); run(`git checkout last-0.3-release`);'
 julia -e 'Pkg.clone("https://github.com/mit-nlp/Ollam.jl");'
 julia -e 'Pkg.clone("https://github.com/mit-nlp/Text.jl");'
 julia -e 'Pkg.clone("https://github.com/trevorlewis/TextREST.jl");'
 julia -e 'Pkg.build();'

(Note: This will install the prerequisite packages and the TextREST package in the $HOME/.julia/v0.3 directory.)

Add WikiExtractor.py from wikipedia-extractor (https://github.com/bwbaugh/wikipedia-extractor) into the TextREST/data directory.
```
 wget https://raw.githubusercontent.com/bwbaugh/wikipedia-extractor/master/WikiExtractor.py -O $HOME/.julia/v0.3/TextREST/data/WikiExtractor.py
 
```
(Note: This script is required to generate training data from the Wikipedia database dump.)

Testing Your Installation

To test the installation run the following command:
```
 julia --color=yes -e 'Pkg.test("TextREST");'
 
```

To start the REST server run the following commands from the $HOME/.julia/v0.3/TextREST/test directory:

 cd $HOME/.julia/v0.3/TextREST/test
 julia --color=yes -e 'using TextREST; bkgmodel, fextractor, model = lid_train("data/text.tsv"); server = text_rest_server(fextractor, model); run(server, host=ip"127.0.0.1", port=8000);'

Please wait till you see the message Running on http://127.0.0.1:8000 (Press CTRL+C to quit)

Open http://127.0.0.1:8000/lid in a browser. You should see a JSON array of 18 language codes that can be detected by the Language Detection REST Server.
To test the language detection run the following command from another terminal:
```
 curl -X PUT -d "hello world" http://127.0.0.1:8000/lid
 
```

Usage

See test/runtests.jl for detailed usage.

Training Data File

The training data file should be a TSV file with the first column containing the language of the text and the second column containing the text. The training data can be generated using data.sh or you can use your own training data TSV file.
data.sh, located in the TextREST/data directory, can be used to prepare the training data for language detection.
It downloads the Wikipedia database XML dumps of the languages listed in the languages.txt file and then extracts plain text from the dumps using WikiExtractor.py from wikipedia-extractor (https://github.com/bwbaugh/wikipedia-extractor).
Then it preprocesses the plain text files into TSV files such that it contains the language of the article text, article ID, article URL, article title and article text.
Finally, it combines 'n' random lines of all the files into a single TSV file such that it contains the language of the article text and the article text.
languages.txt, located in the TextREST/data directory, contains the ISO 639-1 Codes of languages one on each line.
Note: The size of the files generated by data.sh for the 18 languages listed in languages.txt is as follows:
```
  $ du -sch xml_bz2/ text_xml/ tsv/
  32G	xml_bz2/
  35G	text_xml/
  32G	tsv/
  97G	total
  
```
Note: This script downloads the Wikipedia database XML dumps, extracts plain text from the dumps and preprocesses the plain text files into TSV files only once and uses this preprocessed TSV files to generate the training data. So you can delete the folders which contain the Wikipedia database XML dumps (xml_bz2 folder) and the preprocessed plain text files (text_xml folder) but keep the preprocessed TSV files to generate new training or test data files.
Note: Run this script from the TextREST/data directory.

**Usage:**
data.sh [options]
**Options:**
-l, --lang  : languages file name
-f, --file  : output file name
-n, --num   : number of lines for each file
**Example:**
$ ./data.sh -l=languages.txt -f=text.tsv -n=1000

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
src		src
test		test
LICENSE.txt		LICENSE.txt
NOTICE.txt		NOTICE.txt
README.md		README.md
REQUIRE		REQUIRE
install.sh		install.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

src

src

test

test

LICENSE.txt

LICENSE.txt

NOTICE.txt

NOTICE.txt

README.md

README.md

REQUIRE

REQUIRE

install.sh

install.sh

Repository files navigation

Language Detection REST Server using MIT Lincoln Lab’s Text.jl library

Installation

Testing Your Installation

Usage

Training Data File

About

Releases

Packages

Contributors 2

Languages

License

USCDataScience/TextREST.jl

Folders and files

Latest commit

History

Repository files navigation

Language Detection REST Server using MIT Lincoln Lab’s Text.jl library

Installation

Testing Your Installation

Usage

Training Data File

About

Resources

License

Stars

Watchers

Forks

Languages