Skip to content

USCDataScience/TextREST.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Language Detection REST Server using MIT Lincoln Lab’s Text.jl library

The Language Detection REST Server is an HTTP Server in Julia for detecting the language of a text sent as an HTTP PUT request data. The Server will give a JSON response with the language ISO 639-1 Code. It makes use of the Margin-infused relaxed algorithm (MIRA) for the language detection (multiclass classification) based on word and character n-grams using the MIT Lincoln Lab’s Text.jl (TEXT: Numerous tools for text processing) library.

Installation

  1. Install Julia v0.3 from http://julialang.org/downloads/oldreleases.html. You can either use the pre-build binaries or build it from source. (Note: This package requires Julia v0.3 because some of the prerequisite packages of Text.jl are not compatible with Julia v0.4 and above)
  2. Add the julia/bin to your PATH, you can run the following commands OR add it to ~/.bashrc:
     export JULIA_BIN=PATH_TO_JULIA/bin
     export PATH=${PATH}:${JULIA_BIN}
     
  3. Run install.sh OR run the following commands:
     if [[ -a .git/shallow ]]; then git fetch --unshallow; fi
     julia -e 'versioninfo(); Pkg.init();'
     julia -e 'Pkg.clone("https://github.com/saltpork/Stage.jl"); cd(Pkg.dir("Stage")); run(`git checkout last-0.3-release`);'
     julia -e 'Pkg.clone("https://github.com/mit-nlp/Ollam.jl");'
     julia -e 'Pkg.clone("https://github.com/mit-nlp/Text.jl");'
     julia -e 'Pkg.clone("https://github.com/trevorlewis/TextREST.jl");'
     julia -e 'Pkg.build();'
     
    (Note: This will install the prerequisite packages and the TextREST package in the $HOME/.julia/v0.3 directory.)
  4. Add WikiExtractor.py from wikipedia-extractor (https://github.com/bwbaugh/wikipedia-extractor) into the TextREST/data directory.
     wget https://raw.githubusercontent.com/bwbaugh/wikipedia-extractor/master/WikiExtractor.py -O $HOME/.julia/v0.3/TextREST/data/WikiExtractor.py
     
    (Note: This script is required to generate training data from the Wikipedia database dump.)

Testing Your Installation

  1. To test the installation run the following command:
     julia --color=yes -e 'Pkg.test("TextREST");'
     
  2. To start the REST server run the following commands from the $HOME/.julia/v0.3/TextREST/test directory:
     cd $HOME/.julia/v0.3/TextREST/test
     julia --color=yes -e 'using TextREST; bkgmodel, fextractor, model = lid_train("data/text.tsv"); server = text_rest_server(fextractor, model); run(server, host=ip"127.0.0.1", port=8000);'
     
    Please wait till you see the message Running on http://127.0.0.1:8000 (Press CTRL+C to quit)
  3. Open http://127.0.0.1:8000/lid in a browser. You should see a JSON array of 18 language codes that can be detected by the Language Detection REST Server.
  4. To test the language detection run the following command from another terminal:
     curl -X PUT -d "hello world" http://127.0.0.1:8000/lid
     

Usage

See test/runtests.jl for detailed usage.

Training Data File

  • The training data file should be a TSV file with the first column containing the language of the text and the second column containing the text. The training data can be generated using data.sh or you can use your own training data TSV file.
  • data.sh, located in the TextREST/data directory, can be used to prepare the training data for language detection.
  • It downloads the Wikipedia database XML dumps of the languages listed in the languages.txt file and then extracts plain text from the dumps using WikiExtractor.py from wikipedia-extractor (https://github.com/bwbaugh/wikipedia-extractor).
  • Then it preprocesses the plain text files into TSV files such that it contains the language of the article text, article ID, article URL, article title and article text.
  • Finally, it combines 'n' random lines of all the files into a single TSV file such that it contains the language of the article text and the article text.
  • languages.txt, located in the TextREST/data directory, contains the ISO 639-1 Codes of languages one on each line.
  • Note: The size of the files generated by data.sh for the 18 languages listed in languages.txt is as follows:
      $ du -sch xml_bz2/ text_xml/ tsv/
      32G	xml_bz2/
      35G	text_xml/
      32G	tsv/
      97G	total
      
  • Note: This script downloads the Wikipedia database XML dumps, extracts plain text from the dumps and preprocesses the plain text files into TSV files only once and uses this preprocessed TSV files to generate the training data. So you can delete the folders which contain the Wikipedia database XML dumps (xml_bz2 folder) and the preprocessed plain text files (text_xml folder) but keep the preprocessed TSV files to generate new training or test data files.
  • Note: Run this script from the TextREST/data directory.
**Usage:**
data.sh [options]
**Options:**
-l, --lang  : languages file name
-f, --file  : output file name
-n, --num   : number of lines for each file
**Example:**
$ ./data.sh -l=languages.txt -f=text.tsv -n=1000

About

Language Detection REST Server using MIT Lincoln Lab’s Text.jl library

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published