Skip to content
/ tlx Public

Text processing library for natural language processing.

License

Notifications You must be signed in to change notification settings

eedsp/tlx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tlx

Library to tokenize text (UTF-8) using regular expression.

Requirements

  • CMake -- Build, test and package software
  • PCRE2 -- Perl Compatible Regular Expressions
  • spdlog -- Super fast C++ logging library
  • ICU -- Library for Unicode and Globalization
  • Jansson -- C library for working with JSON
  • APR -- Apache Portable Runtime
  • APR-util -- Apache Portable Runtime Utility

Build

mkdir ./build
cd ./build
cmake ..
make [-j]

Testing

Text tokenizer

Create a dictionary for text segmentation

cd ./scripts
./build.dic.sh

# check dictionary files
ls ../db
   phrase.db       token.sgmt.db

Import dictionary files into shared memory

cd ./scripts
./db.import.sh

Run test

cd ./tests

# test script for text tokenizer and extract phrase pattern from text
./test.phrase.sh

# test script for text tokenizer
./test.token.sh

Remove dictionary from shared memory

cd ./scripts
./db.free.sh

License

Licensed under an Apache-2.0 license.

About

Text processing library for natural language processing.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published