Text preprocessors has the goal of implementing algorithms that work even if they are ported and using the best statistical methods. Text mining algorithms have the detrimental trait of often being abstract and subjective. The libraries often implement certain tasks well and others without much care.
The tool aims to also make these algorithms distributable via Spark using map partitions.
These are not SimplrTeks/SimplrTerms trade secrets, just the best ways to prepare data for use in the mining process.
This tool includes the following whether custom, ported or wrapped.
- Smoothing (traingular,rectangular,simple exponential with moving average,Hamming Window based; Hanning Window based)
- Named Entity extraction and replacement using Epic*
- Ported Text Segmentation from Python NLTK with more custom smoothing and a few changes *
- Punctuation Removal
- Number replacement
- SimplrTerms similarity based word replacement (also generates a word replacement model)*