Skip to content

asevans48/TextPreprocessors

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Goals

Text preprocessors has the goal of implementing algorithms that work even if they are ported and using the best statistical methods. Text mining algorithms have the detrimental trait of often being abstract and subjective. The libraries often implement certain tasks well and others without much care.

The tool aims to also make these algorithms distributable via Spark using map partitions.

These are not SimplrTeks/SimplrTerms trade secrets, just the best ways to prepare data for use in the mining process.

Implementations and Algorithms

This tool includes the following whether custom, ported or wrapped.

  • Smoothing (traingular,rectangular,simple exponential with moving average,Hamming Window based; Hanning Window based)
  • Named Entity extraction and replacement using Epic*
  • Ported Text Segmentation from Python NLTK with more custom smoothing and a few changes *
  • Punctuation Removal
  • Number replacement
  • SimplrTerms similarity based word replacement (also generates a word replacement model)*
* indicates where files can be written to a non-HDFS file system (most return a RDD[String])

About

Text segmentation algorithm (including C99 at the outset) and the associated pre-processors presented by SimplrTek.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published