GitHub - asevans48/TextPreprocessors: Text segmentation algorithm (including C99 at the outset) and the associated pre-processors presented by SimplrTek.

Goals

Text preprocessors has the goal of implementing algorithms that work even if they are ported and using the best statistical methods. Text mining algorithms have the detrimental trait of often being abstract and subjective. The libraries often implement certain tasks well and others without much care.

The tool aims to also make these algorithms distributable via Spark using map partitions.

These are not SimplrTeks/SimplrTerms trade secrets, just the best ways to prepare data for use in the mining process.

Implementations and Algorithms

This tool includes the following whether custom, ported or wrapped.

Smoothing (traingular,rectangular,simple exponential with moving average,Hamming Window based; Hanning Window based)
Named Entity extraction and replacement using Epic*
Ported Text Segmentation from Python NLTK with more custom smoothing and a few changes *
Punctuation Removal
Number replacement
SimplrTerms similarity based word replacement (also generates a word replacement model)*

* indicates where files can be written to a non-HDFS file system (most return a RDD[String])

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
.settings		.settings
data		data
src/com/simplrtek		src/com/simplrtek
.cache-main		.cache-main
.classpath		.classpath
.gitignore		.gitignore
.project		.project
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.settings

.settings

data

data

src/com/simplrtek

src/com/simplrtek

.cache-main

.cache-main

.classpath

.classpath

.gitignore

.gitignore

.project

.project

README.md

README.md

pom.xml

pom.xml

Repository files navigation

Goals

Implementations and Algorithms

About

Releases

Packages

Languages

asevans48/TextPreprocessors

Folders and files

Latest commit

History

Repository files navigation

Goals

Implementations and Algorithms

About

Resources

Stars

Watchers

Forks

Languages