The software is the second of a triad of software, that is needed to perform various tasks in analyzing strings. The first part of the software is the normalization part, the second (this) the representations level of text and the third part is the definition of equality and distances to implement measures to performe classification of objects of text representative.
-
Input: string representing Text (latin & polytonic greek)
-
Decomposition: split string into syllables, n-grams, n-grams panned, letter-selections, head-body-tail,
-
Transformation: letter transforms, word-selections, stop word masking, vectors/co-occurrence representations
-
Shallow neighbourhood representation
To get a notion of what the software is good for and how to use it visit this page:
http://ecomparatio.net/~khk/NORM-DECOMP-DIST/zerl.html
To run minimal python test call on commandline: python3 textdecomp.py
-
ngramWhole: letter ngram for entire string
-
ngramWords: letter ngram for wordlevel with/without padding
-
genngram: general ngram builder
-
skipgram: general skip gram builder
-
silben: pseudo syllable
-
ohneKon: sting without consonant
-
ohnVoka: string without vowel
-
justKLEIN: string of just stopwords
-
justGROSZ: string with no stopwords
-
earasegram: skipgram of justKLEIN or justGROSZ
-
toKKC: every word as head, body, tail, equal separation
-
toKKCnSufixWords: every word as a array of head, body, tail possible letter sequences
-
fnb: shallow neighbourhood representation
-
gettransformarray: get a array of a string with letter transformations
-
pseudosyntagma: lager tokens from longest possible sequence of words
Inlcude the textnorm.py if you use this in your Python3 software.