Skip to content
Subregular toolkit for language processing
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
PyKleene
tests
LICENSE
README.md

README.md

Python package for subregular language processing

Disclaimer: the first version of the package is currently being developed. For comments, questions, or suggestions please email to alena.aksenova@stonybrook.edu.

Readme updated: March 29, 2018.

In this repository, you can find a package for subregular language processing. Please find below a brief overview of what it consists of (for now), and how to use it. In order to get access to the language classes and methods defined for them, please run the following commands in the terminal after downloading the folder slp. Please ensure that you are running Python3 no older than the version 3.6.

$ cd ~/slp
$ python3
>>> from main import *

Tools for strictly local (SL) languages

SL languages are generated by SL grammars. The core idea behind SL grammars is to allow or to prohibit substrings up to a certain length in the generated language. Grammars that prohibit illicit substrings are negative, and the ones that list allowed substrings are positive.

Initialize positive or negative SL grammar:

    >>> posSL = PosSL()
    >>> negSL = NegSL()

Initialize the grammar, data, edge symbols in use, or k-gram length:

    >>> posSL.grammar = [('>', 'a'), ('a', 'b'), ('b', '<'), ('b', 'a')]
    >>> posSL.data = ["ab", "ababab"]
    >>> posSL.edges = [">", "<"]
    >>> posSL.k = 2

Extract grammar given the data

In order to automatically extract the grammar, one needs data of the language. For this purpose, the user can use learn method provided for SL classes.

    >>> posSL.data = ["ab", "ababab"]
    >>> posSL.learn()
    >>> posSL.grammar
    [('>', 'a'), ('a', 'b'), ('b', '<'), ('b', 'a')] 

Generating a data sample

Method generate_sample generates a data sample. By default, it generates 10 words and does not avoids repetitions, but these specifications can be changed.

Arguments:

  • n:int = 10 (number of words to be generated)
  • rep:bool = True (generating with or without repetitions)
    >>> posSL.grammar = [('>', 'a'), ('a', 'b'), ('b', '<'), ('b', 'a')] 
    >>> posSL.generate_sample(n=5, rep=False)
    >>> posSL.data_sample
    ['>ababab<', '>abab<', '>ab<', '>abababab<', '>abababababab<']

Scanning a word

For a given word, tells whether it might or might not be generated by the given grammar.

    >>> posSL.grammar = [('>', 'a'), ('a', 'b'), ('b', '<'), ('b', 'a')] 
    >>> posSL.scan("aba")
    False
    >>> posSL.scan("abab")
    True

Cleaning the grammar

If the given grammar has useless ngrams, the clean method is able to detect and remove them.

    >>> posSL.grammar = [(">", "a"), ("a", "b"), ("b", "a"), ("b", "<"), (">", "c"), ("c", "d"), 
        ("e", "f"), ("f", "<")]
    >>> posSL.clean()
    >>> posSL.grammar
    [('b', '<'), ('a', 'b'), ('>', 'a'), ('b', 'a')]

Changing polarity of the grammar

If the polarity of the grammar needs to be changed (positive to negative, or vice versa), one can use the method change_polarity. It changes the grammar and the class of the grammar.

    >>> posSL.grammar = [('>', 'a'), ('a', 'b'), ('b', '<'), ('b', 'a')] 
    >>> negSL = posSL
    >>> negSL.change_polarity()
    >>> negSL.grammar
    [('b', 'b'), ('a', 'a'), ('a', '<'), ('>', '<'), ('>', 'b')]

Corresponding Finite State Machine

In order to build a Finite State Machine (FSM) corresponding to a given grammar, one can use the method fsmize. To access the transitions, look them up by typing posSL.fsm.transitions.

    >>> posSL.grammar = [('>', 'a'), ('a', 'b'), ('b', '<'), ('b', 'a')]
    >>> posSL.fsmize()
    >>> posSL.fsm.transitions
    [(('>',), 'a', ('a',)), (('a',), 'b', ('b',)), (('b',), '<', ('<',)), (('b',), 'a', ('a',))]

Tools for tier-based strictly local (TSL) languages

Tier-based strictly local (TSL) languages are generated by TSL grammars, see (Heinz et. al 2011; Jardine and Heinz 2016). For this type of languages, the notion of a tier is crucial. For every word, its tier consists of an image of this word under the erasing function, thar removes all non-tier symbols from the tier imagibe of the word. This approach allows to view long-distance processes as local by ignoring all the intervening material.

Initialize positive TSL grammar:

    posTSL = PosTSL()

Extract grammar given the data

In order to automatically extract the grammar, one needs data of the language. For this purpose, the user can use learn method provided for SL classes.

The algotithm for extracting Strictly k-Local Grammars from a given data used in this package is taken from (Jardine and McMullin 2017).

    >>> posTSL.data = ["aaaaab", "baaaa", "aaaabaaaa", "b"]
    >>> posTSL.learn()
    >>> posTSL.grammar
    [('b', '<'), ('>', 'b')]

Generating a data sample

Method generate_sample generates a data sample. By default, it generates 10 words and does not avoids repetitions, but these specifications can be changed.

Arguments:

  • n:int = 10 (number of words to be generated)
  • rep:bool = True (generating with or without repetitions)
    >>> posTSL.tier = ["b"]
    >>> posTSL.alphabet = ["a", "b"]
    >>> posTSL.grammar = [('b', '<'), ('>', 'b')]
    >>> posTSL.generate_sample(n=5, rep=False)
    >>> posTSL.data_sample
    ['>aabaa<', '>baa<', '>b<', '>abaaa<', '>aaabaaa<']

Scanning a word

For a given word, tells whether it might or might not be generated by the given grammar.

    >>> posTSL.tier = ["b"]
    >>> posTSL.alphabet = ["a", "b"]
    >>> posTSL.grammar = [('b', '<'), ('>', 'b')]
    >>> posTSL.scan("aaaaba")
    True
    >>> posTSL.scan("aaabaacaa")
    False
    >>> posTSL.scan("aaa")
    False

Cleaning the grammar

If the given grammar has useless ngrams, the clean method is able to detect and remove them. Works in the same way as its SL counterpart.

    >>> posTSL.grammar = [(">", "a"), ("a", "b"), ("b", "a"), ("b", "<"), (">", "c"), ("c", "d"), 
        ("e", "f"), ("f", "<")]
    >>> posTSL.clean()
    >>> posTSL.grammar
    [('b', '<'), ('a', 'b'), ('>', 'a'), ('b', 'a')]

Changing polarity of the grammar

If the polarity of the grammar needs to be changed (positive to negative, or vice versa), one can use the method change_polarity. It changes the grammar and the class of the grammar.

    >>> posTSL.grammar = [('>', 'b'), ('b', '<')]
    >>> negTSL = posTSL
    >>> negTSL.change_polarity()
    >>> negTSL.grammar
    [('>', '<')]

Corresponding Finite State Machine

In order to build a Finite State Machine (FSM) corresponding to a given grammar, one can use the method fsmize. The obtained FSM corresponds to a TSL grammar and ignores non-tier symbols. To access the transitions, look them up by typing posTSL.fsm.transitions.

    >>> posTSL.grammar = [('>', 'a'), ('a', 'b'), ('b', '<'), ('b', 'a')]
    >>> posTSL.fsmize()
    >>> posTSL.fsm.transitions
    [(('>',), 'a', ('a',)), (('a',), 'b', ('b',)), (('b',), '<', ('<',)), (('b',), 'a', ('a',))]

Tools for strictly piecewise (SP) languages

Strictly piecewise (SP) languages are generated by SP grammars. The main difference of SP grammars in comparison to the SL ones is that the later ones exhibit the successor relation (x is next to y), whereas the former ones operate with the precedence relation (x is somewhere before y). For more information see (Fu, Heinz and Tanner 2011).

Initialize positive or negative SP grammar:

    >>> posSP = PosSP()
    >>> negSP = NegSP()

The class is under construction!

To-do list

Coming soon:

  • class of FSM family (needed for SP learning)
  • SP learner
  • SP fsmizer
  • SP scanner
  • SP sample generator
  • SP grammar cleaning
  • SP polarity swicher

References

[1] Jie Fu, Jeffrey Heinz, and Herbert G. Tanner. (2011) An Algebraic Characterization of Strictly Piecewise Languages. In Mitsunori Ogihara and Jun Tarui, editors, Theory and Applications of Models of Computation, volume 6648 of Lecture Notes in Computer Science, pages 252--263. Springer Berlin / Heidelberg, 2011.

[2] Jeffrey Heinz, Chetan Rawal, and Herbert G. Tanner. (2011) Tier-based Strictly Local Constraints for Phonology. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 58--64, Portland, Oregon, USA, June 2011. Association for Computational Linguistics.

[3] Adam Jardine and Jeffrey Heinz. (2016) Learning Tier-based Strictly 2-Local Languages. Transactions of the Association for Computational Linguistics, 4:87--98, April 2016.

[4] Adam Jardine and Kevin McMullin. Efficient Learning of Tier-based Strictly k-Local Languages. (2017) In Frank Drewes, Carlos Martín-Vide, and Bianca Truthe (eds.), Proceedings of Language and Automata Theory and Applications, 11th International Conference, pp. 64-76. Lecture Notes in Computer Science series. Springer.

© Alëna Aksënova

You can’t perform that action at this time.