Qumin (QUantitative Modelling of INflection) is a collection of scripts for the computational modelling of the inflectional morphology of languages. It was developed by me (Sacha Beniamine) for my PhD, which was supervised by Olivier Bonami .
The documentation has moved to ReadTheDocs at: https://qumin.readthedocs.io/
For more detail, you can refer to my dissertation (in French):
First, open the terminal and navigate to the folder where you want the Qumin code. Clone the repository from github: :
git clone https://github.com/XachaB/Qumin.git
Make sure to have all the python dependencies installed. The dependencies are listed in environment.yml. A simple solution is to use conda and create a new environment from the environment.yml file: :
conda env create -f environment.yml
There is now a new conda environment named Qumin. It needs to be activated before using any Qumin script: :
conda activate Qumin
The scripts expect full paradigm data in phonemic transcription, as well as a feature key for the transcription.
To provide a data sample in the correct format, Qumin includes a subset of the French flexique lexicon, distributed under a Creative Commons Attribution-NonCommercial-ShareAlike license.
For Russian nouns, see the Inflected lexicon of Russian Nouns in IPA notation.
Alternation patterns serve as a basis for all the other scripts. The algorithm to find the patterns was presented in: Sacha Beniamine. Un algorithme universel pour l'abstraction automatique d'alternances morphophonologiques 24e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), Jun 2017, Orléans, France. 2 (2017), 24e Conférence sur le Traitement Automatique des Langues Naturelles.
Computing automatically aligned patterns for paradigm entropy or macroclass:
bin/$ python3 find_patterns.py <paradigm.csv> <segments.csv>
Computing automatically aligned patterns for lattices:
bin/$ python3 find_patterns.py -d -o <paradigm.csv> <segments.csv>
To visualize the microclasses and their similarities, you can use the new script `microclass_heatmap.py`:
Computing a microclass heatmap:
bin/$ python3 microclass_heatmap.py <paradigm.csv> <output_path>
Computing a microclass heatmap, comparing with class labels:
bin/$ python3 microclass_heatmap.py -l <labels.csv> -- <paradigm.csv> <output_path>
The labels file is a csv file. The first column give lexemes names, the second column provides inflection class labels. This allows to visually compare a manual classification with pattern-based similarity. This script relies heavily on seaborn's clustermap function.
This script was used in:
- Bonami, Olivier, and S. Beniamine. "Joint predictiveness in inflectional paradigms." Word Structure 9, no. 2 (2016): 156-182. Some improvements have been implemented since then.
Computing entropies from one cell :
bin/$ python3 calc_paradigm_entropy.py -n 1 -- <patterns.csv> <paradigm.csv> <segments.csv>
Computing entropies from two cells (you can specify any number of predictors, e.g. -n 1 2 3 works too) :
bin/$ python3 calc_paradigm_entropy.py -n 2 -- <patterns.csv> <paradigm.csv> <segments.csv>
Add a file with features to help prediction (for example gender -- features will be added to the known information when predicting) :
bin/$ python3 calc_paradigm_entropy.py -n 2 --features <features.csv> -- <patterns.csv> <paradigm.csv> <segments.csv>
Our work on automatical inference of macroclasses was published in Beniamine, Sacha, Olivier Bonami, and Benoît Sagot. "Inferring Inflection Classes with Description Length." Journal of Language Modelling (2018).
Inferring macroclasses :
bin/$ python3 find_macroclasses.py <patterns.csv> <segments.csv>
This script was used in:
- Beniamine, Sacha. (in press) "One lexeme, many classes: inflection class systems as lattices" , In: One-to-Many Relations in Morphology, Syntax and Semantics , Ed. by Berthold Crysmann and Manfred Sailer. Berlin: Language Science Press.
Inferring a lattice of inflection classes, with html output :
bin/$ python3 make_lattice.py --html <patterns.csv> <segments.csv>