Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Supervised learning of morphology
Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Type||Name||Latest commit message||Commit time|
|Failed to load latest commit information.|
INTRODUCTION ============ Morfette website: https://code.google.com/p/morfette/ Morfette is a tool for supervised learning of inflectional morphology. Given a corpus of sentences annotated with lemmas and morphological labels, and optionally a lexicon, morfette learns how to morphologically analyze new sentences. In the learning stage Morfette fits two separate logistic regression models: one for morphological tagging and one for lemmatization. The predictions of the models are combined dynamically and produce a globally plausible sequence of morphological-tag - lemma pairs for a sentence. In Morfette lemmatization is cast as a classification task where a a lemmatization class corresponds to the specification of the edit operations which are needed to transform the inflected word form into the corresponding lemma. The basic approach is described in (Chrupała et al 2008 and Chrupała 2008). The current version of Morfette uses an averaged perceptron to fit the models, rather than Maximum Entropy training. The lemmatization classes are Edit-Tree-based as described in (Chrupała 2008). LICENSE ======= The source code in the src directory is licensed under the BSD license. INSTALLATION ============ The easiest way to install Morfette is to first install the Haskell Platform <http://www.haskell.org/platform/> then execute the following commands from within the morfette directory: > cabal update > cabal install --bindir=$HOME/bin This will compile Morfette and install the executable in $HOME/bin. Cabal can also download Morfette from the source code repository Hackage: > cabal install morfette --bindir=$HOME/bin --datadir=$HOME/share This will download Morfette, compile it, install the executable in $HOME/bin, and install the data files in a $HOME/share. There are also pre-built binaries available from the project website. USAGE ===== Usage: morfette command [OPTION...] [ARG...] train: train models train [OPTION...] TRAIN-FILE MODEL-DIR --dict-file=PATH path to optional dictionary --language-configuration=es|pl|tr|.. language configuration --iter-pos=NUM iterations for POS model --iter-lemma=NUM iterations for Lemma model extract-features: extract features extract-features [OPTION...] MODEL-DIR --dict-file=PATH path to optional dictionary --model-id=pos|lemma model id (`pos' or `lemma') predict: predict postags and lemmas using saved model data predict [OPTION...] MODEL-DIR --beam=+INT beam size to use --tokenize tokenize input --multi=+INT n-best output eval: evaluate morpho-tagging and lemmatization results eval [OPTION...] TRAIN-FILE GOLD-FILE TEST-FILE --ignore-case ignore case for evaluation --baseline-file=PATH path to baseline results --dict-file=PATH path to optional dictionary --ignore-punctuation ignore punctuation for evaluation --ignore-pos=POS-prefix ignore POS starting with POS-prefix for evaluation version: show version version [OPTION...] EXAMPLE USAGE ============= To train a new model: > morfette train --dict-file=DICT TRAINING-FILE MODEL-DIR To use the model in MODEL-DIR to analyze new data: > morfette predict MODEL-DIR < TEST-DATA > ANALYZED-TEST-DATA PRETRAINED MODELS ================= Pretrained models for Spanish and French are available in the data directory: data/es/model and data/fr/model. For example you can use the Spanish model like this: > morfette predict data/es/model < TEST-DATA > ANALYZER-TEST-DATA DATA FORMAT =========== Morfette expects both training and testing data to be tokenized and split into sentences. The format of training data look like this: Gómez Gómez np0000p sostiene sostener vmip3s0 que que cs la el da0fs0 propuesta propuesta ncfs000 no no rn cambiará cambiar vmif3s0 . . Fp La el da0fs0 propuesta propuesta ncfs000 será ser vsif3s0 la el da0fs0 misma mismo pi0fs000 There is one token per line, with three columns separated by spaces or tabs. The columns contain word form, lemma and morphological tag respectively. Sentences are separated by an empty line. Text should be encoded in UTF-8. Test data format is similar, except only the first column is needed: Gómez sostiene que la propuesta no cambiará . La propuesta será la misma Optionally, Morfette can also use vector representations of words for training and prediction, in addition to character strings, as described in . In this case, the second column of is used for the vectors. Inside this field vector components are separated by commas: Gómez 0.1,0.0,0.9 Gómez np0000p sostiene 0.0,0.0,0.2 sostener vmip3s0 For prediction the format is similar, but only the first two columns are used: Gómez 0.1,0.0,0.9 sostiene 0.0,0.0,0.2 Each vector should have the same number of components. Additionally, Morfette can use a dictionary with mappings between word-forms, lemmas and morphological tags. The tags do NOT need to be of the same kind and the tags used in the training data. The format of the dictionary file is one record per line with the following fields: FORM LEMMA1 POS1 LEMMA2 POS2 ... Example from the Spanish dictionary: sientes sentar VMSP2S0 sentir VMIP2S0 siento sentar VMIP1S0 sentir VMIP1S0 References ==========  Grzegorz Chrupała, Georgiana Dinu and Josef van Genabith. 2008. Learning Morphology with Morfette. In Proceedings of LREC 2008. http://www.lrec-conf.org/proceedings/lrec2008/pdf/594_paper.pdf  Grzegorz Chrupała. 2008. Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Chapter 6. PhD dissertation, Dublin City University. http://grzegorz.chrupala.me/papers/phd.pdf  Grzegorz Chrupała. 2010. Efficient induction of probabilistic word classes with LDA. IJCNLP. http://grzegorz.chrupala.me/papers/ijcnlp-2011.pdf