Myaso

Myaso [ˈmʲæ.sə] is a morphological analysis and synthesis library, written in Ruby.

Installation

Add this line to your application's Gemfile:

gem 'myaso'

And then execute:

$ bundle

Or install it:

$ gem install myaso

Usage

At the moment, Myaso has pretty fast part of speech (POS) tagger built on hidden Markov models (HMMs). The tagging operation requires statistical model to be trained.

Myaso supports trained models in the TnT format. One could be obtained at the Serge Sharoff et al. resource called Russian statistical taggers and parsers.

Analysis

Since Yandex has released the Mystem analyzer in the form of shared library, it makes it possible to use the analyzer through the foreign function interface.

Firstly, it is necessary to read and agree with the mystem EULA. Secondly, download and install the shared library for your operating system. Finally, use Myaso and enjoy the benefits.

Analysis API

Myaso uses mystem library to process Russian words. That is quite simple.

pp Myaso::Mystem.analyze('котёночка')
=begin
[#<struct Myaso::Mystem::Lemma
  lemma="котеночек",
  form="котёночка",
  quality=:dictionary,
  msd=#<Myasorubka::MSD::Russian msd="Ncmsay">,
  stem_grammemes=[136, 192, 201],
  flex_grammemes=[168, 174, 166],
  flex_length=6,
  rule_id=1525>]
=end

Myaso works fine even in case the given word is either ambiguous or does not appear in the mystem's dictionary.

pp Myaso::Mystem.analyze('аудисты')
=begin
[#<struct Myaso::Mystem::Lemma
  lemma="аудист",
  form="аудисты",
  quality=:bastard,
  msd=#<Myasorubka::MSD::Russian msd="Ncmpny">,
  stem_grammemes=[136, 192, 201],
  flex_grammemes=[165, 175],
  flex_length=1,
  rule_id=25>,
 #<struct Myaso::Mystem::Lemma
  lemma="аудистый",
  form="аудисты",
  quality=:bastard,
  msd=#<Myasorubka::MSD::Russian msd="A---p-s">,
  stem_grammemes=[128],
  flex_grammemes=[175, 183],
  flex_length=1,
  rule_id=65>]
=end

Synthesis

Given the analyzed word, it is possible to retrieve all the possible forms. Having this information, one may use it to inflect a word. This is implemeneted using the abovementioned mystem shared library.

Synthesis API

In general form, all the possible word forms can be extracted with the specified word and its inflection rule.

pp Myaso::Mystem.forms('человеком', 3890)
=begin
[#<struct Myaso::Mystem::Form
  form="людей",
  msd=#<Myasorubka::MSD::Russian msd="Ncmpay">,
  stem_grammemes=[136, 192, 201],
  flex_grammemes=[168, 175, 166]>,
 ...
 #<struct Myaso::Mystem::Form
  form="человеку",
  msd=#<Myasorubka::MSD::Russian msd="Ncmsdy">,
  stem_grammemes=[136, 192, 201],
  flex_grammemes=[167, 174]>]
=end

There exists a convenient way of doing this, which requires a previously lemmatized word.

lemmas = Myaso::Mystem.analyze('кот') # => [#<Myaso::Mystem::Lemma lemma="кот" msd="Ncmsny">]
pp lemmas[0].forms
=begin
[#<struct Myaso::Mystem::Form
  form="кот",
  msd=#<Myasorubka::MSD::Russian msd="Ncmsny">,
  stem_grammemes=[136, 192, 201],
  flex_grammemes=[165, 174]>,
 ...
 #<struct Myaso::Mystem::Form
  form="коты",
  msd=#<Myasorubka::MSD::Russian msd="Ncmpny">,
  stem_grammemes=[136, 192, 201],
  flex_grammemes=[165, 175]>]
=end

Moreover, Myaso makes it possible to find exact matches of grammemes, but you have to be careful because computational linguistics is a hard field.

lemmas = Myaso::Mystem.analyze('человек') # => [#<Myaso::Mystem::Lemma lemma="человек" msd="Ncmpay">]
pp lemmas[0].inflect(:number => :plural, :case => :dative)
=begin
[#<struct Myaso::Mystem::Form
  form="людям",
  msd=#<Myasorubka::MSD::Russian msd="Ncmpdy">,
  stem_grammemes=[136, 192, 201],
  flex_grammemes=[167, 175]>,
 #<struct Myaso::Mystem::Form
  form="человекам",
  msd=#<Myasorubka::MSD::Russian msd="Ncmpdy">,
  stem_grammemes=[136, 192, 201],
  flex_grammemes=[167, 175]>]
=end

Tagging

Myaso performs POS tagging using its own implementation of the Viterbi algorithm on HMMs. The output has the following format: token<TAB>tag.

Please remember that tagger command line interface accepts only tokenized texts — one token per line. For instance, the Greeb tokenizer can help you. Do not be afraid to use another text tokenization or segmentation tool if necessary.

% echo 'Как поспал, проголодался наверное?' | greeb | myaso -n snyat-msd.123 -l snyat-msd.lex tagger
Как	P-----r
поспал	Vmis-sma
,	,
проголодался	Vmis-sma
наверное	R
?	SENT

Unfortunately, current implementation of the tagger has two significant drawbacks:

The tagger handles unknown words not so good. Sorry.
Tagging is fast inself, but requires pretty slow training procedure running only once.

Tagging API

It is possible to embed the POS tagging feature in your own application using API.

model = Myaso::Tagger::TnT.new('model.123', 'model.lex')
tagger = Myaso::Tagger.new(model)
pp tagger.annotate(%w(Как поспал , проголодался наверное ?))
=begin
["P-----r", "Vmis-sma", ",", "Vmis-sma", "R", "SENT"]
=end

It is possible to significantly speed up the initialization process by expicit setting of the interpolations vector. For instance, the TnT model from http://corpus.leeds.ac.uk/mocky/ has the following (approximated) linear interpolation coefficients: k1 = 0.14, k2 = 0.30, k3 = 0.56. In the example these values are provided precisely.

interpolations = [0.14095796503456284, 0.3032174211273352, 0.555824613838102]
model = Myaso::Tagger::TnT.new('model.123', 'model.lex', interpolations)
tagger = Myaso::Tagger.new(model)
pp tagger.annotate(%w(Как поспал , проголодался наверное ?))
=begin
["P-----r", "Vmis-sma", ",", "Vmis-sma", "R", "SENT"]
=end

Acknowledgement

This work is partially supported by the Ural Branch of the Russian Academy of Sciences, grant no. РЦП-12-П10.

Contributing

Fork it;
Create your feature branch (git checkout -b my-new-feature);
Commit your changes (git commit -am 'Added some feature');
Push to the branch (git push origin my-new-feature);
Create new Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 485 Commits
bin		bin
lib		lib
spec		spec
.gitignore		.gitignore
.travis.yml		.travis.yml
Gemfile		Gemfile
LICENSE.txt		LICENSE.txt
README.md		README.md
Rakefile		Rakefile
myaso.gemspec		myaso.gemspec
myaso.jpg		myaso.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Myaso

Installation

Usage

Analysis

Analysis API

Synthesis

Synthesis API

Tagging

Tagging API

Acknowledgement

Contributing

Copyright

About

Releases

Packages

Contributors 4

Languages

License

dustalov/myaso

Folders and files

Latest commit

History

Repository files navigation

Myaso

Installation

Usage

Analysis

Analysis API

Synthesis

Synthesis API

Tagging

Tagging API

Acknowledgement

Contributing

Copyright

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages