No description, website, or topics provided.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
Corpus
Dela
Graphs
LICENSE
README.md

README.md

Normatex - Russian text normalization

This is a set of Finite-State Transducers (FSTs) for normalization of Russian texts for speech synthesis, machine translation and other natural language processing tasks.

The FSTs are developed using Unitex, a corpus processor.

To normalize a Russian text:

  1. Copy your text (e.g. example.txt) to Corpus folder, open it in Unitex and preprocess it with following resources:
  • apply Graphs/Preprocessing/Sentence/SentenceUniver.grf in MERGE mode
  • apply Graphs/Preprocessing/Replace/replace.grf in REPLACE mode
  1. Apply lexical resources:
  1. Create a cascade (Text\Apply CasSys Cascade... menu, New) to sequentially apply the following FSTs to your text in REPLACE mode:
  • Graphs/numbers.fst2
  • Graphs/abbr/abbr_w.fst2
  • Graphs/abbr/acronyms_w.fst2
  • Graphs/Postprocessing/replace.fst2
  1. Launch the cascade of FSTs.
  2. The normalized text is in Corpus/example_csc/example_4_0.snt.

Slides