Skip to content

datasets-br/unitex-pt-br

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

  goodtables.io   datapackage preview

Note: dictionary data in this repo is a read-only mirror (translated to open formats for data interchange) of the official Unitex repository, where active development is ongoing.

unitex-pt-br

The Brazilian Portuguese (pt-BR language), Unitex primary sources for the vocabulary and its morphological definitions, in a open data (FrictionlessData) interchange format.

Controlled primary sources:

  • pt-BR Alphabet: Alphabet.csv and Alphabet_sort.csv

  • pt-BR DELAS: DELA for Simple words, "Dicionário de Palavras Simples para o Português Brasileiro". ~67500 canonic words and its inflection rules. DELAS.csv.

  • pt-BR DELACF: DELA for Compound Forms, "Dicionário de Palavras Compostas Flexionadas para o Português Brasileiro". ~4000 compound words and its morphological classification. DELACF.csv.

  • pt-BR Inflections: all *.fst2 (finite state transducer v2) files, the compiled format for inflection graphs (see chapter 14.3 of the Unitex Manual). Each file contains only the basic representations of transitions of the graph — not changes by Graph-layout editing, changes only when topology or classification is modified. Under construction (JSON format), see dumps folder.

References

Updating sources

See spreadsheets do download here as data/*.csv.

Any other file must be validated by software (see SQL back-end).

License

  • Unitex sources: LGPLLR - Lesser General Public License For Linguistic Resources.

  • Other texts and sources: CC-BY-4.0 - Attribution 4.0 International.

About

The Brazilian Portuguese language, Unitex primary sources for the vocabulary and dictionary definitions

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages