Skip to content
The Brazilian Portuguese language, Unitex primary sources for the vocabulary and dictionary definitions
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
assets
data
docs
dump
src
.gitignore
README.md
datapackage.json

README.md

  goodtables.io   datapackage preview

Note: dictionary data in this repo is a read-only mirror (translated to open formats for data interchange) of the official Unitex repository, where active development is ongoing.

unitex-pt-br

The Brazilian Portuguese (pt-BR language), Unitex primary sources for the vocabulary and its morphological definitions, in a open data (FrictionlessData) interchange format.

Controlled primary sources:

  • pt-BR Alphabet: Alphabet.csv and Alphabet_sort.csv

  • pt-BR DELAS: DELA for Simple words, "Dicionário de Palavras Simples para o Português Brasileiro". ~67500 canonic words and its inflection rules. DELAS.csv.

  • pt-BR DELACF: DELA for Compound Forms, "Dicionário de Palavras Compostas Flexionadas para o Português Brasileiro". ~4000 compound words and its morphological classification. DELACF.csv.

  • pt-BR Inflections: all *.fst2 (finite state transducer v2) files, the compiled format for inflection graphs (see chapter 14.3 of the Unitex Manual). Each file contains only the basic representations of transitions of the graph — not changes by Graph-layout editing, changes only when topology or classification is modified. Under construction (JSON format), see dumps folder.

References

Updating sources

See spreadsheets do download here as data/*.csv.

Any other file must be validated by software (see SQL back-end).

License

  • Unitex sources: LGPLLR - Lesser General Public License For Linguistic Resources.

  • Other texts and sources: CC-BY-4.0 - Attribution 4.0 International.

You can’t perform that action at this time.