This is a vocabulary containing a ton of words in portuguese (brazilian) made to expand Matheus73's vocabulario by collecting all words supplied by the sources
Note1: Even though this repo contains the same scrapping sources of Matheus73's repo, this was done using lxml instead of BS4 (BeautifulSoup4) and inlcudes a much wider range of words.
Note2: This is intended to be a PT-BR vocabulary, but at the sources of this scrappping there are PT-PT words. 🇵🇹
File | Content | Number of words | Source | Source code |
---|---|---|---|---|
conjugacao-verbos.txt |
Infintive form verbs | 20526 | Conjugacao | conjugacao-verbos_mais_usados.py |
conjugacao-verbos_conjugados.txt |
All conjugated forms of all verbs from verbos.txt |
1703200 (Unique) | Conjugacao | conjugacao-verbos_conjugados.py |
dicio.txt |
All words listed as palavras mais buscadas | 160923 | Dicio | dicio-palavras_mais_buscadas.py |
File | Content | Number of words | Source | Source code |
---|---|---|---|---|
vocabulary.txt |
Whole words collection | 1864779 (Unique) | Dicio Conjugacao | file_merge.py |
vocabulary_cleansed.txt |
"Cleansed" words collection | 1703863 (Unique) | Dicio Conjugacao | Cleanse.py |
As Matheus73 noticed, in those files there is some expressions wich are not words. This might be a good reason to discard some of those expressions, that is why there is a vocabulary_cleansed.txt
.