Skip to content

Voz-bonita/Vocabulary_PT-BR

Repository files navigation

Web Scraping Vocabulary PT-BR 🇧🇷

What is it?

This is a vocabulary containing a ton of words in portuguese (brazilian) made to expand Matheus73's vocabulario by collecting all words supplied by the sources

Note1: Even though this repo contains the same scrapping sources of Matheus73's repo, this was done using lxml instead of BS4 (BeautifulSoup4) and inlcudes a much wider range of words.

Note2: This is intended to be a PT-BR vocabulary, but at the sources of this scrappping there are PT-PT words. 🇵🇹

Side files

File Content Number of words Source Source code
conjugacao-verbos.txt Infintive form verbs 20526 Conjugacao conjugacao-verbos_mais_usados.py
conjugacao-verbos_conjugados.txt All conjugated forms of all verbs from verbos.txt 1703200 (Unique) Conjugacao conjugacao-verbos_conjugados.py
dicio.txt All words listed as palavras mais buscadas 160923 Dicio dicio-palavras_mais_buscadas.py

Main files

File Content Number of words Source Source code
vocabulary.txt Whole words collection 1864779 (Unique) Dicio Conjugacao file_merge.py
vocabulary_cleansed.txt "Cleansed" words collection 1703863 (Unique) Dicio Conjugacao Cleanse.py

About vocabulary_cleansed.txt

As Matheus73 noticed, in those files there is some expressions wich are not words. This might be a good reason to discard some of those expressions, that is why there is a vocabulary_cleansed.txt.

About

Web scrapping service done to collect PT-BR words

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages