Skip to content

dfalbel/ptwikiwords

Repository files navigation

ptwikiwords

Words used in Portuguese Wikipedia

Travis-CI Build Status CRAN_Status_Badge

This data-package contains a dataset with words used in a random sample from ~15.000 pages from the Portuguese Wikipedia.

Installing

It can be installed using:

devtools::install_github("dfalbel/ptwikiwords")

Using

After installing the package, you can load the dataset using:

library(ptwikiwords)
data(ptwikiwords)
head(ptwikiwords)
#> # A tibble: 6 × 3
#>    word  count check
#>   <chr>  <int> <lgl>
#> 1    de 210954  TRUE
#> 2     a 109652  TRUE
#> 3     e 100028  TRUE
#> 4     o  87839  TRUE
#> 5    em  67040  TRUE
#> 6    do  59489  TRUE

The dataset contains 3 columns:

  • word: word, as is, found in Wikipedia pages
  • count: number of times the word was found in the sample of Wikipedia pages
  • check: wheter the word exists in the portuguese language

Here is a wordcloud of those words:

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(wordcloud))
words_filter <- ptwikiwords %>%
  filter(check == T) %>%
  slice(1:300)
wordcloud(words_filter$word, words_filter$count)

Here is a wordcloud of the 2-grams.

data(ngrams)
words_filter <- ngrams %>%
  slice(1:100)
wordcloud(words_filter$ngrams, words_filter$count)
#> Warning in wordcloud(words_filter$ngrams, words_filter$count): com o could
#> not be fit on page. It will not be plotted.
#> Warning in wordcloud(words_filter$ngrams, words_filter$count): o primeiro
#> could not be fit on page. It will not be plotted.
#> Warning in wordcloud(words_filter$ngrams, words_filter$count): é um could
#> not be fit on page. It will not be plotted.
#> Warning in wordcloud(words_filter$ngrams, words_filter$count): para a could
#> not be fit on page. It will not be plotted.
#> Warning in wordcloud(words_filter$ngrams, words_filter$count): de um could
#> not be fit on page. It will not be plotted.
#> Warning in wordcloud(words_filter$ngrams, words_filter$count): janeiro de
#> could not be fit on page. It will not be plotted.
#> Warning in wordcloud(words_filter$ngrams, words_filter$count): é uma could
#> not be fit on page. It will not be plotted.
#> Warning in wordcloud(words_filter$ngrams, words_filter$count): setembro de
#> could not be fit on page. It will not be plotted.

About

Words used in Portuguese Wikipedia

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages