Skip to content

Commit

Permalink
version 0.2.0
Browse files Browse the repository at this point in the history
  • Loading branch information
Fabio Ashtar Telarico authored and cran-robot committed Jun 12, 2023
0 parents commit fc1948f
Show file tree
Hide file tree
Showing 19 changed files with 516 additions and 0 deletions.
26 changes: 26 additions & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
Package: morestopwords
Type: Package
Title: All Stop Words in One Place
Version: 0.2.0
Authors@R:
c(person('Fabio Ashtar', 'Telarico', email = 'Fabio-Ashtar.Telarico@fdv.uni-lj.si',
role = c('aut', 'cre'), comment = c(ORCID = '0000-0002-8740-7078')),
person('Kohei', 'Watanabe', email = 'watanabe.kohei@gmail.com',
role = c('aut')))
Maintainer: Fabio Ashtar Telarico <Fabio-Ashtar.Telarico@fdv.uni-lj.si>
Description: A standalone package combining several stop-word lists for 65 languages with a median of 329 stop words for language and over 1,000 entries for English, Breton, Latin, Slovenian, and Ancient Greek! The user automatically gets access to all the unique stop words contained in: the 'StopwordISO' repository; the 'Natural Language Toolkit' for 'python'; the 'Snowball' stop-word list; the R package 'quanteda'; the 'marimo' repository; the 'Perseus' project; and A. Berra's list of stop words for Ancient Greek and Latin.
License: MIT + file LICENSE
URL: https://fatelarico.github.io/morestopwords.html
BugReports: https://github.com/FATelarico/morestopwords/issues
Encoding: UTF-8
Depends: R (>= 2.10)
LazyData: no
RoxygenNote: 7.2.3
Suggests: cld2
NeedsCompilation: no
Packaged: 2023-06-11 13:37:35 UTC; fabio
Author: Fabio Ashtar Telarico [aut, cre]
(<https://orcid.org/0000-0002-8740-7078>),
Kohei Watanabe [aut]
Repository: CRAN
Date/Publication: 2023-06-12 09:30:02 UTC
2 changes: 2 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
YEAR: 2023
COPYRIGHT HOLDER: stopwords authors
18 changes: 18 additions & 0 deletions MD5
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
628e458c27b3fb9d1c7a2ddef1b5fb7f *DESCRIPTION
0556d96bd69f29842f7f1264e7940e35 *LICENSE
b494f5e4f940ae03a2a1ff991509374e *NAMESPACE
feeef9ec67004fadaa128c8eda1b0a26 *R/data.R
fec0f4c69aabeafd84b47762f05ba329 *R/internal.R
06a701dee4cba228e52d13bbf67cd78f *R/stopwords.R
14b0530bc369f81d6cc5a963214454b0 *R/sysdata.rda
c386e3e779a9fe6fcdc3f2594d56aaa2 *README.md
439bf689fa27cf9affd0335332142165 *build/partial.rdb
0d9a79da00e8d03163b8cb1eb9023883 *data/stopwordsISO.rda
524856af477322b99c5b4f363bebb467 *man/del.stopwords.Rd
4c5a77d10372f0d43666efb37111e5ce *man/figures/compare_stopwords_lists.png
70fe8b86f517858016d0e3ffcc3befa9 *man/figures/logo.png
b397ba665c491aef3cdb835fec48169c *man/languages.Rd
cc0165ee8f722e54752677ffbf7d0198 *man/match.lang.Rd
027a754837b760cfdeceb3aa75b76c20 *man/remove.stopwords.Rd
25d128cfc325af9c2a8e87a102680856 *man/stopwords.Rd
fc95238c73b4f2fec2003a1b84d797da *man/stopwordsISO.Rd
5 changes: 5 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by roxygen2: do not edit by hand

export(languages)
export(remove.stopwords)
export(stopwords)
28 changes: 28 additions & 0 deletions R/data.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#' Combined stop words for all languages
#'
#' A list of stop words in each of the supported languages
#'
#' Note: All Unicode characters are escaped. To un-escape them, consider using:
#'
#' \preformatted{
#' library(AllStopwords)
#' if(!requireNamespace('stringi')){
#' install.packages('stringi')
#' }
#' data('stopwordsISO')
#' stopwords_unescaped <- lapply(stopwordsISO,
#' stringi::stri_unescape_unicode)
#' }
#'
#' @source All unique stopwords in the following databases:\itemize{
#' \item the StopwordISO \href{https://github.com/stopwords-iso/stopwords-iso}{repository};
#' \item python's Natural Language Toolkit (\href{https://www.nltk.org/}{nltk});
#' \item the \href{http://snowball.tartarus.org/algorithms/english/stop.txt}{Snowball} stop-word list;
#' \item the R package \href{https://quanteda.io/}{quanteda};
#' \item the marimo \href{https://github.com/koheiw/marimo}{repository};
#' \item the \href{https://www.perseus.tufts.edu/hopper/stopwords}{Perseus} project; and
#' \item Aurélien Berra's list of stop words for {Ancient Greek and Latin} (\doi{10.5281/zenodo.3860343}).
#' }
#'
#' @author Each stop-word list's Authors
'stopwordsISO'
64 changes: 64 additions & 0 deletions R/internal.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
#' Matches a string with the ISO 639-1 code available in this library
#'
#' See \url{https://en.wikipedia.org/wiki/ISO_639-1} for details of the language code.
#'
#' @param lang Either an ISO 639-2/3 or a language name from which to derive a ISO 639-2 code. For language names performs string matching.
#'
#' @returns A character vector containing the two-letter ISO 639-1 code associated to the requested language.
#'
#' @keywords internal

match.lang <- function(lang){
df <- languages()
df$name <- tolower(df$name)
lang <- tolower(lang)

pos <- ifelse(test = nchar(lang)==2,
# Possible 2-letter code
yes = which(df$`ISO639-1`==lang),
no = ifelse(test = nchar(lang)>3,
# Possible language name
yes = which(df$name==match.arg(lang, df$name)),
# Possible 3-letter code
no = ifelse(test = any(lang%in%df$`ISO639-2`),
# Is it a IS O639-2 code?
yes = which(df$`ISO639-2`==lang),
# Otherwise, try as a ISO 639-3 code
no = which(df$`ISO639-3`==lang))))

if(is.na(pos)){
# No match
stop('Not a valid language (code): ', lang)
} else {
# Return match
df$`ISO639-1`[pos]
}
}

#' Removes stop words for a string the language of which is known
#'
#' @param str The string which to delete the stop words from
#' @param lang Either an ISO 639-2/3 or a language name from which to derive a ISO 639-2 code. For language names performs string matching.
#'
#' @returns A character vector corresponding to the string \code{str} without stopwords for the language \code{lang}
#'
#' @keywords internal

del.stopwords <- function(str, lang){

# Find stop words
stpwrds <- stopwords(lang = lang)

# Remove stop words
y <- str

for(w in stpwrds){
y <- gsub(paste0('\\b', w,'\\b'), '', y)
}

while(any(grepl(' ', y))){
y <- gsub(' ', ' ', y, fixed = TRUE)
}

trimws(y)
}
147 changes: 147 additions & 0 deletions R/stopwords.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
#' Collection of stopwords in multiple languages
#'
#' This function returns stop words contained in the \href{https://github.com/stopwords-iso/stopwords-iso}{StopwordsISO} repository.
#'
#' @param lang Language for which to retrieve the stop word among those supported. This parameters supports: \itemize{
#' \item three-letter ISO 639-2/3 codes (e.g., \code{'eng'});
#' \item two-letter ISO639-1 codes (\code{'en'});
#' \item names based ISO 639-2 codes (\code{'English'} or \code{'english'}) and their unambiguous substrings (\code{'engl'}, \code{'engli'}, etc.).
#' }
#'
#' @return A character vector containing the stop words from the selected language as listed in the \href{https://github.com/stopwords-iso/stopwords-iso}{StopwordISO} repository.
#'
#' @export
#'
#' @examples
#' # They all return the correct list of stop words!
#'
#' stopwords('German')
#' stopwords('germ')
#' stopwords('de')
#' stopwords('deu')

stopwords <- function(lang = 'en') {

lang <- match.lang(lang = lang)

if (lang %in% names(stopwordsISO)){
stopwordsISO[[lang]]
} else {
stop(paste0(lang, ' is not supported by `StopwordsISO`!'))
}

}

#' Returns ISO codes and names for all language or only those available in this package
#'
#' See the relevant \href{https://en.wikipedia.org/wiki/ISO_639-1}{Wikipedia article} for details on the language codes.
#'
#' Note that: \itemize{
#' \item the ISO 639-1 code for mainland Chinese was changed to \code{zh-cn}.
#' \item A list of stop words in the variety of Chinese spoken in the island of Taiwan is accessible using the ISO 639-1 \code{zh-tw} or the name \code{'Chinese Taiwan'}.
#' \item Ancient Greek has been assigned an artifact ISO 639-1 code (\code{gr}) because it had none. Its ISO 639-2 and 639-3 codes are both \code{grc}.
#' }
#'
#' @param available \emph{logical}, whether to return only the languages supported in this package.
#'
#' @returns A data frame with a row for each languages (only those supported if \code{available} is \code{TRUE}) and columns for the several ISO codes (639-2, 639-3, 639-1) and the name.
#'
#' @export
#'
#' @examples
#' # Return all languages in the ISO 639-2/3 standard
#' languages()

languages <- function(available = TRUE) {

# Extract language codes
code <- names(stopwordsISO)

# Prepare the table
if(available){
code <- ISOcodes[match(code, ISOcodes$`ISO639-1`),]
rownames(code) <- NULL
}

code
}

#' Removes stop words for a string the language of which is known
#'
#' @param str A string or a vector of strings which to delete the stop words from
#' @param lang Either: \itemize{
#' \item \code{'auto'} in which case \code{cld2} is used to perform language detection; or
#' \item A string (or a vector of strings, depending on \code{str}) representing an ISO 639-2/3 or a language name from which to derive a ISO 639-2 code (for language names, string matching is performed)
#' }
#' @param fallback Fallback language in case \code{cld2} fails to detect the language of the manually-specified string does not match a supported language. Default to \code{'English'}.
#'
#' @returns A strings (or a vector, depending on \code{str}) corresponding to the string/s \code{str} without stop words for the language/s \code{lang}.
#'
#' @export
#'
#' @examples
#' # Multiple strings in different languages
#' remove.stopwords(str = c(Gibberish = 'dadas',
#' Catalan = 'Adeu amic meu',
#' Irish = 'Slan a chara',
#' French = 'Je suis en Allemagne',
#' German = 'Eich liebe Deutschland'),
#' # Various ways of indicating the language
#' lang = c(NA, 'cata', 'Iris', 'fr', 'deu'),
#' # Yet another way
#' fallback = 'english'
#' )
#'
remove.stopwords <- function(str, lang = 'auto', fallback = 'English'){
# Code of the fallback language
fallback <- match.lang(fallback)

# Language detection
if(length(lang) == 1 && lang == 'auto'){
# Whether it is possible to use `cld2`
has_cld2 <- requireNamespace('cld2', quietly = TRUE)

if(has_cld2){ # Possible
# Detect language
lang <- cld2::detect_language(str, lang_code = TRUE)

# If unknown
# Works both when `str` is a string and when it is a vector of strings
if(any(is.na(lang))){
lang[is.na(lang)] <- fallback
}
} else { # Impossible
# Use fallback language
fallback_name <- ISOcodes$name[which(ISOcodes$`ISO639-1`==fallback)]
lang <- fallback
# Warn the user
warning(paste('Language detection requires the package `cld2`\n',
'Reverting to fallback language:', fallback_name))

}
} else {
# If unknown
# Works both when `str` is a string and when it is a vector of strings
if(any(is.na(lang))){
lang[is.na(lang)] <- fallback
}

# Code language/s
lang <- if(length(lang>1)){
lapply(lang, match.lang)|> unlist()
} else {
match.lang(lang)
}

}

out <- if(length(str)>1){
lapply(seq_along(str), function(w){
del.stopwords(str = str[[w]], lang = lang[w])
})
} else {
del.stopwords(str = str, lang = lang)
}

out
}
Binary file added R/sysdata.rda
Binary file not shown.
50 changes: 50 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# morestopwords: All Stop Words in One Place <img src="man/figures/logo.png" align="right" width="120"/>

Author: Fabio Ashtar Telarico, University of Ljubljana, FDV

<!-- badges: start -->

![](https://img.shields.io/badge/Original%20author-koheiw-lightgrey)\
![](https://img.shields.io/badge/R%20CMD-passing-brightgreen)\
![](https://img.shields.io/badge/version-0.2.0-orange)\
![](https://img.shields.io/badge/CRAN-0.2.0-blue)\
![](https://img.shields.io/github/last-commit/fatelarico/stopwords?logo=GitHub&logoColor=orange&style=plastic)

<!-- badges: end -->

# Introduction

`stopwords` is an R package originally developed by [Kohei Watanabe](https://github.com/koheiw) of the Waseda Institute for Advanced Study (check out his publications [here](https://scholar.google.com/citations?user=9BGfT7EAAAAJ&hl=en)) that provides easy access to stopwords in more than 50 languages in the Stopwords ISO library.

The package has not been updated since Dec 22, 2017 and was not installable anymore from `GitHub`. So, this reboot happened to grant continuity to the project.

# Installation

### CRAN (Stable release)

```
install.packages('morestopwords')
```

### GitHub (Development version)

```
if(requireNamespace('remotes'))
remotes::install_github('fatelarico/morestopwords')
```

# Usage

The code base has changed since version 0.1.0 (the last maintained by Dr. Watanabe). Now, the function `stopwords::stopwords()` supports not only two-letter ISO codes, but also three-letter ones. Moreover, it can identify languages by their ISO name (e.g., German, not *Deutsch*; Swedish, not *Sverige*, etc.).

# Comparison to similar packages

The package [`stopwords`](https://CRAN.R-project.org/package=stopwords) is also based on Watanabe's archived GitHub repository. Thus, it is the most similar to `morestopwords`, too. However, these two packages are differentiated by both design choices and features:

1. `morestopwords` has got no dependencies and integrates with the package [`cld2`](https://CRAN.R-project.org/package=cld2).
2. `morestopwords` can (if `cld2` is installed) identify the language of one (or more) string(s) automatically
3. `morestopwords` can remove stop words from one or more strings either in conjuction with language detection or independently.
4. `morestopwords` does not allow the user to choose a list of stop words to use. Rather, it tries to provide the most comprehensive list in an intuitive way.
5. `morestopwords`'s lists include more stop words than any single list included in `stopwords`.

![](./man/figures/compare_stopwords_lists.png)
Binary file added build/partial.rdb
Binary file not shown.
Binary file added data/stopwordsISO.rda
Binary file not shown.
20 changes: 20 additions & 0 deletions man/del.stopwords.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Binary file added man/figures/compare_stopwords_lists.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added man/figures/logo.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
28 changes: 28 additions & 0 deletions man/languages.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

0 comments on commit fc1948f

Please sign in to comment.