-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit fc1948f
Showing
19 changed files
with
516 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
Package: morestopwords | ||
Type: Package | ||
Title: All Stop Words in One Place | ||
Version: 0.2.0 | ||
Authors@R: | ||
c(person('Fabio Ashtar', 'Telarico', email = 'Fabio-Ashtar.Telarico@fdv.uni-lj.si', | ||
role = c('aut', 'cre'), comment = c(ORCID = '0000-0002-8740-7078')), | ||
person('Kohei', 'Watanabe', email = 'watanabe.kohei@gmail.com', | ||
role = c('aut'))) | ||
Maintainer: Fabio Ashtar Telarico <Fabio-Ashtar.Telarico@fdv.uni-lj.si> | ||
Description: A standalone package combining several stop-word lists for 65 languages with a median of 329 stop words for language and over 1,000 entries for English, Breton, Latin, Slovenian, and Ancient Greek! The user automatically gets access to all the unique stop words contained in: the 'StopwordISO' repository; the 'Natural Language Toolkit' for 'python'; the 'Snowball' stop-word list; the R package 'quanteda'; the 'marimo' repository; the 'Perseus' project; and A. Berra's list of stop words for Ancient Greek and Latin. | ||
License: MIT + file LICENSE | ||
URL: https://fatelarico.github.io/morestopwords.html | ||
BugReports: https://github.com/FATelarico/morestopwords/issues | ||
Encoding: UTF-8 | ||
Depends: R (>= 2.10) | ||
LazyData: no | ||
RoxygenNote: 7.2.3 | ||
Suggests: cld2 | ||
NeedsCompilation: no | ||
Packaged: 2023-06-11 13:37:35 UTC; fabio | ||
Author: Fabio Ashtar Telarico [aut, cre] | ||
(<https://orcid.org/0000-0002-8740-7078>), | ||
Kohei Watanabe [aut] | ||
Repository: CRAN | ||
Date/Publication: 2023-06-12 09:30:02 UTC |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
YEAR: 2023 | ||
COPYRIGHT HOLDER: stopwords authors |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
628e458c27b3fb9d1c7a2ddef1b5fb7f *DESCRIPTION | ||
0556d96bd69f29842f7f1264e7940e35 *LICENSE | ||
b494f5e4f940ae03a2a1ff991509374e *NAMESPACE | ||
feeef9ec67004fadaa128c8eda1b0a26 *R/data.R | ||
fec0f4c69aabeafd84b47762f05ba329 *R/internal.R | ||
06a701dee4cba228e52d13bbf67cd78f *R/stopwords.R | ||
14b0530bc369f81d6cc5a963214454b0 *R/sysdata.rda | ||
c386e3e779a9fe6fcdc3f2594d56aaa2 *README.md | ||
439bf689fa27cf9affd0335332142165 *build/partial.rdb | ||
0d9a79da00e8d03163b8cb1eb9023883 *data/stopwordsISO.rda | ||
524856af477322b99c5b4f363bebb467 *man/del.stopwords.Rd | ||
4c5a77d10372f0d43666efb37111e5ce *man/figures/compare_stopwords_lists.png | ||
70fe8b86f517858016d0e3ffcc3befa9 *man/figures/logo.png | ||
b397ba665c491aef3cdb835fec48169c *man/languages.Rd | ||
cc0165ee8f722e54752677ffbf7d0198 *man/match.lang.Rd | ||
027a754837b760cfdeceb3aa75b76c20 *man/remove.stopwords.Rd | ||
25d128cfc325af9c2a8e87a102680856 *man/stopwords.Rd | ||
fc95238c73b4f2fec2003a1b84d797da *man/stopwordsISO.Rd |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
# Generated by roxygen2: do not edit by hand | ||
|
||
export(languages) | ||
export(remove.stopwords) | ||
export(stopwords) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
#' Combined stop words for all languages | ||
#' | ||
#' A list of stop words in each of the supported languages | ||
#' | ||
#' Note: All Unicode characters are escaped. To un-escape them, consider using: | ||
#' | ||
#' \preformatted{ | ||
#' library(AllStopwords) | ||
#' if(!requireNamespace('stringi')){ | ||
#' install.packages('stringi') | ||
#' } | ||
#' data('stopwordsISO') | ||
#' stopwords_unescaped <- lapply(stopwordsISO, | ||
#' stringi::stri_unescape_unicode) | ||
#' } | ||
#' | ||
#' @source All unique stopwords in the following databases:\itemize{ | ||
#' \item the StopwordISO \href{https://github.com/stopwords-iso/stopwords-iso}{repository}; | ||
#' \item python's Natural Language Toolkit (\href{https://www.nltk.org/}{nltk}); | ||
#' \item the \href{http://snowball.tartarus.org/algorithms/english/stop.txt}{Snowball} stop-word list; | ||
#' \item the R package \href{https://quanteda.io/}{quanteda}; | ||
#' \item the marimo \href{https://github.com/koheiw/marimo}{repository}; | ||
#' \item the \href{https://www.perseus.tufts.edu/hopper/stopwords}{Perseus} project; and | ||
#' \item Aurélien Berra's list of stop words for {Ancient Greek and Latin} (\doi{10.5281/zenodo.3860343}). | ||
#' } | ||
#' | ||
#' @author Each stop-word list's Authors | ||
'stopwordsISO' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
#' Matches a string with the ISO 639-1 code available in this library | ||
#' | ||
#' See \url{https://en.wikipedia.org/wiki/ISO_639-1} for details of the language code. | ||
#' | ||
#' @param lang Either an ISO 639-2/3 or a language name from which to derive a ISO 639-2 code. For language names performs string matching. | ||
#' | ||
#' @returns A character vector containing the two-letter ISO 639-1 code associated to the requested language. | ||
#' | ||
#' @keywords internal | ||
|
||
match.lang <- function(lang){ | ||
df <- languages() | ||
df$name <- tolower(df$name) | ||
lang <- tolower(lang) | ||
|
||
pos <- ifelse(test = nchar(lang)==2, | ||
# Possible 2-letter code | ||
yes = which(df$`ISO639-1`==lang), | ||
no = ifelse(test = nchar(lang)>3, | ||
# Possible language name | ||
yes = which(df$name==match.arg(lang, df$name)), | ||
# Possible 3-letter code | ||
no = ifelse(test = any(lang%in%df$`ISO639-2`), | ||
# Is it a IS O639-2 code? | ||
yes = which(df$`ISO639-2`==lang), | ||
# Otherwise, try as a ISO 639-3 code | ||
no = which(df$`ISO639-3`==lang)))) | ||
|
||
if(is.na(pos)){ | ||
# No match | ||
stop('Not a valid language (code): ', lang) | ||
} else { | ||
# Return match | ||
df$`ISO639-1`[pos] | ||
} | ||
} | ||
|
||
#' Removes stop words for a string the language of which is known | ||
#' | ||
#' @param str The string which to delete the stop words from | ||
#' @param lang Either an ISO 639-2/3 or a language name from which to derive a ISO 639-2 code. For language names performs string matching. | ||
#' | ||
#' @returns A character vector corresponding to the string \code{str} without stopwords for the language \code{lang} | ||
#' | ||
#' @keywords internal | ||
|
||
del.stopwords <- function(str, lang){ | ||
|
||
# Find stop words | ||
stpwrds <- stopwords(lang = lang) | ||
|
||
# Remove stop words | ||
y <- str | ||
|
||
for(w in stpwrds){ | ||
y <- gsub(paste0('\\b', w,'\\b'), '', y) | ||
} | ||
|
||
while(any(grepl(' ', y))){ | ||
y <- gsub(' ', ' ', y, fixed = TRUE) | ||
} | ||
|
||
trimws(y) | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,147 @@ | ||
#' Collection of stopwords in multiple languages | ||
#' | ||
#' This function returns stop words contained in the \href{https://github.com/stopwords-iso/stopwords-iso}{StopwordsISO} repository. | ||
#' | ||
#' @param lang Language for which to retrieve the stop word among those supported. This parameters supports: \itemize{ | ||
#' \item three-letter ISO 639-2/3 codes (e.g., \code{'eng'}); | ||
#' \item two-letter ISO639-1 codes (\code{'en'}); | ||
#' \item names based ISO 639-2 codes (\code{'English'} or \code{'english'}) and their unambiguous substrings (\code{'engl'}, \code{'engli'}, etc.). | ||
#' } | ||
#' | ||
#' @return A character vector containing the stop words from the selected language as listed in the \href{https://github.com/stopwords-iso/stopwords-iso}{StopwordISO} repository. | ||
#' | ||
#' @export | ||
#' | ||
#' @examples | ||
#' # They all return the correct list of stop words! | ||
#' | ||
#' stopwords('German') | ||
#' stopwords('germ') | ||
#' stopwords('de') | ||
#' stopwords('deu') | ||
|
||
stopwords <- function(lang = 'en') { | ||
|
||
lang <- match.lang(lang = lang) | ||
|
||
if (lang %in% names(stopwordsISO)){ | ||
stopwordsISO[[lang]] | ||
} else { | ||
stop(paste0(lang, ' is not supported by `StopwordsISO`!')) | ||
} | ||
|
||
} | ||
|
||
#' Returns ISO codes and names for all language or only those available in this package | ||
#' | ||
#' See the relevant \href{https://en.wikipedia.org/wiki/ISO_639-1}{Wikipedia article} for details on the language codes. | ||
#' | ||
#' Note that: \itemize{ | ||
#' \item the ISO 639-1 code for mainland Chinese was changed to \code{zh-cn}. | ||
#' \item A list of stop words in the variety of Chinese spoken in the island of Taiwan is accessible using the ISO 639-1 \code{zh-tw} or the name \code{'Chinese Taiwan'}. | ||
#' \item Ancient Greek has been assigned an artifact ISO 639-1 code (\code{gr}) because it had none. Its ISO 639-2 and 639-3 codes are both \code{grc}. | ||
#' } | ||
#' | ||
#' @param available \emph{logical}, whether to return only the languages supported in this package. | ||
#' | ||
#' @returns A data frame with a row for each languages (only those supported if \code{available} is \code{TRUE}) and columns for the several ISO codes (639-2, 639-3, 639-1) and the name. | ||
#' | ||
#' @export | ||
#' | ||
#' @examples | ||
#' # Return all languages in the ISO 639-2/3 standard | ||
#' languages() | ||
|
||
languages <- function(available = TRUE) { | ||
|
||
# Extract language codes | ||
code <- names(stopwordsISO) | ||
|
||
# Prepare the table | ||
if(available){ | ||
code <- ISOcodes[match(code, ISOcodes$`ISO639-1`),] | ||
rownames(code) <- NULL | ||
} | ||
|
||
code | ||
} | ||
|
||
#' Removes stop words for a string the language of which is known | ||
#' | ||
#' @param str A string or a vector of strings which to delete the stop words from | ||
#' @param lang Either: \itemize{ | ||
#' \item \code{'auto'} in which case \code{cld2} is used to perform language detection; or | ||
#' \item A string (or a vector of strings, depending on \code{str}) representing an ISO 639-2/3 or a language name from which to derive a ISO 639-2 code (for language names, string matching is performed) | ||
#' } | ||
#' @param fallback Fallback language in case \code{cld2} fails to detect the language of the manually-specified string does not match a supported language. Default to \code{'English'}. | ||
#' | ||
#' @returns A strings (or a vector, depending on \code{str}) corresponding to the string/s \code{str} without stop words for the language/s \code{lang}. | ||
#' | ||
#' @export | ||
#' | ||
#' @examples | ||
#' # Multiple strings in different languages | ||
#' remove.stopwords(str = c(Gibberish = 'dadas', | ||
#' Catalan = 'Adeu amic meu', | ||
#' Irish = 'Slan a chara', | ||
#' French = 'Je suis en Allemagne', | ||
#' German = 'Eich liebe Deutschland'), | ||
#' # Various ways of indicating the language | ||
#' lang = c(NA, 'cata', 'Iris', 'fr', 'deu'), | ||
#' # Yet another way | ||
#' fallback = 'english' | ||
#' ) | ||
#' | ||
remove.stopwords <- function(str, lang = 'auto', fallback = 'English'){ | ||
# Code of the fallback language | ||
fallback <- match.lang(fallback) | ||
|
||
# Language detection | ||
if(length(lang) == 1 && lang == 'auto'){ | ||
# Whether it is possible to use `cld2` | ||
has_cld2 <- requireNamespace('cld2', quietly = TRUE) | ||
|
||
if(has_cld2){ # Possible | ||
# Detect language | ||
lang <- cld2::detect_language(str, lang_code = TRUE) | ||
|
||
# If unknown | ||
# Works both when `str` is a string and when it is a vector of strings | ||
if(any(is.na(lang))){ | ||
lang[is.na(lang)] <- fallback | ||
} | ||
} else { # Impossible | ||
# Use fallback language | ||
fallback_name <- ISOcodes$name[which(ISOcodes$`ISO639-1`==fallback)] | ||
lang <- fallback | ||
# Warn the user | ||
warning(paste('Language detection requires the package `cld2`\n', | ||
'Reverting to fallback language:', fallback_name)) | ||
|
||
} | ||
} else { | ||
# If unknown | ||
# Works both when `str` is a string and when it is a vector of strings | ||
if(any(is.na(lang))){ | ||
lang[is.na(lang)] <- fallback | ||
} | ||
|
||
# Code language/s | ||
lang <- if(length(lang>1)){ | ||
lapply(lang, match.lang)|> unlist() | ||
} else { | ||
match.lang(lang) | ||
} | ||
|
||
} | ||
|
||
out <- if(length(str)>1){ | ||
lapply(seq_along(str), function(w){ | ||
del.stopwords(str = str[[w]], lang = lang[w]) | ||
}) | ||
} else { | ||
del.stopwords(str = str, lang = lang) | ||
} | ||
|
||
out | ||
} |
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
# morestopwords: All Stop Words in One Place <img src="man/figures/logo.png" align="right" width="120"/> | ||
|
||
Author: Fabio Ashtar Telarico, University of Ljubljana, FDV | ||
|
||
<!-- badges: start --> | ||
|
||
![](https://img.shields.io/badge/Original%20author-koheiw-lightgrey)\ | ||
![](https://img.shields.io/badge/R%20CMD-passing-brightgreen)\ | ||
![](https://img.shields.io/badge/version-0.2.0-orange)\ | ||
![](https://img.shields.io/badge/CRAN-0.2.0-blue)\ | ||
![](https://img.shields.io/github/last-commit/fatelarico/stopwords?logo=GitHub&logoColor=orange&style=plastic) | ||
|
||
<!-- badges: end --> | ||
|
||
# Introduction | ||
|
||
`stopwords` is an R package originally developed by [Kohei Watanabe](https://github.com/koheiw) of the Waseda Institute for Advanced Study (check out his publications [here](https://scholar.google.com/citations?user=9BGfT7EAAAAJ&hl=en)) that provides easy access to stopwords in more than 50 languages in the Stopwords ISO library. | ||
|
||
The package has not been updated since Dec 22, 2017 and was not installable anymore from `GitHub`. So, this reboot happened to grant continuity to the project. | ||
|
||
# Installation | ||
|
||
### CRAN (Stable release) | ||
|
||
``` | ||
install.packages('morestopwords') | ||
``` | ||
|
||
### GitHub (Development version) | ||
|
||
``` | ||
if(requireNamespace('remotes')) | ||
remotes::install_github('fatelarico/morestopwords') | ||
``` | ||
|
||
# Usage | ||
|
||
The code base has changed since version 0.1.0 (the last maintained by Dr. Watanabe). Now, the function `stopwords::stopwords()` supports not only two-letter ISO codes, but also three-letter ones. Moreover, it can identify languages by their ISO name (e.g., German, not *Deutsch*; Swedish, not *Sverige*, etc.). | ||
|
||
# Comparison to similar packages | ||
|
||
The package [`stopwords`](https://CRAN.R-project.org/package=stopwords) is also based on Watanabe's archived GitHub repository. Thus, it is the most similar to `morestopwords`, too. However, these two packages are differentiated by both design choices and features: | ||
|
||
1. `morestopwords` has got no dependencies and integrates with the package [`cld2`](https://CRAN.R-project.org/package=cld2). | ||
2. `morestopwords` can (if `cld2` is installed) identify the language of one (or more) string(s) automatically | ||
3. `morestopwords` can remove stop words from one or more strings either in conjuction with language detection or independently. | ||
4. `morestopwords` does not allow the user to choose a list of stop words to use. Rather, it tries to provide the most comprehensive list in an intuitive way. | ||
5. `morestopwords`'s lists include more stop words than any single list included in `stopwords`. | ||
|
||
![](./man/figures/compare_stopwords_lists.png) |
Binary file not shown.
Binary file not shown.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
Oops, something went wrong.